provides software and services that enable enterprises
Live Chat 1-888-673-6564

Open Source Software Technical Articles

  • Home
  • Search
  • Contact Us
  • Products and Support
  • Services
  • Enterprise OSS Blog
  • Wazi Technical Blog
  • About Wazi
  • Attributions and Licensing
  • Supply Chain Compliance
  • How to Contribute
  • Contributors
  • Resources Library
  • Cloud Services
  • Partners
  • Customers
  • Community
  • Company
  • Careers
  • News and Events

Subscribe to Wazi by Email

Your email:


Enterprise Developer Support 24 x 7, Get a Support Quote Now!


click-here-to-chat-with-an-online-representative

download-oss-discovery

Latest Posts

  • A more colorful LibreOffice unveiled
  • Toward a more colorful LibreOffice
  • Flexible administration with Puppet's Facter and templates
  • Knock for OpenSSH
  • Get more out of phpMyAdmin
  • Image annotation in GIMP, Dia, and OpenOffice Draw
  • Solr, Drupal 7, and faceted search
  • Using FreeNAS' new full disk encryption for ZFS
  • Create distributed storage with Gluster
  • How to set up Solr 4.2 on Drupal 7 with Apache

Connect with Us!

Current Articles | RSS Feed RSS Feed

Index and Search Your Sites with ht://Dig

Posted by Juliet Kemp on Fri, Jan 27, 2012
  
Email This Email Article  
Tweet  
  

The venerable ht://Dig application lets you keep track of the text on a single website or a group of sites. Its various components create an index of search terms, provide an interface to search that index, and provide fuzzy matching algorithms and a handful of other options. While it's increasingly common to outsource website searching (often to Google), there are still advantages to controlling your own search engine for your website. Doing so gives you more control over the detail of how the search works, and you can search across multiple sub-sites, or not, as you prefer. Here's how to create a personal search engine for your personal domain.



ht://Dig provides binary installation packages for Debian- and Red Hat-based Linux distributions. When you install the application with your package manager, it should generate a basic /etc/htdig/htdig.conf file, which you'll need to edit to customize it for your site. The most important line to change specifies the URL that ht://Dig will start from when indexing:



 start_url:      http://my.site.com


The software begins spidering from this URL, which means that if you have pages that aren't linked from elsewhere in your site (for example, entirely separate subsites), they won't be indexed. You can intentionally avoid indexing certain pages too, as we'll see in a moment. The start URL can be a list of string patterns, so if you want to spider other sites that are related but unlinked, you can specify them like this:




start_url: http://www.example1.com \
           http://www.example2.com


If you have multiple sites, another option is to set up separate configuration files for each – for instance, /etc/htdig/htdig-example1.conf and /etc/htdig/htdig-example2.conf. The format for these files is the same as for the default configuration file, but their databases will be stored separately. When you search these databases, as we'll see in the next section, you'll need to make sure that you specify which configuration file to use.



Another important line, which limits the scope of the search, is this one:



limit_urls_to:     ${start_url}


This is the default setting, which indexes only pages on the site specified by the ${start_url} variable. Alternatively, again, you can specify a list of URLs to use for limiting. Any external pages (i.e. pages that don't match the specified URL or URLs) that your site links to won't be spidered, so you won't wind up indexing the whole Internet.



Even when limited to a single site or sites, the databases that ht://Dig generates can get quite large. Make sure that the directory specified in the configuration file's database_dir attribute has plenty of space. The default database location is /var/lib/htdig, but you can of course keep the databases anywhere you fancy.



To improve performance, you can tell ht://Dig to spider local files where possible (rather than using HTTP) by telling it which directory corresponds to particular URLs:



local_urls:    http://my.site.com=/var/www/


This means that where possible the software will search locally, although it will revert to spidering if it can't find a particular file but can find it via HTTP.



Once your config is tweaked to your liking, you can generate the database by running the command



sudo rundig -a -c /etc/htdig/htdig.conf


The -a flag enables users to access the search facility while ht://Dig is spidering content. You'll need to run this command once for every config file you've set up.



rundig runs the ht://Dig components htdig to create the main database, htmerge to generate a document index and word database, and then htnotify and htfuzzy. htfuzzy allows ht://Dig to do fuzzy matching, in which the search algorithm also looks for synonyms and alternative endings for the terms searched for. By default, only synonym and ending searches are used, and these alternative searches are weighted less strongly when the search results are returned. To use more of the fuzzy search options, and to change their weighting, edit the search_algorithm directive in htdig.conf. The available options include substring searching and spelling error matching.



To keep your index current as your site changes, you should rundig regularly. If you've installed ht://Dig from a Debian or Red Hat package, the installer should have installed a daily cronjob to do the trick for you. Otherwise, for a basic cronjob, save these lines as /etc/cron.daily/htdig:



#!/bin/sh
/usr/bin/rundig -c -a /etc/htdig/htdig.conf


Setting Up and Customizing Website Search


At this point you have the search database, but no way for a user to get at it. The next step is to set up your Apache web server appropriately so that users can search the site. For a single site, just add this line to your /etc/apache2/apache2.conf file (or create a file called htdig that contains this line in /etc/apache2/conf.d if you run Debian or have a similar multi-file config setup):



 Alias /htdig /usr/share/htdig


Reload Apache with the command /etc/init.d/apache2 reload, then try this URL out in your browser:



http://www.example.com/cgi-bin/htsearch?config=htdig&words=test


The search should now work! Of course you'll also want to add a search box somewhere on your website, rather than expecting people to use this long URL. The sample HTML code below, taken from the default wrapper HTML at /etc/htdig/wrapper.html, will generate a search box with various dropdowns:




<form method="post" action="/cgi-bin/htsearch">
<font size="-1">
<input type="hidden" name="config" value="htdig"/>
<input type="hidden" name="restrict" value=""/>
<input type="hidden" name="exclude" value=""/> Match: <select name="method">
<option value="and">All</option>
<option value="or">Any</option>
<option value="boolean">Boolean</option>
</select> Format: <select name="format">
<option value="builtin-long">Long</option>
<option value="builtin-short">Short</option>
</select> Sort by: <select name="sort">
<option value="score">Score</option>
<option value="time">Time</option>
<option value="title">Title</option>
<option value="revscore">Reverse Score</option>
<option value="revtime">Reverse Time</option>
<option value="revtitle">Reverse Title</option>
</select>
<br /> Search:
<input type="text" size="30" name="words" value=""/>
<input type="submit" value="Search"/>
</font>
</form>


Note the hidden form field that specifies the config file to use. Here it's /etc/htdig/htdig.conf; a value of htdig1 would look for the file /etc/htdig/htdig1.conf, and so forth. You can of course edit this form if you want to, by editing wrapper.html or the other default HTML files. If you do this, you'll notice that a lot of these values are environment variables made available by the htsearch results templates; check out that link for more variables you can use.


19a98812-f823-48dc-841e-bf029c63c6d7

You can and should customize the header and footer pages that you get back from ht://Dig to match the appearance of your site. Edit the files in /etc/htdig/ (or /usr/share/htdig in some setups), or add a directive in your htdig.conf file to point elsewhere:



search_results_header:  /my/www/dir/htdig/header.html


Excluding and Including Files


ht://Dig won't parse absolutely everything it finds on your site. Indexing certain types of content is turned off by default; the bad_extensions setting in htdig.conf sets a list of file extensions that will be skipped, including various media files (such as .jpg, .wav, and .mpg) and compressed files (.gz, .tar, and .zip). You can add other extensions if you have other types of files that you never want to be indexed.



The /etc/htdig/bad_words file has a list of words that ht://Dig will ignore when indexing. The default set includes words like "the," "and," "was," and "this"; again, you can add anything you want (or, indeed, remove words from it) to further customize the search for your site if you have particular common words that you don't want to index.



You can also ignore single pages by adding this line to the <head> section of each page to be ignored:



<meta name="robots" content="noindex, follow">


This will (or should) also mean that other spidering bots, such as Google's, don't index that page.



In the other direction, you can include more material if you set ht://Dig up to index PDF and DOC files, using the external_parsers directive in htdig.conf:



 
external_parsers: application/pdf /usr/share/htdig/parse_doc.pl \
application/msword->text/html /usr/share/htdig/parse_doc.pl


The Perl script parser parse_doc.pl comes with the Debian install, or can be downloaded from the contrib section of the ht://Dig website, along with various other parsers. Note that you will need to install the application catdoc for this script to handle Word files, and pstotext for it to handle PDF files.



ht://Dig has a lot more options for you to tweak if you want to customize its results further, but you should now have enough information to index and search a production site. ht://Dig is configurable, robust, and usable. For more information and to get more into the options available for personalizing ht://Dig, check out the comprehensive FAQ and the online documentation.

Follow @openlogic
Follow @CloudSwing

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.Follow @openlogic
Follow @OSCloudServices

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.
Tags: Linux, Technical, Web Server, Review, htdig

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Loading...
Error sending email
Email sent successfully

Email article
Email To : 
Your name : 
Message : (maximum 200 characters)
Home | Search | Contact Us | Products and Support | Services | Enterprise OSS Blog | Wazi Technical Blog | Resources Library | Cloud Services | Partners | Customers | Community | Company | Careers | News and Events
Products
OpenLogic Exchange (OLEX)
License Compliance Module
OSS Discovery
OSS Deep Discovery
OpenUpdate
Services
Open Source Support
CentOS Support
Scanning & Compliance
Open Source Training
Professional Services
Solutions
Support & Indemnification
Open Source Governance
Open Source Scanning
Open Source Provisioning
Consulting & Training
Contact Us
1-888-673-6564


© 2013 OpenLogic, Inc. All rights reserved.
Site Map  |  Privacy Policy