Current Articles | RSS Feed
The venerable ht://Dig application lets you keep track of the text on a single website or a group of sites. Its various components create an index of search terms, provide an interface to search that index, and provide fuzzy matching algorithms and a handful of other options. While it's increasingly common to outsource website searching (often to Google), there are still advantages to controlling your own search engine for your website. Doing so gives you more control over the detail of how the search works, and you can search across multiple sub-sites, or not, as you prefer. Here's how to create a personal search engine for your personal domain.
ht://Dig provides binary installation packages for Debian- and Red Hat-based Linux distributions. When you install the application with your package manager, it should generate a basic /etc/htdig/htdig.conf file, which you'll need to edit to customize it for your site. The most important line to change specifies the URL that ht://Dig will start from when indexing:
start_url: http://my.site.com
The software begins spidering from this URL, which means that if you have pages that aren't linked from elsewhere in your site (for example, entirely separate subsites), they won't be indexed. You can intentionally avoid indexing certain pages too, as we'll see in a moment. The start URL can be a list of string patterns, so if you want to spider other sites that are related but unlinked, you can specify them like this:
start_url: http://www.example1.com \ http://www.example2.com
If you have multiple sites, another option is to set up separate configuration files for each – for instance, /etc/htdig/htdig-example1.conf and /etc/htdig/htdig-example2.conf. The format for these files is the same as for the default configuration file, but their databases will be stored separately. When you search these databases, as we'll see in the next section, you'll need to make sure that you specify which configuration file to use.
Another important line, which limits the scope of the search, is this one:
limit_urls_to: ${start_url}
This is the default setting, which indexes only pages on the site specified by the ${start_url} variable. Alternatively, again, you can specify a list of URLs to use for limiting. Any external pages (i.e. pages that don't match the specified URL or URLs) that your site links to won't be spidered, so you won't wind up indexing the whole Internet.
${start_url}
Even when limited to a single site or sites, the databases that ht://Dig generates can get quite large. Make sure that the directory specified in the configuration file's database_dir attribute has plenty of space. The default database location is /var/lib/htdig, but you can of course keep the databases anywhere you fancy.
database_dir
To improve performance, you can tell ht://Dig to spider local files where possible (rather than using HTTP) by telling it which directory corresponds to particular URLs:
local_urls: http://my.site.com=/var/www/
This means that where possible the software will search locally, although it will revert to spidering if it can't find a particular file but can find it via HTTP.
Once your config is tweaked to your liking, you can generate the database by running the command
sudo rundig -a -c /etc/htdig/htdig.conf
The -a flag enables users to access the search facility while ht://Dig is spidering content. You'll need to run this command once for every config file you've set up.
-a
rundig runs the ht://Dig components htdig to create the main database, htmerge to generate a document index and word database, and then htnotify and htfuzzy. htfuzzy allows ht://Dig to do fuzzy matching, in which the search algorithm also looks for synonyms and alternative endings for the terms searched for. By default, only synonym and ending searches are used, and these alternative searches are weighted less strongly when the search results are returned. To use more of the fuzzy search options, and to change their weighting, edit the search_algorithm directive in htdig.conf. The available options include substring searching and spelling error matching.
rundig
htdig
htmerge
htnotify
htfuzzy
search_algorithm
To keep your index current as your site changes, you should rundig regularly. If you've installed ht://Dig from a Debian or Red Hat package, the installer should have installed a daily cronjob to do the trick for you. Otherwise, for a basic cronjob, save these lines as /etc/cron.daily/htdig:
#!/bin/sh/usr/bin/rundig -c -a /etc/htdig/htdig.conf
At this point you have the search database, but no way for a user to get at it. The next step is to set up your Apache web server appropriately so that users can search the site. For a single site, just add this line to your /etc/apache2/apache2.conf file (or create a file called htdig that contains this line in /etc/apache2/conf.d if you run Debian or have a similar multi-file config setup):
Alias /htdig /usr/share/htdig
Reload Apache with the command /etc/init.d/apache2 reload, then try this URL out in your browser:
/etc/init.d/apache2 reload
http://www.example.com/cgi-bin/htsearch?config=htdig&words=test
The search should now work! Of course you'll also want to add a search box somewhere on your website, rather than expecting people to use this long URL. The sample HTML code below, taken from the default wrapper HTML at /etc/htdig/wrapper.html, will generate a search box with various dropdowns:
<form method="post" action="/cgi-bin/htsearch"><font size="-1"><input type="hidden" name="config" value="htdig"/><input type="hidden" name="restrict" value=""/><input type="hidden" name="exclude" value=""/> Match: <select name="method"><option value="and">All</option><option value="or">Any</option><option value="boolean">Boolean</option></select> Format: <select name="format"><option value="builtin-long">Long</option><option value="builtin-short">Short</option></select> Sort by: <select name="sort"><option value="score">Score</option><option value="time">Time</option><option value="title">Title</option><option value="revscore">Reverse Score</option><option value="revtime">Reverse Time</option><option value="revtitle">Reverse Title</option></select><br /> Search:<input type="text" size="30" name="words" value=""/><input type="submit" value="Search"/></font></form>
Note the hidden form field that specifies the config file to use. Here it's /etc/htdig/htdig.conf; a value of htdig1 would look for the file /etc/htdig/htdig1.conf, and so forth. You can of course edit this form if you want to, by editing wrapper.html or the other default HTML files. If you do this, you'll notice that a lot of these values are environment variables made available by the htsearch results templates; check out that link for more variables you can use.
You can and should customize the header and footer pages that you get back from ht://Dig to match the appearance of your site. Edit the files in /etc/htdig/ (or /usr/share/htdig in some setups), or add a directive in your htdig.conf file to point elsewhere:
search_results_header: /my/www/dir/htdig/header.html
ht://Dig won't parse absolutely everything it finds on your site. Indexing certain types of content is turned off by default; the bad_extensions setting in htdig.conf sets a list of file extensions that will be skipped, including various media files (such as .jpg, .wav, and .mpg) and compressed files (.gz, .tar, and .zip). You can add other extensions if you have other types of files that you never want to be indexed.
bad_extensions
The /etc/htdig/bad_words file has a list of words that ht://Dig will ignore when indexing. The default set includes words like "the," "and," "was," and "this"; again, you can add anything you want (or, indeed, remove words from it) to further customize the search for your site if you have particular common words that you don't want to index.
You can also ignore single pages by adding this line to the <head> section of each page to be ignored:
<head>
<meta name="robots" content="noindex, follow">
This will (or should) also mean that other spidering bots, such as Google's, don't index that page.
In the other direction, you can include more material if you set ht://Dig up to index PDF and DOC files, using the external_parsers directive in htdig.conf:
external_parsers
external_parsers: application/pdf /usr/share/htdig/parse_doc.pl \ application/msword->text/html /usr/share/htdig/parse_doc.pl
The Perl script parser parse_doc.pl comes with the Debian install, or can be downloaded from the contrib section of the ht://Dig website, along with various other parsers. Note that you will need to install the application catdoc for this script to handle Word files, and pstotext for it to handle PDF files.
catdoc
pstotext
ht://Dig has a lot more options for you to tweak if you want to customize its results further, but you should now have enough information to index and search a production site. ht://Dig is configurable, robust, and usable. For more information and to get more into the options available for personalizing ht://Dig, check out the comprehensive FAQ and the online documentation.
Allowed tags: <a> link, <b> bold, <i> italics