Re: [htdig] htdig and the '-i' option

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

According to Ahmon Dancy:
> >> Generally, checking to see if the documents have been modified is much
> >> quicker than re-fetching and re-parsing all documents every time.
> 
> Great.  I think I understand.   For what I'm doing, -i is fine because
> all the output is CGI generated (which, presumably, means that it will
> always be "newer" that previous information).

Yes, for dynamic content that doesn't generate Last-Modified headers,
you pretty much have to reindex every time.  With htdig 3.1.5 and older,
though, htdig (without -i) didn't do this by default, unless you set
modification_time_is_now to true.  In 3.1.6, this attribute is set to true
by default, and in the 3.2 betas the attribute is gone altogether and the
code always uses the current time as the modification time if it doesn't
get a Last-Modified header.

> Thank you for your help.  I wonder if you know the answer to this one
> as well:
> 
> Can you explain each of these files (i.e., what good are they).
> 
> db.words.db
> db.docs.index
> db.docdb
> db.wordlist
> 
> dn.wordlist and word.db seem to go together...  and db.docs.index and
> db.docsb seem to go together...  But I don't understand their
> relationship.

What good are they?  Nothing really, unless holding onto data is any good.
:-)  htdig generates db.docdb, which holds a big record containing
all the info needed for each document indexed, including a document
excerpt, and db.wordlist, which is an ASCII file of all the words in
the documents, indicating the DocID for each word, as well as weight
for scoring and a few other tidbits.  htmerge cleans these up, and from
db.docdb it generates db.docs.index which maps DocIDs to URLs, which are
used as keys to db.docdb, and it generates db.words.db from db.wordlist.
When htsearch looks for words, it looks them up in db.words.db, and gets
the DocIDs for matched words.  It looks up these DocIDs in db.docs.index
to get the URLs of matches, and uses these URLs to look up the db.docdb
records to get all the info for the result summaries.

According to Roman Maeder:
> if you want to speed up update digs even more, you could record the
> expiration of a document and only check for changes after it has
> expired and so avoid the http request entirely. Old mailing list
> archives, for example, never change and the http server can give those
> documents a very long expiration time. This would dramatically cut the
> work htdig performs every night or so.

This is an interesting suggestion!  It wouldn't help with dynamic content,
unless the dynamic content somehow generated an expiry date, but it
could be a big help for mail archives.  Right now, htdig doesn't keep
track of expiry dates, but this could be a worthwhile addition to 3.2.
Do you have any references to suggest, for the format of tags that give
expiry dates to search engines?  Is this a standard?

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930