From: Gilles D. <gr...@sc...> - 2001-12-07 18:48:49
|
According to Ahmon Dancy: > >> Generally, checking to see if the documents have been modified is much > >> quicker than re-fetching and re-parsing all documents every time. > > Great. I think I understand. For what I'm doing, -i is fine because > all the output is CGI generated (which, presumably, means that it will > always be "newer" that previous information). Yes, for dynamic content that doesn't generate Last-Modified headers, you pretty much have to reindex every time. With htdig 3.1.5 and older, though, htdig (without -i) didn't do this by default, unless you set modification_time_is_now to true. In 3.1.6, this attribute is set to true by default, and in the 3.2 betas the attribute is gone altogether and the code always uses the current time as the modification time if it doesn't get a Last-Modified header. > Thank you for your help. I wonder if you know the answer to this one > as well: > > Can you explain each of these files (i.e., what good are they). > > db.words.db > db.docs.index > db.docdb > db.wordlist > > dn.wordlist and word.db seem to go together... and db.docs.index and > db.docsb seem to go together... But I don't understand their > relationship. What good are they? Nothing really, unless holding onto data is any good. :-) htdig generates db.docdb, which holds a big record containing all the info needed for each document indexed, including a document excerpt, and db.wordlist, which is an ASCII file of all the words in the documents, indicating the DocID for each word, as well as weight for scoring and a few other tidbits. htmerge cleans these up, and from db.docdb it generates db.docs.index which maps DocIDs to URLs, which are used as keys to db.docdb, and it generates db.words.db from db.wordlist. When htsearch looks for words, it looks them up in db.words.db, and gets the DocIDs for matched words. It looks up these DocIDs in db.docs.index to get the URLs of matches, and uses these URLs to look up the db.docdb records to get all the info for the result summaries. According to Roman Maeder: > if you want to speed up update digs even more, you could record the > expiration of a document and only check for changes after it has > expired and so avoid the http request entirely. Old mailing list > archives, for example, never change and the http server can give those > documents a very long expiration time. This would dramatically cut the > work htdig performs every night or so. This is an interesting suggestion! It wouldn't help with dynamic content, unless the dynamic content somehow generated an expiry date, but it could be a big help for mail archives. Right now, htdig doesn't keep track of expiry dates, but this could be a worthwhile addition to 3.2. Do you have any references to suggest, for the format of tags that give expiry dates to search engines? Is this a standard? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |