From: Gilles D. <gr...@sc...> - 2002-01-02 20:45:51
|
According to Dan Langille: > I've found an instance where a document contains in robots.txt is > included in the final index. Not sure if this is a bug or a feature. It's a bug, but it's in your script... ... > $ more rundig.merge > #!/bin/sh ... > $BINDIR/htdig -vvv -c ${CONFIGMERGE} > > $BINDIR/htmerge -vvv -c ${CONFIG} -m ${CONFIGMERGE} This is the problem. As I've mentioned many times on this list before, you can't go straight from htdig to htmerge -m. You need to run htmerge in the standard way on the database from htdig before running htmerge -m. What's happening is when htdig goes to fetch the disallowed document, it puts a control record in db.wordlist to tell htmerge to purge this document. But if you don't run htmerge in the normal way, it doesn't process this control record so the document isn't purged from the database before you merge it into the new database. Even worse, because htmerge -m doesn't expect these control records, it sometimes can put a junk record into the new wordlist, which may in some cases cause the wrong document to be purged from the database. You must insert $BINDIR/htmerge -vvv -c ${CONFIGMERGE} in your script after running htdig, and before running htmerge -m, to properly clean up the CONFIGMERGE database before merging it into your main one. I think that if I can't easily fix htmerge -m to deal with control records, I'll have to put a really big warning in the htmerge.html manual page about this. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |