From: Lachlan A. <lh...@us...> - 2004-06-14 09:58:02
|
Greetings all, If htdig is interrupted, it creates the file specified in url_log (db.log by default) to contain the URLs seen but not yet visited. If this file exists, its urls are added to the next pass of the digging (if -i isn't used). My question: Is there a way to ensure that these URLs and their descendents are visited first? If so, that could be pushed as a work-around for the slow digging. Every day, a script can start an incremental dig, and kill it after X hours. If this is guaranteed to continue where it left off, it could mean that the daily digs could still be run during non-peak times. If the URLs are reordered too much, this might result in some pages never getting processed. This would probably require the file to list two classes of URLs: those processed so far, and those seen but not processed. For large data sets, that might slow the exit time down considerably. Thoughts? Lachlan -- lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |