[htdig-dev] url_log behaviour

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Greetings all,

If  htdig  is interrupted, it creates the file specified in  url_log  
(db.log by default) to contain the URLs seen but not yet visited.  If 
this file exists, its urls are added to the next pass of the digging 
(if -i isn't used).

My question:  Is there a way to ensure that these URLs and their 
descendents are visited first?  If so, that could be pushed as a 
work-around for the slow digging.  Every day, a script can start an 
incremental dig, and kill it after X hours.  If this is guaranteed to 
continue where it left off, it could mean that the daily digs could 
still be run during non-peak times.  If the URLs are reordered too 
much, this might result in some pages never getting processed.  This 
would probably require the file to list two classes of URLs:  those 
processed so far, and those seen but not processed.  For large data 
sets, that might slow the exit time down considerably.

Thoughts?

Lachlan

-- 
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)