From: Neal R. <ne...@ri...> - 2003-10-02 22:58:03
|
Hey all, I've got a question for all of you about how the htdig 'indexer' should function. htdig.cc 337 List *list = docs.URLs(); 338 retriever.Initial(*list); 339 delete list; 340 341 // Add start_url to the initial list of the retriever. 342 // Don't check a URL twice! 343 // Beware order is important, if this bugs you could change 344 // previous line retriever.Initial(*list, 0) to Initial(*list,1) 345 retriever.Initial(config->Find("start_url"), 1); Note lines 337-339. This code loads the entire list of documents currently in the index and feeds this to the retriever object for retrieval and processing. The effect of this is that we potentially are visiting and keeping webpages that we aren't about to find via a link, and we will keep revisiting a website even if we remove it from the 'start_url' in htdig.conf. The workaround is to use 'htdig -i'. This is a disadvantage as we will revisit and index pages even if they haven't changes since the last run of htdig. Here's the Fix: 1) At the start of Htdig, after we've opened the DBs we 'walk' the docDB and mark EVERY document as Reference_obsolete. I wrote code to do this.. very short. 2) Comment out htdig.cc 337-339 3) When the indexer fires up and spiders a site, documents that are in the tree and marked as Reference_obsolete are remarked as Reference_normal. 4) when htpurge is run, the obsoleted docs are flushed. Documents that aren't revisited (since a link isn't found) are flushed. This is fix addresses two flaws: 1)Changing 'start_url' and removing a starting url.. the documents are still in the index after the next run of htdig (unless you use -i) 2)Pages that still exist on a webserver at a give URL, that are no longer linked to by any other pages on the site. I've tested this fix and it works. Eh? Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |