[htdig-dev] Logical Error in Indexer???

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hey all,
	I've got a question for all of you about how the htdig 'indexer'
should function.

htdig.cc
337     List    *list = docs.URLs();
338     retriever.Initial(*list);
339     delete list;
340
341     // Add start_url to the initial list of the retriever.
342     // Don't check a URL twice!
343     // Beware order is important, if this bugs you could change
344     // previous line retriever.Initial(*list, 0) to Initial(*list,1)
345     retriever.Initial(config->Find("start_url"), 1);

Note lines 337-339.  This code loads the entire list of documents
currently in the index and feeds this to the retriever object for
retrieval and processing.

The effect of this is that we potentially are visiting and keeping
webpages that we aren't about to find via a link, and we will keep
revisiting a website even if we remove it from the 'start_url' in
htdig.conf.

The workaround is to use 'htdig -i'.  This is a disadvantage as we will
revisit and index pages even if they haven't changes since the last run of
htdig.

Here's the Fix:

1) At the start of Htdig, after we've opened the DBs we 'walk' the docDB
and mark EVERY document as Reference_obsolete.  I wrote code to do this..
very short.

2) Comment out htdig.cc 337-339

3) When the indexer fires up and spiders a site, documents that are in
the tree and marked as Reference_obsolete are remarked as
Reference_normal.

4) when htpurge is run, the obsoleted docs are flushed.

Documents that aren't revisited (since a link isn't found) are flushed.

This is fix addresses two flaws:

1)Changing 'start_url' and removing a starting url.. the documents are
still in the index after the next run of htdig (unless you use -i)

2)Pages that still exist on a webserver at a give URL, that are no longer
linked to by any other pages on the site.

I've tested this fix and it works.

Eh?

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485