Re: [htdig-dev] Re: Logical Error in Indexer???

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>
> I think this only became an issue because of persistent connections.
> Correct me if I'm wrong, but I think htdig's behaviour in the past
> (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET,
> and upon seeing the headers if it decided it didn't need to refetch the
> file, it would simply close the connection right away and not read the
> stream of data for the file.  No wasted bandwidth, but maybe it caused
> some unnecessary overhead on the server, which probably started serving
> up each file (including running CGI scripts if that's what made the page)
> before realising the connection was closed.

  True, but we can override the current setting if '-i' is given to
force head_before_get=false.

>
> The critical part of the above, which I was trying to explain before, is
> point 4 (a).  If a document hasn't changed, htdig would need somehow to
> keep track of every link that document had to others, so that it could
> keep traversing the hierarchy of links as it crawls its way through
> to every "active" page on the site.  That would require additional
> information in the database that htdig doesn't keep track of right now.
> Right now, the only way to do a complete crawl is to reparse every
> document.

  Yep, this is true.  On the plus side, if we do keep and maintain that
list I've got a strack of research papers talking about what can be done
with that list to make searching better.  It opens up a world of
possibilities for improving relevance ranking, learning relationships
between pages, etc..

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485