|
From: Neal R. <ne...@ri...> - 2003-10-08 19:16:14
|
> > I think this only became an issue because of persistent connections. > Correct me if I'm wrong, but I think htdig's behaviour in the past > (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, > and upon seeing the headers if it decided it didn't need to refetch the > file, it would simply close the connection right away and not read the > stream of data for the file. No wasted bandwidth, but maybe it caused > some unnecessary overhead on the server, which probably started serving > up each file (including running CGI scripts if that's what made the page) > before realising the connection was closed. True, but we can override the current setting if '-i' is given to force head_before_get=false. > > The critical part of the above, which I was trying to explain before, is > point 4 (a). If a document hasn't changed, htdig would need somehow to > keep track of every link that document had to others, so that it could > keep traversing the hierarchy of links as it crawls its way through > to every "active" page on the site. That would require additional > information in the database that htdig doesn't keep track of right now. > Right now, the only way to do a complete crawl is to reparse every > document. Yep, this is true. On the plus side, if we do keep and maintain that list I've got a strack of research papers talking about what can be done with that list to make searching better. It opens up a world of possibilities for improving relevance ranking, learning relationships between pages, etc.. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |