From: Gilles D. <gr...@sc...> - 2003-10-07 18:06:13
|
According to Gabriele Bartolini: > > Nope, if head_before_get=TRUE we use the HEAD request and the HTTP > >server is kind enough to give us the timestamp on the document in the header. > >If the timestamps are the same we don't bother to download it. > > Yep, you are right. I remember that was one of the reasons why I wrote the > code for the 'HEAD' method (also to avoid downloading an entire document > whose content-type is not parsable). I think this only became an issue because of persistent connections. Correct me if I'm wrong, but I think htdig's behaviour in the past (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, and upon seeing the headers if it decided it didn't need to refetch the file, it would simply close the connection right away and not read the stream of data for the file. No wasted bandwidth, but maybe it caused some unnecessary overhead on the server, which probably started serving up each file (including running CGI scripts if that's what made the page) before realising the connection was closed. Now, head_before_get=TRUE would add a bit of bandwidth for htdig -i, because it would get the headers twice, but it's obviously a big advantage for update digs, especially when using persistent connections. Closing the connection whenever htdig decides not to fetch a file would kill any speed advantage you'd otherwise gain from persistent connections, not to mention the extra load on the server. > > > I think you misinterpreted what Lachlan suggested, i.e. the case where Y > > > does NOT change. If Y is the only document with a link to X, and Y does > > > not change, it will still have the link to X, so X is still "valid". > > > However, if Y didn't change, and htdig (without -i) doesn't reindex Y, > > > then how will it find the link to X to validate X's presence in the db? > > I must admit I am not very confortable with the incremental indexing code. > Anyway, when I was thinking of the same procedure for ht://Check (not yet > done, as I said) I came up to this (I will try to stay on a logical level): > > 1) As you said, mark all the document as let's say 'Reference_obsolete' > 2) Read the start URL and mark all the URLs in the start URL to be > retrieved (eventually add them in the index of documents) > 3) Loop until there are no URLs to be retrieved > > 4) For every URL, through a pre-emptive head call, get to know if it is > changed: > a) not changed: let's get all the URLs linked to it and mark them > "to be retrieved" or something like that > b) yes: let's download it again and mark all the new link as "to > be retrieved" > > 5) Purge all the obsolete URLs > > This approach would solve your second "flaw" Neal (I guess so). The critical part of the above, which I was trying to explain before, is point 4 (a). If a document hasn't changed, htdig would need somehow to keep track of every link that document had to others, so that it could keep traversing the hierarchy of links as it crawls its way through to every "active" page on the site. That would require additional information in the database that htdig doesn't keep track of right now. Right now, the only way to do a complete crawl is to reparse every document. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |