[htdig-dev] Re: Logical Error in Indexer???

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Greetings Neal,

On Sat, 4 Oct 2003 11:00, Neal Richter wrote:
> If the timestamps are the same we don't bother to download it.
>
> > I think you misinterpreted what Lachlan suggested, i.e. the case
> > where Y does NOT change.  If Y is the only document with a link
> > to X, and Y does not change, it will still have the link to X, so
> > X is still "valid". However, if Y didn't change, and htdig
> > (without -i) doesn't reindex Y, then how will it find the link to
> > X to validate X's presence in the db?
>
>   Changing Y is the point!

Agreed, changing Y is what triggers the current bug.  However, I=20
believe that a simple fix of the current bug will introduce a *new*=20
bug for the more common case that Y *doesn't* change.  Reread=20
Gilles's scenario and try to answer his question.  I'd explain it=20
more clearly, but I don't have a napkin handy :)

If we get around to implementing Google's link analysis, as Geoff=20
suggested, then we may be able to fix the problem properly.  It seems=20
that any fix will have to look at all links *to* a page, and then=20
mark as "obsolete" those *links* where (a) the link-from page ("Y")=20
is changed and (b) it no longer contains the link.  After the dig,=20
all pages must be checked (in the database), and those with no links=20
which are not obsolete can themselves be marked as obsolete.

> However I would strongly recommend we enable head_before_get by
> default. We're basically wasting bandwidth like drunken sailors
> with it off!!!

Good suggestion.  If we want some code bloat, we could have an "auto"=20
mode, which would use  head_before_get  unless  -i  is specified, but=20
not when  -i  is specified (since we'll always have to do the "get"=20
anyway)...

Cheers,
Lachlan

--=20
lh...@us...
ht://Dig developer DownUnder  (http://www.htdig.org)