|
From: Lachlan A. <lh...@us...> - 2003-10-05 08:08:14
|
Greetings Neal,
On Sat, 4 Oct 2003 11:00, Neal Richter wrote:
> If the timestamps are the same we don't bother to download it.
>
> > I think you misinterpreted what Lachlan suggested, i.e. the case
> > where Y does NOT change. If Y is the only document with a link
> > to X, and Y does not change, it will still have the link to X, so
> > X is still "valid". However, if Y didn't change, and htdig
> > (without -i) doesn't reindex Y, then how will it find the link to
> > X to validate X's presence in the db?
>
> Changing Y is the point!
Agreed, changing Y is what triggers the current bug. However, I=20
believe that a simple fix of the current bug will introduce a *new*=20
bug for the more common case that Y *doesn't* change. Reread=20
Gilles's scenario and try to answer his question. I'd explain it=20
more clearly, but I don't have a napkin handy :)
If we get around to implementing Google's link analysis, as Geoff=20
suggested, then we may be able to fix the problem properly. It seems=20
that any fix will have to look at all links *to* a page, and then=20
mark as "obsolete" those *links* where (a) the link-from page ("Y")=20
is changed and (b) it no longer contains the link. After the dig,=20
all pages must be checked (in the database), and those with no links=20
which are not obsolete can themselves be marked as obsolete.
> However I would strongly recommend we enable head_before_get by
> default. We're basically wasting bandwidth like drunken sailors
> with it off!!!
Good suggestion. If we want some code bloat, we could have an "auto"=20
mode, which would use head_before_get unless -i is specified, but=20
not when -i is specified (since we'll always have to do the "get"=20
anyway)...
Cheers,
Lachlan
--=20
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
|