htdig-dev Mailing List for ht://Dig (Page 45)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

According to Neal Richter:
> On Fri, 3 Oct 2003, Lachlan Andrew wrote:
> > I'm not sure that I understand this.  If a page 'X' is linked only by
> > a page 'Y' which isn't changed since the previous dig, do we parse
> > the unchanged page 'Y'?  If so, why not run  htdig -i?  If not, how
> > do we know that page 'X' should still be in the database?
> 
> X does not change, but Y does.. it no longer has a link to X.
> 
> If the website is big enough htdig -i is wastefull of network bandwidth.
> 
> The locical error as I see it is that we revisit the list of documents
> currently in the index, rather than starting from the beginning and
> spidering... then removing the all documents we didn't find links for.

But if we need to re-spider everything, don't we need to re-index all
documents, whether they've changed or not?  If so, then we need to do
htdig -i all the time.  If we don't reparse every document, we need some
other means to re-validate every document to which an unchanged document
has links.

I think you misinterpreted what Lachlan suggested, i.e. the case where Y
does NOT change.  If Y is the only document with a link to X, and Y does
not change, it will still have the link to X, so X is still "valid".
However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
then how will it find the link to X to validate X's presence in the db?

> > I'd be inclined not to fix this until after we've released the next
> > "archive point", whether that be 3.2.0b5 or 3.2.0rc1...

I'd be inclined to agree.  If it comes down to the possibility of
losing valid documents in the db vs. keeping invalid ones, I'd prefer
the latter behaviour.  Until we can find a way to ensure all currently
linked documents remain in the db, without having to reparse them all,
then I think the current behaviour is the best compromise.  If you
want to reparse everything to ensure a clean db with accurate linkages,
that's what -i is for.

A somewhat related problem/limitation in update digs is that the backlink
count and link depth from start_url may not get properly updated for
documents that aren't reparsed.  If these matter to you, periodic full
digs may be needed to restore the accuracy of these fields.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (47)	Nov (74)	Dec (66)
2002	Jan (95)	Feb (102)	Mar (83)	Apr (64)	May (55)	Jun (39)	Jul (23)	Aug (77)	Sep (88)	Oct (84)	Nov (66)	Dec (46)
2003	Jan (56)	Feb (129)	Mar (37)	Apr (63)	May (59)	Jun (104)	Jul (48)	Aug (37)	Sep (49)	Oct (157)	Nov (119)	Dec (54)
2004	Jan (51)	Feb (66)	Mar (39)	Apr (113)	May (34)	Jun (136)	Jul (67)	Aug (20)	Sep (7)	Oct (10)	Nov (14)	Dec (3)
2005	Jan (40)	Feb (21)	Mar (26)	Apr (13)	May (6)	Jun (4)	Jul (23)	Aug (3)	Sep (1)	Oct (13)	Nov (1)	Dec (6)
2006	Jan (2)	Feb (4)	Mar (4)	Apr (1)	May (11)	Jun (1)	Jul (4)	Aug (4)	Sep	Oct (4)	Nov	Dec (1)
2007	Jan (2)	Feb (8)	Mar (1)	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov	Dec
2008	Jan (1)	Feb	Mar (1)	Apr (2)	May	Jun	Jul (1)	Aug	Sep (1)	Oct	Nov	Dec
2009	Jan	Feb	Mar (2)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (1)
2011	Jan	Feb	Mar (1)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2016	Jan (1)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec

htdig-dev Mailing List for ht://Dig (Page 45)

htdig-dev — Developer Discussion for the ht://Dig project