|
From: Neal R. <ne...@ri...> - 2003-10-02 22:58:03
|
Hey all,
I've got a question for all of you about how the htdig 'indexer'
should function.
htdig.cc
337 List *list = docs.URLs();
338 retriever.Initial(*list);
339 delete list;
340
341 // Add start_url to the initial list of the retriever.
342 // Don't check a URL twice!
343 // Beware order is important, if this bugs you could change
344 // previous line retriever.Initial(*list, 0) to Initial(*list,1)
345 retriever.Initial(config->Find("start_url"), 1);
Note lines 337-339. This code loads the entire list of documents
currently in the index and feeds this to the retriever object for
retrieval and processing.
The effect of this is that we potentially are visiting and keeping
webpages that we aren't about to find via a link, and we will keep
revisiting a website even if we remove it from the 'start_url' in
htdig.conf.
The workaround is to use 'htdig -i'. This is a disadvantage as we will
revisit and index pages even if they haven't changes since the last run of
htdig.
Here's the Fix:
1) At the start of Htdig, after we've opened the DBs we 'walk' the docDB
and mark EVERY document as Reference_obsolete. I wrote code to do this..
very short.
2) Comment out htdig.cc 337-339
3) When the indexer fires up and spiders a site, documents that are in
the tree and marked as Reference_obsolete are remarked as
Reference_normal.
4) when htpurge is run, the obsoleted docs are flushed.
Documents that aren't revisited (since a link isn't found) are flushed.
This is fix addresses two flaws:
1)Changing 'start_url' and removing a starting url.. the documents are
still in the index after the next run of htdig (unless you use -i)
2)Pages that still exist on a webserver at a give URL, that are no longer
linked to by any other pages on the site.
I've tested this fix and it works.
Eh?
Thanks.
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Jessica B. <jes...@ya...> - 2003-10-03 00:49:17
|
--- Neal Richter <ne...@ri...> wrote: > Hey all, > I've got a question for all of you about how the > htdig 'indexer' > should function. > I've tested this fix and it works. > > Eh? I felt like I was sharing a beer with you at the pub, and you just got done "schematicizing" the problem and fix on a napkin-coaster and ended it with, "Eh?" Sounds like a good fix to a problem that I think (subconciously) I knew existed. How about this one -- does your patch help with the check_unique_md5 problem? Even when I use a "-i" option (or without), if the start_url's MD5 hash-sig matches the one from my previous index, it just says that it detected an MD5 duplicate and exits. Deleting db.md5hash.db seems to do the trick. But would that be sacrilege removing the db.md5hash.db before a refresh? -Jes __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com |
|
From: Lachlan A. <lh...@us...> - 2003-10-03 13:10:03
|
Greetings Neal, I'm not sure that I understand this. If a page 'X' is linked only by=20 a page 'Y' which isn't changed since the previous dig, do we parse=20 the unchanged page 'Y'? If so, why not run htdig -i? If not, how=20 do we know that page 'X' should still be in the database? I'd be inclined not to fix this until after we've released the next=20 "archive point", whether that be 3.2.0b5 or 3.2.0rc1... Cheers, Lachlan On Fri, 3 Oct 2003 08:56, Neal Richter wrote: > The workaround is to use 'htdig -i'. This is a disadvantage as we > will revisit and index pages even if they haven't changes since the > last run of htdig. > > Here's the Fix: > > 1) At the start of Htdig, after we've opened the DBs we 'walk' the > docDB and mark EVERY document as Reference_obsolete. I wrote code > to do this.. very short. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
|
From: Gabriele B. <bar...@in...> - 2003-10-03 14:42:43
|
Hi guys,
well ... I really like your idea Neal (I got a similar one for
ht://Check, but I have never had the time to realise that!).
However, I agree with Lachlan. I'd prefer to wait until we release this
*benedetta* 3.2.0b5 version, hopefully soon.
Any other opinions?
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Neal R. <ne...@ri...> - 2003-10-03 17:46:45
|
On Fri, 3 Oct 2003, Lachlan Andrew wrote: > Greetings Neal, > > I'm not sure that I understand this. If a page 'X' is linked only by > a page 'Y' which isn't changed since the previous dig, do we parse > the unchanged page 'Y'? If so, why not run htdig -i? If not, how > do we know that page 'X' should still be in the database? X does not change, but Y does.. it no longer has a link to X. If the website is big enough htdig -i is wastefull of network bandwidth. The locical error as I see it is that we revisit the list of documents currently in the index, rather than starting from the beginning and spidering... then removing the all documents we didn't find links for. > I'd be inclined not to fix this until after we've released the next > "archive point", whether that be 3.2.0b5 or 3.2.0rc1... > Cheers, > Lachlan > > On Fri, 3 Oct 2003 08:56, Neal Richter wrote: > > The workaround is to use 'htdig -i'. This is a disadvantage as we > > will revisit and index pages even if they haven't changes since the > > last run of htdig. > > > > Here's the Fix: > > > > 1) At the start of Htdig, after we've opened the DBs we 'walk' the > > docDB and mark EVERY document as Reference_obsolete. I wrote code > > to do this.. very short. > > -- > lh...@us... > ht://Dig developer DownUnder (http://www.htdig.org) > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Gilles D. <gr...@sc...> - 2003-10-03 18:13:12
|
According to Neal Richter: > On Fri, 3 Oct 2003, Lachlan Andrew wrote: > > I'm not sure that I understand this. If a page 'X' is linked only by > > a page 'Y' which isn't changed since the previous dig, do we parse > > the unchanged page 'Y'? If so, why not run htdig -i? If not, how > > do we know that page 'X' should still be in the database? > > X does not change, but Y does.. it no longer has a link to X. > > If the website is big enough htdig -i is wastefull of network bandwidth. > > The locical error as I see it is that we revisit the list of documents > currently in the index, rather than starting from the beginning and > spidering... then removing the all documents we didn't find links for. But if we need to re-spider everything, don't we need to re-index all documents, whether they've changed or not? If so, then we need to do htdig -i all the time. If we don't reparse every document, we need some other means to re-validate every document to which an unchanged document has links. I think you misinterpreted what Lachlan suggested, i.e. the case where Y does NOT change. If Y is the only document with a link to X, and Y does not change, it will still have the link to X, so X is still "valid". However, if Y didn't change, and htdig (without -i) doesn't reindex Y, then how will it find the link to X to validate X's presence in the db? > > I'd be inclined not to fix this until after we've released the next > > "archive point", whether that be 3.2.0b5 or 3.2.0rc1... I'd be inclined to agree. If it comes down to the possibility of losing valid documents in the db vs. keeping invalid ones, I'd prefer the latter behaviour. Until we can find a way to ensure all currently linked documents remain in the db, without having to reparse them all, then I think the current behaviour is the best compromise. If you want to reparse everything to ensure a clean db with accurate linkages, that's what -i is for. A somewhat related problem/limitation in update digs is that the backlink count and link depth from start_url may not get properly updated for documents that aren't reparsed. If these matter to you, periodic full digs may be needed to restore the accuracy of these fields. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2003-10-04 01:02:22
|
> But if we need to re-spider everything, don't we need to re-index all
> documents, whether they've changed or not? If so, then we need to do
> htdig -i all the time. If we don't reparse every document, we need some
> other means to re-validate every document to which an unchanged document
> has links.
Nope, if head_before_get=TRUE we use the HEAD request and the HTTP
server is kind enough to give us the timestamp on the document in the header.
If the timestamps are the same we don't bother to download it.
> I think you misinterpreted what Lachlan suggested, i.e. the case where Y
> does NOT change. If Y is the only document with a link to X, and Y does
> not change, it will still have the link to X, so X is still "valid".
> However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
> then how will it find the link to X to validate X's presence in the db?
Changing Y is the point! I think my original description was unclear.
Bug #1
1) Website contains page X. There is at least one page that contains a
link to X.
2) Remove all links to X in the website, but don't delete it. Run htdig
without the -i option.
3) Do a search and notice that page X is still returned, even though it
technically isn't in the 'website' anymore... it is orphaned on the
webserver.
Bug #2
1) make start_url contain two separate websites & set up filters
accordingly.
2) run htdig -i.... all is OK
3) remove on of the websites from the start_url
4) rerun htdig without -i.
5) do a search and note that the removed websites pages are still
returned!
> > > I'd be inclined not to fix this until after we've released the next
> > > "archive point", whether that be 3.2.0b5 or 3.2.0rc1...
>
> I'd be inclined to agree. If it comes down to the possibility of
> losing valid documents in the db vs. keeping invalid ones, I'd prefer
> the latter behaviour. Until we can find a way to ensure all currently
> linked documents remain in the db, without having to reparse them all,
> then I think the current behaviour is the best compromise. If you
> want to reparse everything to ensure a clean db with accurate linkages,
> that's what -i is for.
If you change all pages to remove a link to a page that doesn't get
deleted, the HTTP header will change and HtDig re-downloads it.. thus
giving correct behavior.
The fix accomplishes this. There is no danger of 'losing valid
documents'. The datestamp in the http header with the proper logic will
guarantee proper behavior. If a page changes, it's re-downloaded
and reparsed and its links are examined for changes. Orphaned pages are
never revisited, and are purged after the spider is done.
I've spent hours inside a debugger examining how the spider does
things... I will continue to look for efficiency gains.
This bug in minor, and a decent workaround exists... so I agree with
waiting to commit the fix.
I'll sit on it and come up with an actual test case at the appropriate
time to demonstrate the bug. It's just plain inefficient the way we
currently do it, we revist pages that don't need it and carry cruft in the
database that is deadweight.
However I would strongly recommend we enable head_before_get by default.
We're basically wasting bandwidth like drunken sailors with it off!!!
Thanks.
Jessica: I'm heading to the Pub here in Bozeman, MT. I'll draw some
stuff on napkins for ya!
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Lachlan A. <lh...@us...> - 2003-10-05 08:08:14
|
Greetings Neal,
On Sat, 4 Oct 2003 11:00, Neal Richter wrote:
> If the timestamps are the same we don't bother to download it.
>
> > I think you misinterpreted what Lachlan suggested, i.e. the case
> > where Y does NOT change. If Y is the only document with a link
> > to X, and Y does not change, it will still have the link to X, so
> > X is still "valid". However, if Y didn't change, and htdig
> > (without -i) doesn't reindex Y, then how will it find the link to
> > X to validate X's presence in the db?
>
> Changing Y is the point!
Agreed, changing Y is what triggers the current bug. However, I=20
believe that a simple fix of the current bug will introduce a *new*=20
bug for the more common case that Y *doesn't* change. Reread=20
Gilles's scenario and try to answer his question. I'd explain it=20
more clearly, but I don't have a napkin handy :)
If we get around to implementing Google's link analysis, as Geoff=20
suggested, then we may be able to fix the problem properly. It seems=20
that any fix will have to look at all links *to* a page, and then=20
mark as "obsolete" those *links* where (a) the link-from page ("Y")=20
is changed and (b) it no longer contains the link. After the dig,=20
all pages must be checked (in the database), and those with no links=20
which are not obsolete can themselves be marked as obsolete.
> However I would strongly recommend we enable head_before_get by
> default. We're basically wasting bandwidth like drunken sailors
> with it off!!!
Good suggestion. If we want some code bloat, we could have an "auto"=20
mode, which would use head_before_get unless -i is specified, but=20
not when -i is specified (since we'll always have to do the "get"=20
anyway)...
Cheers,
Lachlan
--=20
lh...@us...
ht://Dig developer DownUnder (http://www.htdig.org)
|
|
From: Gabriele B. <bar...@in...> - 2003-10-05 09:29:16
|
>If we get around to implementing Google's link analysis, as Geoff
>suggested, then we may be able to fix the problem properly. It seems
>that any fix will have to look at all links *to* a page, and then
>mark as "obsolete" those *links* where (a) the link-from page ("Y")
>is changed and (b) it no longer contains the link. After the dig,
>all pages must be checked (in the database), and those with no links
>which are not obsolete can themselves be marked as obsolete.
Yep ... that's exactly what I wrote ... Sorry Lachlan.
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Gabriele B. <bar...@in...> - 2003-10-05 09:29:13
|
Ciao guys,
> Nope, if head_before_get=TRUE we use the HEAD request and the HTTP
>server is kind enough to give us the timestamp on the document in the header.
>If the timestamps are the same we don't bother to download it.
Yep, you are right. I remember that was one of the reasons why I wrote the
code for the 'HEAD' method (also to avoid downloading an entire document
whose content-type is not parsable).
> > I think you misinterpreted what Lachlan suggested, i.e. the case where Y
> > does NOT change. If Y is the only document with a link to X, and Y does
> > not change, it will still have the link to X, so X is still "valid".
> > However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
> > then how will it find the link to X to validate X's presence in the db?
I must admit I am not very confortable with the incremental indexing code.
Anyway, when I was thinking of the same procedure for ht://Check (not yet
done, as I said) I came up to this (I will try to stay on a logical level):
1) As you said, mark all the document as let's say 'Reference_obsolete'
2) Read the start URL and mark all the URLs in the start URL to be
retrieved (eventually add them in the index of documents)
3) Loop until there are no URLs to be retrieved
4) For every URL, through a pre-emptive head call, get to know if it is
changed:
a) not changed: let's get all the URLs linked to it and mark them
"to be retrieved" or something like that
b) yes: let's download it again and mark all the new link as "to
be retrieved"
5) Purge all the obsolete URLs
This approach would solve your second "flaw" Neal (I guess so).
Ciao ciao
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check
maintainer
Current Location: Melbourne, Victoria, Australia
bar...@in... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The
Inferno
|
|
From: Gilles D. <gr...@sc...> - 2003-10-07 18:06:13
|
According to Gabriele Bartolini: > > Nope, if head_before_get=TRUE we use the HEAD request and the HTTP > >server is kind enough to give us the timestamp on the document in the header. > >If the timestamps are the same we don't bother to download it. > > Yep, you are right. I remember that was one of the reasons why I wrote the > code for the 'HEAD' method (also to avoid downloading an entire document > whose content-type is not parsable). I think this only became an issue because of persistent connections. Correct me if I'm wrong, but I think htdig's behaviour in the past (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, and upon seeing the headers if it decided it didn't need to refetch the file, it would simply close the connection right away and not read the stream of data for the file. No wasted bandwidth, but maybe it caused some unnecessary overhead on the server, which probably started serving up each file (including running CGI scripts if that's what made the page) before realising the connection was closed. Now, head_before_get=TRUE would add a bit of bandwidth for htdig -i, because it would get the headers twice, but it's obviously a big advantage for update digs, especially when using persistent connections. Closing the connection whenever htdig decides not to fetch a file would kill any speed advantage you'd otherwise gain from persistent connections, not to mention the extra load on the server. > > > I think you misinterpreted what Lachlan suggested, i.e. the case where Y > > > does NOT change. If Y is the only document with a link to X, and Y does > > > not change, it will still have the link to X, so X is still "valid". > > > However, if Y didn't change, and htdig (without -i) doesn't reindex Y, > > > then how will it find the link to X to validate X's presence in the db? > > I must admit I am not very confortable with the incremental indexing code. > Anyway, when I was thinking of the same procedure for ht://Check (not yet > done, as I said) I came up to this (I will try to stay on a logical level): > > 1) As you said, mark all the document as let's say 'Reference_obsolete' > 2) Read the start URL and mark all the URLs in the start URL to be > retrieved (eventually add them in the index of documents) > 3) Loop until there are no URLs to be retrieved > > 4) For every URL, through a pre-emptive head call, get to know if it is > changed: > a) not changed: let's get all the URLs linked to it and mark them > "to be retrieved" or something like that > b) yes: let's download it again and mark all the new link as "to > be retrieved" > > 5) Purge all the obsolete URLs > > This approach would solve your second "flaw" Neal (I guess so). The critical part of the above, which I was trying to explain before, is point 4 (a). If a document hasn't changed, htdig would need somehow to keep track of every link that document had to others, so that it could keep traversing the hierarchy of links as it crawls its way through to every "active" page on the site. That would require additional information in the database that htdig doesn't keep track of right now. Right now, the only way to do a complete crawl is to reparse every document. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Neal R. <ne...@ri...> - 2003-10-08 19:16:14
|
> > I think this only became an issue because of persistent connections. > Correct me if I'm wrong, but I think htdig's behaviour in the past > (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, > and upon seeing the headers if it decided it didn't need to refetch the > file, it would simply close the connection right away and not read the > stream of data for the file. No wasted bandwidth, but maybe it caused > some unnecessary overhead on the server, which probably started serving > up each file (including running CGI scripts if that's what made the page) > before realising the connection was closed. True, but we can override the current setting if '-i' is given to force head_before_get=false. > > The critical part of the above, which I was trying to explain before, is > point 4 (a). If a document hasn't changed, htdig would need somehow to > keep track of every link that document had to others, so that it could > keep traversing the hierarchy of links as it crawls its way through > to every "active" page on the site. That would require additional > information in the database that htdig doesn't keep track of right now. > Right now, the only way to do a complete crawl is to reparse every > document. Yep, this is true. On the plus side, if we do keep and maintain that list I've got a strack of research papers talking about what can be done with that list to make searching better. It opens up a world of possibilities for improving relevance ranking, learning relationships between pages, etc.. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2003-10-05 18:43:04
|
On Fri, 3 Oct 2003, Gilles Detillieux wrote: > I think you misinterpreted what Lachlan suggested, i.e. the case where Y > does NOT change. If Y is the only document with a link to X, and Y does > not change, it will still have the link to X, so X is still "valid". > However, if Y didn't change, and htdig (without -i) doesn't reindex Y, > then how will it find the link to X to validate X's presence in the db? Doh! Yep, now I see it and yep it's a problem! We're going to need to implement the logic Gabriele/Lachlan discussed to cure this. On Sun, 5 Oct 2003, Lachlan Andrew wrote: >> However I would strongly recommend we enable head_before_get by >> default. We're basically wasting bandwidth like drunken sailors >> with it off!!! > >Good suggestion. If we want some code bloat, we could have an "auto" >mode, which would use head_before_get unless -i is specified, but >not when -i is specified (since we'll always have to do the "get" >anyway)... This would be a two-line change.. we simply make 'true' the default and manualy change it to false with a line of code that executes when '-i' is given! Post 3.2 we could remove the config verb altogether. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |