Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 strip-www canon. rule causes failed crawl of netarkivet.dk - ID: 1222360
Last Update: Comment added ( karl-ia )

See
http://groups.yahoo.com/group/archive-crawler/message/1961

Crawl 'archive.org' and see how things work across
redirect to 'www.archive.org'. This works fine. We
have code to handle this case.

What we don't have is code to handle the inverse, going
from 'www.netarkivet.dk' to 'netarkivet.dk'.

TO REPRODUCE:

Crawl using the seed 'www.netarkivet.dk'. See how the
crawl profile looks like:

2005-06-17T04:49:32.368Z 1 77
dns:www.netarkivet.dk P http://www.netarkivet.dk/
text/dns #001 20050617044931942+18 - -
2005-06-17T04:49:33.375Z 302 301
http://www.netarkivet.dk/robots.txt P
http://www.netarkivet.dk/ text/html #001
20050617044932942+402 ESVJLA2FRUB6UDKGUQWRMONKMVJQR325 -
2005-06-17T04:49:35.804Z 302 291
http://www.netarkivet.dk/ - - text/html #001
20050617044935387+402 FJZLU5E256LIJ5NFHPXYMTWVX7E5CS3L 3t


... and then we crawl no more.

The site wants to redirect us to the domain minus the
'www' but we're ruliing pages as already seen.


Michael Stack ( stack-sf ) - 2005-06-17 04:52

5

Closed

Fixed

Karl Thiessen

None

1.6.0

Public


Comments ( 7 )

Date: 2007-03-14 00:55
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-446 -- please add further
comments at that location.


Date: 2005-07-18 19:45
Sender: karl-ia

Logged In: YES
user_id=1269624

Test in harness has been passing for a week, moved to
regression harness.

Closing bug.


Date: 2005-06-24 01:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fix introduced a bug where we'd look for ever on a
(perverse) redirect where the redirect was sending us back
to the page we were redirected from.

E.g.
http://bridalelegance.com/images/buttons3/tuxedos-off.gif
has a 'Content-Location' of itself. We'd redirect, then
redirect, then redirect... to this URL over and over again.

Be careful w/ that forcefetch.


Fix bug introduced by fix to '[ 1222360 ] strip-www canon.
rule causes failed
crawl of netarkivet.dk'
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
If redirect and we're redirecting to ourselves, don't
forcefetch else
we'll loop endlessly.



Date: 2005-06-17 16:25
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Gordon:

Thanks for adding reference to history (I should have reused
the original issue).

Yes, latest changes equate to parenthetical 'or force-fetch'
from that issue .

Yes, there may still be gaps. Does 'Chakrabarti/Mining the
Web' have a canonicalization section? I don't remember (Do
you have this book? I haven't seen it in a while). Was
thinking a review of alexa rules might help too.


Date: 2005-06-17 15:57
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

It would be better if this issue was closed as a duplicate
of "[ 1078094 ] www-strip canonicalization unintended
exclusion of redirect"-- and these reports/comments/fixes
documented there:
http://sourceforge.net/tracker/index.php?func=detail&aid=1078094&group_id=73833&atid=539099

The 'Summary' there perfectly fits this problem; this issue
is best understood in the context of previous similar work;
and I *think* the effect of the latest changes is
essentially equivalent to the parenthetical of my first fix
suggestion in that issue: if the 'via' and 'uri'
canonicalize to the same thing, set force-fetch.

I had reservations about the suggestion then as not covering
all cases... it may still have gaps now.

In particular, we still have a canonicalization problem when
A and B canonicalize to the same thing, but A succeeds while
B fails. If B is encountered first, variant A will never be
crawled.

This could also happen in a (as yet contrived but plausible)
2-hop redirect:

- crawler encounters http://host.com/
- redirect is to http://host.com/index.html
- redirect is to http://www.host.com

We should review the literature (Chakrabarti/Mining the Web)
and other crawlers to see if there's a more comprehensive
fix to our canonicalization issues.



Date: 2005-06-17 05:11
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Reopening and assigning to Karl for verification of fix.


Date: 2005-06-17 05:08
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed fix. Below is commit message.

Fix for '[ 1222360 ] strip-www canon. rule causes failed
crawl of
netarkivet.dk'.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
Replaced conditionalCanonicalization with method
canonialization that
takes a CandidateURI. New method sets forceFetch if
canonicalization
of a via == current URL AND this URL is result of
redirect. Previous, for
this special circumstance, we'd return the
non-canonicalized version of
the URL. This non-canonicalized URL would pass through
the already seen
check and the URL would be crawled. This worked for case
where
non-canonicalized form did not equal the via -- e.g. if
site was trying to
redirect us from DOMAIN to www.DOMAIN (via ==
canonicalized form). The
technique wouldn't work where we were being redirect
from www.DOMAIN to
DOMAIN (via != canonicalized form && non-canonicalized
form == via).
* src/java/org/archive/crawler/frontier/WorkQueueFrontier.java
Refactor schedule. Call canonicalize. Then switch off
forceFetch flag
(Canonicalize has changed. It may set forcefetch).


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:02 gojomo
status_id Open 2005-07-18 19:45 karl-ia
close_date 2005-06-17 05:08 2005-07-18 19:45 karl-ia
assigned_to nobody 2005-06-17 05:11 stack-sf
status_id Closed 2005-06-17 05:11 stack-sf
status_id Open 2005-06-17 05:08 stack-sf
close_date - 2005-06-17 05:08 stack-sf
resolution_id None 2005-06-17 05:08 stack-sf