Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 www-strip canonicalization unintended exclusion of redirect - ID: 1078094
Last Update: Comment added ( karl-ia )

If you try a host crawl of, for example,
http://yahoo.com, you will be redirected to
http://www.yahoo.com. This should be added as a seed
and then crawled.

However. with 1.2 & up canonicalization, the
www.yahoo.com is rejected as alreadyIncluded, since it
canonicalizes to yahoo.com. As a result, the crawl ends
as soon as dns/robots/root yahoo.com are retrieved,
never getting the intended content.

Some sort of exception or special handling of such
situations is necessary. Some ideas:

- compare canonicalized version against via; don't
canonicalize if same (or force-fetch if same) (only
handles some cases)
- always force-fetch seeds (only resolves for case
where URL is seed)
- clear alreadyIncluded of URLs in certain situations,
as if they redirect to a canonical-equivalent version
of themself. (violates preferred insert-only behavior
of alreadyIncluded; only handles some cases)
- disable www canonicalization

Should consider whether other standard
canonicalizations have same risk.


Gordon Mohr ( gojomo ) - 2004-12-03 02:45

7

Closed

Fixed

Michael Stack

Frontier

None

Public


Comments ( 8 )

Date: 2007-03-14 00:18
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-301 -- please add further
comments at that location.


Date: 2005-03-17 05:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Closing. Below is commit. In implementation, did one
thing different from rules listed above: If seed, do not
canonicalize. Added back www-stripping to default set of
canonicalization rules.

Fix for '[ 1078094 ] www-strip canonicalization unintended
exclusion of
redirect'
* src/conf/profiles/default/order.xml
Add back the www-stripping rule as part of default
canonicalization set.
* src/java/org/archive/crawler/datamodel/CandidateURI.java
Minor formatting.
(isLocation): Added.
* src/java/org/archive/crawler/extractor/ExtractorHTTP.java
Formatting. Removed duplicated code.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
(conditionalCanonicalize): Added. If a seed or a
redirect where the
canoncialization of current uri equates to
canonicalization of the via,
then do not canonicalize.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Call new conditionalCanonicalize. Also added doc. on new
behavior.
* src/java/org/archive/crawler/postprocessor/Postselector.java
Formatting. Removed useless javadoc.



Date: 2005-03-17 05:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Closing. Below is commit. In implementation, did one
thing different from rules listed above: If seed, do not
canonicalize. Added back www-stripping to default set of
canonicalization rules.

Fix for '[ 1078094 ] www-strip canonicalization unintended
exclusion of
redirect'
* src/conf/profiles/default/order.xml
Add back the www-stripping rule as part of default
canonicalization set.
* src/java/org/archive/crawler/datamodel/CandidateURI.java
Minor formatting.
(isLocation): Added.
* src/java/org/archive/crawler/extractor/ExtractorHTTP.java
Formatting. Removed duplicated code.
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
(conditionalCanonicalize): Added. If a seed or a
redirect where the
canoncialization of current uri equates to
canonicalization of the via,
then do not canonicalize.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Call new conditionalCanonicalize. Also added doc. on new
behavior.
* src/java/org/archive/crawler/postprocessor/Postselector.java
Formatting. Removed useless javadoc.



Date: 2005-03-17 01:36
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Talked w/ Igor, Gordon and Brad. Discussed if it'd be ok if
test for this condition was not part of Canonicalization
system, that rather it was run upfront in code that decides
whether or not Canonicalization should be run. Boys were
fine w/ that. Then asked opinion on what the test should
comprise of. Came up with:

If url came of a redirect and
if this url's canonicalization equals the via's
canonicalization
then
pass url, not its canonicalization, to alreadyseen
for Q'ing.


Date: 2005-02-11 22:51
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Would like to be able to use www-strip without triggering
this. Also believe other canonicalizations could be affected
-- seems common to redirect someone to similar URL (eg with
session-id added, etc.)

Needs investigation & resolution.


Date: 2005-01-21 11:27
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Suggested fix: Have the WWW canonicalizer bypass URIs marked
as seeds (this could be configurable).

A more comprehensive fix might have it look at the via
string and the parent URI, if the via string is a redirect
and the current URI and the parent URI would become
identical after cannonicalization, none should be performed
(in fact maybe this should always be the rule, regardless of
the via).


Date: 2004-12-11 00:47
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

For now removed ffrom default set. Downed the priority.


Date: 2004-12-03 17:59
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Good find.

I'd suggest that making users be explicit about their seed
listings -- they must list yahoo.com and www.yahoo.com --
and not having the crawler 'magically' add to the seed list
-- i.e. the original solln. proposed for '[ 1069105 ] make
auto seed add on redirect optional (if happens at all)' --
would make this item less of an issue (We'd have to bring
out better implications of the www canonicalization rule
when trying to crawl www.yahoo.com and yahoo.com); it would
also make the crawlers behavior more explicit, more predictable.


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
status_id Open 2005-03-17 05:22 stack-sf
close_date - 2005-03-17 05:22 stack-sf
resolution_id None 2005-03-17 05:22 stack-sf
assigned_to nobody 2005-03-02 19:22 gojomo
priority 5 2005-02-11 22:51 gojomo
priority 6 2004-12-11 00:47 stack-sf
priority 7 2004-12-03 22:51 gojomo
priority 9 2004-12-03 02:45 gojomo