If you try a host crawl of, for example,
http://yahoo.com, you will be redirected to
http://www.yahoo.com. This should be added as a seed
and then crawled.
However. with 1.2 & up canonicalization, the
www.yahoo.com is rejected as alreadyIncluded, since it
canonicalizes to yahoo.com. As a result, the crawl ends
as soon as dns/robots/root yahoo.com are retrieved,
never getting the intended content.
Some sort of exception or special handling of such
situations is necessary. Some ideas:
- compare canonicalized version against via; don't
canonicalize if same (or force-fetch if same) (only
handles some cases)
- always force-fetch seeds (only resolves for case
where URL is seed)
- clear alreadyIncluded of URLs in certain situations,
as if they redirect to a canonical-equivalent version
of themself. (violates preferred insert-only behavior
of alreadyIncluded; only handles some cases)
- disable www canonicalization
Should consider whether other standard
canonicalizations have same risk.
Michael Stack
Frontier
None
Public
|
Date: 2007-03-14 00:18
|
|
Date: 2005-03-17 05:22 Logged In: YES |
|
Date: 2005-03-17 05:22 Logged In: YES |
|
Date: 2005-03-17 01:36 Logged In: YES |
|
Date: 2005-02-11 22:51 Logged In: YES |
|
Date: 2005-01-21 11:27 Logged In: YES |
|
Date: 2004-12-11 00:47 Logged In: YES |
|
Date: 2004-12-03 17:59 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-03-17 05:22 | stack-sf |
| close_date | - | 2005-03-17 05:22 | stack-sf |
| resolution_id | None | 2005-03-17 05:22 | stack-sf |
| assigned_to | nobody | 2005-03-02 19:22 | gojomo |
| priority | 5 | 2005-02-11 22:51 | gojomo |
| priority | 6 | 2004-12-11 00:47 | stack-sf |
| priority | 7 | 2004-12-03 22:51 | gojomo |
| priority | 9 | 2004-12-03 02:45 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use