Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 Non-canonical seed URLs need better reporting - ID: 1217290
Last Update: Comment added ( karl-ia )

Example seed list (from AU crawl):

abareonlineshop.com
www.abareonlineshop.com
cinemedia.net
www.cinemedia.net
reframingthefuture.net
www.reframingthefuture.net
climateaustralia.org
www.climateaustralia.org

The seed report will include canonical versions with
status HTTP-200, but will report that the non-canonical
versions have not been processed.

Assigning to myself until I get a test in the harness.


Karl Thiessen ( karl-ia ) - 2005-06-08 23:24

6

Closed

None

Karl Thiessen

Usability/UI

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 00:53
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-429 -- please add further
comments at that location.


Date: 2005-09-14 01:51
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I think the problem here is deeper than just reporting; for
example, in the case of the cinemedia.net/www.cinemedia.net
URIs above, the following occurs:

(1) cinemedia.net gets scheduled and counted as alreadyIncluded
(2) www.cinemedia.net does not get scheduled because it
canonicalizes (via the StripWWW rule) to cinemedia.net
(3) as a result, www.cinemedia.net always shows in seeds
report as 'not processed' (reporting problem)
(4) cinemedia.net is an unfetchable URI, so it shows as a
'failed, will retry' indefinitely
(5) HOWEVER, www.cinemedia.net is fetchable, and returns a
redirect to the real site, www.acmi.net.au. So, due to the
order in which they were tried and the canonicalization, a
valid URI/site is never tried/found. (serious coverage problem)

A hack workaround for just these top 'slash page' URIs is to
only apply the StripWWW canonicalization when there is some
'path' component after the third slash. Thus,
www.cinemedia.net would not be canonicalized, but
www.cinemedia.net/home.html would be. For slash page seeds,
this would create a small risk of getting duplicate content,
but eliminate the problem where the whole site in its
alternate (www-full, www-less) form can be missed -- a
worthwhile tradeoff.

I've made this change to StripWWW, but canonicalization
problems causing us to miss content are still a problem
needing investigation, for example as part of new issue:

[ 1290579 ] canonicalization losing docs: make
content&result sensitive
http://sourceforge.net/tracker/index.php?func=detail&aid=1290579&group_id=73833&atid=539102

In the meantime, the workaround for StripWW commit comment:

Sorta-fix for [ 1217290 ] Non-canonical seed URLs need
better reporting
* StripWWWRule.java
only perform stripping if URI has some path/query
component (content after third slash); leave pure-hostname
('slash page') URIs uncanonicalized -- so that we risk
getting duplicate slash page content rather than missing
entire site only available through one or the other hostname

Assigning to Karl for verification/closing.




Date: 2005-09-14 01:26
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Re: the previous comment (apparently from Mike Schwartz)

Internally, the seed redirects do have their 'via' set to
the original seed. However, they are also marked as seeds
themselves -- this being necessary in some cases to
dynamically expand the seed-derived scope to include the new
hostname. If I understand the comment correctly, it is the
way this confuses the seed report that's a concern, rather
than the lack of necessary 'via' info. If the underlying
desire is to track all URIs crawled back to the seed from
whence they were discovered, another facility under
consideration would allow inheritable properties to flow
from any URI to the URIs discovered from it. See the
discussion in the following issue for details:

[ 1289245 ] n-hops-off decide rule / focus plus N hops scoping
http://sourceforge.net/tracker/index.php?func=detail&aid=1289245&group_id=73833&atid=539102



Date: 2005-06-15 15:35
Sender: nobody

Logged In: NO

I'd like to suggest that redirect (HTTP 301 status code) URLs
from seed URLs be assigned a "via" of the seed they were
redirected from. That way a site that redirects at the top-level
seed will show up as being crawled from the original seed,
rather than the redirected seed showing up as a brand new
seed. Among other things this will make it easier to bucket
crawled URLs according to the original seeds via which they
were injected into the crawl.

for more details on why I'd like this, please see the message
I posted on the archive-crawler@yahoogroups.com mailing
list on 10 June 2005, Subject: "[archive-crawler] mapping
between crawled and seed URLs"


Date: 2005-06-09 00:14
Sender: karl-ia

Logged In: YES
user_id=1269624

Test is in the harness under this bug# (1217290). Note that
I am changing the category of this bug from uri to UI; it's
a Web-UI only bug -- the seeds-report.txt on disk simply
reports NOTCRAWLED, with no explanation given.

Assigning to Gordon for a fix.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
status_id Open 2005-12-02 17:14 stack-sf
close_date - 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to gojomo 2005-09-14 01:51 gojomo
category_id uri 2005-06-09 00:14 karl-ia
assigned_to karl-ia 2005-06-09 00:14 karl-ia