Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 max-trans-hops=0 generates -63 in crawl.log - ID: 1090564
Last Update: Comment added ( karl-ia )

I'd expect that setting max-trans-hops to zero would
allow me to crawl a site without ever leaving it to
follow redirects or to pickup embeds. Currently
setting it to zero has the crawler fail on the first
item fetched with a -63 (Looks like dns lookup is
considered out of scope).


Michael Stack ( stack-sf ) - 2004-12-23 20:51

6

Closed

Fixed

Gordon Mohr

Usability/UI

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 00:19
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-315 -- please add further
comments at that location.


Date: 2005-09-22 00:19
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

That would be intuitive behavior. It runs into two problems:

(1) DNS URIs aren't giving out any 'host' to be used for
classic scoping. UURI.getHost() didn't return anything for
DNS URIs by design; UURI.getReferencedHost() is necessary to
get either the 'contacted' host (as in HTTP) or 'referenced'
host (as in DNS URIs). Classic scoping should be using
getReferencedHost() rather than getHost().

(2) max-trans-hops is being enforced as a cap on consecutive
hops-ending non-link hops, even when the URI is otherwise in
scope. That's probably wrong; max-trans-hops is meant to be
the max hops at which a URI can get a transitive ACCEPT, not
a threshold beyond which a URI gets a hard REJECT. (Much
easier to express in decide-rule terms.)

I've made these changes to effect the intuitive behavior.
There could be a side-effect in that in-focus junk that's
reached via long non-link (trans) hop chains that used to be
ruled out will now be ruled in, but that's better than
ruling out otherwise in-focus material simply because its
hop path ended with too many nonlink hops.

Commit comment:

Fix for [ 1090564 ] max-trans-hops=0 generates -63 in crawl.log
* CrawlScope.java
use getReferencedHost() for isSameHost() test
* ClassicScope.java
don't enforce max-trans-hops as part of exceedsMaxHops
* UURI.java
use getReferencedHost() in cacheHostBasename()

After change, a crawl with max-trans-hops has the expected
behavior.

Closing as fixed.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:01 gojomo
status_id Open 2005-09-22 23:46 gojomo
close_date - 2005-09-22 23:46 gojomo
resolution_id None 2005-09-22 00:19 gojomo
priority 5 2005-09-21 23:32 gojomo
assigned_to nobody 2005-09-21 23:32 gojomo