Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 [uuri] '$' in path gets scheduled, spawns queueing error - ID: 1065413
Last Update: Comment added ( karl-ia )

In heritrix_out.log:


11/11/2004 14:32:42 -0800 WARNING
org.archive.util.DevUtils warnHandle Failed to get
class key: Illegal domain label: $library
CrawlURI(dns:$library)
org.apache.commons.httpclient.URIException: Illegal
domain label: $library
at
org.archive.crawler.datamodel.UURIFactory.checkDomainlabel(UURIFactory.java
:480)
at
org.archive.crawler.datamodel.UURIFactory.fixup(UURIFactory.java:377)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:254)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:244)
at
org.archive.crawler.datamodel.UURIFactory.getInstance(UURIFactory.java:213)

at
org.archive.crawler.datamodel.CrawlURI.calculateClassKey(CrawlURI.java:396)




Here iw what was in recovery log that was responsible:

Fs
http://fdab.gsfc.nasa.gov/live/$library/carpenter_russell.jpg
...
F+ dns:$library LLLLLXLLLXELLLXRLLLEP
http://$library/carpenter_russell.jpg
Fe dns:$library

We should be failing way earlier on stuff like '$library'.


Michael Stack ( stack-sf ) - 2004-11-12 19:59

5

Closed

Fixed

Gordon Mohr

uri

1.6.0

Public


Comments ( 3 )

Date: 2007-03-14 00:18
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-291 -- please add further
comments at that location.


Date: 2005-09-21 23:00
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Problems with illegal DNS URIs mitigated. Commit comment:

Fix for [ 1065413 ] [uuri] '$' in path gets scheduled,
spawns queueing error
* FetchDNS.java
when DNS URI gives no workable hostname/IP, mark URI as
unfetchable (-7) rather than throwing NPE
* HostnameQueueAssignmentPolicy.java,
SurtAuthorityQueueAssignmentPolicy.java
downgrade incidence of URI getting a default class key,
because a usual key cannot be calculated, to an INFO log
event (rather than WARNING)

Closing as fixed.


Date: 2005-09-21 22:47
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

In reproducing (by starting with seeds which include: (1) an
HTTP URI with a '$' in one path-segment; (2) a "dns:$foo"
URI) I find that...

- there's no inherent problem with '$' in the HTTP URI --
it's attempted fine
- there's no problem with abs-path relative (eg '/$foo/bar')
or fully relative (eg '$foo/bar') URIs; they're
derelativized and attempted properly
- there are some problems with URIs of the form "dns:$foo".
Because the UURI parsing doesn't enforce any other internal
structure for URIs other than 'http' (like 'dns'), the bad
URI will be accepted (from seeds or other references) and be
scheduled. This then causes (1) a warning when a proper
hostname/surt-authority based class key cannot be made; (2)
a NPE inside FetchDNS's attempted dotted IPv4 match, because
UURI.getReferencedHost() returned null.

We could add knowledge of what a valid DNS URI looks like to
UURI's parsing -- but it's not strictly necessary, and
accepting an invalid DNS URI as a UURI is a recoverable
problem, on its own.
So I've spawned a separate RFE for UURI to validate DNS
URIs. See:
[ 1298220 ] UURI could reject illegal DNS URIs
http://sourceforge.net/tracker/index.php?func=detail&aid=1298220&group_id=73833&atid=539102

Failing to get a regular class key is also not a serious
problem; that's why a default fallback class key exists.
Probably this WARNING-level problem should be demoted to
INFO so it doesn't clutter the Alert box.

The NPE is a serious problem, as it gets handled by the
catchall handler. We recover OK, but the exception shouldn't
have to escape the whole normal processing chain; FetchDNS
shouldn't be throwing NPEs for unfetchable URIs; it should
be marking them as problematic and letting normal processing
continue.





Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:01 gojomo
status_id Open 2005-09-21 23:00 gojomo
resolution_id None 2005-09-21 23:00 gojomo
close_date - 2005-09-21 23:00 gojomo
assigned_to stack-sf 2005-09-21 22:47 gojomo
summary [uuri] '$' in path does us in. 2004-12-02 21:23 gojomo