Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 [UURI] Catch bad-encoding earlier - ID: 1002144
Last Update: Comment added ( karl-ia )

The parent URI class for UURI can figure that the below
URI is improperly encoded. Our UURI fixup code is
letting the URI through because its judging it already
encoded (escaped). There must be a test from the
parent class that can be used to look at the escaping
that we should use so we can fail this URI before it
gets to PathDepthFilter and httpclient.

07/31/2004 15:31:00 -0700 SEVERE
org.archive.crawler.filter.PathDepthFilter innerAccepts
Failed getpath for
http://club-scar.com/zboard4/B.Kaga%3A%B%E8%C1%BE%BC%F6
07/31/2004 15:31:05 -0700 SEVERE
org.archive.crawler.filter.PathDepthFilter innerAccepts
Failed getpath for
http://club-scar.com/zboard4/B.Kaga%3A%B%E8%C1%BE%BC%F6
07/31/2004 15:31:05 -0700 SEVERE
org.archive.crawler.prefetch.PreconditionEnforcer
considerRobotsPreconditions Failed get of path for
CrawlURI(http://club-scar.com/zboard4/B.Kaga%3A%B%E8%C1%BE%BC%F6)
java.lang.IllegalArgumentException: Invalid uri
'http://club-scar.com/zboard4/B.Kaga%3A%B%E8%C1%BE%BC%F6':
incomplete trailing escape pattern
at
org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java(Com
piled
Code))
at
org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java(Inlin
ed
Compiled Code))
at
org.archive.httpclient.HttpRecorderGetMethod.<init>(HttpRecorderGetMethod.j
ava(Inlined
Compiled Code))
at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java(Compiled
Code))
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java(Comp
iled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java(Compiled
Code))


Michael Stack ( stack-sf ) - 2004-08-02 18:12

5

Closed

Duplicate

Michael Stack

Extraction

None

Public


Comments ( 3 )

Date: 2007-03-14 00:15
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-213 -- please add further
comments at that location.


Date: 2004-10-13 23:15
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing as a duplicate of

[ 1036680 ] PathDepthFilter innerAccepts SEVERE log: "Failed
getPath..."

https://sourceforge.net/tracker/index.php?func=detail&aid=1036680&group_id=73833&atid=539099




Date: 2004-08-05 18:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is another example from ukgov. Below was in
heritrix_out. Should be caught by UURI and logged in
uri-errors.

08/03/2004 10:42:37 -0700 SEVERE
org.archive.crawler.prefetch.PreconditionEnforcer
considerRobotsPreconditions Failed get of path for
CrawlURI(http://www.army.mod.uk/img/101regtrav/REME%20lightning%2050%.jpg)
java.lang.IllegalArgumentException: Invalid uri
'http://www.army.mod.uk/img/101regtrav/REME%20lightning%2050%.jpg':
incomplete trailing escape pattern
at
org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java(Compiled
Code))
at
org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java(Inlined
Compiled Code))
at
org.archive.httpclient.HttpRecorderGetMethod.<init>(HttpRecorderGetMethod.java(Inlined
Compiled Code))
at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java(Compiled
Code))
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java(Compiled
Code))



Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-10-13 23:15 stack-sf
resolution_id None 2004-10-13 23:15 stack-sf
close_date - 2004-10-13 23:15 stack-sf