Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 URIException in deserialization, post CrawlURI slimming - ID: 1212377
Last Update: Comment added ( karl-ia )

From Christian Kohlschütter on the discussion list:
==================================
With the [recent CrawlURI serialization slimming]
changes applied, I get URIExceptions (usually hidden
behind BDB's
RuntimeExceptionWrapper), with messages like
- "Relative URI but no
base:
:http:/www.imagesjournal.com/issue02/reviews/mannoirs.htm"
- "Invalid URL encoding" (happens if URI's last
character is '%').

Here is a partial stacktrace from my own "NewFrontier"
implementation (it has
the same problems as the BdbFrontier, but also shows
the exception's cause):

Caused by: org.apache.commons.httpclient.URIException:
Invalid URL encoding
at org.apache.commons.httpclient.URI.decode(URI.java:1768)
at org.apache.commons.httpclient.URI.decode(URI.java:1724)
at org.apache.commons.httpclient.URI.getURI(URI.java:3743)
at
org.archive.crawler.datamodel.CandidateURI.writeObject(CandidateURI.java:51
7)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown
Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
l.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:890)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1333)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1284
)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1073)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:291)
at
de.kohlschuetter.collections.queues.BucketQueue.enqueue(BucketQueue.java:22
1)
at
org.archive.crawler.frontier.NewWorkQueue.insertItem(NewWorkQueue.java:53)
at
org.archive.crawler.frontier.WorkQueue.insert(WorkQueue.java:352)
at
org.archive.crawler.frontier.WorkQueue.enqueue(WorkQueue.java:122)
... 12 more

I am not sure how/why "bad URIs" reached that point,
but without your patch, I
do not get any exceptions.
-========================================


Gordon Mohr ( gojomo ) - 2005-06-01 02:43

7

Closed

Duplicate

Gordon Mohr

Frontier

1.6.0

Public


Comments ( 6 )

Date: 2007-03-14 00:53
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-423 -- please add further
comments at that location.


Date: 2005-08-09 01:38
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Closing as duplicate (esp wrt trailing '%' issue) of...

[ 1213095 ] UURI handling of inconsistent escaping makes
broken instance
https://sourceforge.net/tracker/index.php?func=detail&aid=1213095&group_id=73833&atid=539099


Date: 2005-08-04 20:44
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix in progress for

[ 1242747 ] over-escaping (of '%', etc) compared to browsers

should also address this.


Date: 2005-07-12 01:13
Sender: karl-ia

Logged In: YES
user_id=1269624

Test in harness under this bug ID, currently failing for
URLs ending in unescaped percent sign.

Assigning to Gordon for further investigation.


Date: 2005-06-02 20:23
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Workaround for HTTPClient issue committed; comment:

Workaround for HTTPClient bug #35148
<http://issues.apache.org/bugzilla/show_bug.cgi?id=35148>
* UURI.java
override parseUriReference() to prevent URIs which begin
with a colon from being interpreted as absolute URIs with a
zero-length scheme
(treating them as relative URIs instead matches
IE/Firefox browser behavior)

--
Separately, making CandidateURI deserialization more robust
against other potential errors of this type committed. Comment:

Fallback fix for [ 1212377 ] URIException in
deserialization, post CrawlURI slimming
* CandidateURI.java
Make deserialization more robust against problems
creating UURI, including last-ditch fallback to synthesized
'invalid:' scheme so that dequeueing/processing isn't broken
by runtimeexceptions
* heritrix.properties
Add pseudo-scheme 'invalid' as supported UURI scheme

--
Believed fixed from my end; assigning to Karl for
verification if appropriate.






Date: 2005-06-01 21:39
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Making this issue strictly for the problem on
deserialization, as evidenced by any (relative) URI which
begins with a ':'.

See this HTTPClient bug for related discussion:

http://issues.apache.org/bugzilla/show_bug.cgi?id=35148

Spawning separate bug for problem on serialization.



Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:01 gojomo
close_date - 2005-08-09 01:38 gojomo
status_id Open 2005-08-09 01:38 gojomo
resolution_id None 2005-08-09 01:38 gojomo
assigned_to karl-ia 2005-07-12 01:13 karl-ia
assigned_to gojomo 2005-06-02 20:23 gojomo
summary URIExceptions in (de)serialization, post CrawlURI slimming 2005-06-01 21:39 gojomo