From Christian Kohlschütter on the discussion list:
==================================
With the [recent CrawlURI serialization slimming]
changes applied, I get URIExceptions (usually hidden
behind BDB's
RuntimeExceptionWrapper), with messages like
- "Relative URI but no
base:
:http:/www.imagesjournal.com/issue02/reviews/mannoirs.htm"
- "Invalid URL encoding" (happens if URI's last
character is '%').
Here is a partial stacktrace from my own "NewFrontier"
implementation (it has
the same problems as the BdbFrontier, but also shows
the exception's cause):
Caused by: org.apache.commons.httpclient.URIException:
Invalid URL encoding
at org.apache.commons.httpclient.URI.decode(URI.java:1768)
at org.apache.commons.httpclient.URI.decode(URI.java:1724)
at org.apache.commons.httpclient.URI.getURI(URI.java:3743)
at
org.archive.crawler.datamodel.CandidateURI.writeObject(CandidateURI.java:51
7)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown
Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
l.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:890)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1333)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1284
)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1073)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:291)
at
de.kohlschuetter.collections.queues.BucketQueue.enqueue(BucketQueue.java:22
1)
at
org.archive.crawler.frontier.NewWorkQueue.insertItem(NewWorkQueue.java:53)
at
org.archive.crawler.frontier.WorkQueue.insert(WorkQueue.java:352)
at
org.archive.crawler.frontier.WorkQueue.enqueue(WorkQueue.java:122)
... 12 more
I am not sure how/why "bad URIs" reached that point,
but without your patch, I
do not get any exceptions.
-========================================
Gordon Mohr
Frontier
1.6.0
Public
|
Date: 2007-03-14 00:53
|
|
Date: 2005-08-09 01:38 Logged In: YES |
|
Date: 2005-08-04 20:44 Logged In: YES |
|
Date: 2005-07-12 01:13 Logged In: YES |
|
Date: 2005-06-02 20:23 Logged In: YES |
|
Date: 2005-06-01 21:39 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| artifact_group_id | None | 2005-09-23 18:01 | gojomo |
| close_date | - | 2005-08-09 01:38 | gojomo |
| status_id | Open | 2005-08-09 01:38 | gojomo |
| resolution_id | None | 2005-08-09 01:38 | gojomo |
| assigned_to | karl-ia | 2005-07-12 01:13 | karl-ia |
| assigned_to | gojomo | 2005-06-02 20:23 | gojomo |
| summary | URIExceptions in (de)serialization, post CrawlURI slimming | 2005-06-01 21:39 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use