Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 UURIFactory.validateEscaping() -> IllegalArgumentException - ID: 1046696
Last Update: Comment added ( karl-ia )

Title: Problem occured processing
'http://ompo.dent.umich.edu/contact.html'
Time: Oct. 13, 2004 23:39:56 GMT
Level: SEVERE
Message:

Problem java.lang.IllegalArgumentException: Parameter
may not be null occured when trying to process
'http://ompo.dent.umich.edu/contact.html' at step
ABOUT_TO_BEGIN_PROCESSOR
UURIFactory.validateEscaping() is using HTTPClient's
EncodingUtil.getAsciiBytes(), which is throwing an
unhandled IllegalArgumentException on a number of pages.

Example alert:

Associated Throwable:
java.lang.IllegalArgumentException: Parameter may not
be null

Message:
Parameter may not be null

Stacktrace:
java.lang.IllegalArgumentException: Parameter may not
be null
at
org.apache.commons.httpclient.util.EncodingUtil.getAsciiBytes(EncodingUtil.
java:232)
at
org.archive.crawler.datamodel.UURIFactory.validateEscaping(UURIFactory.java
:530)
at
org.archive.crawler.datamodel.UURIFactory.fixup(UURIFactory.java:508)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:307)
at
org.archive.crawler.datamodel.UURIFactory.getInstance(UURIFactory.java:266)

at
org.archive.crawler.postprocessor.Postselector.handleLinkCollection(Postsel
ector.java:349)
at
org.archive.crawler.postprocessor.Postselector.innerProcess(Postselector.ja
va:172)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:251)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:127)

Other URIs showing same problem stack:

http://ompo.dent.umich.edu/dept.html
http://ompo.dent.umich.edu/research.html
http://ompo.dent.umich.edu/services.html
http://ompo.dent.umich.edu/contact.html
http://www.ci.sterling-heights.mi.us/bin/site/templates/splash.asp
http://www.casl.umd.umich.edu/hum/hum-intern/placements.html

Nothing immediately leaps out as being
challenging/atypical about the page content.



Gordon Mohr ( gojomo ) - 2004-10-14 00:04

6

Closed

Fixed

Nobody/Anonymous

Extraction

None

Public


Comments ( 2 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-261 -- please add further
comments at that location.


Date: 2004-10-14 01:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Here is commit below. Closing.

Fix for "[ 1046696 ] UURIFactory.validateEscaping() ->
IllegalArgumentException"Also fix it so that if a port has
leading zeros as in '00080', they get
stripped off (Was causing problem '[ 1046657 ] READY queue
is BUSY
(Was: NoSuchElementException dequeuing)').
* src/java/org/archive/crawler/datamodel/UURIFactory.java
(checkPort): Have it return authority in case it has to
rewrite it to
strip leading zeros.
(validateEscaping): Check for null argument.
* src/java/org/archive/crawler/datamodel/UURIFactoryTest.java
Added test for stripping of zeros from front of port number.



Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-10-14 01:21 stack-sf
resolution_id None 2004-10-14 01:21 stack-sf
close_date - 2004-10-14 01:21 stack-sf