Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 NPE in FetchDNS, caused by UURI - ID: 1187973
Last Update: Comment added ( karl-ia )

(for post-1.4.0)

Bad URIs with trailing garbage can cause FetchDNS to
throw a NullPointerException:

Problem java.lang.NullPointerException occured when
trying to process 'dns:www.treasurequest.com.%0A%3CBR'
at step PROCESSING in DNS

Associated Throwable: java.lang.NullPointerException

Stacktrace:
java.lang.NullPointerException
at
java.util.regex.Matcher.getTextLength(Matcher.java:1127)
at java.util.regex.Matcher.reset(Matcher.java:284)
at java.util.regex.Matcher.(Matcher.java:205)
at java.util.regex.Pattern.matcher(Pattern.java:879)
at
org.archive.crawler.fetcher.FetchDNS.innerProcess(FetchDNS.java:114)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:282)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:151)


The reason is that for the URI
'dns:www.treasurequest.com.%0A%3CBR',
UURI.getReferencedHost() returns null.

FetchDNS should therefore either check if null is
returned, or UURI.getReferencedHost() should throw an
URIException in that case

However, getReferencedHost's javadoc says nothing about
'null' return-values, so the latter option should be
preferred, in my opinion.


Christian Kohlschütter ( ck-heritrix ) - 2005-04-22 10:39

5

Closed

Out of Date

Nobody/Anonymous

None

1.6.0

Public


Comments ( 3 )

Date: 2007-03-14 00:22
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-397 -- please add further
comments at that location.


Date: 2005-09-22 19:41
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

There's a new RFE for UURI to ensure DNS URIs are legal at
construction time:

[ 1298220 ] UURI could reject illegal DNS URIs
http://sourceforge.net/tracker/index.php?func=detail&aid=1298220&group_id=73833&atid=539102

In the meantime, a fix for another bug has made FetchDNS
better handle a null getReferencedHost() return value from a
UURI. See:

[ 1065413 ] [uuri] '$' in path gets scheduled, spawns
queueing error
https://sourceforge.net/tracker/?func=detail&atid=539099&aid=1065413&group_id=73833

This is preferred somewhat over having
UURI.getReferencedHost() throw an exception, because
ideally, once something is instantiated as a U(sable)URI ,
it should not be subject to parse/format errors on followup
accesses.


I've confirmed that this other FetchDNS fix also addresses
this NPE, replacing it with a more-sensible marking of the
DNS URI as being unfetchable.

So, closing as out-of-date (because fixed elsewhere).


Date: 2005-04-22 16:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

dns UURIs always return null for path (They're 'host' is the
path component). Need to fix javadoc.

dns:www.treasurequest.com.%0A%3CBR looks like it should fail
a UURI parse -- this would cause the UURI to fail before it
got near FetchDNS.

Thanks for reporting Christian.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:01 gojomo
status_id Open 2005-09-22 19:41 gojomo
resolution_id None 2005-09-22 19:41 gojomo
close_date - 2005-09-22 19:41 gojomo