Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 Change in ExtractorHTML triggers NullPointerExceptions - ID: 1123859
Last Update: Comment added ( karl-ia )

Hello,

I just noticed that the latest modification of
ExtractorHTML in CVS HEAD causes probably
unsubstantiated alerts, for example:

> Problem java.lang.NullPointerException occured when
trying to process
'http://www.nthposition.com/author.php?authid=281' at
step ABOUT_TO_BEGIN_PROCESSOR in ExtractorHTML:
>
> Associated Throwable: java.lang.NullPointerException
>
> Stacktrace:
> java.lang.NullPointerException
> at
org.archive.crawler.extractor.ExtractorHTML.innerProcess(ExtractorHTML.java
:352)
> at
org.archive.crawler.framework.Processor.process(Processor.java:102)
> at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)

> at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)

This is caused by the fact that curi.getHttpRecorder()
may return null (javadoc says so), thus an assignment
like cs =
curi.getHttpRecorder().getReplayCharSequence()
(ExtractorHTML, line 352) is not guaranteed to be
evaluated without throwing a NullPointerException. This
exception is no longer caught directly, so it is thrown
up too far, triggering the reported alert.

A patch fixing this behaviour is attached.


Christian Kohlschütter


Christian Kohlschütter ( ck-heritrix ) - 2005-02-16 11:13

5

Closed

Fixed

Nobody/Anonymous

Extraction

None

Public


Comments ( 3 )

Date: 2007-03-14 00:21
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-363 -- please add further
comments at that location.


Date: 2005-02-21 13:42
Sender: ck-heritrix

Logged In: YES
user_id=1220421

The malicious line still appears to be in CVS HEAD.
Therefore, the following bug-testing code will never be
reached if the HttpRecorder is null.

*** cs = curi.getHttpRecorder().getReplayCharSequence();

HttpRecorder hr = curi.getHttpRecorder();
if (hr == null) {
throw new IOException("Why is recorder null here?");
}
cs = hr.getReplayCharSequence();



Date: 2005-02-16 16:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Thank you for the patch Christian.

Closing.

Below is the commit.

Fix for '[ 1123859 ] Change in ExtractorHTML triggers
NullPointerExceptions'
Contributed by Christian Kohlschütte
* src/java/org/archive/crawler/extractor/ExtractorHTML.java
Change from Exception to IOException revealed a squashed
NPE that happened
if no ReplayCharSequence available.


Attached File ( 1 )

Filename Description Download
extractor-html-npe.patch Bugfix Download

Changes ( 5 )

Field Old Value Date By
status_id Closed 2005-02-21 13:43 ck-heritrix
status_id Open 2005-02-16 16:21 stack-sf
resolution_id None 2005-02-16 16:21 stack-sf
close_date - 2005-02-16 16:21 stack-sf
File Added 120255: extractor-html-npe.patch 2005-02-16 11:14 ck-heritrix