Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 ExtractorJS NPE doing speculative extraction - ID: 1392104
Last Update: Comment added ( karl-ia )

Reported by Bjarne up on the list:


Dec 25, 2005 9:18:01 PM
org.archive.crawler.extractor.Extractor innerProcess
WARNING: ExtractorJS: NullPointerException
java.lang.NullPointerException
at org.archive.net.UURIFactory.create(UURIFactory.java:336)
at
org.archive.net.UURIFactory.getInstance(UURIFactory.java:285)
at
org.archive.crawler.datamodel.CrawlURI.createAndAddLinkRelativeToVia(CrawlU
RI.java:1183)
at
org.archive.crawler.extractor.ExtractorJS.considerStrings(ExtractorJS.java:
152)
at
org.archive.crawler.extractor.ExtractorJS.extract(ExtractorJS.java:118)
at
org.archive.crawler.extractor.Extractor.innerProcess(Extractor.java:67)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:306)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:153)

Any ideas ?

Here is what I wrote back:

We're speculating there's a link at this point in the
javascript. Looks like we're passing a null 'base'
into UURIFactory (See
http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#336).
Should add a check in UURIFactory and probably to
ExtractorJS since its in speculative mode (I opened an
issue). I suppose you have no idea how to reproduce
since we're not logging the page we found the NPE on?


Michael Stack ( stack-sf ) - 2005-12-28 17:06

6

Closed

Fixed

Karl Thiessen

Extraction

1.8.0

Public


Comments ( 3 )

Date: 2007-03-14 01:04
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-531 -- please add further
comments at that location.


Date: 2006-05-05 00:17
Sender: karl-ia

Logged In: YES
user_id=1269624

Test with this number in harness; bug verified and verified
fixed.

Closing.


Date: 2006-01-02 23:55
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

ExtractorJS (and the convenience method of CrawlURI,
createAndAddLinkRelativeToVia()) were assuming a 'via' would
always be available. It's not, for seeds or sometimes other
URIs added to a crawl. (Bjarne has confirmed that in their
case, they're adding extra JS URIs as seeds that were found
during QA.)

Fixed by making createAndAddLinkRelativeToVia() fall back to
using the conigured base (or URI itself) when no 'via' is
available.

Fix for [ 1392104 ] ExtractorJS NPE doing speculative extraction
* CrawlURI.java
in createAndAddLinkRelativeToVia(), fall back to
alternate base URI if no 'via' available

Assigning to Karl for creation of a regression test. Simply
trying to crawl as a seed any JS file with detectable URIs
inside should trigger the NPE pre-fix. Two example live web
URIs that trigger the problem are:

http://www.dr.dk/drdkGlobal/scripts/NetTV.js

http://www.dr.dk/drdkGlobal/scripts/generelpopup.js


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2006-05-05 00:17 karl-ia
close_date - 2006-05-05 00:17 karl-ia
resolution_id None 2006-01-02 23:55 gojomo
assigned_to gojomo 2006-01-02 23:55 gojomo
priority 5 2006-01-02 23:46 gojomo
assigned_to nobody 2006-01-02 23:46 gojomo
artifact_group_id None 2006-01-02 23:46 gojomo