Can you fix this error? It appears repeatedly and I
cannot leave the robot alone.
The robot jams with the following exception:
jobo.JoBoSwing
java.lang.NullPointerException
at net.matuschek.html.HtmlDocument.addLink
(HtmlDocument.java:298)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:153)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.extractLinks
(HtmlDocument.java:196)
at net.matuschek.html.HtmlDocument.getLinks
(HtmlDocument.java:93)
at net.matuschek.spider.WebRobot.retrieveURL
(WebRobot.java:902)
at net.matuschek.spider.WebRobot.walkTree
(WebRobot.java:742)
at net.matuschek.spider.WebRobot.work
(WebRobot.java:720)
at net.matuschek.spider.WebRobot.run
(WebRobot.java:709)
at java.lang.Thread.run(Unknown Source)
I was using start-url
http://www.suomi.fi/suomi/julkishallinnon_organisaatiot_j
a_virastot/aakkosellinen_organisaatiohakemisto/
and settings:
<?xml version="1.0"?>
<!DOCTYPE JOBO SYSTEM "jobo.dtd">
<JoBo>
<Robot>
<!-- <AgentName>JoBo
(http://www.matuschek.net/jobo.html)</AgentName> --
>
<StartReferer>http://www.matuschek.net/jobo.html</St
artReferer>
<IgnoreRobotsTxt>true</IgnoreRobotsTxt>
<SleepTime>0</SleepTime>
<MaxDepth>5</MaxDepth>
<WalkToOtherHosts>true</WalkToOtherHosts>
<Bandwidth>0</Bandwidth>
<!-- <MaxDocumentAge>30</MaxDocumentAge> -->
<AllowWholeHost>true</AllowWholeHost>
<AllowWholeDomain>true</AllowWholeDomain>
<AllowCaching>true</AllowCaching>
<FlexibleHostCheck>true</FlexibleHostCheck>
<!-- Proxy configuration
<Proxy>proxy.myprovider.com:80</Proxy> -->
<!-- robot is allowed to visit these URLs more then
once -->
<!-- (useful for forms with different parameter sets -
->
<VisitMany>http://www.matuschek.net</VisitMany>
<!-- form handler -->
<FormHandler url="http://www.matuschek.net/cgi-
bin/test-cgi">
<FormField name="i" value="1"/>
<FormField name="j" value="2"/>
<FormField name="k" value="3"/>
</FormHandler>
</Robot>
<DownloadRuleSet>
<DownloadRule allow="true" mimeType="*/*"/>
</DownloadRuleSet>
<URLCheck>
<RegExpRule allow="true" pattern="." />
<RegExpRule allow="false" pattern="\.ppt$" />
<RegExpRule allow="false" pattern="\.pdf$" />
<RegExpRule allow="false" pattern="\.doc$" />
<RegExpRule allow="false" pattern="\.psd$" />
<RegExpRule allow="false" pattern="\.ps$" />
<RegExpRule allow="false" pattern="\.bmp$" />
<RegExpRule allow="false" pattern="\.zip$" />
<RegExpRule allow="false" pattern="\.tar$" />
<RegExpRule allow="false" pattern="\.sty$" />
<RegExpRule allow="false" pattern="\.exe$" />
<RegExpRule allow="false" pattern="\.com$" />
<RegExpRule allow="false" pattern="\.gz$" />
<RegExpRule allow="false" pattern="\.avi$" />
<RegExpRule allow="false" pattern="\.xls$" />
<RegExpRule allow="false" pattern="\.dat$" />
<RegExpRule allow="false" pattern="\.png$" />
<RegExpRule allow="false" pattern="\.gif$" />
<RegExpRule allow="false" pattern="\.jpg$" />
<RegExpRule allow="false" pattern="\.jpeg$" />
</URLCheck>
<LocalizeLinks>false</LocalizeLinks>
<StoreCGI>true</StoreCGI>
</JoBo>