Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 all DNS attempts fail -6 - ID: 1149470
Last Update: Comment added ( karl-ia )

I have code running which is embedding Heritrix using
Heritrix 1.2.0. I was running this about a week ago
with no problems, urls were getting crawled as
expected. In the past day or so, I notice that no
crawls are occurring and every url in the seed is
queued for a period of time, then the crawler ends. I
don't see any error conditions in the logs. If I
create a job using the same order.xml through the gui
and start heritrix outside of my applications, the urls
get crawled as expected.

I am embedding Heritrix in another application by the
following source code:
Heritrix.initialize();
System.out.println(Heritrix.getHeritrixHome());
System.out.println("Status: " +
Heritrix.launch("C:\\heritrix-1.2.0\\src\\conf\\profiles\\Simple\\order.xml
",true));
while (true)
{
Thread.sleep(30000);
System.out.println("Is Crawling: " +
Heritrix.jobHandler.isCrawling()
+ " Is Running: " +
Heritrix.jobHandler.isRunning()
+ " Frontier Report: " +
Heritrix.jobHandler.getFrontierReport() );
}
}catch (Exception e)
{
e.printStackTrace();
}

With a seed of 14 urls and the following order.xml
(note I took out my specific package name and email
address on purpose):

<?xml version="1.0" encoding="UTF-8"?>
<crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
<meta>
<name>Simple</name>
<description>Profile: Simple crawl</description>
<operator>Admin</operator>
<organization />
<audience />
<date>20040409202922</date>
</meta>
<controller>
<string name="settings-directory">settings</string>
<string name="disk-path" />
<string name="scratch-path">scratch</string>
<string name="state-path">state</string>
<string name="logs-path">logs</string>
<string name="checkpoints-path">checkpoints</string>
<integer name="max-toe-threads">50</integer>
<long name="max-bytes-download">0</long>
<long name="max-document-download">0</long>
<long name="max-time-sec">0</long>
<newObject name="scope"
class="org.archive.crawler.scope.DomainScope">
<boolean name="enabled">true</boolean>
<string name="seedsfile">seeds.txt</string>
<integer name="max-link-hops">25</integer>
<integer name="max-trans-hops">5</integer>
<newObject name="exclude-filter"
class="org.archive.crawler.filter.OrFilter">
<boolean name="enabled">true</boolean>
<boolean name="if-matches-return">true</boolean>
<map name="filters">
<newObject name="pathdepth"

class="org.archive.crawler.filter.PathDepthFilter">
<boolean name="enabled">true</boolean>
<integer name="max-path-depth">20</integer>
<boolean
name="path-less-or-equal-return">false</boolean>
</newObject>
<newObject name="pathologicalpath"

class="org.archive.crawler.filter.PathologicalPathFilter">
<boolean name="enabled">true</boolean>
<integer name="repetitions">3</integer>
</newObject>
</map>
</newObject>
<newObject name="additionalScopeFocus"
class="org.archive.crawler.filter.FilePatternFilter">
<boolean name="enabled">true</boolean>
<boolean name="if-match-return">true</boolean>
<string name="use-default-patterns">All</string>
<string name="regexp"/>
</newObject>
<newObject name="transitiveFilter"
class="org.archive.crawler.filter.TransclusionFilter">
<boolean name="enabled">true</boolean>
<integer name="max-speculative-hops">1</integer>
<integer name="max-referral-hops">-1</integer>
<integer name="max-embed-hops">-1</integer>
</newObject>
</newObject>
<map name="http-headers">
<string name="user-agent">Mozilla/5.0
(compatible; heritrix/1.2.0
+http://www.myhost.com)</string>
<string name="from">jsleeman@myemail.com</string>
</map>
<newObject name="robots-honoring-policy"
class="org.archive.crawler.datamodel.RobotsHonoringPolicy">
<string name="type">classic</string>
<boolean name="masquerade">false</boolean>
<text name="custom-robots"/>
<stringList name="user-agents">
</stringList>
</newObject>
<newObject name="frontier"
class="org.archive.crawler.frontier.HostQueuesFrontier">
<float name="delay-factor">5.0</float>
<integer name="max-delay-ms">1000</integer>
<integer name="min-delay-ms">50</integer>
<integer name="max-retries">30</integer>
<long name="retry-delay-seconds">900</long>
<integer
name="total-bandwidth-usage-KB-sec">0</integer>
<integer
name="max-per-host-bandwidth-usage-KB-sec">0</integer>
</newObject>
<map name="uri-canonicalization-rules">
<newObject name="Lowercase"

class="org.archive.crawler.url.canonicalize.LowercaseRule">
</newObject>
<newObject name="Userinfo"

class="org.archive.crawler.url.canonicalize.StripUserinfoRule">
</newObject>
<newObject name="WWW"

class="org.archive.crawler.url.canonicalize.StripWWWRule">
</newObject>
<newObject name="SessionIDs"

class="org.archive.crawler.url.canonicalize.StripSessionIDs">
</newObject>
<newObject name="QueryStrPrefix"

class="org.archive.crawler.url.canonicalize.FixupQueryStr">
</newObject>
</map>
<map name="pre-fetch-processors">
<newObject name="Preselector"
class="org.archive.crawler.prefetch.Preselector">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<boolean name="recheck-scope">true</boolean>
<boolean name="block-all">false</boolean>
<string name="block-by-regexp"/>
</newObject>
<newObject name="Preprocessor"
class="org.archive.crawler.prefetch.PreconditionEnforcer">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<integer
name="ip-validity-duration-seconds">21600</integer>
<integer
name="robot-validity-duration-seconds">86400</integer>
</newObject>
</map>
<map name="fetch-processors">
<newObject name="DNS"
class="org.archive.crawler.fetcher.FetchDNS">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="HTTP"
class="org.archive.crawler.fetcher.FetchHTTP">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<integer name="timeout-seconds">1200</integer>
<integer name="sotimeout-ms">20000</integer>
<long name="max-length-bytes" >0</long>
<string name="load-cookies-from-file"/>
<string name="save-cookies-to-file"/>
<string name="trust-level">open</string>
</newObject>
</map>
<map name="extract-processors">
<newObject name="ExtractorHTTP"
class="org.archive.crawler.extractor.ExtractorHTTP">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorHTML"
class="org.archive.crawler.extractor.ExtractorHTML">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorCSS"

class="org.archive.crawler.extractor.ExtractorCSS">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorJS"
class="org.archive.crawler.extractor.ExtractorJS">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorSWF"

class="org.archive.crawler.extractor.ExtractorSWF">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
</map>
<map name="write-processors">
<newObject name="Archiver" class="...myCrawler">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
</map>
<map name="post-processors">
<newObject name="Updater"
class="org.archive.crawler.postprocessor.CrawlStateUpdater">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="Postselector"

class="org.archive.crawler.postprocessor.Postselector">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<boolean
name="seed-redirects-new-seed">true</boolean>
<boolean name="override-logger">false</boolean>
<map name="scope-rejected-uri-log-filters" />
</newObject>
</map>
<map name="loggers">
<newObject name="crawl-statistics"
class="org.archive.crawler.admin.StatisticsTracker">
<integer name="interval-seconds">20</integer>
</newObject>
</map>
<string name="recover-path"/>
<newObject name="credential-store"
class="org.archive.crawler.datamodel.CredentialStore">
<map name="credentials">
</map>
</newObject>
</controller>
</crawl-order>

Stats:
-----===== STATS =====-----
Discovered: 28
Queued: 56
Finished: 0
Successfully: 0
Failed: 0
Disregarded: 0

-----===== QUEUES =====-----
Already included size: 28

Ready class queues size: 0

Snooze queues size: 14
KeyedQueue www.nasa.gov
Length: 2
Status: SNOOZED
Wakes in: 12m15s94ms
Last enqueued: dns:www.nasa.gov
Last dequeued: dns:www.nasa.gov
KeyedQueue lisar.larc.nasa.gov
Length: 2
Status: SNOOZED
Wakes in: 12m15s110ms
Last enqueued: dns:lisar.larc.nasa.gov
Last dequeued: dns:lisar.larc.nasa.gov
KeyedQueue www.edwards.af.mil
Length: 2
Status: SNOOZED
Wakes in: 12m15s110ms
Last enqueued: dns:www.edwards.af.mil
Last dequeued: dns:www.edwards.af.mil
KeyedQueue www.wpafb.af.mil
Length: 2
Status: SNOOZED
Wakes in: 12m15s110ms
Last enqueued: dns:www.wpafb.af.mil
Last dequeued: dns:www.wpafb.af.mil
KeyedQueue www.sciencephoto.com
Length: 2
Status: SNOOZED
Wakes in: 12m15s125ms
Last enqueued: dns:www.sciencephoto.com
Last dequeued: dns:www.sciencephoto.com
KeyedQueue www.nasm.si.edu
Length: 2
Status: SNOOZED
Wakes in: 12m15s141ms
Last enqueued: dns:www.nasm.si.edu
Last dequeued: dns:www.nasm.si.edu
KeyedQueue www.ars.usda.gov
Length: 2
Status: SNOOZED
Wakes in: 12m15s235ms
Last enqueued: dns:www.ars.usda.gov
Last dequeued: dns:www.ars.usda.gov
KeyedQueue www.nv.doe.gov

Output after ending:
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
beginCrawlStop Starting beginCrawlStop()...
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
beginCrawlStop Finished beginCrawlStop().
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Entered complete stop.
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Sent crawlEnded to
org.archive.crawler.admin.CrawlJobHandler@4ecfdd
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker crawlEnded
Entered crawlEnded
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\hosts-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\mimetype-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\responsecode-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\seeds-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\crawl-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\processors-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\crawl-manifest.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker crawlEnded
Leaving crawlEnded
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Sent crawlEnded to
org.archive.crawler.admin.StatisticsTracker@1359c1b
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Finished crawl.
Is Crawling: false Is Running: true Frontier Report:
Crawler not running

Just wondering if I am not embedding Heritrix correctly
or if this is a real problem. Any help would be
appreciated.

Thanks and I apologize if this is not the correct post
location.


Jennifer ( jsleeman ) - 2005-02-23 00:01

7

Closed

Wont Fix

Nobody/Anonymous

3rd-party libs

None

Public


Comments ( 11 )

Date: 2007-03-14 00:21
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-367 -- please add further
comments at that location.


Date: 2005-04-01 23:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Jennifer reports that problem was misconfigured dns server:

"Jennifer Sleeman wrote:

>Yeah, it looks like after the admin made dns changes (the
box was
>pointing at the wrong dns server) and changes to nis (this
was causing
>the rpc errors) and rebooted, no dns problems. I only
believe this is
>true because I am seeing packets to and from that box on
the requests
>made.
>..."

Closing again as 'wont fix/cant fix'.


Date: 2005-03-31 18:38
Sender: jsleeman

Logged In: YES
user_id=432885

Looks like I am having this problem again. There were some
dns changes here that is causing the symptoms to occur. I
could now send you whatever information you need to assist
in debugging.


Date: 2005-03-02 18:31
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below are more notes from Jennifer.

Added release note on this issue.

Closing as won't/can't fix.

Tried to go back and find old logs or something but no luck.
If an example of the dns format is anywhere it would be in
the posts I made. I believe I included some system out
text. I cannot recreate it now so I cannot reproduce it for
you.

----- Original Message -----
From: stack
Sent: 3/1/2005 1:24:49 PM
To: jsleeman@redpebble.com
Cc: stack-sf@users.sourceforge.net
Subject: Re: Regards Embedding Heritrix and Snoozing


>> J.Sleeman wrote:
>>
>
>>> >I did some poking around because it seemed odd that
this stopeed working after working for weeks and found out
that there was a DNS problem occurring at the time when I
was having these issues. They fired the guy running our
network here and someone came in and cleaned things up.
There are no problems with the regular expression matcher
now. If someone else is having a problem you may want to
suggest this as a cause. I could provide you with more
details as to the cause if needed.
>>> >
>>> >
>
>> If you had an instance of what the bad dns was looking
like when it
>> failed the regex match, that'd for sure help.
>> Thanks Jennifer.
>> St.Ack


Date: 2005-03-02 06:58
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This could be the same issue others have reported with
non-English WIndows XP and/or XP SP2:

http://groups.yahoo.com/group/archive-crawler/message/1621

A workaround is to use the new FetchDNS expert setting
'accept-non-dns-resolves' to allow native/local name lookups
separate from DNSJava.



Date: 2005-02-25 04:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

This is interesting Jennifer. Can you supply an example of
an incorrectly formatted DNSName so I can see what we
were failing on? Thank you.


Date: 2005-02-23 16:36
Sender: jsleeman

Logged In: YES
user_id=432885

OK found the problem:
The DNSName is not matching the regular expression and
causing the ip address to be set to null. Below i added
code to get the InetAddress based on host name and the
crawler works properly. It is crawling without error. Hope
this helps, I don't know a lot about how your code works but
it appears that the format of the dnsName is not as expected.

System.out.println("dnname= "+dnsName);
Matcher matcher =
DNSJavaUtil.IPV4_QUADS.matcher(dnsName);
// if it's an ip no need to do a lookup
System.out.println("matcher=== "+matcher);
System.out.println("matcher=== "+matcher.matches());

if (matcher != null /*&& matcher.matches()*/) {
// Ideally this branch would never be reached:
no CrawlURI
// would be created for numerical IPs
logger.warning("unnecessary DNS CrawlURI
created: " + curi);
try {

targetServer.getHost().setIP(InetAddress.getByName(dnsName),CrawlHost.IP_NEVER_EXPIRES);

/*targetServer.getHost().setIP(
InetAddress.getByAddress(dnsName,
new byte[] {
(byte)(new
Integer(matcher.group(1)).intValue()),
(byte)(new
Integer(matcher.group(2)).intValue()),
(byte)(new
Integer(matcher.group(3)).intValue()),
(byte)(new
Integer(matcher.group(4)).intValue())}),
CrawlHost.IP_NEVER_EXPIRES); // Never
expire numeric IPs*/

} catch (UnknownHostException e) {
// This should never happen as a dns lookup
is not made
e.printStackTrace();
}
curi.setFetchStatus(S_DNS_SUCCESS);

// No further lookup necessary
return;
}


Date: 2005-02-23 15:37
Sender: jsleeman

Logged In: YES
user_id=432885

More Info:

I did finally find the defintion of -6 and it appears the ip
address is null in the ToeThread class:
***************toethread
curi.getServer()CrawlServer(www.edwards.af.mil)
getHost() CrawlHost<www.edwards.af.mil(ip:null)>
getIP() null




Date: 2005-02-23 15:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942


The -6 says prerequisites are not being fetched -- the dns
or robots fetch are failing (Status codes are appendix in
the user manual). Sounds suspiciously like the issues I
pasted in above, in particular the one where the heritrix
jar is not being placed first in the CLASSPATH.


Date: 2005-02-23 15:04
Sender: jsleeman

Logged In: YES
user_id=432885

Thanks for the response however neither of these two issues
are related. I am keeping the thread going with the while
loop and my classpath is correctly including all libraries
from the heritrix lib directory. Here is what is reported
when it completes:

Crawl Name: Simple
Crawl Status: Finished
Duration Time: 15m15s344ms
Total Seeds Crawled: 0
Total Seeds not Crawled: 14
Total Hosts Crawled: -1
Total Documents Crawled: 28
Processed docs/sec: 0.0
Bandwidth in Kbytes/sec: 0
Total Raw Data Size in Bytes: 0 (0 B)

http://landsat.gsfc.nasa.gov/main/images.html
-6 NOTCRAWLED
http://lisar.larc.nasa.gov/
-6 NOTCRAWLED
http://photolibrary.usap.gov/
-6 NOTCRAWLED
http://www.africaguide.com/library.htm
-6 NOTCRAWLED
http://www.aoa.gov/press/multimed/photos/multimed_photos.asp
-6 NOTCRAWLED
http://www.ars.usda.gov/is/graphics/photos/
-6 NOTCRAWLED
http://www.edwards.af.mil/gallery/index.html
-6 NOTCRAWLED
http://www.loc.gov/rr/print/catalog.html
-6 NOTCRAWLED
http://www.nasa.gov/multimedia/imagegallery/index.html
-6 NOTCRAWLED
http://www.nasm.si.edu/collections/imagery.cfm
-6 NOTCRAWLED
http://www.nv.doe.gov/news&pubs/photos&films/photolib.htm
-6 NOTCRAWLED
http://www.photolib.noaa.gov/collections.html
-6 NOTCRAWLED
http://www.sciencephoto.com/
-6 NOTCRAWLED
http://www.wpafb.af.mil/museum/
-6 NOTCRAWLED


I tried to figure out what this resp code is but I have had
no luck. Any feedback would be great. Thanks.


Date: 2005-02-23 00:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Make sure the problem is not this:
http://crawler.archive.org/faq.html#windowsstart nor this:
http://groups.yahoo.com/group/archive-crawler/message/1276
St.Ack



Attached File

No Files Currently Attached

Changes ( 9 )

Field Old Value Date By
close_date 2005-03-02 18:31 2005-04-01 23:38 stack-sf
status_id Open 2005-04-01 23:38 stack-sf
status_id Closed 2005-03-31 18:38 jsleeman
close_date - 2005-03-02 18:31 stack-sf
status_id Open 2005-03-02 18:31 stack-sf
resolution_id None 2005-03-02 18:31 stack-sf
category_id None 2005-03-02 18:31 stack-sf
summary Embedding Heritrix and Snoozing Frontier 2005-03-02 06:58 gojomo
priority 5 2005-02-25 04:16 stack-sf