I have code running which is embedding Heritrix using
Heritrix 1.2.0. I was running this about a week ago
with no problems, urls were getting crawled as
expected. In the past day or so, I notice that no
crawls are occurring and every url in the seed is
queued for a period of time, then the crawler ends. I
don't see any error conditions in the logs. If I
create a job using the same order.xml through the gui
and start heritrix outside of my applications, the urls
get crawled as expected.
I am embedding Heritrix in another application by the
following source code:
Heritrix.initialize();
System.out.println(Heritrix.getHeritrixHome());
System.out.println("Status: " +
Heritrix.launch("C:\\heritrix-1.2.0\\src\\conf\\profiles\\Simple\\order.xml
",true));
while (true)
{
Thread.sleep(30000);
System.out.println("Is Crawling: " +
Heritrix.jobHandler.isCrawling()
+ " Is Running: " +
Heritrix.jobHandler.isRunning()
+ " Frontier Report: " +
Heritrix.jobHandler.getFrontierReport() );
}
}catch (Exception e)
{
e.printStackTrace();
}
With a seed of 14 urls and the following order.xml
(note I took out my specific package name and email
address on purpose):
<?xml version="1.0" encoding="UTF-8"?>
<crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
<meta>
<name>Simple</name>
<description>Profile: Simple crawl</description>
<operator>Admin</operator>
<organization />
<audience />
<date>20040409202922</date>
</meta>
<controller>
<string name="settings-directory">settings</string>
<string name="disk-path" />
<string name="scratch-path">scratch</string>
<string name="state-path">state</string>
<string name="logs-path">logs</string>
<string name="checkpoints-path">checkpoints</string>
<integer name="max-toe-threads">50</integer>
<long name="max-bytes-download">0</long>
<long name="max-document-download">0</long>
<long name="max-time-sec">0</long>
<newObject name="scope"
class="org.archive.crawler.scope.DomainScope">
<boolean name="enabled">true</boolean>
<string name="seedsfile">seeds.txt</string>
<integer name="max-link-hops">25</integer>
<integer name="max-trans-hops">5</integer>
<newObject name="exclude-filter"
class="org.archive.crawler.filter.OrFilter">
<boolean name="enabled">true</boolean>
<boolean name="if-matches-return">true</boolean>
<map name="filters">
<newObject name="pathdepth"
class="org.archive.crawler.filter.PathDepthFilter">
<boolean name="enabled">true</boolean>
<integer name="max-path-depth">20</integer>
<boolean
name="path-less-or-equal-return">false</boolean>
</newObject>
<newObject name="pathologicalpath"
class="org.archive.crawler.filter.PathologicalPathFilter">
<boolean name="enabled">true</boolean>
<integer name="repetitions">3</integer>
</newObject>
</map>
</newObject>
<newObject name="additionalScopeFocus"
class="org.archive.crawler.filter.FilePatternFilter">
<boolean name="enabled">true</boolean>
<boolean name="if-match-return">true</boolean>
<string name="use-default-patterns">All</string>
<string name="regexp"/>
</newObject>
<newObject name="transitiveFilter"
class="org.archive.crawler.filter.TransclusionFilter">
<boolean name="enabled">true</boolean>
<integer name="max-speculative-hops">1</integer>
<integer name="max-referral-hops">-1</integer>
<integer name="max-embed-hops">-1</integer>
</newObject>
</newObject>
<map name="http-headers">
<string name="user-agent">Mozilla/5.0
(compatible; heritrix/1.2.0
+http://www.myhost.com)</string>
<string name="from">jsleeman@myemail.com</string>
</map>
<newObject name="robots-honoring-policy"
class="org.archive.crawler.datamodel.RobotsHonoringPolicy">
<string name="type">classic</string>
<boolean name="masquerade">false</boolean>
<text name="custom-robots"/>
<stringList name="user-agents">
</stringList>
</newObject>
<newObject name="frontier"
class="org.archive.crawler.frontier.HostQueuesFrontier">
<float name="delay-factor">5.0</float>
<integer name="max-delay-ms">1000</integer>
<integer name="min-delay-ms">50</integer>
<integer name="max-retries">30</integer>
<long name="retry-delay-seconds">900</long>
<integer
name="total-bandwidth-usage-KB-sec">0</integer>
<integer
name="max-per-host-bandwidth-usage-KB-sec">0</integer>
</newObject>
<map name="uri-canonicalization-rules">
<newObject name="Lowercase"
class="org.archive.crawler.url.canonicalize.LowercaseRule">
</newObject>
<newObject name="Userinfo"
class="org.archive.crawler.url.canonicalize.StripUserinfoRule">
</newObject>
<newObject name="WWW"
class="org.archive.crawler.url.canonicalize.StripWWWRule">
</newObject>
<newObject name="SessionIDs"
class="org.archive.crawler.url.canonicalize.StripSessionIDs">
</newObject>
<newObject name="QueryStrPrefix"
class="org.archive.crawler.url.canonicalize.FixupQueryStr">
</newObject>
</map>
<map name="pre-fetch-processors">
<newObject name="Preselector"
class="org.archive.crawler.prefetch.Preselector">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<boolean name="recheck-scope">true</boolean>
<boolean name="block-all">false</boolean>
<string name="block-by-regexp"/>
</newObject>
<newObject name="Preprocessor"
class="org.archive.crawler.prefetch.PreconditionEnforcer">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<integer
name="ip-validity-duration-seconds">21600</integer>
<integer
name="robot-validity-duration-seconds">86400</integer>
</newObject>
</map>
<map name="fetch-processors">
<newObject name="DNS"
class="org.archive.crawler.fetcher.FetchDNS">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="HTTP"
class="org.archive.crawler.fetcher.FetchHTTP">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<integer name="timeout-seconds">1200</integer>
<integer name="sotimeout-ms">20000</integer>
<long name="max-length-bytes" >0</long>
<string name="load-cookies-from-file"/>
<string name="save-cookies-to-file"/>
<string name="trust-level">open</string>
</newObject>
</map>
<map name="extract-processors">
<newObject name="ExtractorHTTP"
class="org.archive.crawler.extractor.ExtractorHTTP">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorHTML"
class="org.archive.crawler.extractor.ExtractorHTML">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorCSS"
class="org.archive.crawler.extractor.ExtractorCSS">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorJS"
class="org.archive.crawler.extractor.ExtractorJS">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="ExtractorSWF"
class="org.archive.crawler.extractor.ExtractorSWF">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
</map>
<map name="write-processors">
<newObject name="Archiver" class="...myCrawler">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
</map>
<map name="post-processors">
<newObject name="Updater"
class="org.archive.crawler.postprocessor.CrawlStateUpdater">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
</newObject>
<newObject name="Postselector"
class="org.archive.crawler.postprocessor.Postselector">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<boolean
name="seed-redirects-new-seed">true</boolean>
<boolean name="override-logger">false</boolean>
<map name="scope-rejected-uri-log-filters" />
</newObject>
</map>
<map name="loggers">
<newObject name="crawl-statistics"
class="org.archive.crawler.admin.StatisticsTracker">
<integer name="interval-seconds">20</integer>
</newObject>
</map>
<string name="recover-path"/>
<newObject name="credential-store"
class="org.archive.crawler.datamodel.CredentialStore">
<map name="credentials">
</map>
</newObject>
</controller>
</crawl-order>
Stats:
-----===== STATS =====-----
Discovered: 28
Queued: 56
Finished: 0
Successfully: 0
Failed: 0
Disregarded: 0
-----===== QUEUES =====-----
Already included size: 28
Ready class queues size: 0
Snooze queues size: 14
KeyedQueue www.nasa.gov
Length: 2
Status: SNOOZED
Wakes in: 12m15s94ms
Last enqueued: dns:www.nasa.gov
Last dequeued: dns:www.nasa.gov
KeyedQueue lisar.larc.nasa.gov
Length: 2
Status: SNOOZED
Wakes in: 12m15s110ms
Last enqueued: dns:lisar.larc.nasa.gov
Last dequeued: dns:lisar.larc.nasa.gov
KeyedQueue www.edwards.af.mil
Length: 2
Status: SNOOZED
Wakes in: 12m15s110ms
Last enqueued: dns:www.edwards.af.mil
Last dequeued: dns:www.edwards.af.mil
KeyedQueue www.wpafb.af.mil
Length: 2
Status: SNOOZED
Wakes in: 12m15s110ms
Last enqueued: dns:www.wpafb.af.mil
Last dequeued: dns:www.wpafb.af.mil
KeyedQueue www.sciencephoto.com
Length: 2
Status: SNOOZED
Wakes in: 12m15s125ms
Last enqueued: dns:www.sciencephoto.com
Last dequeued: dns:www.sciencephoto.com
KeyedQueue www.nasm.si.edu
Length: 2
Status: SNOOZED
Wakes in: 12m15s141ms
Last enqueued: dns:www.nasm.si.edu
Last dequeued: dns:www.nasm.si.edu
KeyedQueue www.ars.usda.gov
Length: 2
Status: SNOOZED
Wakes in: 12m15s235ms
Last enqueued: dns:www.ars.usda.gov
Last dequeued: dns:www.ars.usda.gov
KeyedQueue www.nv.doe.gov
Output after ending:
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
beginCrawlStop Starting beginCrawlStop()...
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
beginCrawlStop Finished beginCrawlStop().
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Entered complete stop.
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Sent crawlEnded to
org.archive.crawler.admin.CrawlJobHandler@4ecfdd
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker crawlEnded
Entered crawlEnded
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\hosts-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\mimetype-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\responsecode-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\seeds-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\crawl-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\processors-report.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker writeReport
C:\\heritrix-1.2.0\src\conf\profiles\Simple\crawl-manifest.txt
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.admin.StatisticsTracker crawlEnded
Leaving crawlEnded
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Sent crawlEnded to
org.archive.crawler.admin.StatisticsTracker@1359c1b
02/22/2005 23:59:09 +0000 INFO
org.archive.crawler.framework.CrawlController
completeStop Finished crawl.
Is Crawling: false Is Running: true Frontier Report:
Crawler not running
Just wondering if I am not embedding Heritrix correctly
or if this is a real problem. Any help would be
appreciated.
Thanks and I apologize if this is not the correct post
location.
Nobody/Anonymous
3rd-party libs
None
Public
|
Date: 2007-03-14 00:21
|
|
Date: 2005-04-01 23:38 Logged In: YES |
|
Date: 2005-03-31 18:38 Logged In: YES |
|
Date: 2005-03-02 18:31 Logged In: YES |
|
Date: 2005-03-02 06:58 Logged In: YES |
|
Date: 2005-02-25 04:17 Logged In: YES |
|
Date: 2005-02-23 16:36 Logged In: YES |
|
Date: 2005-02-23 15:37 Logged In: YES |
|
Date: 2005-02-23 15:19 Logged In: YES |
|
Date: 2005-02-23 15:04 Logged In: YES |
|
Date: 2005-02-23 00:42 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| close_date | 2005-03-02 18:31 | 2005-04-01 23:38 | stack-sf |
| status_id | Open | 2005-04-01 23:38 | stack-sf |
| status_id | Closed | 2005-03-31 18:38 | jsleeman |
| close_date | - | 2005-03-02 18:31 | stack-sf |
| status_id | Open | 2005-03-02 18:31 | stack-sf |
| resolution_id | None | 2005-03-02 18:31 | stack-sf |
| category_id | None | 2005-03-02 18:31 | stack-sf |
| summary | Embedding Heritrix and Snoozing Frontier | 2005-03-02 06:58 | gojomo |
| priority | 5 | 2005-02-25 04:16 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use