I happen to have a seed list of nearly 1024 entries.
Not totally surprisingly, Heritrix behaves a little
oddly with that many seeds. First, crawls with either
0.6.0 or the latest CVS build fail because too many
files are opened almost immediately, and then neither
socket operations nor file logging are able to proceed.
A typical exception:
java.io.FileNotFoundException:
/crawl/heritrix/heritrix-0.6.0/jobs/crs-20040427190708335/disk/scratch/bphc
.hrsa.gov.ff0
(Too many open files)
at java.io.FileOutputStream.open(Native Method)
at
java.io.FileOutputStream.<init>(FileOutputStream.java:179)
at
java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at
org.archive.io.FlipFileOutputStream.<init>(FlipFileOutputStream.java:69)
at
org.archive.io.DiskBackedByteQueue.initializeStreams(DiskBackedByteQueue.ja
va:67)
at
org.archive.util.DiskQueue.<init>(DiskQueue.java:100)
at
org.archive.util.DiskBackedQueue.<init>(DiskBackedQueue.java:59)
at
org.archive.crawler.basic.KeyedQueue.<init>(KeyedQueue.java:76)
at
org.archive.crawler.basic.Frontier.keyedQueueFor(Frontier.java:927)
at
org.archive.crawler.basic.Frontier.scheduleForRetry(Frontier.java:1333)
at
org.archive.crawler.basic.Frontier.finished(Frontier.java:676)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:200)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:124)
You can get past that by allowing a larger number of
open files for the process (which requires running
Heritrix with root privilege), as in:
# (ulimit -n 4096; JAVA_OPTS=-Xmx320 bin/heritrix -p
9876)
Michael Stack
Disk I/O
None
Public
|
Date: 2007-03-14 00:10
|
|
Date: 2004-04-29 18:29 Logged In: YES |
|
Date: 2004-04-28 19:33 Logged In: YES |
|
Date: 2004-04-28 19:21 Logged In: YES |
|
Date: 2004-04-28 15:27 Logged In: YES |
| Filename | Description | Download |
|---|---|---|
| lsof_output.txt | Lsof output for crawler w/ hundreds of seeds and 200 threads. | Download |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use