Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Too many open files - ID: 943768
Last Update: Comment added ( karl-ia )

I happen to have a seed list of nearly 1024 entries.
Not totally surprisingly, Heritrix behaves a little
oddly with that many seeds. First, crawls with either
0.6.0 or the latest CVS build fail because too many
files are opened almost immediately, and then neither
socket operations nor file logging are able to proceed.
A typical exception:

java.io.FileNotFoundException:
/crawl/heritrix/heritrix-0.6.0/jobs/crs-20040427190708335/disk/scratch/bphc
.hrsa.gov.ff0
(Too many open files)
at java.io.FileOutputStream.open(Native Method)
at
java.io.FileOutputStream.<init>(FileOutputStream.java:179)
at
java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at
org.archive.io.FlipFileOutputStream.<init>(FlipFileOutputStream.java:69)
at
org.archive.io.DiskBackedByteQueue.initializeStreams(DiskBackedByteQueue.ja
va:67)
at
org.archive.util.DiskQueue.<init>(DiskQueue.java:100)
at
org.archive.util.DiskBackedQueue.<init>(DiskBackedQueue.java:59)
at
org.archive.crawler.basic.KeyedQueue.<init>(KeyedQueue.java:76)
at
org.archive.crawler.basic.Frontier.keyedQueueFor(Frontier.java:927)
at
org.archive.crawler.basic.Frontier.scheduleForRetry(Frontier.java:1333)
at
org.archive.crawler.basic.Frontier.finished(Frontier.java:676)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:200)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:124)
You can get past that by allowing a larger number of
open files for the process (which requires running
Heritrix with root privilege), as in:
# (ulimit -n 4096; JAVA_OPTS=-Xmx320 bin/heritrix -p
9876)


Michael Stack ( stack-sf ) - 2004-04-28 15:13

7

Closed

Fixed

Michael Stack

Disk I/O

None

Public


Comments ( 5 )

Date: 2007-03-14 00:10
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-126 -- please add further
comments at that location.


Date: 2004-04-29 18:29
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed leaking FD issue where we were calling delete w/o
first doing a close on file backing Frontier KeyQueue.
Also, removed frozen key queue because its a feature not yet
implemented. FDs are now about 100 base cost w/ two FDs or
so per thread. Added note to faq on too many open files
with igor and Andy Boyko suggestions for how to deal.

Closing.


Date: 2004-04-28 19:33
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Just attached lsof output. Here's some accounting of its
content.

+ The JVM, its native '.so's and jars, and the 20 or so
heritrix jars and 2 webapps account for about 60 open
descriptors.
+ Log files, their locks and ui webserver listening socket
account for about another 20 open descriptors.
+ The rest of the descriptors are descriptors for files that
back queues (There is '.ff0', '.ff1', '.frozen.ff0', and
'.frozen.ff1'). Most are marked 'deleted' in the lsof
listing. Let me try and figure why we still have reference.
Also, the 'frozen.ff?' are for a feature not yet
implemented so will turn that off for now.


Date: 2004-04-28 19:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Attached lsof output.




Date: 2004-04-28 15:27
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

http://sourceforge.net/tracker/index.php?func=detail&aid=943770&group_id=73833&atid=539099


Attached File ( 1 )

Filename Description Download
lsof_output.txt Lsof output for crawler w/ hundreds of seeds and 200 threads. Download

Changes ( 6 )

Field Old Value Date By
status_id Open 2004-04-29 18:29 stack-sf
resolution_id None 2004-04-29 18:29 stack-sf
close_date - 2004-04-29 18:29 stack-sf
File Added 85391: lsof_output.txt 2004-04-28 19:23 stack-sf
assigned_to nobody 2004-04-28 15:27 stack-sf
category_id None 2004-04-28 15:27 stack-sf