Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 ConcurrentModificationExceptions - ID: 943781
Last Update: Comment added ( karl-ia )

I happen to have a seed list of nearly 1024 entries.
Not totally surprisingly, Heritrix behaves a little
oddly with that many seeds. First, crawls with either
0.6.0 or the latest CVS build fail because too many
files are opened almost immediately, and then neither
socket operations nor file logging are able to proceed.
A typical exception:

.....

Next up, using the current CVS build, a surprising
number (like, ~70) of
java.util.ConcurrentModificationExceptions occurred in
the first moments of the crawl (and then intermittently
throughout), all with the same stack trace. An example:

20040427194255925 -5 39804 #48
http://eia.doe.gov/ 124 text/html 3t
java.util.ConcurrentModificationException
at
java.util.AbstractList$Itr.checkForComodification(AbstractList.java:448)
at
java.util.AbstractList$Itr.next(AbstractList.java:419)
at
org.archive.crawler.scope.HostScope.focusAccepts(HostScope.java:120)
at
org.archive.crawler.framework.CrawlScope.innerAccepts(CrawlScope.java:198)
at
org.archive.crawler.framework.Filter.accepts(Filter.java:94)
at
org.archive.crawler.basic.Postselector.schedule(Postselector.java:200)
at
org.archive.crawler.basic.Postselector.handleLinkCollection(Postselector.ja
va:262)
at
org.archive.crawler.basic.Postselector.innerProcess(Postselector.java:112)
at
org.archive.crawler.framework.Processor.process(Processor.java:106)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:205)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:135)
Looking at the code, it looks like the CrawlScope class
hands out an iterator on the scope's seeds list; that
iteration needs to synchronize on the list (per the
note in
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Collections.html#synchron
izedCollection(java.util.Collection)
), which I guess is going to take some refactoring.

Should it be relevant, the few changes made to the
default configuration for this crawl, other than adding
a pile of seeds, were:
- HostScope
- max-link-hops 1
- total-bandwidth-usage-KB-sec 500
Otherwise, the crawl for this large seed list seems to
be proceeding apace.


Michael Stack ( stack-sf ) - 2004-04-28 15:29

7

Closed

None

Michael Stack

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:10
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-128 -- please add further
comments at that location.


Date: 2004-04-28 19:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. This issue was double-entered:
https://sourceforge.net/tracker/?func=detail&aid=943770&group_id=73833&atid=539099


Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2004-04-28 19:13 stack-sf
close_date - 2004-04-28 19:13 stack-sf
assigned_to nobody 2004-04-28 15:30 stack-sf