Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 ConcurrentModificationExceptions - ID: 943770
Last Update: Comment added ( karl-ia )

I happen to have a seed list of nearly 1024 entries.
Not totally surprisingly, Heritrix behaves a little
oddly with that many seeds. First, crawls with either
0.6.0 or the latest CVS build fail because too many
files are opened almost immediately, and then neither
socket operations nor file logging are able to proceed.
A typical exception:

.....

Next up, using the current CVS build, a surprising
number (like, ~70) of
java.util.ConcurrentModificationExceptions occurred in
the first moments of the crawl (and then intermittently
throughout), all with the same stack trace. An example:

20040427194255925 -5 39804 #48
http://eia.doe.gov/ 124 text/html 3t
java.util.ConcurrentModificationException
at
java.util.AbstractList$Itr.checkForComodification(AbstractList.java:448)
at
java.util.AbstractList$Itr.next(AbstractList.java:419)
at
org.archive.crawler.scope.HostScope.focusAccepts(HostScope.java:120)
at
org.archive.crawler.framework.CrawlScope.innerAccepts(CrawlScope.java:198)
at
org.archive.crawler.framework.Filter.accepts(Filter.java:94)
at
org.archive.crawler.basic.Postselector.schedule(Postselector.java:200)
at
org.archive.crawler.basic.Postselector.handleLinkCollection(Postselector.ja
va:262)
at
org.archive.crawler.basic.Postselector.innerProcess(Postselector.java:112)
at
org.archive.crawler.framework.Processor.process(Processor.java:106)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:205)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:135)
Looking at the code, it looks like the CrawlScope class
hands out an iterator on the scope's seeds list; that
iteration needs to synchronize on the list (per the
note in
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Collections.html#synchron
izedCollection(java.util.Collection)
), which I guess is going to take some refactoring.

Should it be relevant, the few changes made to the
default configuration for this crawl, other than adding
a pile of seeds, were:
- HostScope
- max-link-hops 1
- total-bandwidth-usage-KB-sec 500
Otherwise, the crawl for this large seed list seems to
be proceeding apace.


Michael Stack ( stack-sf ) - 2004-04-28 15:14

7

Closed

Fixed

Michael Stack

General

None

Public


Comments ( 3 )

Date: 2007-03-14 00:10
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-127 -- please add further
comments at that location.


Date: 2004-04-30 01:01
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed. Made it so each iteration and add to the seed list
manages the necessary synchronization.

Note that seeds are always cached in memory in the current
default frontier implementation. Will make an issue after
this for our being able to not have this happen
automatically so we can do the case where there are millions
of seeds.

Closing.


Date: 2004-04-28 15:25
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

See also
http://sourceforge.net/tracker/index.php?func=detail&aid=943768&group_id=73833&atid=539099


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-04-30 01:01 stack-sf
resolution_id None 2004-04-30 01:01 stack-sf
close_date - 2004-04-30 01:01 stack-sf
category_id None 2004-04-28 15:25 stack-sf
assigned_to nobody 2004-04-28 15:25 stack-sf