Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Threads contend for scratch files (after kill/readFully/Gap) - ID: 1052570
Last Update: Comment added ( karl-ia )

In 1.0.x, the thread killing process creates serious
problems; it assumes threads are always at their
serialnumber position in the toes list, which isn't
true after any kills. Also, the setSize() method, which
creates new threads if the current size is less than
the newly set size, assumes it can start numbering the
new threads just above the size of the current toes.
This can lead to (the last few) numbers being reused,
which is disasterous for the use of
recording-scratch-storage (the tt## files) -- more than
one thread is trying to use the same files.

The result is error stacks in heritrix_out like the
following two examples:

10/21/2004 13:19:43 -0700 WARNING
org.archive.util.DevUtils warnHandle Gap between
expected and actual: 681404
#198
http://www.dau.mil/conferences/presentations/2003/presentations/T1-FiscalLa
w-RexBragaw.pdf
(0 attempts)
XXRELLXLLL
Current processor: Archiver
ACTIVE for 10s760ms
Where: ABOUT_TO_BEGIN_PROCESSOR



java.lang.Throwable: Gap between expected and actual:
681404
#198
http://www.dau.mil/conferences/presentations/2003/presentations/T1-FiscalLa
w-RexBragaw.pdf
(0 attempts)
XXRELLXLLL
Current processor: Archiver
ACTIVE for 10s760ms
Where: ABOUT_TO_BEGIN_PROCESSOR



at
org.archive.io.arc.ARCWriter.write(ARCWriter.java(Compiled
Code))
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.
java(Compiled
Code))
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcess
or.java(Compiled
Code))
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java(Comp
iled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java(Compiled
Code))



---- or -----

10/21/2004 15:29:01 -0700 SEVERE
org.archive.io.ReplayCharSequenceFactory$ByteReplayCharSequence
loadBuffer
raFile.seekraFile.readFully(wraparoundBuffer,0,65536)
raFile.length()0
#197
https://www.caps.navsea.navy.mil/caps/caps.nsf/f1bf587d9f0a49c7852568a10057
8c56/d3c34484c393a5d98525679f0053
LLLRLLRLRELLL
Current processor: ExtractorHTML
ACTIVE for 24s144ms
Where: ABOUT_TO_BEGIN_PROCESSOR

java.io.EOFException
at
java.io.RandomAccessFile.readFully(RandomAccessFile.java(Compiled
Code))
at
org.archive.io.ReplayCharSequenceFactory$ByteReplayCharSequence.loadBuffer(
ReplayCharSequenceFactory.java(Co
at
org.archive.io.ReplayCharSequenceFactory$ByteReplayCharSequence.recenterBuf
fer(ReplayCharSequenceFactory.jav
at
org.archive.io.ReplayCharSequenceFactory$ByteReplayCharSequence.faultCharAt
(ReplayCharSequenceFactory.java(C
at
org.archive.io.ReplayCharSequenceFactory$ByteReplayCharSequence.charAt(Repl
ayCharSequenceFactory.java(Compil
at
java.util.regex.Pattern$Ctype.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Curly.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$GroupTail.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$GroupTail.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Curly.match0(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Curly.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$GroupHead.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Branch.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$GroupHead.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$GroupHead.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Branch.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Branch.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$GroupHead.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$SingleA.match(Pattern.java(Compiled
Code))
at
java.util.regex.Pattern$Start.match(Pattern.java(Compiled
Code))
at
java.util.regex.Matcher.find(Matcher.java(Inlined
Compiled Code))
at
java.util.regex.Matcher.find(Matcher.java(Inlined
Compiled Code))
at
org.archive.crawler.extractor.ExtractorHTML.extract(ExtractorHTML.java(Comp
iled
Code))
at
org.archive.crawler.extractor.ExtractorHTML.innerProcess(ExtractorHTML.java
(Compiled
Code))
at
org.archive.crawler.framework.Processor.process(Processor.java(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java(Comp
iled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java(Compiled
Code))

---


Gordon Mohr ( gojomo ) - 2004-10-23 00:50

7

Closed

Fixed

Gordon Mohr

Disk I/O

None

Public


Comments ( 4 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-269 -- please add further
comments at that location.


Date: 2004-10-28 00:03
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

HEAD didn't have the same problem, but just to be sure,
changed the method of serial-number assignment to make
duplicate numbers impossible. Commit comment:

Fix for [ 1033701 ] incorrect number of total active threads
* ToePool.java
Improve accuracy of tallies of total, active threads
Fix for [ 1052570 ] Threads contend for scratch files (after
kill/readFully/Gap)
* ToeThread.java
Internalize assignment of unique serial number, so
numbers can't overlap.

Remaining issue would be to ensure future 'Gap' errors are
more prominent -- creating separate bug.


Date: 2004-10-23 02:44
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fixed in 1.0.x. Commit comment:

Fix for [ 1052570 ] Threads contend for scratch files (after
kill/readFully/Gap)
* ToePool.java
Kill threads by matching number, not position. Assign
new numbers by increasing counter. Remove risky 'replace'
option.
* webapps/admin/reports/threads.jsp
Remove 'replace' option.

Need to evaluate if any version of this problem exists in
HEAD. Also, (1) the 'gap' problem, when detected, shoudl be
more prominent than heritrix_out output, because it may
indicate ARC corruption; (2) thread scratch files should be
opened exclusively and fail early, on record, rather than on
playback.

Demoting to 7


Date: 2004-10-23 01:12
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Can easily reproduce by starting crawl with, say 20 threads.
While it's running, repeatedly kill "thread #1" from threads
report. In fact, thread in the 1 position will keep getting
killed.

Then, edit configuration to change number of threads. (Or
even jsut submit config without changes; it still triggers
the ToePool.setSize()). Within moments, the new threads
created to get back to 20, which reuse the high numbers of
the original threads, will start clashing with existing
threads over scratch files-- generating exceptions like the
above.

Trying fix where:
- the 'replace' option is banished, as it would risk the
same sort of contention
- ToePool.startNewThread() uses a private, monotonically
increasing searial nubmer to number threads, so no two can
get the same number
- ToePool.killThread() probes through the toe list for a
match, rather than assuming position == serialNumber

An effect will be that after kils and config-changes, the
serial numbers will range higher than the total active
threads, but that's sensible.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-10-28 00:03 gojomo
resolution_id None 2004-10-28 00:03 gojomo
close_date - 2004-10-28 00:03 gojomo
priority 9 2004-10-23 02:44 gojomo