Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 ARCWriter 'Gap' errors should be more prominent - ID: 1055789
Last Update: Comment added ( karl-ia )

This is a spinoff of: [ 1052570 ] Threads contend for
scratch files (after
kill/readFully/Gap)

That problem is fixed, but in general, the 'Gap' error
(from ARCWriter.write()) which led to the problem's
detection is very serious, and should appear more
prominently in the crawl UI.


Gordon Mohr ( gojomo ) - 2004-10-28 00:21

7

Closed

Fixed

Michael Stack

None

None

Public


Comments ( 4 )

Date: 2007-03-14 00:17
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-273 -- please add further
comments at that location.


Date: 2005-02-17 21:27
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

You now get a message like the below in the UI.

Also, on error, the ARC is closed and given a '.invalid' suffix.

Closing.

Commit message follows:

Fix for '[ 1055789 ] ARCWriter 'Gap' errors should be more
prominent'.
Also, added the closing and marking of ARCs -- by appending
a '.invalid' suffix
-- with which we had problems writing.
* src/articles/releasenotes.xml
Added release note on new '.invalid' suffix.
* src/java/org/archive/crawler/writer/ARCWriterProcessor.java
(write): Added new method that wraps the getting and
returning of items
from arc writer pool. Added here call to new method
invalidateARCWriteri
called when we have an IOE writing an ARC Record.
* src/java/org/archive/io/RecordingInputStream.java
Minor formatting.
* src/java/org/archive/io/arc/ARCWriter.java
Added readbuffer data member in place of the making of a
new buffer per
call to ARCWriter#write. Removed checking for whitepace
in hostip and in
contenttype. The MimetypeUtils.truncate already checks
for this.
* src/java/org/archive/io/arc/ARCWriterPool.java
(invalidateARCWriter): Added.
* src/java/org/archive/io/arc/ARCWriterPoolTest.java
(testInvalidate): Added test of invalidation of arc writers.
* src/java/org/archive/io/arc/ARCWriterTest.java
(testCheckForWhiteSpace): Removed. Method tested was
removed.
(testGapError): Added test of the 'Gap' error.


Title: Failed write of ARC Record:
CrawlURI(http://crawler.archive.org/)
Time: Feb. 17, 2005 20:17:15 GMT
Level: SEVERE
Message:

Gap between expected and actual: 1
#5 http://crawler.archive.org/ (2 attempts)

Current processor: Archiver
ACTIVE for 52s498ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 52309ms
writing arc
/usr2/workspace/heritrix/jobs/DEFAULT2-20050217201217823/arcs/IAH-20050217201357-00000-debord.arc.gz.open

Associated Throwable: java.io.IOException: Gap between
expected and actual: 1
#5 http://crawler.archive.org/ (2 attempts)

Current processor: Archiver
ACTIVE for 52s498ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 52309ms
writing arc
/usr2/workspace/heritrix/jobs/DEFAULT2-20050217201217823/arcs/IAH-20050217201357-00000-debord.arc.gz.open

Message:
Gap between expected and actual: 1
#5 http://crawler.archive.org/ (2 attempts)

Current processor: Archiver
ACTIVE for 52s498ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 52309ms
writing arc
/usr2/workspace/heritrix/jobs/DEFAULT2-20050217201217823/arcs/IAH-20050217201357-00000-debord.arc.gz.open

Stacktrace:
java.io.IOException: Gap between expected and actual: 1
#5 http://crawler.archive.org/ (2 attempts)

Current processor: Archiver
ACTIVE for 52s498ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 52309ms
writing arc
/usr2/workspace/heritrix/jobs/DEFAULT2-20050217201217823/arcs/IAH-20050217201357-00000-debord.arc.gz.open
at org.archive.io.arc.ARCWriter.write(ARCWriter.java:652)
at
org.archive.crawler.writer.ARCWriterProcessor.write(ARCWriterProcessor.java:400)
at
org.archive.crawler.writer.ARCWriterProcessor.writeHttp(ARCWriterProcessor.java:359)
at
org.archive.crawler.writer.ARCWriterProcessor.innerProcess(ARCWriterProcessor.java:334)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)

Back to alerts


Date: 2005-02-09 19:39
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upped priority.


Date: 2005-02-09 19:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Propose that as well as an alert, that we stop writing
padding and close out the current ARC with a '.bad' or
'.needsfixing' suffix.


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2005-02-17 21:27 stack-sf
resolution_id None 2005-02-17 21:27 stack-sf
assigned_to nobody 2005-02-17 21:27 stack-sf
close_date - 2005-02-17 21:27 stack-sf
priority 5 2005-02-09 19:39 stack-sf