Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 "Illegal response body offset" in ReplayCharSequenceFactory - ID: 1155641
Last Update: Comment added ( karl-ia )

Probably as a side effect of setting http-recorder on
CrawlURIs even if they had little or no content, I've
seen the following alert on a recent crawl:



Problem java.lang.IllegalArgumentException: Illegal
response body offset of 253 whereas size is only 0
occured when trying to process
'http://news.xinhuanet.com/english/2005-03/03/content_2642635.htm'
at step ABOUT_TO_BEGIN_PROCESSOR in ExtractorHTML


Associated Throwable:
java.lang.IllegalArgumentException: Illegal response
body offset of 253 whereas size is only 0

Message:
Illegal response body offset of 253 whereas size is
only 0

Stacktrace:
java.lang.IllegalArgumentException: Illegal response
body offset of 253 whereas size is only 0
at
org.archive.io.ReplayCharSequenceFactory.checkParameters(ReplayCharSequence
Factory.java:234)
at
org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSe
quenceFactory.java:131)
at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputS
tream.java:435)
at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStr
eam.java:294)
at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:318)
at
org.archive.crawler.extractor.ExtractorHTML.innerProcess(ExtractorHTML.java
:355)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)



--
Looking back into local-errors.log, I see:

2005-03-03T07:25:28.909Z -5 -
http://news.xinhuanet.com/english/2005-03/03/content_2642635.htm
LRL http://news.google.com/nwshp?hl=en&gl=us text/html
#032 - - err=java.lang.IllegalArgumentException
java.io.IOException: Socket timed out after 1200000ms:
Read timed out
at
org.archive.io.RecordingInputStream.readFullyOrUntil(RecordingInputStream.j
ava:216)
at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:336)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)

--
Should reevaluate if setting httprecorder early is
warranted -- are later steps counting on presence of
recorder to indicate a specific amount of content is
available? Perhaps Extractors/others asking for
replayCharSequence need to be more choosy, or recorder
needs cleanup in certain early-exit situations (because
ideally, getReplayCharSequence() would always return
something, even if an empty/limited sequence).


Gordon Mohr ( gojomo ) - 2005-03-03 07:35

7

Closed

Fixed

Michael Stack

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:21
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-371 -- please add further
comments at that location.


Date: 2005-03-04 03:35
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. Handled as part of this commit.

Fix for '[ 1153927 ] npe in ExtractorHTML#innerProcess' and for
'[ 1155641 ] "Illegal response body offset" in
ReplayCharSequenceFactory'.
Refactoring of FetchHTTP so always close on HttpRecorder
even if timeout.
Of note, if timeout or some other IOException while
downloading content, we
don't write the record into ARCWriter because the status
will be < 0.
Extractors used to run in this condition but this commit
changes that; each
now tests for success status before running.
* src/java/org/archive/crawler/datamodel/CrawlURI.java
Formatting.
(getHttpRecorder): More detail in javadoc.
* src/java/org/archive/crawler/extractor/ExtractorCSS.java
* src/java/org/archive/crawler/extractor/ExtractorJS.java
* src/java/org/archive/crawler/extractor/ExtractorSWF.java
* src/java/org/archive/crawler/extractor/ExtractorUniversal.java
(innerProcess): Use new
isHtmlTransactionContentToProcess utility
method from Processor -- it tests if links have already
been extracted,
if its a html transaction, if the crawluri has a non-failure
status message and if content length is non-null.
* src/java/org/archive/crawler/extractor/ExtractorDOC.java
* src/java/org/archive/crawler/extractor/ExtractorHTML.java
* src/java/org/archive/crawler/extractor/ExtractorPDF.java
(innerProcess): Use new
isHtmlTransactionContentToProcess utility
method from Processor -- it tests if links have already
been extracted,
if its a html transaction, if the crawluri has a non-failure
status message and if content length is non-null. Also
use new
isExpectedMimeType utility method from Processor.
* src/java/org/archive/crawler/extractor/ExtractorHTTP.java
Added test for non-zero status.
* src/java/org/archive/crawler/fetcher/FetchDNS.java
Formatting.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Refactoring so easier to follow whats going on.
Moved into content read finally the setting of more
crawluri properties
including end-of-fetch time and encoding.
(doAbort): Added. Compounds abort operations -- close
of recorder
and adding of annotations.
(close): Added recorder close.
* src/java/org/archive/crawler/framework/Processor.java
Formatting. Added utility methods.
(isContentToProcess, isHtmlTransactionContentToProcess,
isExpectedMimeType): Added.
* src/java/org/archive/crawler/frontier/BdbFrontier.java
Upper-casing.
* src/java/org/archive/io/RecordingInputStream.java
Put log string construction inside of a test if loggable.



Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2005-03-04 03:35 stack-sf
resolution_id None 2005-03-04 03:35 stack-sf
close_date - 2005-03-04 03:35 stack-sf