Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 ReplayCharSequenceFactory: Unexpected response body offset - ID: 1209665
Last Update: Comment added ( karl-ia )

Hi,

this night, I've got a Heritrix alert which I do not
understand, honestly.

As far as I see, it has to do something with
ReplayCharSequenceFactory's implementation-specific
treatment of a response's offset to its message body
(variable "responseBodyStart"). Is this somehow related
to bugs #1155641 and #922080?

Perhaps you can explain what's going wrong.

All the best,
Christian


Title: Problem occured processing
'http://www.schoenen-dunk.de/basketball/bbl/statistiken.php?action=results&
file=ligasued0405.l98&endtab=30&st=30&tabtype=4'
Time: Mai. 27, 2005 00:10:38 GMT
Level: SEVERE
Message:

Problem java.lang.IllegalArgumentException: Unexpected
response body offset of 76829. The way this class
works, it assumes the HTTP headers are in buffer: 65536
occured when trying to process
'http://www.schoenen-dunk.de/basketball/bbl/statistiken.php?action=results&
file=ligasued0405.l98&endtab=30&st=30&tabtype=4'
at step PROCESSING in ExtractorHTML


Associated Throwable:
java.lang.IllegalArgumentException: Unexpected response
body offset of 76829. The way this class works, it
assumes the HTTP headers are in buffer: 65536

Message:
Unexpected response body offset of 76829. The way
this class works, it assumes the HTTP headers are in
buffer: 65536

Stacktrace:
java.lang.IllegalArgumentException: Unexpected response
body offset of 76829. The way this class works, it
assumes the HTTP headers are in buffer: 65536
at
org.archive.io.ReplayCharSequenceFactory.checkParameters(ReplayCharSequence
Factory.java:209)
at
org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSe
quenceFactory.java:124)
at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputS
tream.java:416)
at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStr
eam.java:291)
at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:299)
at
org.archive.crawler.extractor.ExtractorHTML.innerProcess(ExtractorHTML.java
:432)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:283)

at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)


Christian Kohlschütter ( ck-heritrix ) - 2005-05-27 07:36

7

Closed

None

Karl Thiessen

None

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:53
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-422 -- please add further
comments at that location.


Date: 2005-11-08 01:18
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

None of the above referenced URLs currently reproduce the
problem. So, instead of throwing an exception when the
buffer is exceeded, just logging a WARNING.

Setting the recorder-in buffer to a tiny value (64 bytes)
caused most URIs to trigger this warning... but they
otherwise appeared to be processed normally.

So, considering fixed, and at some future date we may want
to remove the WARNING too, though it should occur so rarely
with default settings (only when headers are >64K) that
leaving it in for now, so that we have a chance to look at
some more of these extreme cases, should be harmless.

Fix for [ 1209665 ] ReplayCharSequenceFactory: Unexpected
response body offset
* ReplayCharSequenceFactory.java
change checkParameters to only WARN on
headers-beyond-initial-buffer; testing suggests this
overflow is harmless

Assigning to Karl for verification/close.


Date: 2005-09-20 01:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This error suggests that the CRLFCRLF end-of-headers wasn't
found within the in-memory buffer of 'recorded' HTTP
response. (Though looking at a couple of the URLs that
triggered the error don't look like they have more than 64K
of headers, it's possible they did at the time fo the
errors. Alternatively, perhaps the headers weren't ended
properly (CRCR or LFLF instead of CRLFCRLF) and we're not
adapting.)

Looking at ReplayCharSequenceFactory.checkParameters(), I
don't know why we require the headers to be within the
in-memory buffer; I don't immediately see anywhere that's a
hard-and-fast requirments, and even if it is, it seems we
should be able to accept larger headers -- and just skip
them into the disk backing -- without much problem.

(Also looking at
ReplayCharSequenceFactory.checkParameters(), the limit of
Integer.MAX_VALUE on file sizes is a limit I'd like to
remove, because we use longs throughout much of our IO code
to accomodate >2GB resources. I suspect the limit is here
because the CharBuffer used for multibyte encodings only
accepts int lengths/indexes.)



Date: 2005-06-13 00:36
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Just got a bunch of these same exceptions. 3 came together.

Title: Problem occured processing
'http://www.racing-index.com/sponsored/index.php?section=serve&id=36&keyword=services&output=js'
Time: Jun. 13, 2005 00:37:39 GMT
Level: SEVERE
Message:

Problem java.lang.IllegalArgumentException: Unexpected
response body offset of 108615. The way this class works,
it assumes the HTTP headers are in buffer: 65536 occured
when trying to process
'http://www.racing-index.com/sponsored/index.php?section=serve&id=36&keyword=services&output=js'
at step ABOUT_TO_BEGIN_PROCESSOR in ExtractorJS


Associated Throwable: java.lang.IllegalArgumentException:
Unexpected response body offset of 108615. The way this
class works, it assumes the HTTP headers are in buffer: 65536

Message:
Unexpected response body offset of 108615. The way this
class works, it assumes the HTTP headers are in buffer: 65536

Stacktrace:
java.lang.IllegalArgumentException: Unexpected response body
offset of 108615. The way this class works, it assumes the
HTTP headers are in buffer: 65536
at
org.archive.io.ReplayCharSequenceFactory.checkParameters(ReplayCharSequenceFactory.java:209)
at
org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSequenceFactory.java:124)
at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream.java:412)
at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.java:293)
at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:299)
at
org.archive.crawler.extractor.ExtractorJS.innerProcess(ExtractorJS.java:103)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:286)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)


Title: Problem occured processing
'http://www.racing-index.com/sponsored/index.php?section=serve&id=36&keyword=news&output=js'
Time: Jun. 13, 2005 00:37:23 GMT
Level: SEVERE
Message:

Problem java.lang.IllegalArgumentException: Unexpected
response body offset of 74353. The way this class works, it
assumes the HTTP headers are in buffer: 65536 occured when
trying to process
'http://www.racing-index.com/sponsored/index.php?section=serve&id=36&keyword=news&output=js'
at step ABOUT_TO_BEGIN_PROCESSOR in ExtractorJS


Associated Throwable: java.lang.IllegalArgumentException:
Unexpected response body offset of 74353. The way this
class works, it assumes the HTTP headers are in buffer: 65536

Message:
Unexpected response body offset of 74353. The way this
class works, it assumes the HTTP headers are in buffer: 65536

Stacktrace:
java.lang.IllegalArgumentException: Unexpected response body
offset of 74353. The way this class works, it assumes the
HTTP headers are in buffer: 65536
at
org.archive.io.ReplayCharSequenceFactory.checkParameters(ReplayCharSequenceFactory.java:209)
at
org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSequenceFactory.java:124)
at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream.java:412)
at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.java:293)
at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:299)
at
org.archive.crawler.extractor.ExtractorJS.innerProcess(ExtractorJS.java:103)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:286)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)



Title: Problem occured processing
'http://www.racing-index.com/sponsored/index.php?section=serve&id=36&keyword=systems&output=js'
Time: Jun. 13, 2005 00:34:39 GMT
Level: SEVERE
Message:

Problem java.lang.IllegalArgumentException: Unexpected
response body offset of 69375. The way this class works, it
assumes the HTTP headers are in buffer: 65536 occured when
trying to process
'http://www.racing-index.com/sponsored/index.php?section=serve&id=36&keyword=systems&output=js'
at step ABOUT_TO_BEGIN_PROCESSOR in ExtractorJS


Associated Throwable: java.lang.IllegalArgumentException:
Unexpected response body offset of 69375. The way this
class works, it assumes the HTTP headers are in buffer: 65536

Message:
Unexpected response body offset of 69375. The way this
class works, it assumes the HTTP headers are in buffer: 65536

Stacktrace:
java.lang.IllegalArgumentException: Unexpected response body
offset of 69375. The way this class works, it assumes the
HTTP headers are in buffer: 65536
at
org.archive.io.ReplayCharSequenceFactory.checkParameters(ReplayCharSequenceFactory.java:209)
at
org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSequenceFactory.java:124)
at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream.java:412)
at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.java:293)
at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:299)
at
org.archive.crawler.extractor.ExtractorJS.innerProcess(ExtractorJS.java:103)
at
org.archive.crawler.framework.Processor.process(Processor.java:103)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:286)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)




Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2005-12-02 17:14 stack-sf
close_date - 2005-12-02 17:14 stack-sf
assigned_to gojomo 2005-11-08 01:18 gojomo
artifact_group_id None 2005-11-02 20:03 gojomo
assigned_to nobody 2005-11-02 19:20 gojomo
priority 6 2005-11-02 19:20 gojomo
priority 5 2005-09-20 01:16 gojomo