Probably as a side effect of setting http-recorder on
CrawlURIs even if they had little or no content, I've
seen the following alert on a recent crawl:
Problem java.lang.IllegalArgumentException: Illegal
response body offset of 253 whereas size is only 0
occured when trying to process
'http://news.xinhuanet.com/english/2005-03/03/content_2642635.htm'
at step ABOUT_TO_BEGIN_PROCESSOR in ExtractorHTML
Associated Throwable:
java.lang.IllegalArgumentException: Illegal response
body offset of 253 whereas size is only 0
Message:
Illegal response body offset of 253 whereas size is
only 0
Stacktrace:
java.lang.IllegalArgumentException: Illegal response
body offset of 253 whereas size is only 0
at
org.archive.io.ReplayCharSequenceFactory.checkParameters(ReplayCharSequence
Factory.java:234)
at
org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSe
quenceFactory.java:131)
at
org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputS
tream.java:435)
at
org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStr
eam.java:294)
at
org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:318)
at
org.archive.crawler.extractor.ExtractorHTML.innerProcess(ExtractorHTML.java
:355)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)
--
Looking back into local-errors.log, I see:
2005-03-03T07:25:28.909Z -5 -
http://news.xinhuanet.com/english/2005-03/03/content_2642635.htm
LRL http://news.google.com/nwshp?hl=en&gl=us text/html
#032 - - err=java.lang.IllegalArgumentException
java.io.IOException: Socket timed out after 1200000ms:
Read timed out
at
org.archive.io.RecordingInputStream.readFullyOrUntil(RecordingInputStream.j
ava:216)
at
org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:336)
at
org.archive.crawler.framework.Processor.process(Processor.java:102)
at
org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:273)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:143)
--
Should reevaluate if setting httprecorder early is
warranted -- are later steps counting on presence of
recorder to indicate a specific amount of content is
available? Perhaps Extractors/others asking for
replayCharSequence need to be more choosy, or recorder
needs cleanup in certain early-exit situations (because
ideally, getReplayCharSequence() would always return
something, even if an empty/limited sequence).
Michael Stack
None
None
Public
|
Date: 2007-03-14 00:21
|
|
Date: 2005-03-04 03:35 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-03-04 03:35 | stack-sf |
| resolution_id | None | 2005-03-04 03:35 | stack-sf |
| close_date | - | 2005-03-04 03:35 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use