Karl was running jobs looking at issue where we OOME if
multiple jobs are run in series. Concurrently he was
reviewing arc validation tools.
He found the following problem where an ARC had a
record that had nothing but the ARC Record meta data
line -- and it was not the last record in the ARC.
He was crawling against infiniteurl so records are
small and things are running faster than usual.
Here are the records at the end of the ARC. A bunch
got written in a flurry of activity -- all at the same
moment:
20050513213319 127.0.0.1
http://14-31-50-738.0.localhost:8888/infiniteurl/infinity/0-14-31-56-117.ht
ml
text/html - LC2HUXXNPG5UB43RSDDB6WISSMHCFFW3 80593 1767
bad-arc
WARNING Premature EOF before end-of-record:
{archive-version=1.1, ip-address=127.0.0.1,
archive-offset=81290,
url=http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-89
3.html,
archive-date=20050513213319, content-type=text/html,
archive-length=1767}
20050513213319 127.0.0.1
http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht
ml
text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 81290 1767
bad-arc
20050513213319 127.0.0.1
http://1.localhost:8888/infiniteurl/infinity/7-14-31-35-13.html
text/html - GI6CLSVDJEIANAEB7FS4J37OSXIOXPXY 81404 1753
bad-arc
20050513213319 127.0.0.1
http://14-32-11-415.0.localhost:8888/infiniteurl/infinity/1-14-32-15-690.ht
ml
text/html - S5TXPU5DLDNJ546WSFAKVRMGV23C6HBL 82095 1767
bad-arc
20050513213319 127.0.0.1
http://14-33-03-160.0.localhost:8888/infiniteurl/infinity/9.html
text/html - ROFJQPYHC35DSBP5ULDOBHOE3SS7HY35 82793 1754
bad-arc
20050513213319 127.0.0.1
http://14-31-40-393.0.localhost:8888/infiniteurl/infinity/4-14-31-48-864.ht
ml
text/html - 3BBTYOAVFVHNJF4KGBNO7EUFSTMSOCLJ 83484 1767
bad-arc
20050513213319 127.0.0.1
http://14-31-30-333.0.localhost:8888/infiniteurl/infinity/0-14-31-35-563.ht
ml
text/html - ZA73PPXZT6IY5C27QSDOQ3RI5ZQHZW7N 84181 1767
bad-arc
20050513213319 127.0.0.1
http://14-32-01-60.1.localhost:8888/infiniteurl/infinity/4-14-32-06-221.htm
l
text/html - 2WMTZF7EKMM6CTTF33TIP4CC4QGSMV5G 84877 1766
bad-arc
20050513213319 127.0.0.1
http://14-32-52-924.1.localhost:8888/infiniteurl/infinity/2-14-32-54-731.ht
ml
text/html - 7IDQKMFTLYIOLVVGJ5Y2YDYNYCGGGGAM 85575 1767
bad-arc
The second record in the above has a warning about its
inconsistent length.
Here is how things look when we zcat:
....
<a
href="/infiniteurl/infinity/7-14-33-19-222.html">7</a>,
<a
href="/infiniteurl/infinity/8-14-33-19-222.html">8</a>,
<a
href="/infiniteurl/infinity/9-14-33-19-222.html">9</a>
</body>
</html>
http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht
ml
127.0.0.1 20050513213319 text/html 1767
http://1.localhost:8888/infiniteurl/infinity/7-14-31-35-13.html
127.0.0.1 20050513213319 text/html 1753
HTTP/1.1 200 OK^M
Date: Fri, 13 May 2005 21:33:19 GMT^M
Server: Jetty/4.2.19 (Linux/2.4.27 i386 java/1.5.0_01)^M
Connection: close^M
Content-Type: text/html;charset=ISO-8859-1^M
^M
...
Karl had the old logs around and we found following in
Heritrix out:
Followup: lines 533-542 of
labcrawl04:/1/webcrawl-test/1/webcrawl-test/2-heritrix/heritrix-1.5.0-20050
5122111/heritrix_out.log:
<<<
java.lang.OutOfMemoryError: Java heap space
#28
http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht
ml
(0 attempts)
LLLLLL
http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/1.html
Current processor: Archiver
ACTIVE for 2s441ms
Where: ABOUT_TO_BEGIN_PROCESSOR for 2430ms
java.lang.OutOfMemoryError: Java heap space
>>>>>>
Exception in thread "StartNextJob"
java.lang.OutOfMemoryError: Java heap space
From Karl, "So it looks like this record was aborted
just after the header line was written."
The writing of such incomplete records should never
happen, not even in OOME situations.
This issue is about adding to arc writer the handling
of OOMEs (At least, the file should be marked invalid).
Below are original notes from Karl
Karl Thiessen <karl@archive.org> writes:
>> In my testing, I found a bad ARC record created with
a header line but no data at all. Shell excerpt
fillows, with commentary after:
>>
>> === begin shell excerpt ==
>> [1] 10:39:24 (pts/5)
karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1026$ sh
$HERITRIX_HOME/bin/arcreader -s bad-arc.gz > /dev/null
>> Exception in thread "main" java.io.IOException:
Premature EOF before end-of-record:
{archive-version=1.1, ip-address=127.0.0.1,
archive-offset=81290,
url=http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-89
3.html,
archive-date=20050513213319, content-type=text/html,
archive-length=1767}
>> at
org.archive.io.arc.ARCRecord.read(ARCRecord.java:416)
>> at
org.archive.io.arc.ARCRecord.skip(ARCRecord.java:450)
>> at
org.archive.io.arc.ARCRecord.skip(ARCRecord.java:474)
>> at
org.archive.io.arc.ARCRecord.close(ARCRecord.java:352)
>> at
org.archive.io.arc.ARCReader.outputARCRecordCdx(ARCReader.java:989)
>> at
org.archive.io.arc.ARCReader.cdxOutput(ARCReader.java:918)
>> at
org.archive.io.arc.ARCReader.output(ARCReader.java:841)
>> at
org.archive.io.arc.ARCReader.main(ARCReader.java:1169)
>>
>> [1] 10:39:32 (pts/5)
karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1027$
/home/brad/projects/web/validate-arc-files bad-arc.gz
>> EVENT 1116610793 labcrawl04.archive.org bad-arc.gz
ShortRecord coff 81290
http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht
ml
(http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.h
tml
127.0.0.1 20050513213319 text/html 1767) want(1768) got(0)
>>
>> [1] 10:39:53 (pts/5)
karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1028$ ls
-l IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz
>> lrwxrwxrwx 1 karl webcrawl 180 May 20
10:35
IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz
->
/1/webcrawl-test/1-crawldata/test-results/heritrix-1.5.0-200505122111-20050
513T141347/t1123230/default-20050513212748787/arcs/IAH-20050513213215-00004
-labcrawl04.archive.org.arc.gz
>>
>> [0] 10:40:06 (pts/5)
karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1029$
md5sum
IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz
bad-arc.gz
>> 571e41cc1c0f90e7a2f9203372ad662a
IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz
>> 571e41cc1c0f90e7a2f9203372ad662a bad-arc.gz
>>
>> [0] 10:40:44 (pts/5)
karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1030$
zgrep -A 10 -E
'^http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.
html'
bad-arc.gz
>>
http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht
ml
127.0.0.1 20050513213319 text/html 1767
>>
http://1.localhost:8888/infiniteurl/infinity/7-14-31-35-13.html
127.0.0.1 20050513213319 text/html 1753
>> HTTP/1.1 200 OK
>> Date: Fri, 13 May 2005 21:33:19 GMT
>> Server: Jetty/4.2.19 (Linux/2.4.27 i386 java/1.5.0_01)
>> Connection: close
>> Content-Type: text/html;charset=ISO-8859-1
>>
>>
>>
>>
>> [0] 10:45:16 (pts/5)
karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1031$
>>
>> === end shell excerpt ===
>>
>>
>> The thing to notice is the two lines beginning
http:// right on top of
>> one another in the zgrep output. The top one is a
perfectly fine
>> header line, with presumably 1767 bytes opf data to
follow, but the
>> data isn't there. It is emphatically *not* the last
record in the arc
>> file; the beginning of the next record is clearly
visible. Doesn't
>> look liek this was tthe result of a job termination
or crawler kill.
>>
>> Any ideas what would cause this? Could it be part
of the pattern that
>> leads to our high-prority bug, [ 1116456 ] ARCWriter
length is wrong
>> (But coherent gzip record) ?
>>
>> Thanks for any input,
>> --K.
Michael Stack
Disk I/O
1.6.0
Public
|
Date: 2007-03-14 00:52
|
|
Date: 2005-09-20 00:19 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-12-02 17:14 | stack-sf |
| close_date | - | 2005-12-02 17:14 | stack-sf |
| artifact_group_id | None | 2005-09-23 18:29 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use