Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 ARCWriter length is wrong (But coherent gzip record) - ID: 1116456
Last Update: Comment added ( karl-ia )

Igor found ARC records where the length is wrong though
the record is sitting inside a coherent gzip member.
Here is a list of bad ARCs. Investigating. Setting to
highest priority.

narapeth001.log:20050204080345 EVENT 1107504280
narapeth001.archive.org
/0/NARA-PEOT-2004-20041110010753-01820-crawling006.archive.org.arc.gz
ShortRecord coff 38701466
http://hyperion.gsfc.nasa.gov/Personnel/people/Newman/solveii/pics/newman_p
ics/misc_DC8_photos/FCAS_NMASS_inlet.JPG
(http://hyperion.gsfc.nasa.gov/Personnel/people/Newman/solveii/pics/newman_
pics/misc_DC8_photos/FCAS_NMASS_inlet.JPG
128.183.102.165 20041110012422 image/jpeg 1790358)
want(1790359) got(731530)
narapeth001.log:20050204080429 EVENT 1107504324
narapeth001.archive.org
/0/NARA-PEOT-2004-20041110012423-00000-crawling006.archive.org.arc.gz
ShortRecord coff 197879
http://techreports.jpl.nasa.gov/2000/00-2361.pdf
(http://techreports.jpl.nasa.gov/2000/00-2361.pdf
137.78.111.199 20041110012426 application/pdf 1594491)
want(1594492) got(703130)
narapeth001.log:20050204080430 EVENT 1107504324
narapeth001.archive.org
/0/NARA-PEOT-2004-20041110012423-00001-crawling006.archive.org.arc.gz
ShortRecord coff 490741
http://trfic.jpl.nasa.gov/jers/scenes/1512810615196146.gif
(http://trfic.jpl.nasa.gov/jers/scenes/1512810615196146.gif
137.78.28.36 20041110012426 image/gif 619942)
want(619943) got(576227)
narapeth001.log:20050204080430 EVENT 1107504325
narapeth001.archive.org
/0/NARA-PEOT-2004-20041110012423-00002-crawling006.archive.org.arc.gz
ShortRecord coff 851738
http://www.dfrc.nasa.gov/DTRS/1978/PDF/H-916.pdf
(http://www.dfrc.nasa.gov/DTRS/1978/PDF/H-916.pdf
130.134.83.80 20041110012426 application/pdf 1772577)
want(1772578) got(51639)
narapeth001.log:20050204134522 EVENT 1107524776
narapeth001.archive.org
/1/NARA-PEOT-2004-20041104170359-04230-crawling006.archive.org.arc.gz
ShortRecord coff 3456225
http://lambda.gsfc.nasa.gov/data/iras/skymaps_and_related/faint_source_surv
ey_plates/data/006.tar
(http://lambda.gsfc.nasa.gov/data/iras/skymaps_and_related/faint_source_sur
vey_plates/data/006.tar
128.183.175.60 20041104164321 application/x-tar
503081272) want(503081273) got(250862522)
narapeth001.log:20050204134627 EVENT 1107524842
narapeth001.archive.org
/1/NARA-PEOT-2004-20041104170917-04233-crawling006.archive.org.arc.gz
ShortRecord coff 10152785
http://microwave.msfc.nasa.gov/ims/data/camex4/SMART-R/c4gsmart_2001.270_da
ily.tar
(http://microwave.msfc.nasa.gov/ims/data/camex4/SMART-R/c4gsmart_2001.270_d
aily.tar
198.122.199.239 20041104163353 application/x-tar
734003555) want(734003556) got(314317059)
narapeth001.log:20050204134803 EVENT 1107524937
narapeth001.archive.org
/1/NARA-PEOT-2004-20041104171723-04237-crawling006.archive.org.arc.gz
ShortRecord coff 64108858
http://ecco.jpl.nasa.gov/datasets/kf040o/kf040o_1995/n10day_28_37/Vave_08_0
8.06480_08880_240.cdf
(http://ecco.jpl.nasa.gov/datasets/kf040o/kf040o_1995/n10day_28_37/Vave_08_
08.06480_08880_240.cdf
128.149.33.210 20041104171748 application/x-netcdf
148384047) want(148384048) got(104060722)
narapeth001.log:20050204134831 EVENT 1107524966
narapeth001.archive.org
/1/NARA-PEOT-2004-20041104172125-04239-crawling006.archive.org.arc.gz
ShortRecord coff 52591402
http://mars3.jpl.nasa.gov/gallery/video/movies/Pathfinder320.mov
(http://mars3.jpl.nasa.gov/gallery/video/movies/Pathfinder320.mov
137.78.99.32 20041104172247 video/quicktime 6616472)
want(6616473) got(1721895)
narapeth002-bu.log:20050204131537 EVENT 1107522992
narapeth002-bu.archive.org
/2/NARA-PEOT-2004-20041102144946-03474-crawling004.archive.org.arc.gz
ShortRecord coff 153747
http://edcsgs16.cr.usgs.gov/landdaac/1KM/orbits/14/1997/12/07/ao14120797040
013.l1b
(http://edcsgs16.cr.usgs.gov/landdaac/1KM/orbits/14/1997/12/07/ao1412079704
0013.l1b
152.61.128.25 20041102144053 application/octet-stream
194649983) want(194649984) got(180339948)
narapeth002-bu.log:20050204131608 EVENT 1107523023
narapeth002-bu.archive.org
/2/NARA-PEOT-2004-20041102145222-03476-crawling004.archive.org.arc.gz
ShortRecord coff 73447521
http://goes.ngdc.noaa.gov/data/full/xrays/1981/X0238102.FIT
(http://goes.ngdc.noaa.gov/data/full/xrays/1981/X0238102.FIT
140.172.186.61 20041102145543 text/plain 6062674)
want(6062675) got(4488709)
narapeth002-bu.log:20050204131616 EVENT 1107523030
narapeth002-bu.archive.org
/2/NARA-PEOT-2004-20041102145435-03477-crawling004.archive.org.arc.gz
ShortRecord coff 46412333
http://www2.rma.usda.gov/FTP/Supplements/1997/97s35ytd.zip
(http://www2.rma.usda.gov/FTP/Supplements/1997/97s35ytd.zip
165.221.71.231 20041102145528 application/zip 19075241)
want(19075242) got(15490579)
narapeth002-bu.log:20050204131619 EVENT 1107523033
narapeth002-bu.archive.org
/2/NARA-PEOT-2004-20041102145514-03478-crawling004.archive.org.arc.gz
ShortRecord coff 8885167
http://nerslweb.cr.usgs.gov/M/1976NPRA010/61_76/SEGY/L17215.SGY
(http://nerslweb.cr.usgs.gov/M/1976NPRA010/61_76/SEGY/L17215.SGY
137.227.241.80 20041102145309 application/octet-stream
39465648) want(39465649) got(12514164)
narapeth002-bu.log:20050204131621 EVENT 1107523036
narapeth002-bu.archive.org
/2/NARA-PEOT-2004-20041102145630-03479-crawling004.archive.org.arc.gz
ShortRecord coff 8886478
http://unicorn.csc.noaa.gov/docs/czic/HT168.C2_D45_1981/4317.pdf
(http://unicorn.csc.noaa.gov/docs/czic/HT168.C2_D45_1981/4317.pdf
216.26.244.92 20041102145418 application/pdf 16614077)
want(16614078) got(911441)
narapeth006-bu.log:20050204173756 EVENT 1107538731
narapeth006-bu.archive.org
/3/NARA-PEOT-2004-20041112120602-00069-crawling009.archive.org.arc.gz
ShortRecord coff 42773021
http://sfbpc.fws.gov/transmittaljb.pdf
(http://sfbpc.fws.gov/transmittaljb.pdf 164.159.171.13
20041112121700 application/pdf 157672) want(157673)
got(157672)
narapeth006-bu.log:20050204173812 EVENT 1107538747
narapeth006-bu.archive.org
/3/NARA-PEOT-2004-20041112121038-00071-crawling009.archive.org.arc.gz
ShortRecord coff 34507613
http://www.txed.uscourts.gov/navbar.swf
(http://www.txed.uscourts.gov/navbar.swf 207.41.16.103
20041112121659 application/octet-stream 39011)
want(39012) got(39011)


Michael Stack ( stack-sf ) - 2005-02-04 20:34

7

Closed

None

Karl Thiessen

Disk I/O

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 00:20
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-352 -- please add further
comments at that location.


Date: 2005-05-17 21:25
Sender: karl-ia

Logged In: YES
user_id=1269624

Assigning to me until I can come up with a coherent set of
tests for valid ARC files. Working with Brad on this issue.


Date: 2005-02-17 22:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is list of bad arcs from Brad:
ia00365:/tmp/brad/nara-validate/arc.log.

There are 1500 bad arcs logged in here.

I took a sample. I downloaded the last 20 arcs.

I ran gzip -t against them. All were coherent gzip files.

I found ARC that ARCReader doesn't think bad:

NARA-PEOT-2004-20041020100446-02056-crawling007.archive.org.arc.gz
NARA-PEOT-2004-20041020100446-02056-crawling007.archive.org.arc.gz

whereas the brad test complains:

20050204200712 EVENT 1107547677
narapeth008.archive.org
/1/NARA-PEOT-2004-20041020132045-02110-crawling007.archive.org.arc.gz
LongRecord coff 19454644
http://woodrow.mpls.frb.fed.us/research/sr/sr10.ps
(http://woodrow.mpls.frb.fed.us/research/sr/sr10.ps
192.245.119.194 20041020132332 application/postscript
2935382) want(2935383) got(22659710)

and

20050204200005 EVENT 1107547250
narapeth008.archive.org
/1/NARA-PEOT-2004-20041020100446-02056-crawling007.archive.org.arc.gz
Fixable Metaline at coff 36458278
(http://whitehousedrugpolicy.gov/publications/asp/subtopics.asp?language=All&topic=Regions%20&sub=State%20and%20Local&sort=date)
(http://whitehousedrugpolicy.gov/publications/asp/subtopics.asp?language=All&topic=Regions
&sub=State%20and%20Local&sort=date 198.77.71.95
20041020100706 text/html 370135)

Need to look into this more; confirm that the above are real
issues and fix ARCReader so it finds them too.


Date: 2005-02-09 19:34
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

'[ 1055789 ] ARCWriter 'Gap' errors should be more
prominent' is related in that we try to keep going writing
ARCs though we've come across a case where record length is
greater than amount read: i.e. we're allowing the writing of
'bad' arcs.

Brad and Igor turned up 77 total bad NARA arcs. Need to
look at these to confirm that they are all bad in the same
way -- that the last record is corrupt before can close this
issue.


Date: 2005-02-04 23:16
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I went through all arcs above. All but the last two fail
gzip -t. The last record is corrupt. Probably because of
OOME or killed crawler.

The last two work as ARC files. The last record is missing
a new line. av and arcreader can deal with records like this.

Downing priority.

Igor and Brad are turning up more files to look at.


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
close_date - 2005-12-02 17:14 stack-sf
status_id Open 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to stack-sf 2005-05-17 21:25 karl-ia
category_id None 2005-05-04 17:40 karl-ia
assigned_to nobody 2005-03-02 19:28 gojomo
priority 5 2005-02-10 01:20 stack-sf
priority 9 2005-02-04 23:16 stack-sf