Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Henrik,

At Fri, 21 Dec 2012 10:52:24 +0000,
Henrik Ranthin wrote:
> 
> Thanks for the quick reply!
> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
> 
> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?

Yes, they should be. Sorry for the distraction, but badly gzipped WARC
files are often the problem.
> 
> I’ve also tried to compress the file using the scripts from the warc-tools project: 
> warc2warc.py –Z my_archive.warc > my_archive.warc.gz
> However, I still get the same result.
> 
> From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
> 
> Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt
> WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz)

I’m just guessing here, but the WARC files I see don’t start with
filedesc://... records; only the ARC files. Is this an ARC file that
was named with .warc.gz rather than .arc.gz?

best, Erik