Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi!

Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange?

Regards, Henrik

-----Original Message-----
From: Erik Hetzner [mailto:eri...@uc...] 
Sent: den 21 december 2012 22:43
To: Henrik Ranthin
Cc: arc...@li...
Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

Hi Henrik,

At Fri, 21 Dec 2012 10:52:24 +0000,
Henrik Ranthin wrote:
> 
> Thanks for the quick reply!
> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
> 
> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?

Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem.
> 
> I’ve also tried to compress the file using the scripts from the warc-tools project: 
> warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still 
> get the same result.
> 
> From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
> 
> Dec 21, 2012 10:57:27 AM 
> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter 
> adapt
> WARNING: FAILED 
> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168
> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443
> .warc.gz)

I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz?

best, Erik