Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks for the quick reply!
The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).

I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?

I’ve also tried to compress the file using the scripts from the warc-tools project: 
warc2warc.py –Z my_archive.warc > my_archive.warc.gz
However, I still get the same result.

From the log it seems like Wayback is treating the name of the compressed warc file as an URL:

Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt
WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz)

Regards, Henrik

-----Original Message-----
From: Erik Hetzner [mailto:eri...@uc...] 
Sent: den 20 december 2012 18:04
To: Henrik Ranthin
Cc: arc...@li...
Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

At Thu, 20 Dec 2012 11:57:48 +0000,
Henrik Ranthin wrote:
> 
> Hi!
> 
> I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files?
> (I've tried uncompressing the warc.gz files and then everything works 
> as intended)

Hi Henrik,

How are your WARC files compressed? Do you compress them yourself, or are they compressed using a special (W)ARC tool? (You can’t compress them yourself using a normal gzip tool.)

For more information, see, e.g., 

http://sourceforge.net/mailarchive/message.php?msg_id=28183532

best, Erik