From: Henrik R. <Hen...@ap...> - 2012-12-21 10:52:36
|
Thanks for the quick reply! The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? I’ve also tried to compress the file using the scripts from the warc-tools project: warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still get the same result. From the log it seems like Wayback is treating the name of the compressed warc file as an URL: Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz) Regards, Henrik -----Original Message----- From: Erik Hetzner [mailto:eri...@uc...] Sent: den 20 december 2012 18:04 To: Henrik Ranthin Cc: arc...@li... Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback At Thu, 20 Dec 2012 11:57:48 +0000, Henrik Ranthin wrote: > > Hi! > > I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files? > (I've tried uncompressing the warc.gz files and then everything works > as intended) Hi Henrik, How are your WARC files compressed? Do you compress them yourself, or are they compressed using a special (W)ARC tool? (You can’t compress them yourself using a normal gzip tool.) For more information, see, e.g., http://sourceforge.net/mailarchive/message.php?msg_id=28183532 best, Erik |