From: Erik H. <eri...@uc...> - 2012-12-21 21:43:22
|
Hi Henrik, At Fri, 21 Dec 2012 10:52:24 +0000, Henrik Ranthin wrote: > > Thanks for the quick reply! > The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). > > I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem. > > I’ve also tried to compress the file using the scripts from the warc-tools project: > warc2warc.py –Z my_archive.warc > my_archive.warc.gz > However, I still get the same result. > > From the log it seems like Wayback is treating the name of the compressed warc file as an URL: > > Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt > WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz) I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz? best, Erik |