From: Nicholas C. <ni...@kb...> - 2012-12-28 20:37:58
|
The archive itself seems fine. Have you tried without ~ in the filename? # # Summary of 'C:\Java\workspace\jwat-tools\WEB-20121130122928187-00000-26202~192.168.24.4~8443.warc.gz' # GZip.isValid: true GZip.Entries: 1119 GZip.Errors: 0 GZip.Warnings: 0 Warc.isValid: true Warc.Records: 1119 Warc.Errors: 0 Warc.Warnings: 0 # # Job summary # GZip files: 0 + Arc: 0 + Warc: 1 Arc files: 0 Warc files: 0 Errors: 0 Warnings: 0 RuntimeErr: 0 Skipped: 0 Validation took 5413 ms. Best Nicholas > -----Oprindelig meddelelse----- > Fra: Henrik Ranthin [mailto:Hen...@ap...] > Sendt: 28. december 2012 16:04 > Til: Erik Hetzner > Cc: arc...@li... > Emne: Re: [Archive-access-discuss] Read compressed warc.gz files with > Wayback > > Hi! > > Nope, it is a warc file. I attached a sample warc file. Maybe you can > have a quick look at it and see if there is something strange? > > Regards, Henrik > > -----Original Message----- > From: Erik Hetzner [mailto:eri...@uc...] > Sent: den 21 december 2012 22:43 > To: Henrik Ranthin > Cc: arc...@li... > Subject: Re: [Archive-access-discuss] Read compressed warc.gz files > with Wayback > > Hi Henrik, > > At Fri, 21 Dec 2012 10:52:24 +0000, > Henrik Ranthin wrote: > > > > Thanks for the quick reply! > > The warc files I’ve used are created (and compressed) by the Heritrix > web crawler (version 3.1.1). > > > > I thought the output from Heritrix should be compatible with Wayback. > Maybe I’m missing some setting? > > Yes, they should be. Sorry for the distraction, but badly gzipped WARC > files are often the problem. > > > > I’ve also tried to compress the file using the scripts from the warc- > tools project: > > warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still > > get the same result. > > > > From the log it seems like Wayback is treating the name of the > compressed warc file as an URL: > > > > Dec 21, 2012 10:57:27 AM > > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter > > adapt > > WARNING: FAILED > > canonicalize(http://filedesc:WEB-20121128091040702-00000- > 26202~192.168 > > .24.4~8443.warc.gz:WEB-20121128091040702-00000- > 26202~192.168.24.4~8443 > > .warc.gz) > > I’m just guessing here, but the WARC files I see don’t start with > filedesc://... records; only the ARC files. Is this an ARC file that > was named with .warc.gz rather than .arc.gz? > > best, Erik |