Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The archive itself seems fine. Have you tried without ~ in the filename?

#
# Summary of 'C:\Java\workspace\jwat-tools\WEB-20121130122928187-00000-26202~192.168.24.4~8443.warc.gz'
#
    GZip.isValid: true
    GZip.Entries: 1119
     GZip.Errors: 0
   GZip.Warnings: 0
    Warc.isValid: true
    Warc.Records: 1119
     Warc.Errors: 0
   Warc.Warnings: 0
#
# Job summary
#
GZip files: 0
  +  Arc: 0
  + Warc: 1
 Arc files: 0
Warc files: 0
    Errors: 0
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
Validation took 5413 ms.

Best
Nicholas

> -----Oprindelig meddelelse-----
> Fra: Henrik Ranthin [mailto:Hen...@ap...]
> Sendt: 28. december 2012 16:04
> Til: Erik Hetzner
> Cc: arc...@li...
> Emne: Re: [Archive-access-discuss] Read compressed warc.gz files with
> Wayback
> 
> Hi!
> 
> Nope, it is a warc file. I attached a sample warc file. Maybe you can
> have a quick look at it and see if there is something strange?
> 
> Regards, Henrik
> 
> -----Original Message-----
> From: Erik Hetzner [mailto:eri...@uc...]
> Sent: den 21 december 2012 22:43
> To: Henrik Ranthin
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files
> with Wayback
> 
> Hi Henrik,
> 
> At Fri, 21 Dec 2012 10:52:24 +0000,
> Henrik Ranthin wrote:
> >
> > Thanks for the quick reply!
> > The warc files I’ve used are created (and compressed) by the Heritrix
> web crawler (version 3.1.1).
> >
> > I thought the output from Heritrix should be compatible with Wayback.
> Maybe I’m missing some setting?
> 
> Yes, they should be. Sorry for the distraction, but badly gzipped WARC
> files are often the problem.
> >
> > I’ve also tried to compress the file using the scripts from the warc-
> tools project:
> > warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still
> > get the same result.
> >
> > From the log it seems like Wayback is treating the name of the
> compressed warc file as an URL:
> >
> > Dec 21, 2012 10:57:27 AM
> > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter
> > adapt
> > WARNING: FAILED
> > canonicalize(http://filedesc:WEB-20121128091040702-00000-
> 26202~192.168
> > .24.4~8443.warc.gz:WEB-20121128091040702-00000-
> 26202~192.168.24.4~8443
> > .warc.gz)
> 
> I’m just guessing here, but the WARC files I see don’t start with
> filedesc://... records; only the ARC files. Is this an ARC file that
> was named with .warc.gz rather than .arc.gz?
> 
> best, Erik