|
From: Nicholas C. <ni...@kb...> - 2012-12-28 20:37:58
|
The archive itself seems fine. Have you tried without ~ in the filename?
#
# Summary of 'C:\Java\workspace\jwat-tools\WEB-20121130122928187-00000-26202~192.168.24.4~8443.warc.gz'
#
GZip.isValid: true
GZip.Entries: 1119
GZip.Errors: 0
GZip.Warnings: 0
Warc.isValid: true
Warc.Records: 1119
Warc.Errors: 0
Warc.Warnings: 0
#
# Job summary
#
GZip files: 0
+ Arc: 0
+ Warc: 1
Arc files: 0
Warc files: 0
Errors: 0
Warnings: 0
RuntimeErr: 0
Skipped: 0
Validation took 5413 ms.
Best
Nicholas
> -----Oprindelig meddelelse-----
> Fra: Henrik Ranthin [mailto:Hen...@ap...]
> Sendt: 28. december 2012 16:04
> Til: Erik Hetzner
> Cc: arc...@li...
> Emne: Re: [Archive-access-discuss] Read compressed warc.gz files with
> Wayback
>
> Hi!
>
> Nope, it is a warc file. I attached a sample warc file. Maybe you can
> have a quick look at it and see if there is something strange?
>
> Regards, Henrik
>
> -----Original Message-----
> From: Erik Hetzner [mailto:eri...@uc...]
> Sent: den 21 december 2012 22:43
> To: Henrik Ranthin
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files
> with Wayback
>
> Hi Henrik,
>
> At Fri, 21 Dec 2012 10:52:24 +0000,
> Henrik Ranthin wrote:
> >
> > Thanks for the quick reply!
> > The warc files I’ve used are created (and compressed) by the Heritrix
> web crawler (version 3.1.1).
> >
> > I thought the output from Heritrix should be compatible with Wayback.
> Maybe I’m missing some setting?
>
> Yes, they should be. Sorry for the distraction, but badly gzipped WARC
> files are often the problem.
> >
> > I’ve also tried to compress the file using the scripts from the warc-
> tools project:
> > warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still
> > get the same result.
> >
> > From the log it seems like Wayback is treating the name of the
> compressed warc file as an URL:
> >
> > Dec 21, 2012 10:57:27 AM
> > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter
> > adapt
> > WARNING: FAILED
> > canonicalize(http://filedesc:WEB-20121128091040702-00000-
> 26202~192.168
> > .24.4~8443.warc.gz:WEB-20121128091040702-00000-
> 26202~192.168.24.4~8443
> > .warc.gz)
>
> I’m just guessing here, but the WARC files I see don’t start with
> filedesc://... records; only the ARC files. Is this an ARC file that
> was named with .warc.gz rather than .arc.gz?
>
> best, Erik
|