Thread: [Archive-access-discuss] Read compressed warc.gz files with Wayback

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss

[Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Henrik R. <Hen...@ap...> - 2012-12-20 12:10:50

Hi!

I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files?
(I've tried uncompressing the warc.gz files and then everything works as intended)

I've specified a directory in BDBCollection.xml which contains a single compressed warc file.

  <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean">
    <property name="sourceList">
      <list>
        <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
          <property name="name" value="files1" />
          <property name="prefix" value="/local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs"/>
          <property name="recurse" value="true" />
        </bean>
      </list>
    </property>
  </bean>

>From the tomcat log:

INFO: Added WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz /local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz
Dec 20, 2012 11:22:04 AM org.archive.wayback.resourcestore.indexer.IndexQueueUpdater updateQueue
INFO: Queued WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz for indexing.
Dec 20, 2012 11:22:04 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork
INFO: Indexing WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz from /local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz
Dec 20, 2012 11:22:04 AM org.archive.wayback.resourceindex.updater.IndexClient addCDX
INFO: Queued /local/wayback/base/index-data/incoming/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz for merging.
Dec 20, 2012 11:22:05 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt
WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz)
Dec 20, 2012 11:22:05 AM org.archive.wayback.resourceindex.updater.LocalResourceIndexUpdater handleMerged
INFO: Renamed merged file /local/wayback/base/index-data/incoming/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz to /local/wayback/base/index-data/merged/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz

Any ideas?


Thanks,
Henrik

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Erik H. <eri...@uc...> - 2012-12-20 17:21:10

At Thu, 20 Dec 2012 11:57:48 +0000,
Henrik Ranthin wrote:
> 
> Hi!
> 
> I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files?
> (I've tried uncompressing the warc.gz files and then everything works as intended)

Hi Henrik,

How are your WARC files compressed? Do you compress them yourself, or
are they compressed using a special (W)ARC tool? (You can’t compress
them yourself using a normal gzip tool.)

For more information, see, e.g., 

http://sourceforge.net/mailarchive/message.php?msg_id=28183532

best, Erik

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Henrik R. <Hen...@ap...> - 2012-12-21 10:52:36

Thanks for the quick reply!
The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).

I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?

I’ve also tried to compress the file using the scripts from the warc-tools project: 
warc2warc.py –Z my_archive.warc > my_archive.warc.gz
However, I still get the same result.

From the log it seems like Wayback is treating the name of the compressed warc file as an URL:

Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt
WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz)

Regards, Henrik

-----Original Message-----
From: Erik Hetzner [mailto:eri...@uc...] 
Sent: den 20 december 2012 18:04
To: Henrik Ranthin
Cc: arc...@li...
Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

At Thu, 20 Dec 2012 11:57:48 +0000,
Henrik Ranthin wrote:
> 
> Hi!
> 
> I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files?
> (I've tried uncompressing the warc.gz files and then everything works 
> as intended)

Hi Henrik,

How are your WARC files compressed? Do you compress them yourself, or are they compressed using a special (W)ARC tool? (You can’t compress them yourself using a normal gzip tool.)

For more information, see, e.g., 

http://sourceforge.net/mailarchive/message.php?msg_id=28183532

best, Erik

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Erik H. <eri...@uc...> - 2012-12-21 21:43:22

Hi Henrik,

At Fri, 21 Dec 2012 10:52:24 +0000,
Henrik Ranthin wrote:
> 
> Thanks for the quick reply!
> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
> 
> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?

Yes, they should be. Sorry for the distraction, but badly gzipped WARC
files are often the problem.
> 
> I’ve also tried to compress the file using the scripts from the warc-tools project: 
> warc2warc.py –Z my_archive.warc > my_archive.warc.gz
> However, I still get the same result.
> 
> From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
> 
> Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt
> WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz)

I’m just guessing here, but the WARC files I see don’t start with
filedesc://... records; only the ARC files. Is this an ARC file that
was named with .warc.gz rather than .arc.gz?

best, Erik

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Henrik R. <Hen...@ap...> - 2012-12-28 15:05:45

Attachments: WEB-20121130122928187-00000-26202~192.168.24.4~8443.warc.gz

Hi!

Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange?

Regards, Henrik

-----Original Message-----
From: Erik Hetzner [mailto:eri...@uc...] 
Sent: den 21 december 2012 22:43
To: Henrik Ranthin
Cc: arc...@li...
Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

Hi Henrik,

At Fri, 21 Dec 2012 10:52:24 +0000,
Henrik Ranthin wrote:
> 
> Thanks for the quick reply!
> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
> 
> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?

Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem.
> 
> I’ve also tried to compress the file using the scripts from the warc-tools project: 
> warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still 
> get the same result.
> 
> From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
> 
> Dec 21, 2012 10:57:27 AM 
> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter 
> adapt
> WARNING: FAILED 
> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168
> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443
> .warc.gz)

I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz?

best, Erik

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Nicholas C. <ni...@kb...> - 2012-12-28 20:37:58

The archive itself seems fine. Have you tried without ~ in the filename?

#
# Summary of 'C:\Java\workspace\jwat-tools\WEB-20121130122928187-00000-26202~192.168.24.4~8443.warc.gz'
#
    GZip.isValid: true
    GZip.Entries: 1119
     GZip.Errors: 0
   GZip.Warnings: 0
    Warc.isValid: true
    Warc.Records: 1119
     Warc.Errors: 0
   Warc.Warnings: 0
#
# Job summary
#
GZip files: 0
  +  Arc: 0
  + Warc: 1
 Arc files: 0
Warc files: 0
    Errors: 0
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
Validation took 5413 ms.

Best
Nicholas

> -----Oprindelig meddelelse-----
> Fra: Henrik Ranthin [mailto:Hen...@ap...]
> Sendt: 28. december 2012 16:04
> Til: Erik Hetzner
> Cc: arc...@li...
> Emne: Re: [Archive-access-discuss] Read compressed warc.gz files with
> Wayback
> 
> Hi!
> 
> Nope, it is a warc file. I attached a sample warc file. Maybe you can
> have a quick look at it and see if there is something strange?
> 
> Regards, Henrik
> 
> -----Original Message-----
> From: Erik Hetzner [mailto:eri...@uc...]
> Sent: den 21 december 2012 22:43
> To: Henrik Ranthin
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files
> with Wayback
> 
> Hi Henrik,
> 
> At Fri, 21 Dec 2012 10:52:24 +0000,
> Henrik Ranthin wrote:
> >
> > Thanks for the quick reply!
> > The warc files I’ve used are created (and compressed) by the Heritrix
> web crawler (version 3.1.1).
> >
> > I thought the output from Heritrix should be compatible with Wayback.
> Maybe I’m missing some setting?
> 
> Yes, they should be. Sorry for the distraction, but badly gzipped WARC
> files are often the problem.
> >
> > I’ve also tried to compress the file using the scripts from the warc-
> tools project:
> > warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still
> > get the same result.
> >
> > From the log it seems like Wayback is treating the name of the
> compressed warc file as an URL:
> >
> > Dec 21, 2012 10:57:27 AM
> > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter
> > adapt
> > WARNING: FAILED
> > canonicalize(http://filedesc:WEB-20121128091040702-00000-
> 26202~192.168
> > .24.4~8443.warc.gz:WEB-20121128091040702-00000-
> 26202~192.168.24.4~8443
> > .warc.gz)
> 
> I’m just guessing here, but the WARC files I see don’t start with
> filedesc://... records; only the ARC files. Is this an ARC file that
> was named with .warc.gz rather than .arc.gz?
> 
> best, Erik

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Drazenko C. <dra...@sr...> - 2012-12-28 20:50:41

Hi,

which java and wayback versions do you use?

I had the same problem when I used old version of Heritrix (1.14.4) and 
java newer than 6u22. Here was the reason: 
https://webarchive.jira.com/browse/HER-1865

Regards,
Drazenko


On 28.12.2012. 16:04, Henrik Ranthin wrote:
> Hi!
>
> Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange?
>
> Regards, Henrik
>
> -----Original Message-----
> From: Erik Hetzner [mailto:eri...@uc...]
> Sent: den 21 december 2012 22:43
> To: Henrik Ranthin
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback
>
> Hi Henrik,
>
> At Fri, 21 Dec 2012 10:52:24 +0000,
> Henrik Ranthin wrote:
>>
>> Thanks for the quick reply!
>> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
>>
>> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?
>
> Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem.
>>
>> I’ve also tried to compress the file using the scripts from the warc-tools project:
>> warc2warc.py –Z my_archive.warc>  my_archive.warc.gz However, I still
>> get the same result.
>>
>>  From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
>>
>> Dec 21, 2012 10:57:27 AM
>> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter
>> adapt
>> WARNING: FAILED
>> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168
>> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443
>> .warc.gz)
>
> I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz?
>
> best, Erik
>
>
>
> ------------------------------------------------------------------------------
> Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
> much more. Get web development skills now with LearnDevNow -
> 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
> SALE $99.99 this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122812
>
>
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

From: Drazenko C. <dra...@sr...> - 2013-01-03 11:08:09

Hi,

bug in java was resolved on 2011-03-08 and wayback 1.6.0 is older. You 
should probably use newer wayback 
(http://builds.archive.org:8080/maven2/org/archive/wayback/dist/) and 
java version greater than 6u22.

Regards,
Drazenko

On 2.1.2013. 13:24, Henrik Ranthin wrote:
> Hi!
>
> I switched to an old java version (6u22) and now it seems to work!
> Previously I used java 1.7.0_03.
>
> For wayback I use version 1.6.0.
>
> I saw that the JIRA issue HER-1865 was marked as fixed. Maybe the fix is just not included in the wayback version I'm using?
>
>
> Thanks for all the help!
>
> Regards, Henrik
>
> -----Original Message-----
> From: Drazenko Celjak [mailto:dra...@sr...]
> Sent: den 28 december 2012 21:32
> To: Henrik Ranthin
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback
>
> Hi,
>
> which java and wayback versions do you use?
>
> I had the same problem when I used old version of Heritrix (1.14.4) and java newer than 6u22. Here was the reason:
> https://webarchive.jira.com/browse/HER-1865
>
> Regards,
> Drazenko
>
>
> On 28.12.2012. 16:04, Henrik Ranthin wrote:
>> Hi!
>>
>> Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange?
>>
>> Regards, Henrik
>>
>> -----Original Message-----
>> From: Erik Hetzner [mailto:eri...@uc...]
>> Sent: den 21 december 2012 22:43
>> To: Henrik Ranthin
>> Cc: arc...@li...
>> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files
>> with Wayback
>>
>> Hi Henrik,
>>
>> At Fri, 21 Dec 2012 10:52:24 +0000,
>> Henrik Ranthin wrote:
>>>
>>> Thanks for the quick reply!
>>> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
>>>
>>> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?
>>
>> Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem.
>>>
>>> I’ve also tried to compress the file using the scripts from the warc-tools project:
>>> warc2warc.py –Z my_archive.warc>   my_archive.warc.gz However, I still
>>> get the same result.
>>>
>>>   From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
>>>
>>> Dec 21, 2012 10:57:27 AM
>>> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter
>>> adapt
>>> WARNING: FAILED
>>> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.16
>>> 8
>>> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~844
>>> 3
>>> .warc.gz)
>>
>> I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz?
>>
>> best, Erik
>>
>>
>>
>> ----------------------------------------------------------------------
>> -------- Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API
>> and much more. Get web development skills now with LearnDevNow -
>> 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
>> SALE $99.99 this month only -- learn more at:
>> http://p.sf.net/sfu/learnmore_122812
>>
>>
>>
>> _______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss