Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

bug in java was resolved on 2011-03-08 and wayback 1.6.0 is older. You 
should probably use newer wayback 
(http://builds.archive.org:8080/maven2/org/archive/wayback/dist/) and 
java version greater than 6u22.

Regards,
Drazenko

On 2.1.2013. 13:24, Henrik Ranthin wrote:
> Hi!
>
> I switched to an old java version (6u22) and now it seems to work!
> Previously I used java 1.7.0_03.
>
> For wayback I use version 1.6.0.
>
> I saw that the JIRA issue HER-1865 was marked as fixed. Maybe the fix is just not included in the wayback version I'm using?
>
>
> Thanks for all the help!
>
> Regards, Henrik
>
> -----Original Message-----
> From: Drazenko Celjak [mailto:dra...@sr...]
> Sent: den 28 december 2012 21:32
> To: Henrik Ranthin
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback
>
> Hi,
>
> which java and wayback versions do you use?
>
> I had the same problem when I used old version of Heritrix (1.14.4) and java newer than 6u22. Here was the reason:
> https://webarchive.jira.com/browse/HER-1865
>
> Regards,
> Drazenko
>
>
> On 28.12.2012. 16:04, Henrik Ranthin wrote:
>> Hi!
>>
>> Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange?
>>
>> Regards, Henrik
>>
>> -----Original Message-----
>> From: Erik Hetzner [mailto:eri...@uc...]
>> Sent: den 21 december 2012 22:43
>> To: Henrik Ranthin
>> Cc: arc...@li...
>> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files
>> with Wayback
>>
>> Hi Henrik,
>>
>> At Fri, 21 Dec 2012 10:52:24 +0000,
>> Henrik Ranthin wrote:
>>>
>>> Thanks for the quick reply!
>>> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1).
>>>
>>> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting?
>>
>> Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem.
>>>
>>> I’ve also tried to compress the file using the scripts from the warc-tools project:
>>> warc2warc.py –Z my_archive.warc>   my_archive.warc.gz However, I still
>>> get the same result.
>>>
>>>   From the log it seems like Wayback is treating the name of the compressed warc file as an URL:
>>>
>>> Dec 21, 2012 10:57:27 AM
>>> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter
>>> adapt
>>> WARNING: FAILED
>>> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.16
>>> 8
>>> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~844
>>> 3
>>> .warc.gz)
>>
>> I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz?
>>
>> best, Erik
>>
>>
>>
>> ----------------------------------------------------------------------
>> -------- Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API
>> and much more. Get web development skills now with LearnDevNow -
>> 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
>> SALE $99.99 this month only -- learn more at:
>> http://p.sf.net/sfu/learnmore_122812
>>
>>
>>
>> _______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss