From: Henrik R. <Hen...@ap...> - 2012-12-20 12:10:50
|
Hi! I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files? (I've tried uncompressing the warc.gz files and then everything works as intended) I've specified a directory in BDBCollection.xml which contains a single compressed warc file. <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> <property name="name" value="files1" /> <property name="prefix" value="/local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs"/> <property name="recurse" value="true" /> </bean> </list> </property> </bean> >From the tomcat log: INFO: Added WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz /local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz Dec 20, 2012 11:22:04 AM org.archive.wayback.resourcestore.indexer.IndexQueueUpdater updateQueue INFO: Queued WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz for indexing. Dec 20, 2012 11:22:04 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork INFO: Indexing WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz from /local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz Dec 20, 2012 11:22:04 AM org.archive.wayback.resourceindex.updater.IndexClient addCDX INFO: Queued /local/wayback/base/index-data/incoming/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz for merging. Dec 20, 2012 11:22:05 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz) Dec 20, 2012 11:22:05 AM org.archive.wayback.resourceindex.updater.LocalResourceIndexUpdater handleMerged INFO: Renamed merged file /local/wayback/base/index-data/incoming/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz to /local/wayback/base/index-data/merged/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz Any ideas? Thanks, Henrik |
From: Erik H. <eri...@uc...> - 2012-12-20 17:21:10
|
At Thu, 20 Dec 2012 11:57:48 +0000, Henrik Ranthin wrote: > > Hi! > > I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files? > (I've tried uncompressing the warc.gz files and then everything works as intended) Hi Henrik, How are your WARC files compressed? Do you compress them yourself, or are they compressed using a special (W)ARC tool? (You can’t compress them yourself using a normal gzip tool.) For more information, see, e.g., http://sourceforge.net/mailarchive/message.php?msg_id=28183532 best, Erik |
From: Henrik R. <Hen...@ap...> - 2012-12-21 10:52:36
|
Thanks for the quick reply! The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? I’ve also tried to compress the file using the scripts from the warc-tools project: warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still get the same result. From the log it seems like Wayback is treating the name of the compressed warc file as an URL: Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz) Regards, Henrik -----Original Message----- From: Erik Hetzner [mailto:eri...@uc...] Sent: den 20 december 2012 18:04 To: Henrik Ranthin Cc: arc...@li... Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback At Thu, 20 Dec 2012 11:57:48 +0000, Henrik Ranthin wrote: > > Hi! > > I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files? > (I've tried uncompressing the warc.gz files and then everything works > as intended) Hi Henrik, How are your WARC files compressed? Do you compress them yourself, or are they compressed using a special (W)ARC tool? (You can’t compress them yourself using a normal gzip tool.) For more information, see, e.g., http://sourceforge.net/mailarchive/message.php?msg_id=28183532 best, Erik |
From: Erik H. <eri...@uc...> - 2012-12-21 21:43:22
|
Hi Henrik, At Fri, 21 Dec 2012 10:52:24 +0000, Henrik Ranthin wrote: > > Thanks for the quick reply! > The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). > > I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem. > > I’ve also tried to compress the file using the scripts from the warc-tools project: > warc2warc.py –Z my_archive.warc > my_archive.warc.gz > However, I still get the same result. > > From the log it seems like Wayback is treating the name of the compressed warc file as an URL: > > Dec 21, 2012 10:57:27 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt > WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz) I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz? best, Erik |
From: Henrik R. <Hen...@ap...> - 2012-12-28 15:05:45
|
Hi! Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange? Regards, Henrik -----Original Message----- From: Erik Hetzner [mailto:eri...@uc...] Sent: den 21 december 2012 22:43 To: Henrik Ranthin Cc: arc...@li... Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback Hi Henrik, At Fri, 21 Dec 2012 10:52:24 +0000, Henrik Ranthin wrote: > > Thanks for the quick reply! > The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). > > I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem. > > I’ve also tried to compress the file using the scripts from the warc-tools project: > warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still > get the same result. > > From the log it seems like Wayback is treating the name of the compressed warc file as an URL: > > Dec 21, 2012 10:57:27 AM > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter > adapt > WARNING: FAILED > canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168 > .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443 > .warc.gz) I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz? best, Erik |
From: Nicholas C. <ni...@kb...> - 2012-12-28 20:37:58
|
The archive itself seems fine. Have you tried without ~ in the filename? # # Summary of 'C:\Java\workspace\jwat-tools\WEB-20121130122928187-00000-26202~192.168.24.4~8443.warc.gz' # GZip.isValid: true GZip.Entries: 1119 GZip.Errors: 0 GZip.Warnings: 0 Warc.isValid: true Warc.Records: 1119 Warc.Errors: 0 Warc.Warnings: 0 # # Job summary # GZip files: 0 + Arc: 0 + Warc: 1 Arc files: 0 Warc files: 0 Errors: 0 Warnings: 0 RuntimeErr: 0 Skipped: 0 Validation took 5413 ms. Best Nicholas > -----Oprindelig meddelelse----- > Fra: Henrik Ranthin [mailto:Hen...@ap...] > Sendt: 28. december 2012 16:04 > Til: Erik Hetzner > Cc: arc...@li... > Emne: Re: [Archive-access-discuss] Read compressed warc.gz files with > Wayback > > Hi! > > Nope, it is a warc file. I attached a sample warc file. Maybe you can > have a quick look at it and see if there is something strange? > > Regards, Henrik > > -----Original Message----- > From: Erik Hetzner [mailto:eri...@uc...] > Sent: den 21 december 2012 22:43 > To: Henrik Ranthin > Cc: arc...@li... > Subject: Re: [Archive-access-discuss] Read compressed warc.gz files > with Wayback > > Hi Henrik, > > At Fri, 21 Dec 2012 10:52:24 +0000, > Henrik Ranthin wrote: > > > > Thanks for the quick reply! > > The warc files I’ve used are created (and compressed) by the Heritrix > web crawler (version 3.1.1). > > > > I thought the output from Heritrix should be compatible with Wayback. > Maybe I’m missing some setting? > > Yes, they should be. Sorry for the distraction, but badly gzipped WARC > files are often the problem. > > > > I’ve also tried to compress the file using the scripts from the warc- > tools project: > > warc2warc.py –Z my_archive.warc > my_archive.warc.gz However, I still > > get the same result. > > > > From the log it seems like Wayback is treating the name of the > compressed warc file as an URL: > > > > Dec 21, 2012 10:57:27 AM > > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter > > adapt > > WARNING: FAILED > > canonicalize(http://filedesc:WEB-20121128091040702-00000- > 26202~192.168 > > .24.4~8443.warc.gz:WEB-20121128091040702-00000- > 26202~192.168.24.4~8443 > > .warc.gz) > > I’m just guessing here, but the WARC files I see don’t start with > filedesc://... records; only the ARC files. Is this an ARC file that > was named with .warc.gz rather than .arc.gz? > > best, Erik |
From: Drazenko C. <dra...@sr...> - 2012-12-28 20:50:41
|
Hi, which java and wayback versions do you use? I had the same problem when I used old version of Heritrix (1.14.4) and java newer than 6u22. Here was the reason: https://webarchive.jira.com/browse/HER-1865 Regards, Drazenko On 28.12.2012. 16:04, Henrik Ranthin wrote: > Hi! > > Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange? > > Regards, Henrik > > -----Original Message----- > From: Erik Hetzner [mailto:eri...@uc...] > Sent: den 21 december 2012 22:43 > To: Henrik Ranthin > Cc: arc...@li... > Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback > > Hi Henrik, > > At Fri, 21 Dec 2012 10:52:24 +0000, > Henrik Ranthin wrote: >> >> Thanks for the quick reply! >> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). >> >> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? > > Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem. >> >> I’ve also tried to compress the file using the scripts from the warc-tools project: >> warc2warc.py –Z my_archive.warc> my_archive.warc.gz However, I still >> get the same result. >> >> From the log it seems like Wayback is treating the name of the compressed warc file as an URL: >> >> Dec 21, 2012 10:57:27 AM >> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter >> adapt >> WARNING: FAILED >> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168 >> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443 >> .warc.gz) > > I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz? > > best, Erik > > > > ------------------------------------------------------------------------------ > Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and > much more. Get web development skills now with LearnDevNow - > 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts. > SALE $99.99 this month only -- learn more at: > http://p.sf.net/sfu/learnmore_122812 > > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Drazenko C. <dra...@sr...> - 2013-01-03 11:08:09
|
Hi, bug in java was resolved on 2011-03-08 and wayback 1.6.0 is older. You should probably use newer wayback (http://builds.archive.org:8080/maven2/org/archive/wayback/dist/) and java version greater than 6u22. Regards, Drazenko On 2.1.2013. 13:24, Henrik Ranthin wrote: > Hi! > > I switched to an old java version (6u22) and now it seems to work! > Previously I used java 1.7.0_03. > > For wayback I use version 1.6.0. > > I saw that the JIRA issue HER-1865 was marked as fixed. Maybe the fix is just not included in the wayback version I'm using? > > > Thanks for all the help! > > Regards, Henrik > > -----Original Message----- > From: Drazenko Celjak [mailto:dra...@sr...] > Sent: den 28 december 2012 21:32 > To: Henrik Ranthin > Cc: arc...@li... > Subject: Re: [Archive-access-discuss] Read compressed warc.gz files with Wayback > > Hi, > > which java and wayback versions do you use? > > I had the same problem when I used old version of Heritrix (1.14.4) and java newer than 6u22. Here was the reason: > https://webarchive.jira.com/browse/HER-1865 > > Regards, > Drazenko > > > On 28.12.2012. 16:04, Henrik Ranthin wrote: >> Hi! >> >> Nope, it is a warc file. I attached a sample warc file. Maybe you can have a quick look at it and see if there is something strange? >> >> Regards, Henrik >> >> -----Original Message----- >> From: Erik Hetzner [mailto:eri...@uc...] >> Sent: den 21 december 2012 22:43 >> To: Henrik Ranthin >> Cc: arc...@li... >> Subject: Re: [Archive-access-discuss] Read compressed warc.gz files >> with Wayback >> >> Hi Henrik, >> >> At Fri, 21 Dec 2012 10:52:24 +0000, >> Henrik Ranthin wrote: >>> >>> Thanks for the quick reply! >>> The warc files I’ve used are created (and compressed) by the Heritrix web crawler (version 3.1.1). >>> >>> I thought the output from Heritrix should be compatible with Wayback. Maybe I’m missing some setting? >> >> Yes, they should be. Sorry for the distraction, but badly gzipped WARC files are often the problem. >>> >>> I’ve also tried to compress the file using the scripts from the warc-tools project: >>> warc2warc.py –Z my_archive.warc> my_archive.warc.gz However, I still >>> get the same result. >>> >>> From the log it seems like Wayback is treating the name of the compressed warc file as an URL: >>> >>> Dec 21, 2012 10:57:27 AM >>> org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter >>> adapt >>> WARNING: FAILED >>> canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.16 >>> 8 >>> .24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~844 >>> 3 >>> .warc.gz) >> >> I’m just guessing here, but the WARC files I see don’t start with filedesc://... records; only the ARC files. Is this an ARC file that was named with .warc.gz rather than .arc.gz? >> >> best, Erik >> >> >> >> ---------------------------------------------------------------------- >> -------- Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API >> and much more. Get web development skills now with LearnDevNow - >> 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts. >> SALE $99.99 this month only -- learn more at: >> http://p.sf.net/sfu/learnmore_122812 >> >> >> >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |