From: Henrik R. <Hen...@ap...> - 2012-12-20 12:10:50
|
Hi! I can't get Wayback to work with compressed warc files. Is it possible to make it work or is it required to uncompress the files? (I've tried uncompressing the warc.gz files and then everything works as intended) I've specified a directory in BDBCollection.xml which contains a single compressed warc file. <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> <property name="name" value="files1" /> <property name="prefix" value="/local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs"/> <property name="recurse" value="true" /> </bean> </list> </property> </bean> >From the tomcat log: INFO: Added WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz /local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz Dec 20, 2012 11:22:04 AM org.archive.wayback.resourcestore.indexer.IndexQueueUpdater updateQueue INFO: Queued WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz for indexing. Dec 20, 2012 11:22:04 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork INFO: Indexing WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz from /local/heritrix/heritrix-3.1.1/jobs/testjob2/20121128091005/warcs/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz Dec 20, 2012 11:22:04 AM org.archive.wayback.resourceindex.updater.IndexClient addCDX INFO: Queued /local/wayback/base/index-data/incoming/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz for merging. Dec 20, 2012 11:22:05 AM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt WARNING: FAILED canonicalize(http://filedesc:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz:WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz) Dec 20, 2012 11:22:05 AM org.archive.wayback.resourceindex.updater.LocalResourceIndexUpdater handleMerged INFO: Renamed merged file /local/wayback/base/index-data/incoming/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz to /local/wayback/base/index-data/merged/WEB-20121128091040702-00000-26202~192.168.24.4~8443.warc.gz Any ideas? Thanks, Henrik |