From: Jones, G. <gj...@lo...> - 2015-06-11 16:31:58
|
We are going to reindex our content using OW2.2, surt format with Field S, and we need to cat the cdx files (and can't with what we have since we will have an additional field and want that field with data). I'm not sure how many of us really look at how (open)wayback indexes out of the box. But OW2.2 out of the box indexes with the S field set. The only unique flag that we are throwing is '-new-canon-surt' in order to produce SURT formatted URLs. Not sure how we get any value added in the S field. We tested indexing with OW2.2/java1.7 CDX N b a m s k r M S V g com,gingrichproductions)/vm-shop/books/drill-here-drill-now-pay-less.html 20111103172952 http://www.gingrichproductions.com/vm-shop/books/drill-here-drill-now-pay-less.html text/html 200 AIYIQOUO447KTTHWQCQIQOZHZ32BMRHL - - - 316107693 LOC-ELECTION2012-001-20111103171522321-00000-3218~crawling214.us.archive.org~8443.warc.gz OW2.0/java1.6 ('/apps/waybacks/openwayback2.0/bin/cdx-indexer) ("out of the box") CDX N b a m s k r M V g gingrichproductions.com/vm-shop/books/drill-here-drill-now-pay-less.html 20111103172952 http://www.gingrichproductions.com/vm-shop/books/drill-here-drill-now-pay-less.html text/html 200 AIYIQOUO447KTTHWQCQIQOZHZ32BMRHL - - 316107693 LOC-ELECTION2012-001-20111103171522321-00000-3218~crawling214.us.archive.org~8443.warc.gz I took a look at 1.8.0/6 indexes, CDX N b a m s k r M V g ("out of the box") Although I see Brad had something 2011 http://sourceforge.net/p/archive-access/mailman/message/28558226/ When I look at the source code in Openwayback2.2, it looks like it tries to get a value https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/format/CompressedLengthCDXField.java public class CompressedLengthCDXField implements CDXField { public void apply(String field, CaptureSearchResult result) throws CDXFormatException { if (field.equals("-")) { return; } try { result.setCompressedLength(Long.parseLong(field)); } catch(NumberFormatException e) { throw new CDXFormatException(e.getLocalizedMessage()); } } public String serialize(CaptureSearchResult result) { long r = result.getCompressedLength(); if(r == -1) { return DEFAULT_VALUE; } return String.valueOf(r); } } I don't see anything that helps me here https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/format/CDXFormat.java I will say that the 2.2 code construct makes it easy to look at the source code without a lot of extra work like save .jars to zips and uncompressing, so kudos to the OW2.2 dev team on this, though I can't really read java but at least I can more easily send someone java-smart to figure things out. Is there something I am missing or that we need to do to get value in this field? And interesting, though it needs more testing, but anecdotally, it seems to zip along warcs a bit quicker than 1.8. Thanks, Gina Gina Jones Web Archiving Team Library of Congress 202-707-6604 |