From: Bradley T. <br...@ar...> - 2011-08-15 14:56:04
|
Right on all counts: Wayback 1.6.0 has an under-documented option to cdx-indexer, "-format" which specifies the fields you want produced in your index. The default value is " CDX N b a m s k r M V g" (note the leading space.. part of the format specification Eric referenced via hyperlink.) The extra, "mystery guest" 10th field (actually field #8) is the META tag robot instructions, found within HTML resources. Values are "-" for none, or a combination if "I", "A", "F", for (No-Index, No-Archive, and No-Follow, respectively.) Starting with 1.6.0, the normal CDX Index implementation, org.archive.wayback.resourceindex.CDXIndex will handle CDX lines with either 9 or 10 columns, assuming the extra 8th column, if 10 are present, is the robot instructions field. In anticipation of potentially wanting to use the META tag robot instructions later, we opted to push the field into the default index format, and make the tools handle either format, hoping to eliminate/reduce future need to reindex content from scratch. 1.6.0 also includes another CDX implementation, org.archive.wayback.resourceindex.CDXFormatIndex, which allows for arbitrary index fields, reading the first line in the file, and assuming it contains the CDX header line (for example, " CDX N b a m s k r M V g") These are somewhat advanced features and unused at the moment, so probably not of much concern. Unless of course, you're using the 1.6.0 indexer with a 1.4.X Wayback.. in which case there's a compatibility issue. So, Kaisa, you can either: 1) strip the 8th field (perhaps better done with 'awk', or 'perl -ane' to ensure you strip the correct field?) as you're doing 2) add the options (-format " CDX N b a m s k r V g") (note lack of "M" and again note, leading SPACE before the CDX) to the cdx-indexer tool arguments. 3) upgrade your access Wayback to 1.6.X Hope this clarifies more than confuses! Brad On 8/9/11 5:46 PM, Kaisa Kaunonen wrote: > Quoting "Erik Hetzner"<eri...@uc...>: >> At Fri, 5 Aug 2011 12:11:59 +0300, >> Kaisa Kaunonen wrote: >>> Hello >>> >>> we have a newer java installation which forced us to index arc files >>> with Wayback 1.6.0 instead of 1.4.2 >>> >>> The Wayback TOMCAT application is still from 1.4.2 but it doesn't seem >>> to understand the new CDX file. >>> >>> For example, there are lines 'CDX N b a m s k r M V g' here and there >>> sprinkled around. >>> >>> Are these lines meaningful in some way? What if I remove them with a >>> script. In any case they are reduced to one single line after doing >>> sort -u newFile.cdx> sorted.cdx >>> >>> Does Wayback 1.6.0 TOMCAT application understand old& new CDX files >>> out-of-the-box? >> Hi Kaisa, >> >> This line should be at the beginning of the CDX file. >> >> http://www.archive.org/web/researcher/cdx_file_format.php >> >> I don’t believe that wayback 1.4 actually uses these lines, however, >> so you can remove them. >> >> If they are scattered around your CDX files, this is presumably >> because you are merging CDX files& sorting? >> >> best, Erik >> > > Yes, that's right. A script feeds ARC files to the CDX indexer and > those 'CDX N B a …' lines seem to be at file boundaries. > > There's also another slight difference between CDX produced by Wayback > 1.4.2 and 1.6.0 > > 1.4.2 version has > …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - 11461303 … > > 1.6.0 has > …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 1461303 …… > > > After I changed every instance of ' - - ' to ' - ' with sed, it was > possible to use new CDX with 1.4.2. > > Kaisa > > > ------------------------------------------------------------------------------ > uberSVN's rich system and user administration capabilities and model > configuration take the hassle out of deploying and managing Subversion and > the tools developers use with it. Learn more about uberSVN and get a free > download at: http://p.sf.net/sfu/wandisco-dev2dev > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |