From: Kaisa K. <kai...@he...> - 2011-08-09 10:46:30
|
Quoting "Erik Hetzner" <eri...@uc...>: > At Fri, 5 Aug 2011 12:11:59 +0300, > Kaisa Kaunonen wrote: >> >> Hello >> >> we have a newer java installation which forced us to index arc files >> with Wayback 1.6.0 instead of 1.4.2 >> >> The Wayback TOMCAT application is still from 1.4.2 but it doesn't seem >> to understand the new CDX file. >> >> For example, there are lines 'CDX N b a m s k r M V g' here and there >> sprinkled around. >> >> Are these lines meaningful in some way? What if I remove them with a >> script. In any case they are reduced to one single line after doing >> sort -u newFile.cdx > sorted.cdx >> >> Does Wayback 1.6.0 TOMCAT application understand old & new CDX files >> out-of-the-box? > > Hi Kaisa, > > This line should be at the beginning of the CDX file. > > http://www.archive.org/web/researcher/cdx_file_format.php > > I don’t believe that wayback 1.4 actually uses these lines, however, > so you can remove them. > > If they are scattered around your CDX files, this is presumably > because you are merging CDX files & sorting? > > best, Erik > Yes, that's right. A script feeds ARC files to the CDX indexer and those 'CDX N B a …' lines seem to be at file boundaries. There's also another slight difference between CDX produced by Wayback 1.4.2 and 1.6.0 1.4.2 version has …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - 11461303 … 1.6.0 has …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 1461303 …… After I changed every instance of ' - - ' to ' - ' with sed, it was possible to use new CDX with 1.4.2. Kaisa |