From: Ilya K. <il...@ar...> - 2013-06-06 19:38:01
|
I also wanted to provide a brief overview about the new indexing format we are using at IA for large indexes. We refer to this as "zipnum" or "ziplines" and it is basically concatenated gzip'd blocks, each with 3000 lines of cdx. The concatened .gz file has a corresponding sorted text index summary The text index summary has the first url of each 3000 line block and a filename and offset to the full concatenated .gz file. This allows the full .gz index to be spread over multiple shards, and lends itself well to be being built in Hadoop. We have tools to generate the zipnum sharded index in hadoop as well as standalone Java and Python tools. We are working on providing more documentation of this format, but I just wanted to give a brief overview for now. Using this format, we have been using a similar approach of having a full zipnum cluster (updated less frequently). and smaller zipnum clusters that are updated daily or hourly, and then re-merged into the full zipnum cluster. Wayback has stable support for reading this data format (via ZipNumClusterSearchResultSource) which we have been using for over a year, and the tools to generate the format are in the ia-hadoop-tools repository, however we definitely need to provide more documentation on using this system. Please feel free to let us know if you have further questions in the mean time. Ilya, Engineer IA On 06/06/2013 12:17 AM, Colin Rosenthal wrote: > On 06/04/2013 08:27 PM, Jones, Gina wrote: >> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you have your content indexed with either of the two. However, if you plan to combine the indexes into one big index, they need to match. >> >> -The specific problem we had was with sections of an ongoing crawl. 2009 content was indexed with 1.4.X, but 2009+2010 content was indexed with 1.6.X, so if we merge and sort, we would get the 2009 entries twice, because they do not match exactly (different number of fields). >> >> -The field configurations for the two versions (as we have them are) >> >> 1.4.2: CDX N b h m s k r V g >> 1.6.1: CDX N b a m s k r M V g >> >> For definitions of the fields here is an old reference: http://archive.org/web/researcher/cdx_legend.php >> > Thank you, Gina, that is extremely interesting! > > Colin Rosenthal > Netarkivet > > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |