From: Nicholas C. <ni...@kb...> - 2013-06-06 21:23:44
|
I experimented with an alternative flatfile lookup implementation which caches the first 16 levels of binary search decisions. Somehow that could be fun to mix with compressed blocks. So unless a prefix spans more that 3000 lines you only decompress one block of gzip'ed data per lookup? Is this format faster or predominantly used to save disk space? -Nicholas > -----Oprindelig meddelelse----- > Fra: Ilya Kreymer [mailto:il...@ar...] > Sendt: 6. juni 2013 21:38 > Til: arc...@li... > Emne: Re: [Archive-access-discuss] indexing best practices Wayback > 1.x.x Indexers > > I also wanted to provide a brief overview about the new indexing format > we are using at IA for large indexes. > > We refer to this as "zipnum" or "ziplines" and it is basically > concatenated gzip'd blocks, each with 3000 lines of cdx. > > The concatened .gz file has a corresponding sorted text index summary > The text index summary has the first url of each 3000 line block and a > filename and offset to the full concatenated .gz file. > > This allows the full .gz index to be spread over multiple shards, and > lends itself well to be being built in Hadoop. > > We have tools to generate the zipnum sharded index in hadoop as well as > standalone Java and Python tools. > > We are working on providing more documentation of this format, but I > just wanted to give a brief overview for now. > > > Using this format, we have been using a similar approach of having a > full zipnum cluster (updated less frequently). > and smaller zipnum clusters that are updated daily or hourly, and then > re-merged into the full zipnum cluster. > > Wayback has stable support for reading this data format (via > ZipNumClusterSearchResultSource) which we have been using for over a > year, > and the tools to generate the format are in the ia-hadoop-tools > repository, however we definitely need to provide more documentation on > using this > system. > > Please feel free to let us know if you have further questions in the > mean time. > > Ilya, > Engineer > IA > > > On 06/06/2013 12:17 AM, Colin Rosenthal wrote: > > On 06/04/2013 08:27 PM, Jones, Gina wrote: > >> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you > have your content indexed with either of the two. However, if you plan > to combine the indexes into one big index, they need to match. > >> > >> -The specific problem we had was with sections of an ongoing crawl. > 2009 content was indexed with 1.4.X, but 2009+2010 content was indexed > with 1.6.X, so if we merge and sort, we would get the 2009 entries > twice, because they do not match exactly (different number of fields). > >> > >> -The field configurations for the two versions (as we have them are) > >> > >> 1.4.2: CDX N b h m s k r V g > >> 1.6.1: CDX N b a m s k r M V g > >> > >> For definitions of the fields here is an old reference: > http://archive.org/web/researcher/cdx_legend.php > >> > > Thank you, Gina, that is extremely interesting! > > > > Colin Rosenthal > > Netarkivet > > > > --------------------------------------------------------------------- > --------- > > How ServiceNow helps IT people transform IT departments: > > 1. A cloud service to automate IT design, transition and operations > > 2. Dashboards that offer high-level views of enterprise services > > 3. A single system of record for all IT processes > > http://p.sf.net/sfu/servicenow-d2d-j > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ----------------------------------------------------------------------- > ------- > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |