Re: [Archive-access-discuss] Best practices for indexing a growing 2+ billion document collection

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 06/03/2013 08:49 PM, Erik Hetzner wrote:
> At Mon, 3 Jun 2013 11:39:40 +0000,
> Kristinn Sigurðsson wrote:
>> Dear all,
>>
>> We are planning on updating our Wayback installation and I would
>> like to poll your collective wisdom on the best approach for
>> managing the Wayback index.
>>
>> Currently, our collection is about 2.2 billion items. It is also
>> growing at a rate of approximately 350-400 million records per year.
>>
>> The obvious approach would be to use a sorted CDX file (or files) as
>> the index. I'm, however, concerned about its performance at this
>> scale. Additionally, updating a CDX based index can be troublesome.
>> Especially as we would like to update it continuously as new
>> material is ingested.
>>
>> Any relevant experience and advice you could share on this topic
>> would be greatly appreciated.
> Hi Kristinn,
>
> We use 4 different CDX files. One is updated every ten minutes, one
> hourly, one daily, and one monthly. We use the unix sort command to
> sort. This has worked pretty well for us. We aren’t doing it in the
> most efficient manner, and we will probably switch to sorting with
> hadoop at some point, but it works pretty well.
>
> best, Erik
Hi Kristinn,

Our strategy for building cdx indexes is described at 
https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackConfiguration-AggregatorApplication 
.

Essentially we have multiple threads creating unsorted cdx files for all 
new arc/warc files in the archive. These are then sorted and merged into 
an intermediate index file. When the intermediate file grows larger than 
100MB, it is merged with the current main index file, and when that 
grows larger than 50GB we rollover to a new main index file. We 
currently have
about 5TB total cdx index. This includes 16 older cdx files of size 
150GB-300GB, built by handrolled scripts before we had a functional 
automatic indexing workflow.

We would be fascinated to hear if anyone is using an entirely different 
strategy (e.g. bdb) for a large archive.

One of our big issues at the moment is QA of our cdx files. How can we 
be sure that our indexes actually cover all the files and records in the 
archive?

Colin Rosenthal
IT-Developer
Netarkivet, Denmark