Re: [Archive-access-discuss] Compressed and surted CDX files

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Gina,

I don't know how immediately helpful this is to you, but I figured I'd share our approach. Our largest index would is around ~2.5 TB when stored in uncompressed CDX files. We're currently storing it in a packed binary format in RocksDB with Snappy block-compression (much faster but not as much compression as gzip). This gives us near realtime incremental updates, more than sufficient performance and reduces the index size to ~550GB.

Our general hardware infrastructure consists of a small number of fairly powerful servers with local consumer-level SSD drives (which we use a lot for Solr indexes) and a big NFS disk array for bulk content storage. We've experimented on and off with Hadoop but generally found it more of a complication than helpful at our scale. Until recently we'd been manually sorting together CDX files stored on NFS but as we increased the frequency of Heritrix crawling it quickly became difficult to manage.

We considered putting some tooling into automating CDX file management but what we really wanted was a centralised index server that we could incrementally dump records into and later query from multiple tools. We also wanted to try to reduce the size of the index so that we could comfortably fit it on SSD for fast queries. While we've heard bad things about large Wayback BDB indexes, there are a number of other key-value stores available now and SSD storage can be a game changer as to what's practical. We first experimented with LevelDB and then moved to RocksDB which we found worked a bit better with larger indexes, particularly during the initial data load.

The source code for our index server is here:

https://github.com/nla/tinycdxserver

While we are using it in production its still rather experimental and lacking a lot of functionality you'd expect in an out of the box application (like deletes!).

Cheers,

Alex

--
Alex Osborne
National Library of Australia

________________________________
From: Jones, Gina [gj...@lo...]
Sent: Tuesday, July 14, 2015 1:40 AM
To: arc...@li...
Subject: [Archive-access-discuss] Compressed and surted CDX files

We are looking at making a giant leap in access to our content.  Uncompressed cdx’s would currently be ~ 3.5TB or  4TB  and continue to grow and since we cat and sort indexes into bigger indexes for wayback efficiency, this is somewhat of concern to us.

Did web searches to see if I could find any information on how to structure ourselves into a compressed world.  Looks like IA, BA and Common Crawl are using compressed indexes and from discussions I found, we would use what is currently configurable for the cdx server to manage access.

Beyond that, I don’t have a clue how to create compressed indexes during the indexing process.  It doesn’t seem efficient to uncompress to cat/sort and then compress back up.

We just have a plain vanilla wayback VM running java7. We don’t have a Hadoop infrastructure for ZIPNUM clusters.  We will be approaching 1 PB of content soon.

Any recommendations or pointers off to information on how we can more efficiently index/store and serve up our content?  Or possibly a volunteer to help mentor us to move in  the right direction/develop best practices?  To either help us figure out what we need to do to get up and running or help us document requirements to submit to our information technology services if we need better infrastructure?

Thanks, Gina

Gina Jones
Web Archiving Team
Library of Congress
202-707-6604