[Archive-access-discuss] Compressed and surted CDX files

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

We are looking at making a giant leap in access to our content.  Uncompressed cdx's would currently be ~ 3.5TB or  4TB  and continue to grow and since we cat and sort indexes into bigger indexes for wayback efficiency, this is somewhat of concern to us.

Did web searches to see if I could find any information on how to structure ourselves into a compressed world.  Looks like IA, BA and Common Crawl are using compressed indexes and from discussions I found, we would use what is currently configurable for the cdx server to manage access.

Beyond that, I don't have a clue how to create compressed indexes during the indexing process.  It doesn't seem efficient to uncompress to cat/sort and then compress back up.

We just have a plain vanilla wayback VM running java7. We don't have a Hadoop infrastructure for ZIPNUM clusters.  We will be approaching 1 PB of content soon.

Any recommendations or pointers off to information on how we can more efficiently index/store and serve up our content?  Or possibly a volunteer to help mentor us to move in  the right direction/develop best practices?  To either help us figure out what we need to do to get up and running or help us document requirements to submit to our information technology services if we need better infrastructure?

Thanks, Gina

Gina Jones
Web Archiving Team
Library of Congress
202-707-6604