From: Jones, G. <gj...@lo...> - 2015-07-13 15:41:01
|
We are looking at making a giant leap in access to our content. Uncompressed cdx's would currently be ~ 3.5TB or 4TB and continue to grow and since we cat and sort indexes into bigger indexes for wayback efficiency, this is somewhat of concern to us. Did web searches to see if I could find any information on how to structure ourselves into a compressed world. Looks like IA, BA and Common Crawl are using compressed indexes and from discussions I found, we would use what is currently configurable for the cdx server to manage access. Beyond that, I don't have a clue how to create compressed indexes during the indexing process. It doesn't seem efficient to uncompress to cat/sort and then compress back up. We just have a plain vanilla wayback VM running java7. We don't have a Hadoop infrastructure for ZIPNUM clusters. We will be approaching 1 PB of content soon. Any recommendations or pointers off to information on how we can more efficiently index/store and serve up our content? Or possibly a volunteer to help mentor us to move in the right direction/develop best practices? To either help us figure out what we need to do to get up and running or help us document requirements to submit to our information technology services if we need better infrastructure? Thanks, Gina Gina Jones Web Archiving Team Library of Congress 202-707-6604 |