I have around 5 billion strings, with many duplicates, and I need to count the occurrences of each unique string. Each string is less than 25 characters. So I wrote a class that stores a fixed size string of 25 characters in an array and I set up an stxxl::map with the fixed size string class as the key and an integer counter as the data. This works well except that performance quickly drops below usable levels. When there are around 5 million entries in the map, it takes 3-4 seconds to process 1 million entries. But when there are around 15 million entries in the map, it takes around an hour to process 1 million entries.
I have set up the map like this:
typedef stxxl::map<fixed_size_string, int, comp_type, DATA_NODE_BLOCK_SIZE, DATA_LEAF_BLOCK_SIZE> string_count_map_t;
string_count_map_t string_count_map(NODE_BLOCK_SIZE * 1024 * 128, LEAF_BLOCK_SIZE * 1024 * 128);
If I reduce the size of the block and leaf caches, performance drops earlier.
Is there a way that I can maintain good performance as I add more than 15 million entries to my map?
Why don't you just sort (stxxl::sort) the array (stxxl::vector) and then count the number of occurrences of each string?
Sorting 125 GB is an easy task and needs to transfer in total 0.5 TB from/to disk for reasonable parameters (B, M), followed by a scan for the counting … and if you use the streaming layer you can even save more I/Os.
algo/copy_and_sort_file.cpp may be a starting point, but instead of feeding the sorted stream directly into materialize you'll need to insert a stream operation that produces a stream of (string25, count) tuples.
I don't remember the I/O guarantees for the map … but unstructured access probably costs O(1) I/Os per element if you overflow the cache …
That's an excellent idea. I'll do that. Thanks.
A followup question: What is the state of the stxxl::unordered_map? I was considering that as a possible alternative to stxxl::map, but the thread I was reading about stxxl::unordered_map stopped abruptly some time in 2011. I downloaded that branch and it compiled ok.
unordered_map hasn't undergone intensive testing … and I wouldn't expect better performance for your use case
@anbe: Thank you for the information.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.