I noticed that in a very large multiple file corpus (circa 1 million files), the prepare_corpus process was running out of memory while allocating space to store the list of file names. This made little sense to me so I looked at what was happening:
For every filename, a new (system dependent value) 8k buffer was being allocated. This way, 1 million filenames required 8gb of buffers to be allocated. This was clearly unnecessary as the list of files were stored in a 1.5 megabyte file.
Instead of allocating a new buffer for each filename, this patch causes tokenize.c to allocate a single buffer the same size as the list of filenames. The filenames are loaded into that buffer.
This process may not work if wide characters are necessary in list of filenames, but it does work for me.
Logged In: YES
user_id=1272532
Originator: YES
Rather. It was a 15 megabyte file, not 1.5. I can count.