Infomap NLP Software / Patches / #2 Tokenize.c memory fix with a large multiple file corpus

Tokenize.c memory fix with a large multiple file corpus

#2 Tokenize.c memory fix with a large multiple file corpus

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2007-10-02

Created: 2007-10-02

Creator: Albert Bertram

Private: No

I noticed that in a very large multiple file corpus (circa 1 million files), the prepare_corpus process was running out of memory while allocating space to store the list of file names. This made little sense to me so I looked at what was happening:

For every filename, a new (system dependent value) 8k buffer was being allocated. This way, 1 million filenames required 8gb of buffers to be allocated. This was clearly unnecessary as the list of files were stored in a 1.5 megabyte file.

Instead of allocating a new buffer for each filename, this patch causes tokenize.c to allocate a single buffer the same size as the list of filenames. The filenames are loaded into that buffer.

This process may not work if wide characters are necessary in list of filenames, but it does work for me.

Discussion

Albert Bertram - 2007-10-02

tokenizer.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Albert Bertram - 2007-10-02

Logged In: YES
user_id=1272532
Originator: YES

Rather. It was a 15 megabyte file, not 1.5. I can count.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tokenize.c memory fix with a large multiple file corpus

Group

Searches

Help

#2 Tokenize.c memory fix with a large multiple file corpus

Discussion