Menu

#2 Tokenize.c memory fix with a large multiple file corpus

open
nobody
None
5
2007-10-02
2007-10-02
No

I noticed that in a very large multiple file corpus (circa 1 million files), the prepare_corpus process was running out of memory while allocating space to store the list of file names. This made little sense to me so I looked at what was happening:

For every filename, a new (system dependent value) 8k buffer was being allocated. This way, 1 million filenames required 8gb of buffers to be allocated. This was clearly unnecessary as the list of files were stored in a 1.5 megabyte file.

Instead of allocating a new buffer for each filename, this patch causes tokenize.c to allocate a single buffer the same size as the list of filenames. The filenames are loaded into that buffer.

This process may not work if wide characters are necessary in list of filenames, but it does work for me.

Discussion

  • Albert Bertram

    Albert Bertram - 2007-10-02
     
  • Albert Bertram

    Albert Bertram - 2007-10-02

    Logged In: YES
    user_id=1272532
    Originator: YES

    Rather. It was a 15 megabyte file, not 1.5. I can count.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.