CRF++ / Bugs / #7 Segmentation Faults for large number of classes

#7 Segmentation Faults for large number of classes

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2012-01-24

Created: 2012-01-24

Creator: Marjan Van de Kauter

Private: No

I want to train lemmatization models for some highly inflected languages (Dutch, German and French). Therefore I have a large number of possible classifications (1000-2000). This results in a segmentation fault after 7 iterations or so. I checked my training data (consisting of about 100.000 words for each language) and there is nothing wrong with it. So it seems there is a memory usage problem, although I'm using a 64GB device. Has anyone encountered this problem before? I also tried using even and uneven numbers of threads, but this didn't change anything. What can I do about it? It works when I use a smaller number of training features, but this results in a lower lemmatization accuracy.

Discussion

Adam Radziszewski - 2012-02-14

Confirmed. CRF++ eats up ridiculous amounts of memory when training on dataset with large number of decision classes.

My case: using crf_learn with default args against a training file consisting of 1.2 million lines (made by 77k sentences).
The columns are the following: wordform, POS ambitag (set of possible values), case ambitag (set of possible values of gram. case), gender ambitag, number ambitag, tag to be chosen (decision class).
There are 925 different values of the decision class (last column, tag). The second column has also got quite a number of different values: 241.
Observed behaviour: crf_learn reads the training data and before finishing the first iteration it takes up 40 GB ram and still counting. Can't go any further, it's already on my swap.

Is this the case that the underlying algorhithm is prone to combinatorial explosions in such cases?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Segmentation Faults for large number of classes

Group

Searches

Help

#7 Segmentation Faults for large number of classes

Discussion