Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo


#7 Segmentation Faults for large number of classes


I want to train lemmatization models for some highly inflected languages (Dutch, German and French). Therefore I have a large number of possible classifications (1000-2000). This results in a segmentation fault after 7 iterations or so. I checked my training data (consisting of about 100.000 words for each language) and there is nothing wrong with it. So it seems there is a memory usage problem, although I'm using a 64GB device. Has anyone encountered this problem before? I also tried using even and uneven numbers of threads, but this didn't change anything. What can I do about it? It works when I use a smaller number of training features, but this results in a lower lemmatization accuracy.


  • Confirmed. CRF++ eats up ridiculous amounts of memory when training on dataset with large number of decision classes.

    My case: using crf_learn with default args against a training file consisting of 1.2 million lines (made by 77k sentences).
    The columns are the following: wordform, POS ambitag (set of possible values), case ambitag (set of possible values of gram. case), gender ambitag, number ambitag, tag to be chosen (decision class).
    There are 925 different values of the decision class (last column, tag). The second column has also got quite a number of different values: 241.
    Observed behaviour: crf_learn reads the training data and before finishing the first iteration it takes up 40 GB ram and still counting. Can't go any further, it's already on my swap.

    Is this the case that the underlying algorhithm is prone to combinatorial explosions in such cases?