I'm running into issues creating a language model using the CMU-CAM toolkit.
At its core is converting the text-based arpa model to a binary DMP format via
sphinx_lm_convert.
When using a vocabulary of both 20 and 30000 words culled from a text corpus
made of every 30th line from the english language wikipedia dump, I get a
segfault saying that the size of the trigram segment is > 65535
Sorry, but it seems that any decent sized vocabulary will run up against this
limit. Are there any workarounds?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm running into issues creating a language model using the CMU-CAM toolkit.
At its core is converting the text-based arpa model to a binary DMP format via
sphinx_lm_convert.
When using a vocabulary of both 20 and 30000 words culled from a text corpus
made of every 30th line from the english language wikipedia dump, I get a
segfault saying that the size of the trigram segment is > 65535
Sorry, but it seems that any decent sized vocabulary will run up against this
limit. Are there any workarounds?
Provide the language model you are trying to convert.