I've been having trouble getting a correct and working language model in Dutch, so after a while I tried using the standard Sphinx toolkit. That one actually worked.
I've been testing a lot of different content for the language model, so I got annoyed by writing and converting text to a normal corpus.
That is why I created a pipeline. Its input is a text file with punctuation and its output is a lm-file including a dictionary which only contains the words inside the lm.
If anyone is intersted in using it, or giving me feedback on it.
Please do!
I've been having trouble getting a correct and working language model in Dutch, so after a while I tried using the standard Sphinx toolkit. That one actually worked.
I've been testing a lot of different content for the language model, so I got annoyed by writing and converting text to a normal corpus.
That is why I created a pipeline. Its input is a text file with punctuation and its output is a lm-file including a dictionary which only contains the words inside the lm.
If anyone is intersted in using it, or giving me feedback on it.
Please do!
https://github.com/Hespen/Java---CMU-Sphinx---Text-to-Language-Model
If you want to run everything in Java, I would use https://code.google.com/archive/p/berkeleylm/downloads instead for lm estimation. It lacks good interpolation and pruning support though.