I'm using Pocketsphinx with the latest Russian models downloaded from this site (zero_ru_cont_8k_v3) and observing that source (ARPA) and DMP model build from it are returning different results.
E.g. on the bundled decoder-test.wav sample, ARPA model returns expected "илья ильф евгений петров золотой телёнок".
When I tried to convert the model to DMP:
sphinx_lm_convert -i ru.lm -o ru.lm.dmp)
and use DMP model:
pocketsphinx_continuous \
-samprate 8000 \
-lm ru.lm.dmp \
-dict ru.dic \
-hmm zero_ru.cd_cont_4000 \
-logfn /dev/null \
-remove_noise no \
-infile decoder-test.wav
the result was a bit surprising for me: "илья киев евгения петров золотой телёнок".
Is it by design or I did something wrong in the conversion scenario?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
DMP format supports only 64k words in vocabulary, Russian model does not fit that. You need to use arpa format or wait till we merge sphinxbase-trie branch (couple of weeks) with the new binary format allowing unlimited vocabulary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm using Pocketsphinx with the latest Russian models downloaded from this site (zero_ru_cont_8k_v3) and observing that source (ARPA) and DMP model build from it are returning different results.
E.g. on the bundled decoder-test.wav sample, ARPA model returns expected "илья ильф евгений петров золотой телёнок".
When I tried to convert the model to DMP:
sphinx_lm_convert -i ru.lm -o ru.lm.dmp)
and use DMP model:
pocketsphinx_continuous \
-samprate 8000 \
-lm ru.lm.dmp \
-dict ru.dic \
-hmm zero_ru.cd_cont_4000 \
-logfn /dev/null \
-remove_noise no \
-infile decoder-test.wav
the result was a bit surprising for me: "илья киев евгения петров золотой телёнок".
Is it by design or I did something wrong in the conversion scenario?
DMP format supports only 64k words in vocabulary, Russian model does not fit that. You need to use arpa format or wait till we merge sphinxbase-trie branch (couple of weeks) with the new binary format allowing unlimited vocabulary.
Nickolay, thanks for your answer!