I am using Sphinx3 and have gotten cmuclmtk to generate the language model but
have not found a way to generate the .dict file. Since I am doing a small
phrase list with compound words, that may or may not be real words, I need a
way to generate this file in c++ code. I know the online version of lmtool
works but I need it on my local system.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For english model you can use -lts_mismatch no option to use internal g2p
code. For other languages or phoneset, you need to implement g2p code
yourself. You can use various g2p implementations to do that like sequiturg2p,
g2p from flite, fst-based g2p.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On any, it's a configuration of decoder. And of course it must be
-lts_mismatch yes. You can try with sphinx3_decode for example or with
sphinx3_continuous. Also, I really suggest you to try pocketsphinx instead of
sphinx3. You can find details about that on the website.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Doesn't pocketsphinx also need the dic file?
I'm trying to get the LM and dic to be as small as possible to increase
accuracy.
What it comes down to is I need a c++ call that given a list of words/compound
words I get a lm and dic file.
This list will be under 100 words in size. (Basically a list of currently
available commands, which can be changed on the fly)
(from the nightly build of pocketsphinx)
-hmm ../../../model/hmm/wsj0
-lm ../../../model/lm/turtle/turtle.lm.DMP
-dict ../../../model/lm/turtle/turtle.dic
-ctl ../../../model/lm/turtle/turtle.ctl
-cepdir ../../../model/lm/turtle
-cepext .16k
-adcin TRUE
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The reason of choosing pocketsphinx is not the requirement to have a
dictionary (all decoders need the dictionary, you can't avoid that). The
reason is that pocketsphinx is supported software with frequent bugfix
releases, documented API and good tested performance. With sphinx3 there are
no guarantees.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I still need to generate the dictionary is either case, any suggestions?
In testing pocketsphinx I had too many words recognized for just making some
nonsense sounds. (model was 206 words with most compound) Sphinx3 properly
ignored the sounds.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I still need to generate the dictionary is either case, any suggestions?
No, see lts_mismatch above
In testing pocketsphinx I had too many words recognized for just making some
nonsense sounds. (model was 206 words with most compound) Sphinx3 properly
ignored the sounds.
That can be fixed if you'll provide more info about problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am using Sphinx3 and have gotten cmuclmtk to generate the language model but
have not found a way to generate the .dict file. Since I am doing a small
phrase list with compound words, that may or may not be real words, I need a
way to generate this file in c++ code. I know the online version of lmtool
works but I need it on my local system.
For english model you can use -lts_mismatch no option to use internal g2p
code. For other languages or phoneset, you need to implement g2p code
yourself. You can use various g2p implementations to do that like sequiturg2p,
g2p from flite, fst-based g2p.
On which program is this option used? So far I am using English only.
On any, it's a configuration of decoder. And of course it must be
-lts_mismatch yes. You can try with sphinx3_decode for example or with
sphinx3_continuous. Also, I really suggest you to try pocketsphinx instead of
sphinx3. You can find details about that on the website.
Doesn't pocketsphinx also need the dic file?
I'm trying to get the LM and dic to be as small as possible to increase
accuracy.
What it comes down to is I need a c++ call that given a list of words/compound
words I get a lm and dic file.
This list will be under 100 words in size. (Basically a list of currently
available commands, which can be changed on the fly)
(from the nightly build of pocketsphinx)
-hmm ../../../model/hmm/wsj0
-lm ../../../model/lm/turtle/turtle.lm.DMP
-dict ../../../model/lm/turtle/turtle.dic
-ctl ../../../model/lm/turtle/turtle.ctl
-cepdir ../../../model/lm/turtle
-cepext .16k
-adcin TRUE
The reason of choosing pocketsphinx is not the requirement to have a
dictionary (all decoders need the dictionary, you can't avoid that). The
reason is that pocketsphinx is supported software with frequent bugfix
releases, documented API and good tested performance. With sphinx3 there are
no guarantees.
I still need to generate the dictionary is either case, any suggestions?
In testing pocketsphinx I had too many words recognized for just making some
nonsense sounds. (model was 206 words with most compound) Sphinx3 properly
ignored the sounds.
No, see lts_mismatch above
That can be fixed if you'll provide more info about problem.
If I was using one of the two programs then I could but I am integrating
Sphinx3 into an ocx. I will try and trace what that option does.