I have used SphinxTrain to build acoustic models for Swedish. When using these models and a CFG (with Sphinx 4), I get very good accuracy. However, when building a simple trigram model (LexTreeLinguist + SimpleNGramModel), I get very bad accuracy. Even if I say a phrase that is very frequent in the training material, it is very hard to get good recognition. I also tried the HelloNgram-example that comes with Sphinx and I get very bad accuracy. For example, if I say "the green one on the lower right side", it is almost impossible for it to get it right. I get results like "the green lot all middle are right side" which should not get a high language model score (some of these trigrams do not even exist in the data). This is a very simple example model that really should work most of the times when I read a sentence from the training material. Since I get very good results with a CFG, this should not be a problem with my microphone, or the acoustic models. Have you noticed the same problem?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks a lot! Those parameter settings really helped. The parameters I was using were taken from the HelloNGram example that comes with Sphinx. They should really be updated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have recorded some test sentences (which are well represented in the tri-grams) with two different speakers (GS & JE). Beolow are the results. As you can see, they do not represent very likely word sequences.
REF: the closest purple one on the far left side
JE: closest purple one on four left side
GS: that us us that one on the far next side
REF: the green one right in the middle
JE: the green one right little
GS: between one right of middle
REF: the only one left on the left
JE: you near one left colors
GS: the only was the a only left
REF: the purple one on the lower right side
JE: the purple one little right side
GS: the talking one little are right side
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, I checked this, first of all you need much bigger wordInsertionProbability (around 0.7). Demo is not correct here.
Next, where are you from? Are you from UK? It seems you say words a bit differently. Hub4 acoustic model handles your speech correctly. But for wsj there are differences. I had to fix the dictionary for PURPLE for example to make it work properly:
PURPLE P ER P AH L
PURPLE(2) P AO P EH L
I'd say you say lower like L OW EH R as well, but it's not important. And here is the result:
I have used SphinxTrain to build acoustic models for Swedish. When using these models and a CFG (with Sphinx 4), I get very good accuracy. However, when building a simple trigram model (LexTreeLinguist + SimpleNGramModel), I get very bad accuracy. Even if I say a phrase that is very frequent in the training material, it is very hard to get good recognition. I also tried the HelloNgram-example that comes with Sphinx and I get very bad accuracy. For example, if I say "the green one on the lower right side", it is almost impossible for it to get it right. I get results like "the green lot all middle are right side" which should not get a high language model score (some of these trigrams do not even exist in the data). This is a very simple example model that really should work most of the times when I read a sentence from the training material. Since I get very good results with a CFG, this should not be a problem with my microphone, or the acoustic models. Have you noticed the same problem?
Thanks a lot! Those parameter settings really helped. The parameters I was using were taken from the HelloNGram example that comes with Sphinx. They should really be updated.
Yeah, those values were confusing. I've just changed them to more suitable defaults.
Well, everything depends on the config and recording. Can you please share them.
Ok, I have put together a test set:
http://dl.getdropbox.com/u/110350/testngram.zip
I have recorded some test sentences (which are well represented in the tri-grams) with two different speakers (GS & JE). Beolow are the results. As you can see, they do not represent very likely word sequences.
REF: the closest purple one on the far left side
JE: closest purple one on four left side
GS: that us us that one on the far next side
REF: the green one right in the middle
JE: the green one right little
GS: between one right of middle
REF: the only one left on the left
JE: you near one left colors
GS: the only was the a only left
REF: the purple one on the lower right side
JE: the purple one little right side
GS: the talking one little are right side
Well, I checked this, first of all you need much bigger wordInsertionProbability (around 0.7). Demo is not correct here.
Next, where are you from? Are you from UK? It seems you say words a bit differently. Hub4 acoustic model handles your speech correctly. But for wsj there are differences. I had to fix the dictionary for PURPLE for example to make it work properly:
PURPLE P ER P AH L
PURPLE(2) P AO P EH L
I'd say you say lower like L OW EH R as well, but it's not important. And here is the result:
RESULT: the purple one on the lower right side