Before we go on, I want you to know both sphinx 3 and sphinx 4 are used mainly for sub-word based speech recognition. We never tried to use it for word-based speech recognition. So, we actually didn't know the consequence.
In general, "Word models" and "Phoneme model" are just names of HMM composition scheme in general. When people say they use phoneme models, that actually means "first compose a word HMM model using phoneme HMM models, use them in Viterbi search". When people say they use "word models" means "directly represent word model without composition".
Enough for theory. The following is the trick how people used speech trainer to do whole word model.
1, First, define each word as an HMM, say "one", "two" and "three" are now HMM.
2, In the dictionary file, put the following entries.
one one
two two
three three
.
.
.
Notice that when you do this, the first column actually means the final hmm and the second column actually mean the component used to compose the finall hmm.
3, Now for the "phone list", what you need to put is a list of words like
one
two
three
Why? because this time, the word itself is also the sub-word unit.
I think these are the major thing if you want to do a whole word model hacks. Again, prepare to get hurt because no one actually did it before using Sphinx or SphinxTrain.
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Something also to think about is the acoustic complexity of the word. As a first approximation you might want to look at an acoustic lexicon, like cmudict, and do a one-for-one substitution for all the phone models w/ models particular to the word. For example,
If you do this, the context-dependent parts of training become irrelevant so you needn't use or define any cd phones.
Also be aware that this approach really only works well if you are doing isolated word recognition (pausing between words) because the articulation of a word is significantly influenced by the word preceding it in continuous speech.
...eric
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
can i know how to perform training for word model?
what is the different between training for word and phoneme model in sphinxtrain?
thank you.
Before we go on, I want you to know both sphinx 3 and sphinx 4 are used mainly for sub-word based speech recognition. We never tried to use it for word-based speech recognition. So, we actually didn't know the consequence.
In general, "Word models" and "Phoneme model" are just names of HMM composition scheme in general. When people say they use phoneme models, that actually means "first compose a word HMM model using phoneme HMM models, use them in Viterbi search". When people say they use "word models" means "directly represent word model without composition".
Enough for theory. The following is the trick how people used speech trainer to do whole word model.
1, First, define each word as an HMM, say "one", "two" and "three" are now HMM.
2, In the dictionary file, put the following entries.
one one
two two
three three
.
.
.
Notice that when you do this, the first column actually means the final hmm and the second column actually mean the component used to compose the finall hmm.
3, Now for the "phone list", what you need to put is a list of words like
one
two
three
Why? because this time, the word itself is also the sub-word unit.
I think these are the major thing if you want to do a whole word model hacks. Again, prepare to get hurt because no one actually did it before using Sphinx or SphinxTrain.
Arthur
Something also to think about is the acoustic complexity of the word. As a first approximation you might want to look at an acoustic lexicon, like cmudict, and do a one-for-one substitution for all the phone models w/ models particular to the word. For example,
BAT M1 M2 M3
CANTANKEROUS M4 M5 M6 M7 M8 M9 M10 M11 M12
If you do this, the context-dependent parts of training become irrelevant so you needn't use or define any cd phones.
Also be aware that this approach really only works well if you are doing isolated word recognition (pausing between words) because the articulation of a word is significantly influenced by the word preceding it in continuous speech.
...eric
thanks a lot both of you for the information.
i think it will be more convenient if i just follow the process from the instruction: http://fife.speech.cs.cmu.edu/sphinxman/fr4.html
besides, i have spent a lot of time in training before and now is the time to get the output.
maybe some other time i will focus on this (word model).
thanks again :)