qq- in which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?
Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.
And more wider question: is there any description in detail about acoustic model files ?
Regards
Oleg
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?
There is no such thing
Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.
In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state. So there is no mapping from MFCC to the state but there is a mapping from state to probability distribution. This mapping is described in mdef file. Each line contains a mapping from phonetic context like L after SIL and before AH to three probability distributions corresponding to 3 hmm states. Probability distribution parameters are stored in means and variances files. Every distribution is encoded by the id in mdef file and same id is used in means file which is essentially multidimensional array.
And more wider question: is there any description in detail about acoustic model files ?
You can find some information on the wiki, I suggest you to read:
In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state.
where/how "precomputed probability distribution of HMM state" comes from ?
Last edit: Oleg Chervonogradsky 2014-05-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?
No, during recognition all possible word sequences are first constructed. Then those word sequences are converted to HMM state sequences with mdef file. Then every HMM state sequence is scored against sequence of MFCC vectors. Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed. The sequence of maximum probability is returned as a result.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed.
okay - once we have MFCC vector in input to recognize it is scored against each state(which was computed during acoustic modeling) - how this scoring happens
is taht just another computation of distribution based on MFCC vector we received ? and then searching for best match in 3 hmm state ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do we have to split the utterance by silence ? If the input speech is continuous, say 20 seconds long, what shall we do? Split anywhere in the speech or buffer the whole speech segment?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's only a few pages. I don't think it's productive to read code without understanding the theory, the code is pretty complex. In sphinx4 the search is implemented in SimpleSearchManager for example.
There are simplier HMM decoder implementations which might be easier to read, but they can lack critical features. For example you can also check http://en.wikipedia.org/wiki/Viterbi_algorithm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And in pocketsphinx search is implemented in search components specific to search module. For FSG module search space is constructed in fsg_lextree.c. The search itself is performed in fsg_search.c, but the individual HMM states are advanced in hmm.c.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks, so far qq
acoustic model generation itself has following process then
0. calculating MFCC vector for each of the frame. storing it.
1. identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file
2. workin on one utterance means:
- select frames for each utterance
- in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )
- using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
- based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
- calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
- store distribution parameters by mapping them to specified state and to specified transition
So, we have to distributions
- state to mfcc
- transition-of-state to mfcc
Am I right ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Training is well covered in Rabiner's HMM tutorial, it's worth to read it if you are lazy to read the book.
identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file
There is no such step. Training is performed on the database which is already split on the utterances and for every utterance there is a transcription. You can find an example of such database in tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
workin on one utterance means
Training is a recursive process where you perform multiple iterations to find the best model. You start from an approximation of the model first then make it more and more precise.
in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )
The training operates on HMM states which are smaller then phonemes. Every phoneme is expanded to 3 or 5 states. Phonemes are not accounted in training at all, only states are.
On every iteration we are looking for initial alignment of states to frames and this alignment is not precise but probabilistic. You have certain probability of the fact that frame belongs to certain phoneme.
Once such alignment is calculated the model parameters are updated using this alignment.
Then the next iteration is performed. Finally the iterations converge to an optimal model
using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
store distribution parameters by mapping them to specified state and to specified transition
This thinking is in the right direction but it lacks important vocabulary and important concepts unfortunately. That's why it's better to read something in details first.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hey
qq- in which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?
Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.
And more wider question: is there any description in detail about acoustic model files ?
Regards
Oleg
There is no such thing
In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state. So there is no mapping from MFCC to the state but there is a mapping from state to probability distribution. This mapping is described in mdef file. Each line contains a mapping from phonetic context like L after SIL and before AH to three probability distributions corresponding to 3 hmm states. Probability distribution parameters are stored in means and variances files. Every distribution is encoded by the id in mdef file and same id is used in means file which is essentially multidimensional array.
You can find some information on the wiki, I suggest you to read:
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts
http://cmusphinx.sourceforge.net/wiki/acousticmodelformat
For quick tutorial on recognition you can read chapter 4 in HTKBook, if you are looking for more in-depth picture, a textbook on speech recognition makes more sense. For example you can read http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165
There are also few online courses which were discussed on this forum before.
thanks Nickolay.
where/how "precomputed probability distribution of HMM state" comes from ?
Last edit: Oleg Chervonogradsky 2014-05-17
Distributions are estimated during acoustic model training.
thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?
No, during recognition all possible word sequences are first constructed. Then those word sequences are converted to HMM state sequences with mdef file. Then every HMM state sequence is scored against sequence of MFCC vectors. Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed. The sequence of maximum probability is returned as a result.
okay - once we have MFCC vector in input to recognize it is scored against each state(which was computed during acoustic modeling) - how this scoring happens
is taht just another computation of distribution based on MFCC vector we received ? and then searching for best match in 3 hmm state ?
okay - let me put question in another way - do we have documentation on recognition process in details ?
Yes, I linked you the textbook in the answer above.
i was more interested in pocket sphinx, but thanks anyway
Hi Nickolay,
Do we have to split the utterance by silence ? If the input speech is continuous, say 20 seconds long, what shall we do? Split anywhere in the speech or buffer the whole speech segment?
Hello James
To ask a new question please start a new thread. Don't hijack in unrelated discussion.
thanks - it gets clearler right now, but still some question
how this scoring happens ?
if that is long to explain pls point me to source code function I'll try to dig by myself
Please read at least
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.7829&rep=rep1&type=pdf
It's only a few pages. I don't think it's productive to read code without understanding the theory, the code is pretty complex. In sphinx4 the search is implemented in SimpleSearchManager for example.
There are simplier HMM decoder implementations which might be easier to read, but they can lack critical features. For example you can also check http://en.wikipedia.org/wiki/Viterbi_algorithm
And in pocketsphinx search is implemented in search components specific to search module. For FSG module search space is constructed in fsg_lextree.c. The search itself is performed in fsg_search.c, but the individual HMM states are advanced in hmm.c.
thanks, so far qq
acoustic model generation itself has following process then
0. calculating MFCC vector for each of the frame. storing it.
1. identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file
2. workin on one utterance means:
- select frames for each utterance
- in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )
- using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
- based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
- calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
- store distribution parameters by mapping them to specified state and to specified transition
So, we have to distributions
- state to mfcc
- transition-of-state to mfcc
Am I right ?
Training is well covered in Rabiner's HMM tutorial, it's worth to read it if you are lazy to read the book.
There is no such step. Training is performed on the database which is already split on the utterances and for every utterance there is a transcription. You can find an example of such database in tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
Training is a recursive process where you perform multiple iterations to find the best model. You start from an approximation of the model first then make it more and more precise.
The training operates on HMM states which are smaller then phonemes. Every phoneme is expanded to 3 or 5 states. Phonemes are not accounted in training at all, only states are.
On every iteration we are looking for initial alignment of states to frames and this alignment is not precise but probabilistic. You have certain probability of the fact that frame belongs to certain phoneme.
Once such alignment is calculated the model parameters are updated using this alignment.
Then the next iteration is performed. Finally the iterations converge to an optimal model
This thinking is in the right direction but it lacks important vocabulary and important concepts unfortunately. That's why it's better to read something in details first.