I am trying to assimilate the logical steps of how speech recognition is implemented from what I have gathered from the following sources; am hoping to get a validation if my understanding so far.
Sources:
1.http://mi.eng.cam.ac.uk/reports/svr-ftp/auto-pdf/young_tr38.pdf
2.Fundamentals-Speech-Recognition-Lawrence-Rabiner
3.Spoken Language Processing-Haung Xuedong
Lets take the case of isolated word recognition
1.Recognition starts creation of phoneme sequence for the words based on the phoneme representation in the dictionary.
The phoneme representation is used to form a connected string of HMMs based on the triphone to HMM mapping in the AM
3.Token passing algorithm is used to compute the alignment score of each observation frame to the states in HMM.
4.HMM states correspond to senones. Each senone is represented by a GMM.
5.The state transition probability from the HMMs and emission probability from the GMM information for each senone is used to compute the token score.
6.The word with highest token score at the end of utterance is recognised as the spoken word.
Can you please validate my understanding?
Regards,
Vamsi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would reorder 4,5 and 3 in your list though. You first describe structure of the model, then token passing algorithm. The right sequence would be 1, 2, 4, 5, 3, 6.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am trying to assimilate the logical steps of how speech recognition is implemented from what I have gathered from the following sources; am hoping to get a validation if my understanding so far.
Sources:
1.http://mi.eng.cam.ac.uk/reports/svr-ftp/auto-pdf/young_tr38.pdf
2.Fundamentals-Speech-Recognition-Lawrence-Rabiner
3.Spoken Language Processing-Haung Xuedong
Lets take the case of isolated word recognition
1.Recognition starts creation of phoneme sequence for the words based on the phoneme representation in the dictionary.
3.Token passing algorithm is used to compute the alignment score of each observation frame to the states in HMM.
4.HMM states correspond to senones. Each senone is represented by a GMM.
5.The state transition probability from the HMMs and emission probability from the GMM information for each senone is used to compute the token score.
6.The word with highest token score at the end of utterance is recognised as the spoken word.
Can you please validate my understanding?
Regards,
Vamsi
Your understanding is correct.
I would reorder 4,5 and 3 in your list though. You first describe structure of the model, then token passing algorithm. The right sequence would be 1, 2, 4, 5, 3, 6.
Nickolay, Thanks for clarifying!