I am trying to understand logical steps for Continuous speech recognition. Could someone please validate if my understanding is right?
1. Spoken voice sample is converted into string of connected word combinations by levarging the acoustic model and the lexicon.
2.Pattern matching of subword HMMs is leveraged to create candidate connected word strings.
2.The connected word combinations are hypothesis for the spoken input.
3.Multiple possible hypothesis are stored in a lattice structure.
4.Language model and dynamic programming are leveraged prune the lattice structure and select the most probable candidate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been reading ' Fundamentals of Speech Recognition' Rabiner/Jaung/Yagnarayana and was trying to map what I grasped from the text book to what I saw as results from running Sphinx4 examples.
I have now read the tutorial you have referred too.
While I'll anyway spend more time on reading the book, I wanted to know if LM is leveraged even to create word level matches? Coz my current understanding is that it used only during the sentence level matching phase of recognition.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wanted to know if LM is leveraged even to create word level matches?
Yes, LM is used to construct original HMM search graph
Coz my current understanding is that it used only during the sentence level matching phase of recognition.
There is no sentence level or word level, the first stage search is performed with the expanded graph which combines information from lm, dictionary and acoustic model together. There could be later lattice rescoring stage, but it is not used in sphinx4 by default.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been reading up 'Token Passing Algorithm by SJ Young' to understand how information from AM,LM and Dictionary are used to by the decoder to recognize the spoken text. I have tried to integrate what I read here with what I know about acoustic modelling and language modelling.
Here are a few further questions I have.
Am I right is understanding that the token cost is the ' minimal cost of aligning the test audio with the dynamically generated SentenceHMM'?
2.Is it that the token cost( for intra word transitions) is a function of the state transition probability and the emission probability values for the HMMs trained during AM?
3.Language Model probabilities are used to compute token cost for interword transitions?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Am I right is understanding that the token cost is the ' minimal cost of aligning the test audio with the dynamically generated SentenceHMM'?
Token cost is not the minimal cost, it's the cost of one of the possible alignment. Best token cost is the minimal cost.
2.Is it that the token cost( for intra word transitions) is a function of the state transition probability and the emission probability values for the HMMs trained during AM?
Token cost also contains language model probability
3.Language Model probabilities are used to compute token cost for interword transitions?
Yes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to understand logical steps for Continuous speech recognition. Could someone please validate if my understanding is right?
1. Spoken voice sample is converted into string of connected word combinations by levarging the acoustic model and the lexicon.
2.Pattern matching of subword HMMs is leveraged to create candidate connected word strings.
2.The connected word combinations are hypothesis for the spoken input.
3.Multiple possible hypothesis are stored in a lattice structure.
4.Language model and dynamic programming are leveraged prune the lattice structure and select the most probable candidate.
Your understanding is not 100% correct. For correct process of speech recognition you can read any textbook or at least our tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts
Thanks Nichlay!
I have been reading ' Fundamentals of Speech Recognition' Rabiner/Jaung/Yagnarayana and was trying to map what I grasped from the text book to what I saw as results from running Sphinx4 examples.
I have now read the tutorial you have referred too.
While I'll anyway spend more time on reading the book, I wanted to know if LM is leveraged even to create word level matches? Coz my current understanding is that it used only during the sentence level matching phase of recognition.
Yes, LM is used to construct original HMM search graph
There is no sentence level or word level, the first stage search is performed with the expanded graph which combines information from lm, dictionary and acoustic model together. There could be later lattice rescoring stage, but it is not used in sphinx4 by default.
I have been reading up 'Token Passing Algorithm by SJ Young' to understand how information from AM,LM and Dictionary are used to by the decoder to recognize the spoken text. I have tried to integrate what I read here with what I know about acoustic modelling and language modelling.
Here are a few further questions I have.
2.Is it that the token cost( for intra word transitions) is a function of the state transition probability and the emission probability values for the HMMs trained during AM?
3.Language Model probabilities are used to compute token cost for interword transitions?
Token cost is not the minimal cost, it's the cost of one of the possible alignment. Best token cost is the minimal cost.
Token cost also contains language model probability
Yes
Thank you Nickolay! Your response has strengthened my understanding.