I am trying to understand the exact application of HMM operations used for generating the AM score. Given that three types of operations possible using HMM are evaluation( predicting the likelyhood of an observation sequence being generated by a HMM), classfication( predicting the hidden state sequence that generated a given observation) and training( generating the HMM model that matches the observation sequence), is that evaluation alone is used to generate the AM score? Since the goal is find sting of HMMs that make up an utterance.
If yes, what is the basis for the mulitple combinations of concatenated HMMs evaluated during scoring using the Token Passing algorithm? The way I see every word in the dictionary is mapped to a finite set of pronoucations. Each pronounciation can result in one combination of triphone HMM string. So is it that each of these triphone HMM strings are evaluated to generate the AM score along a path?
I have deliberately left out the role of LM score in this question as I was focussing only the AM scoring aspects.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your question is quite fundamental and has a too long answer for this forum.
In short, you are right in part - likelihood computation is a core part of the speech recognition decoder.
The next part of your question is a little more complex. In practice you do not evaluate HMMs for each word or pronunciation. You build a graph either in advance or dynaically, in both cases using language model to create the word search space and lexicon to go down to the phoneme level. If you scored each possible word sequence, this would go exponential... Also in practice you usually use Viterbi algorithm to evaluate the score for each possible path and cut some paths early if the score is too low (beam search heuristic)
I'd suggest you to check "Efficient Algorithms for Speech Recognition" by Ravishankar or a bit more recent Jurafsky and Martin "Speech and Language Processing".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to understand the exact application of HMM operations used for generating the AM score. Given that three types of operations possible using HMM are evaluation( predicting the likelyhood of an observation sequence being generated by a HMM), classfication( predicting the hidden state sequence that generated a given observation) and training( generating the HMM model that matches the observation sequence), is that evaluation alone is used to generate the AM score? Since the goal is find sting of HMMs that make up an utterance.
If yes, what is the basis for the mulitple combinations of concatenated HMMs evaluated during scoring using the Token Passing algorithm? The way I see every word in the dictionary is mapped to a finite set of pronoucations. Each pronounciation can result in one combination of triphone HMM string. So is it that each of these triphone HMM strings are evaluated to generate the AM score along a path?
I have deliberately left out the role of LM score in this question as I was focussing only the AM scoring aspects.
Your question is quite fundamental and has a too long answer for this forum.
In short, you are right in part - likelihood computation is a core part of the speech recognition decoder.
The next part of your question is a little more complex. In practice you do not evaluate HMMs for each word or pronunciation. You build a graph either in advance or dynaically, in both cases using language model to create the word search space and lexicon to go down to the phoneme level. If you scored each possible word sequence, this would go exponential... Also in practice you usually use Viterbi algorithm to evaluate the score for each possible path and cut some paths early if the score is too low (beam search heuristic)
I'd suggest you to check "Efficient Algorithms for Speech Recognition" by Ravishankar or a bit more recent Jurafsky and Martin "Speech and Language Processing".
Thank you Arseniy. That helped alot!