Hi,
i was just reading through the source file of the viterbi (hmm.c in libpocketsphinx) and was a little confused when i got to the implementation for 3 state HMMs in "hmm_vit_eval_3st_lr":
If I am correct in interpreting the code, the score update for a state "i" is done by taking the maximum over the predecessors ( which in our case are "i" itself and "i-1" ) of the sum of the predecessor score + transition probability + senon score of predecessor.
But shouldn't the viterbi actually only take the maximum over the predecessor score + transition probability and then add the senon score for our current state "i" instead of taking it into the maximization?
What am I missing here? Would be nice to get some clarification here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
now, we see that the hmm_score for state 2 gets assigned the value s2, which in turn was created as the maximum of t0 and t1.
These values were in turn calculated from the helper variables and their added transition probability thus giving the following formula for the update (by putting in the definitions of the code)
hmm_score(2) = s2
= max[ to ; t1 ]
= max[ s2 + tprob(2,2) ; s1 + tprob(1,2) ]
= max[ hmm_score(2) + senscore(2) + tprob(2,2) ; hmm_score(1) + senscore(1) + tprob(1,2) ].
And this is the point that i meant: the senon scores are taken into the maximization process which is not the case in the true viterbi update formula.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the senon scores are taken into the maximization process which
is not the case in the true viterbi update formula.
This is how it should be. I am not sure what do you mean by "viterbi update formula". Here https://en.wikipedia.org/wiki/Viterbi_algorithm you can see in the formula for V_{t,k} there is a term P(y_t | k) which is a score of acoustic observation or senone score.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
exactly, and you can also see that this term P(y_t | k) does not depend on the previous state x in the formula over which the maximum is obtained. it only depends on the state k.
but in the formula in the code we use different senone scores that depend on the previous state x, namely senscore(1) and senscore(2).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is not quite clear why it was implemented this way. Overall, there could be many modifications, you could have distributions on arcs, on outgoing states or on incoming states. The exact choice is not well justified.
Basically in speech recognition you have implementation issues since you have many many hmms which you need to connect in different ways and it is more efficient to implement HMM distributions assigned to graph edges, not graph nodes. And in nodes you store the tokens. This transform is equivalent to standard HMM, but it allows to disconnect distributions from nodes and thus you can glue HMMs on nodes without any recomputations on scores. This is what happens in pocketsphinx, you have 4 nodes and 3 arcs between them. Last node is called non-emitting and is useful for transition between phones. And you can simply bypass non-emitting outgoing score to incoming score of the next HMM.
With distribution assigned to every state like in conventional HMM model Viterbi step between phones would require recalculations.
You have the same thing in WFST framework, there you also evaluate distributions on arcs of WFST, not on nodes. And in nodes you keep the score and perform maximization.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, to put it more simple, we need non-emitting states to connect models together. But because of non-emitting is last state we can't use "target state" acoustic score scheme since non-emitting is also a target and we have to use "source state" acoustic score scheme. This is not the HMM as described in basic tutorials, but it works.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
i was just reading through the source file of the viterbi (hmm.c in libpocketsphinx) and was a little confused when i got to the implementation for 3 state HMMs in "hmm_vit_eval_3st_lr":
If I am correct in interpreting the code, the score update for a state "i" is done by taking the maximum over the predecessors ( which in our case are "i" itself and "i-1" ) of the sum of the predecessor score + transition probability + senon score of predecessor.
But shouldn't the viterbi actually only take the maximum over the predecessor score + transition probability and then add the senon score for our current state "i" instead of taking it into the maximization?
What am I missing here? Would be nice to get some clarification here.
From the code
it is a proper sum of transition score + acoustic score for the current state. I'm not sure where did you take idea about "predecssor state".
I think you can explain more if this thing is still unclear.
well, my point is this:
lets consider the update for s2 and take the case where we cannot skip states (i.e. t2 is never better than t0 or t1)
then in the beginning the following helper variables are created as sums of the score + senon score
then we come to the update phase:
now, we see that the hmm_score for state 2 gets assigned the value s2, which in turn was created as the maximum of t0 and t1.
These values were in turn calculated from the helper variables and their added transition probability thus giving the following formula for the update (by putting in the definitions of the code)
hmm_score(2) = s2
= max[ to ; t1 ]
= max[ s2 + tprob(2,2) ; s1 + tprob(1,2) ]
= max[ hmm_score(2) + senscore(2) + tprob(2,2) ; hmm_score(1) + senscore(1) + tprob(1,2) ].
And this is the point that i meant: the senon scores are taken into the maximization process which is not the case in the true viterbi update formula.
This is how it should be. I am not sure what do you mean by "viterbi update formula". Here https://en.wikipedia.org/wiki/Viterbi_algorithm you can see in the formula for
V_{t,k}
there is a termP(y_t | k)
which is a score of acoustic observation or senone score.exactly, and you can also see that this term P(y_t | k) does not depend on the previous state x in the formula over which the maximum is obtained. it only depends on the state k.
but in the formula in the code we use different senone scores that depend on the previous state x, namely senscore(1) and senscore(2).
Thanks, I see now
It is not quite clear why it was implemented this way. Overall, there could be many modifications, you could have distributions on arcs, on outgoing states or on incoming states. The exact choice is not well justified.
I asked the same question on the mailing list
https://sourceforge.net/p/cmusphinx/mailman/message/35080544/
Ok, I understood the reasons it seems.
Basically in speech recognition you have implementation issues since you have many many hmms which you need to connect in different ways and it is more efficient to implement HMM distributions assigned to graph edges, not graph nodes. And in nodes you store the tokens. This transform is equivalent to standard HMM, but it allows to disconnect distributions from nodes and thus you can glue HMMs on nodes without any recomputations on scores. This is what happens in pocketsphinx, you have 4 nodes and 3 arcs between them. Last node is called non-emitting and is useful for transition between phones. And you can simply bypass non-emitting outgoing score to incoming score of the next HMM.
With distribution assigned to every state like in conventional HMM model Viterbi step between phones would require recalculations.
You have the same thing in WFST framework, there you also evaluate distributions on arcs of WFST, not on nodes. And in nodes you keep the score and perform maximization.
Ok, to put it more simple, we need non-emitting states to connect models together. But because of non-emitting is last state we can't use "target state" acoustic score scheme since non-emitting is also a target and we have to use "source state" acoustic score scheme. This is not the HMM as described in basic tutorials, but it works.