Can someone please post the formula for computing the acoustic score?
The score which we get in the output of a forced aligned file as below:
SFrm EFrm SegAScr Phone
0 2 -125782 SIL
3 10 -37688 a SIL k b
11 17 -89629 k a o i
18 27 -34527 o k l i
28 36 -41078 l o aa i
37 50 -26881 aa l SIL e
51 53 -61455 SIL
Total score: -417040
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok. I think in sphinx3_align, all the scores are normalized by best senone score in each frame.
As you might have noticed, the state level scores add up to give phone level scores, and consequently phone level scores add up to give word level scores.
As per my understanding, state level score, in each frame comes from GMM likelihood and transition matrix probability.
I am writing code for force alignment.
I computed the forward probabilities which is basically addition of log(a) + log(alpha(t-1) ) + log(b) and using Viterbi algorithm to get the state segmentation. However there are some problems wrt alignment.
I observed that while computing log-likelihood (log(b)), there has to be some scaling done wrt the features as mentioned in http://www.speech.cs.cmu.edu/sphinxman/FAQ.html#18
Can you please elaborate this procedure?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure that "Hypothesis Combination" is relevant to your problem. It is a post-decoding stage, when you want to combine more than one possible hypotheses.
Perhaps other can tell more about where and what does that hypothesis-combination code do.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can someone please post the formula for computing the acoustic score?
The score which we get in the output of a forced aligned file as below:
Total score: -417040
Adding to the above question how is variance flooring done when aligning?
Hello Jigar,
Create separate topic for different questions. As for your first question, what did you do to get that output? Which command did you run?
I am asking because there could be normalization by best senone in a frame, depending on the commandline parameters.
Last edit: Jigar 2013-04-08
I ran the following command:
sphinx3_align
-hmm forceAlignsphinx3_noCMN_s1000_g16.cd_cont_1000/
-dict marathiAgmark1500.dic
-fdict 850spkr.filler
-ctl docs/fileid.txt
-insent phone.insent
-cepdir features/
-phsegdir phonesegdir/
-phlabdir phonelabdir/
-stsegdir statesegdir/
-wdsegdir aligndir/
-outsent phone.outsent
-cmn none
-unit_area no
-round_filters no
Ok. I think in sphinx3_align, all the scores are normalized by best senone score in each frame.
As you might have noticed, the state level scores add up to give phone level scores, and consequently phone level scores add up to give word level scores.
As per my understanding, state level score, in each frame comes from GMM likelihood and transition matrix probability.
stateScore = log(GMM probability) + log(transition matrix probability) - score_of_best_senone
Here log is the -logbase parameter (default 1.0003)
Formula for GMM probability can be found in any speech textbook.
Last edit: dovark 2013-04-09
I am writing code for force alignment.
I computed the forward probabilities which is basically addition of log(a) + log(alpha(t-1) ) + log(b) and using Viterbi algorithm to get the state segmentation. However there are some problems wrt alignment.
I observed that while computing log-likelihood (log(b)), there has to be some scaling done wrt the features as mentioned in
http://www.speech.cs.cmu.edu/sphinxman/FAQ.html#18
Can you please elaborate this procedure?
I'm not sure that "Hypothesis Combination" is relevant to your problem. It is a post-decoding stage, when you want to combine more than one possible hypotheses.
Perhaps other can tell more about where and what does that hypothesis-combination code do.