Acoustic score

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Acoustic score

Forum: Speech Recognition Theory

Created: 2013-04-05

Updated: 2013-04-09

Jigar - 2013-04-05

Can someone please post the formula for computing the acoustic score?
The score which we get in the output of a forced aligned file as below:

SFrm EFrm SegAScr Phone 0 2 -125782 SIL 3 10 -37688 a SIL k b 11 17 -89629 k a o i 18 27 -34527 o k l i 28 36 -41078 l o aa i 37 50 -26881 aa l SIL e 51 53 -61455 SIL

Total score: -417040
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jigar - 2013-04-06

Adding to the above question how is variance flooring done when aligning?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

dovark - 2013-04-07

Hello Jigar,

Create separate topic for different questions. As for your first question, what did you do to get that output? Which command did you run?

I am asking because there could be normalization by best senone in a frame, depending on the commandline parameters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jigar - 2013-04-08

Last edit: Jigar 2013-04-08

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jigar - 2013-04-08

I ran the following command:

sphinx3_align
-hmm forceAlignsphinx3_noCMN_s1000_g16.cd_cont_1000/
-dict marathiAgmark1500.dic
-fdict 850spkr.filler
-ctl docs/fileid.txt
-insent phone.insent
-cepdir features/
-phsegdir phonesegdir/
-phlabdir phonelabdir/
-stsegdir statesegdir/
-wdsegdir aligndir/
-outsent phone.outsent
-cmn none
-unit_area no
-round_filters no

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

dovark - 2013-04-09

Ok. I think in sphinx3_align, all the scores are normalized by best senone score in each frame.

As you might have noticed, the state level scores add up to give phone level scores, and consequently phone level scores add up to give word level scores.

As per my understanding, state level score, in each frame comes from GMM likelihood and transition matrix probability.

stateScore = log(GMM probability) + log(transition matrix probability) - score_of_best_senone

Here log is the -logbase parameter (default 1.0003)

Formula for GMM probability can be found in any speech textbook.

Last edit: dovark 2013-04-09

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jigar - 2013-04-09

I am writing code for force alignment.
I computed the forward probabilities which is basically addition of log(a) + log(alpha(t-1) ) + log(b) and using Viterbi algorithm to get the state segmentation. However there are some problems wrt alignment.
I observed that while computing log-likelihood (log(b)), there has to be some scaling done wrt the features as mentioned in
http://www.speech.cs.cmu.edu/sphinxman/FAQ.html#18

Can you please elaborate this procedure?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

dovark - 2013-04-09

I'm not sure that "Hypothesis Combination" is relevant to your problem. It is a post-decoding stage, when you want to combine more than one possible hypotheses.

Perhaps other can tell more about where and what does that hypothesis-combination code do.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.