CMU Sphinx / Forums / Help: which output to use to compute finalScore

Hi,

I have referred to some threads on similar topic but I'm still confused. I'm
using sphinx 3.8 - I got following information in the decoder log file-

Backtrace(test_000399_1_4.mfc_00000000)
FV:test_000399_1_4.mfc_00000000>                 WORD  SFrm  EFrm AScr(UnNorm)    LMScore  AScr+LScr     AScale
fv:test_000399_1_4.mfc_00000000>                <sil>     0    36       587490     -74100     513390     968328 
fv:test_000399_1_4.mfc_00000000>         yavatamaalxa    37   118     -1538274     -69588   -1607862    -365679 
fv:test_000399_1_4.mfc_00000000>           +BG_NOISE+   119   193       247046     -74100     172946    1054052 
fv:test_000399_1_4.mfc_00000000>           raayagadxa   194   253     -2424628     -69588   -2494216   -1854501 
fv:test_000399_1_4.mfc_00000000>           +BG_NOISE+   254   300       167921     -74100      93821     766541 
FV:test_000399_1_4.mfc_00000000>                TOTAL                 -2960445    -361476

FWDVIT: yavatamaalxa raayagadxa (test_000399_1_4.mfc_00000000)
FWDXCT: test_000399_1_4.mfc_00000000 S 591516 T -2959402 A -2960445 L 1043 0 587490 -8101 <sil> 37 -1538274 -7600 yavatamaalxa 119 247046 -8101 +BG_NOISE+ 194 -2424628 -7600 raayagadxa 254 167921 -8101 +BG_NOISE+ 301 0 -11552 </s> 301

How do I compute for any of the words? Should I use information after fv: and
simply do

(AScr + LMScr) + AScale

Do I see the scores after FWDXCT and add
wa: Acoustic score for the word
and
wl: LM score for the word
(following the convention in http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc
/s3_description.html#sec_hypseg)

But these numbers differ from those after fv: (for example AScr(unnorm) for
<sil> is 587490 in one place whereas it is ) -1538274 in another place. Do we
add some other scaling number to get the same scores? </sil>

Secondly, as I figure from

the acoustic scores are scaled values; in each frame, the acoustic scores of all active senones are scaled such that the best senone has a log-likelihood of 0

(http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_description.html#sec_hyp
seg) AScrs for all the words (in hypseg) should be <= 0.

Which of these two set of numbers should be used if one wants to know log
likelihood ( ) for a word/filler?

Thanks.

Pranav Jawale - 2011-05-26

oops, that log was created by sphinx 3.6 and not 3.8

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2011-05-30

Sorry, I realized that these two sets of numbers show the same information.
The format for FWDXCT is such that first AScr then LMScore and then the
word (I was considering the word to be first).

But there remains one question. Should AScale factor be added to (in order
to get or a monotonic function of it) ?

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-05-31

Should AScale factor be added to (in order to get or a monotonic function of
it) ?

The issue is that acoustic score is not probability even on the training stage
because gaussians are not normalized. You will need to rescore it anyway. Why
do you need to have that log P(O|W) + log P(W) at all?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2011-05-31

Hi,

I need log(P(O|W)) because I'm trying to implement_ Confidence Measures
in Speech Recognition based on Probability Distribution of
Likelihoods_

There they talk of b(O) (eqn 3, pp 3) which is basically state level acoustic
likelihood of observation vector O for a given frame. Is the acoustic score
(Unnormalized) for a word as given in Sphinx decoder log equal to the sum over
all frames of that word ?

The authors have used sphinx3 decoder and they are able to find out which
frames do the individual states belong to
in order to get state level acoustic log-likelihood score, they use it in
their algo to get phone level and word level
confidence measure. This is why I need to know whether the acoustic score
given by sphinx is akin to b(O) (i.e. sum over all the phones,
and with proper mixture weights being multiplied to individual gaussians).

Secondly, I am not able to figure out how they got to know which frames belong
to which states.

For example if word "ON" was decoded between frames 20 and 40. /o/ belongs to
20-35 and /n/ belongs to 36-40.

Suppose HMM of /o/ has three states - state1, state2, state3. state1 belongs
to frames 20-23 state2 belongs from frames 24-26 and so on ..

Is there a flag to be set in decoder which will also give this state level
decomposition of frames?

Please help. Thanks in advance for any pointers !

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2011-06-01

P.S. I gather that we need to force align phone transcription of decoded o/p
with the utt to get individual phone boundaries and acoustic scores. (please
correct me if this method is incorrect).

But I'm not sure how to get which frames have been assigned to which
individual states within the phones.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2011-06-01

(perhaps I should start another thread for the second question ) .. I found
that there is a flag -stsegdir with which state level segmentation can be
dumped. But the .stseg file that is created, is in binary format. According
to this decoder overview http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_
overview.html there is a utility stseg-read* which can be used to read this binary
file.

But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
it removed?

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-06-01

I'm trying to implement Confidence Measures in Speech Recognition based on
Probability Distribution of Likelihoods

This is not very clean paper with screwed terminology. Though idea they
present is ok. I would suggest you to look for some other paper on the same
subject. Actually the method they are using is first proposed by Young here
"Recognition Confidence Measures: Detection of Misrecognitions and Out-Of-
Vocabulary Words" Sheryl R. Young in 1994.

You can unscale the log likelihood score since scale factor per frame it's
known in the decoder. And you don't need to have a probabliity in each state,
you can do the same procedure with the score which is not the probability but
has the same properties.

But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
it removed?

I'm also not aware about such tool. It maybe existed but later lost in
history. You can just understand the format of the stseg file looking on the
source code.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2011-06-02

Hey, thanks for the reference, I'm reading it.

As far as stseg is concerned, I modified main_align.c (write_stseg function)
so that it writes the information in a readable format. So I can now see state
level segmentation / scores :)

I just have a couple of questions regd

You can unscale the log likelihood score since scale factor per frame it's
known in the decoder

I'm going through the source-code (it's huge for me!) to understand the
scaling etc.

Could you please tell how to access this "scale factor for each frame"? (if you could tell the variable name)

I gather that I can use first sphinx3_decode and then feed the decoder output (as -insent)
to sphinx3_align to get various (phone/state) segmentations. Is this
segmentation information ALSO available
at the decoder itself, which will save calling "sphinx3_align"?

Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-06-02

. Could you please tell how to access this "scale factor for each frame"?
(if you could tell the variable name)

int32 *ascale; /**< Same as senscale but it records the senscale for the whole sentence. The default size is 3000 frames.

in include/srch.h

I gather that I can use first sphinx3_decode and then feed the decoder
output (as -insent)
to sphinx3_align to get various (phone/state) segmentations. Is this
segmentation information ALSO available
at the decoder itself, which will save calling "sphinx3_align"?

I would modify the sphinx3 decoder to track the phone sequence of the best
match. It shouldn't do it currently but it shoudl be easy to do so. Actually I
recommend you to start working with sphinx4. All those things are WAY EASIER
to implement in sphinx4 than in sphinx3.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

amytop - 2012-02-21

hi pranavi,

have you konwn hwo to get score ? i also need this score to call's
prounanciation evaluation ,but i got the same problem with you now ,how did
you solve it ?can you tell me?

thank you for any help!

amy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

which output to use to compute finalScore

Speech Recognition Toolkit

Forums

Help

which output to use to compute finalScore document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

which output to use to compute finalScore