Menu

which output to use to compute finalScore

Help
2011-05-26
2012-09-22
  • Pranav Jawale

    Pranav Jawale - 2011-05-26

    Hi,

    I have referred to some threads on similar topic but I'm still confused. I'm
    using sphinx 3.8 - I got following information in the decoder log file-

    Backtrace(test_000399_1_4.mfc_00000000)
    FV:test_000399_1_4.mfc_00000000>                 WORD  SFrm  EFrm AScr(UnNorm)    LMScore  AScr+LScr     AScale
    fv:test_000399_1_4.mfc_00000000>                <sil>     0    36       587490     -74100     513390     968328 
    fv:test_000399_1_4.mfc_00000000>         yavatamaalxa    37   118     -1538274     -69588   -1607862    -365679 
    fv:test_000399_1_4.mfc_00000000>           +BG_NOISE+   119   193       247046     -74100     172946    1054052 
    fv:test_000399_1_4.mfc_00000000>           raayagadxa   194   253     -2424628     -69588   -2494216   -1854501 
    fv:test_000399_1_4.mfc_00000000>           +BG_NOISE+   254   300       167921     -74100      93821     766541 
    FV:test_000399_1_4.mfc_00000000>                TOTAL                 -2960445    -361476
    
    FWDVIT: yavatamaalxa raayagadxa (test_000399_1_4.mfc_00000000)
    FWDXCT: test_000399_1_4.mfc_00000000 S 591516 T -2959402 A -2960445 L 1043 0 587490 -8101 <sil> 37 -1538274 -7600 yavatamaalxa 119 247046 -8101 +BG_NOISE+ 194 -2424628 -7600 raayagadxa 254 167921 -8101 +BG_NOISE+ 301 0 -11552 </s> 301
    

    How do I compute for any of the words? Should I use information after fv: and
    simply do

    (AScr + LMScr) + AScale

    OR

    Do I see the scores after FWDXCT and add
    wa: Acoustic score for the word
    and
    wl: LM score for the word
    (following the convention in http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc
    /s3_description.html#sec_hypseg)

    But these numbers differ from those after fv: (for example AScr(unnorm) for
    <sil> is 587490 in one place whereas it is ) -1538274 in another place. Do we
    add some other scaling number to get the same scores? </sil>


    Secondly, as I figure from

    the acoustic scores are scaled values; in each frame, the acoustic scores of all active senones are scaled such that the best senone has a log-likelihood of 0
    

    (http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_description.html#sec_hyp
    seg)
    AScrs for all the words (in hypseg) should be <= 0.

    Which of these two set of numbers should be used if one wants to know log
    likelihood ( ) for a word/filler?

    Thanks.

     
  • Pranav Jawale

    Pranav Jawale - 2011-05-26

    oops, that log was created by sphinx 3.6 and not 3.8

     
  • Pranav Jawale

    Pranav Jawale - 2011-05-30

    Sorry, I realized that these two sets of numbers show the same information.
    The format for FWDXCT is such that first AScr then LMScore and then the
    word (I was considering the word to be first).

    But there remains one question. Should AScale factor be added to (in order
    to get or a monotonic function of it) ?

    Thanks.

     
  • Nickolay V. Shmyrev

    Should AScale factor be added to (in order to get or a monotonic function of
    it) ?

    The issue is that acoustic score is not probability even on the training stage
    because gaussians are not normalized. You will need to rescore it anyway. Why
    do you need to have that log P(O|W) + log P(W) at all?

     
  • Pranav Jawale

    Pranav Jawale - 2011-05-31

    Hi,

    I need log(P(O|W)) because I'm trying to implement_ Confidence Measures
    in Speech Recognition based on Probability Distribution of
    Likelihoods
    _

    There they talk of b(O) (eqn 3, pp 3) which is basically state level acoustic
    likelihood of observation vector O for a given frame. Is the acoustic score
    (Unnormalized) for a word as given in Sphinx decoder log equal to the sum over
    all frames of that word ?

    The authors have used sphinx3 decoder and they are able to find out which
    frames do the individual states belong to
    in order to get state level acoustic log-likelihood score, they use it in
    their algo to get phone level and word level
    confidence measure. This is why I need to know whether the acoustic score
    given by sphinx is akin to b(O) (i.e. sum over all the phones,
    and with proper mixture weights being multiplied to individual gaussians).

    Secondly, I am not able to figure out how they got to know which frames belong
    to which states.

    For example if word "ON" was decoded between frames 20 and 40. /o/ belongs to
    20-35 and /n/ belongs to 36-40.

    Suppose HMM of /o/ has three states - state1, state2, state3. state1 belongs
    to frames 20-23 state2 belongs from frames 24-26 and so on ..

    Is there a flag to be set in decoder which will also give this state level
    decomposition of frames?

    Please help. Thanks in advance for any pointers !

     
  • Pranav Jawale

    Pranav Jawale - 2011-06-01

    P.S. I gather that we need to force align phone transcription of decoded o/p
    with the utt to get individual phone boundaries and acoustic scores. (please
    correct me if this method is incorrect).

    But I'm not sure how to get which frames have been assigned to which
    individual states within the phones.

     
  • Pranav Jawale

    Pranav Jawale - 2011-06-01

    (perhaps I should start another thread for the second question ) .. I found
    that there is a flag -stsegdir with which state level segmentation can be
    dumped. But the .stseg file that is created, is in binary format. According
    to this decoder overview http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_
    overview.html there is a utility
    stseg-read* which can be used to read this binary
    file.

    But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
    it removed?

    Thanks.

     
  • Nickolay V. Shmyrev

    I'm trying to implement Confidence Measures in Speech Recognition based on
    Probability Distribution of Likelihoods

    This is not very clean paper with screwed terminology. Though idea they
    present is ok. I would suggest you to look for some other paper on the same
    subject. Actually the method they are using is first proposed by Young here
    "Recognition Confidence Measures: Detection of Misrecognitions and Out-Of-
    Vocabulary Words" Sheryl R. Young in 1994.

    You can unscale the log likelihood score since scale factor per frame it's
    known in the decoder. And you don't need to have a probabliity in each state,
    you can do the same procedure with the score which is not the probability but
    has the same properties.

    But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
    it removed?

    I'm also not aware about such tool. It maybe existed but later lost in
    history. You can just understand the format of the stseg file looking on the
    source code.

     
  • Pranav Jawale

    Pranav Jawale - 2011-06-02

    Hey, thanks for the reference, I'm reading it.

    As far as stseg is concerned, I modified main_align.c (write_stseg function)
    so that it writes the information in a readable format. So I can now see state
    level segmentation / scores :)

    I just have a couple of questions regd

    You can unscale the log likelihood score since scale factor per frame it's
    known in the decoder

    I'm going through the source-code (it's huge for me!) to understand the
    scaling etc.

    1. Could you please tell how to access this "scale factor for each frame"? (if you could tell the variable name)

    2. I gather that I can use first sphinx3_decode and then feed the decoder output (as -insent)
      to sphinx3_align to get various (phone/state) segmentations. Is this
      segmentation information ALSO available
      at the decoder itself, which will save calling "sphinx3_align"?

    Thanks.

     
  • Nickolay V. Shmyrev

    . Could you please tell how to access this "scale factor for each frame"?
    (if you could tell the variable name)

        int32 *ascale;   /**< Same as senscale but it records the senscale for the whole sentence.
                            The default size is 3000 frames.
    

    in include/srch.h

    1. I gather that I can use first sphinx3_decode and then feed the decoder
      output (as -insent)
      to sphinx3_align to get various (phone/state) segmentations. Is this
      segmentation information ALSO available
      at the decoder itself, which will save calling "sphinx3_align"?

    I would modify the sphinx3 decoder to track the phone sequence of the best
    match. It shouldn't do it currently but it shoudl be easy to do so. Actually I
    recommend you to start working with sphinx4. All those things are WAY EASIER
    to implement in sphinx4 than in sphinx3.

     
  • amytop

    amytop - 2012-02-21

    hi pranavi,

    have you konwn hwo to get score ? i also need this score to call's
    prounanciation evaluation ,but i got the same problem with you now ,how did
    you solve it ?can you tell me?

    thank you for any help!

    amy

     

Log in to post a comment.