Menu

Pocketsphinx internal scale for features?

Help
Anonymous
2012-06-15
2012-09-22
  • Anonymous

    Anonymous - 2012-06-15

    Hi,

    It seems that Pocketsphinx expects the input features to fall within a
    specific dynamic range or scale. I conducted a set of experiments with Sphinx3
    and Pocketsphinx. The features are extracted using the front-end in
    sphinxbase. Now once I get the features, I scaled them by integer multiples
    (2,4,8,16 and so on). I observed that the WER with pocketsphinx changes with
    the scaling factor, while it remains the same with Sphinx3. Moreover, if the
    scaling factor is too high (16 or 32 for example), pocketsphinx completely
    fails to decode speech while Sphinx3 still performs perfectly well.

    So going by this, is there some specific range in which Pocketsphinx expects
    its features to be? If so, how do I find that out? I went through the code but
    couldn't find anything related to this. Is this a bug? Or is it an
    undocumented design step in Pocketsphinx?

    Any help would be appreciated.

    Thanks,
    Mohit

     
  • Pranav Jawale

    Pranav Jawale - 2012-06-16

    Hi,

    Only dynamic range constraints are those due to dataType specification (int,
    float etc)

    If you are going to scale your features, you should scale them before doing
    training (sphinxTrain) also I guess.

     
  • Pranav Jawale

    Pranav Jawale - 2012-06-16

    But it's surprising that you are still getting good results with Sphinx3 .. I
    would expect WER to be bad for both the decoders with scaled features.

     
  • Anonymous

    Anonymous - 2012-06-16

    Well, yes I did train on the scaled features. So if I train on a scale factor
    of 2, my test set is scaled by the same factor.

     
  • Pranav Jawale

    Pranav Jawale - 2012-06-16

    Can we see the decoder arguments in both the cases?

     
  • Anonymous

    Anonymous - 2012-06-16

    Also, for clarification - I first trained on the features 'as is' and tested
    on the features 'as is'. In the second scenario, I trained on the features
    scaled by 1000 and tested on the features scaled by 1000.

     
  • Anonymous

    Anonymous - 2012-06-17
     
  • Anonymous

    Anonymous - 2012-06-17

    And I did play around with the SENSCR_SHIFT values earlier, but they made no
    difference to my scores. And Sphinx3 suffers from scaling, but not all the
    time. I have tried other experiments, varying the scale, and Sphinx3 maintains
    its WER. This specific example that I posted is one of the few times that
    Sphinx3 exhibits a change in WER.

     
  • Nickolay V. Shmyrev

    It seems that Pocketsphinx expects the input features to fall within a
    specific dynamic range or scale.

    Yes

    Moreover, if the scaling factor is too high (16 or 32 for example),
    pocketsphinx completely fails to decode speech while Sphinx3 still performs
    perfectly well. So going by this, is there some specific range in which
    pocketsphinx expects its features to be?

    Pocketsphinx uses only 8 bits to store the acoustic score of the gaussian and
    10 bits for the score of the senone. You can find the corresponding parts
    inside hmm., in particular there is a constant called SENSCR_SHIFT.

    Is this a bug? Or is it an undocumented design step in Pocketsphinx?

    It's a design for the embedded applications.

     
  • Nickolay V. Shmyrev

    So what's the approximate relationship between acoustic scale and language
    weight? By what factor should I scale the language weight if I scale my
    features
    by 1000, for example?

    Nevermind, language weight should be the same. I am wrong, dynamic range of
    the acoustic scores doesn't change, the acoustic scores are just shifted by
    a log(det|A|) where A is a transform matrix. However, the overflow problem
    still
    exists.

    When you scale features, the acoustic score per frame changes by the
    log(det|A|)
    which is in case of scale log(1000 * feature_dimension). Log is taken with the
    base 1.0003. The feature dimension is 39. Thus you have an additive factor to
    the acoustic score of about 800000 or ln(1000)/ln(1.0003) *39 per frame.

    Compare this:

    FV:cen7-mmxg-b>                 WORD  SFrm  EFrm AScr(UnNorm)    LMScore
    AScr+LScr     AScale
    fv:cen7-mmxg-b>                <sil>     0    21      1106828     -82125   
    1024703    1253845
    

    And this

    FV:cen7-mmxg-b>                 WORD  SFrm  EFrm AScr(UnNorm)    LMScore
    AScr+LScr     AScale
    fv:cen7-mmxg-b>                <sil>     0    21    -17757206     -82125
    -17839331
    -17618506
    

    The difference in 21 frames is about 18480000 which is 21 * 848000. There are
    small differences introduced by transition scores, but overall it looks like
    this.

    And I did play around with the SENSCR_SHIFT values earlier, but they
    made no difference to my scores. And Sphinx3 suffers from scaling, but not all
    the time. I have tried other experiments, varying the scale, and Sphinx3
    maintains
    its WER. This specific example that I posted is one of the few times that
    Sphinx3
    exhibits a change in WER.

    You also need to change data types, not just SENSCR_SHIFT. All data types
    should
    be uint32 not uint8 and uint16. Logtable also should be 32 bit in acoustic
    score
    computation. Currently it's only 8bit for the acoustic scores. See lmath_8b
    in sources. Overall it's not a simple change.

     

Log in to post a comment.