It seems that Pocketsphinx expects the input features to fall within a
specific dynamic range or scale. I conducted a set of experiments with Sphinx3
and Pocketsphinx. The features are extracted using the front-end in
sphinxbase. Now once I get the features, I scaled them by integer multiples
(2,4,8,16 and so on). I observed that the WER with pocketsphinx changes with
the scaling factor, while it remains the same with Sphinx3. Moreover, if the
scaling factor is too high (16 or 32 for example), pocketsphinx completely
fails to decode speech while Sphinx3 still performs perfectly well.
So going by this, is there some specific range in which Pocketsphinx expects
its features to be? If so, how do I find that out? I went through the code but
couldn't find anything related to this. Is this a bug? Or is it an
undocumented design step in Pocketsphinx?
Any help would be appreciated.
Thanks,
Mohit
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can we see the decoder arguments in both the cases?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2012-06-16
Also, for clarification - I first trained on the features 'as is' and tested
on the features 'as is'. In the second scenario, I trained on the features
scaled by 1000 and tested on the features scaled by 1000.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2012-06-17
And I did play around with the SENSCR_SHIFT values earlier, but they made no
difference to my scores. And Sphinx3 suffers from scaling, but not all the
time. I have tried other experiments, varying the scale, and Sphinx3 maintains
its WER. This specific example that I posted is one of the few times that
Sphinx3 exhibits a change in WER.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems that Pocketsphinx expects the input features to fall within a
specific dynamic range or scale.
Yes
Moreover, if the scaling factor is too high (16 or 32 for example),
pocketsphinx completely fails to decode speech while Sphinx3 still performs
perfectly well. So going by this, is there some specific range in which
pocketsphinx expects its features to be?
Pocketsphinx uses only 8 bits to store the acoustic score of the gaussian and
10 bits for the score of the senone. You can find the corresponding parts
inside hmm., in particular there is a constant called SENSCR_SHIFT.
Is this a bug? Or is it an undocumented design step in Pocketsphinx?
It's a design for the embedded applications.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So what's the approximate relationship between acoustic scale and language
weight? By what factor should I scale the language weight if I scale my
features
by 1000, for example?
Nevermind, language weight should be the same. I am wrong, dynamic range of
the acoustic scores doesn't change, the acoustic scores are just shifted by
a log(det|A|) where A is a transform matrix. However, the overflow problem
still
exists.
When you scale features, the acoustic score per frame changes by the
log(det|A|)
which is in case of scale log(1000 * feature_dimension). Log is taken with the
base 1.0003. The feature dimension is 39. Thus you have an additive factor to
the acoustic score of about 800000 or ln(1000)/ln(1.0003) *39 per frame.
The difference in 21 frames is about 18480000 which is 21 * 848000. There are
small differences introduced by transition scores, but overall it looks like
this.
And I did play around with the SENSCR_SHIFT values earlier, but they
made no difference to my scores. And Sphinx3 suffers from scaling, but not all
the time. I have tried other experiments, varying the scale, and Sphinx3
maintains
its WER. This specific example that I posted is one of the few times that
Sphinx3
exhibits a change in WER.
You also need to change data types, not just SENSCR_SHIFT. All data types
should
be uint32 not uint8 and uint16. Logtable also should be 32 bit in acoustic
score
computation. Currently it's only 8bit for the acoustic scores. See lmath_8b
in sources. Overall it's not a simple change.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
It seems that Pocketsphinx expects the input features to fall within a
specific dynamic range or scale. I conducted a set of experiments with Sphinx3
and Pocketsphinx. The features are extracted using the front-end in
sphinxbase. Now once I get the features, I scaled them by integer multiples
(2,4,8,16 and so on). I observed that the WER with pocketsphinx changes with
the scaling factor, while it remains the same with Sphinx3. Moreover, if the
scaling factor is too high (16 or 32 for example), pocketsphinx completely
fails to decode speech while Sphinx3 still performs perfectly well.
So going by this, is there some specific range in which Pocketsphinx expects
its features to be? If so, how do I find that out? I went through the code but
couldn't find anything related to this. Is this a bug? Or is it an
undocumented design step in Pocketsphinx?
Any help would be appreciated.
Thanks,
Mohit
Hi,
Only dynamic range constraints are those due to dataType specification (int,
float etc)
If you are going to scale your features, you should scale them before doing
training (sphinxTrain) also I guess.
But it's surprising that you are still getting good results with Sphinx3 .. I
would expect WER to be bad for both the decoders with scaled features.
Well, yes I did train on the scaled features. So if I train on a scale factor
of 2, my test set is scaled by the same factor.
Can we see the decoder arguments in both the cases?
Also, for clarification - I first trained on the features 'as is' and tested
on the features 'as is'. In the second scenario, I trained on the features
scaled by 1000 and tested on the features scaled by 1000.
https://docs.google.com/open?id=0B34iZP4k14xGUzFIbkFfT1E1alk
Here's another link.
And I did play around with the SENSCR_SHIFT values earlier, but they made no
difference to my scores. And Sphinx3 suffers from scaling, but not all the
time. I have tried other experiments, varying the scale, and Sphinx3 maintains
its WER. This specific example that I posted is one of the few times that
Sphinx3 exhibits a change in WER.
Yes
Pocketsphinx uses only 8 bits to store the acoustic score of the gaussian and
10 bits for the score of the senone. You can find the corresponding parts
inside hmm., in particular there is a constant called SENSCR_SHIFT.
It's a design for the embedded applications.
Nevermind, language weight should be the same. I am wrong, dynamic range of
the acoustic scores doesn't change, the acoustic scores are just shifted by
a log(det|A|) where A is a transform matrix. However, the overflow problem
still
exists.
When you scale features, the acoustic score per frame changes by the
log(det|A|)
which is in case of scale log(1000 * feature_dimension). Log is taken with the
base 1.0003. The feature dimension is 39. Thus you have an additive factor to
the acoustic score of about 800000 or ln(1000)/ln(1.0003) *39 per frame.
Compare this:
And this
The difference in 21 frames is about 18480000 which is 21 * 848000. There are
small differences introduced by transition scores, but overall it looks like
this.
You also need to change data types, not just SENSCR_SHIFT. All data types
should
be uint32 not uint8 and uint16. Logtable also should be 32 bit in acoustic
score
computation. Currently it's only 8bit for the acoustic scores. See lmath_8b
in sources. Overall it's not a simple change.