Building on my experiences with a couple of other recognizers (that shall
remain nameless) including one with a robust duration model and another that,
like PoxketSphinx, lacks a duration model, I tried a little experiment.
Since simple HMMs which can either hold at a state or move on to the next
state tend to match a exponentially decaying duration, but no known model of
speech production actually has an exponentially decaying duration, you
typically get better accuracy by either tweaking state transition
probabilities in a rather arbitrary (hacking) fashion, or simply deliberately
mismatching the framerate between the training data and the actual recognition
task. The second approach is much easier for tuning, since it's scriptable.
I wasn't quite prepared for the results. With a particularly ugly corpus
(distance mic, automotive task, in the rain) and grammar, my peak accuracy was
at reduced framerates. Specifically
The fact that two different models have two different minima shows how broken
the whole idea of exponential duration is, but that's a topic for another day.
I'm more used to an increase in the framerate, typically about 20%, for peak
accuracy, but that's the opposite of what I saw. Looking at
voxforge_en_sphinx.cd_cont_5000, the frame rate vs. sentence error rate was...
70 - 29.78
80 - 30.07
83 - 28.33
100 - 33.25
110 - 36.43
120 - 41.50
This could be looked at as
1) a way to get Pocketsphinx to run with a small (3-4%) decrease in word error
rate together with a 15-25% increase in speed (a win-win),
2) A single, tunable parameter that probably can be used for easy, low
overhead speaker adaptation. My test corpus has a mix of 6 talkers. I didn't
try tuning for each talker,individually. (is there a better parameter to tweak
for state duration? A transition probability bias of some sort?)
3) a bug in need of squashing.
4) an argument as to just how badly Sphinx needs a decent duration model.
5) a thesis project, or at least a piece of one.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just sharing an interesting observation.
Building on my experiences with a couple of other recognizers (that shall
remain nameless) including one with a robust duration model and another that,
like PoxketSphinx, lacks a duration model, I tried a little experiment.
Since simple HMMs which can either hold at a state or move on to the next
state tend to match a exponentially decaying duration, but no known model of
speech production actually has an exponentially decaying duration, you
typically get better accuracy by either tweaking state transition
probabilities in a rather arbitrary (hacking) fashion, or simply deliberately
mismatching the framerate between the training data and the actual recognition
task. The second approach is much easier for tuning, since it's scriptable.
I wasn't quite prepared for the results. With a particularly ugly corpus
(distance mic, automotive task, in the rain) and grammar, my peak accuracy was
at reduced framerates. Specifically
• 83 - voxforge_en_sphinx.cd_cont_5000
• 75 - hub4wsj_sc_8k
The fact that two different models have two different minima shows how broken
the whole idea of exponential duration is, but that's a topic for another day.
I'm more used to an increase in the framerate, typically about 20%, for peak
accuracy, but that's the opposite of what I saw. Looking at
voxforge_en_sphinx.cd_cont_5000, the frame rate vs. sentence error rate was...
70 - 29.78
80 - 30.07
83 - 28.33
100 - 33.25
110 - 36.43
120 - 41.50
This could be looked at as
1) a way to get Pocketsphinx to run with a small (3-4%) decrease in word error
rate together with a 15-25% increase in speed (a win-win),
2) A single, tunable parameter that probably can be used for easy, low
overhead speaker adaptation. My test corpus has a mix of 6 talkers. I didn't
try tuning for each talker,individually. (is there a better parameter to tweak
for state duration? A transition probability bias of some sort?)
3) a bug in need of squashing.
4) an argument as to just how badly Sphinx needs a decent duration model.
5) a thesis project, or at least a piece of one.
Hello
This is an interesting observation. To get feedback it's better to post it to
cmusphinx-devel mailing list though.