Hello,
I am using pocketsphinx for spoken command recognition. Grammar is simple and
it is a small vocablary set; however, recognition accuracy and performance is
lower than expected. In order to check if I were doing anything wrong, I
compared pocketsphinx performance and accuracy with sphinx3 decoder, using
models provided by cmusphinx (US English Tidigits Telephone Acoustic Model and Voxforge English ) and models I trained. In all
the cases I used 8kHz sampling frequency and continuous models
(tidigits_cd_phone_201103 and voxforge_en_sphinx.cd_cont_3000). I also adapted
the voxforge model with the TIDIGITS database using MLLR and MAP
I trained a model with the TIDIGIT speech database . I called this model
"tidigits.8k.cd_cont_250".
Here are some performance metrics I obtained along with some parameter I used
for sphinx 3 and pocketsphinx:
With pockecsphinx
Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT Model
125 151 323 2.10% 4.89% 97.90% 98.34% 1.00E-060 1.00E-040 12 1.00E-001
14862.32 0.03 tidigits_cd_phone_201103.
112 167 1208 5.20% 13.05% 94.80% 95.19% 1.00E-060 1.00E-040 12 5.00E-001
14862.32 0.07 voxofrge_en (adapted)
103 126 181 1.43% 3.38% 98.57% 98.93% 1.00E-060 1.00E-040 12 1.00E-001
14862.32 0.02 tidigits.8k.cd_cont_250
As you can see, sphinx 3 outperforms pocketsphinx in every case. What can I do
to get better performance with pocketsphinx? Are these performances Ok, or
should I expect better performce with pocketsphinx?
Here are some other tuning I have made:
with tidigits_cd_phone_201103 and tidigits.8k.cd_cont_250 models I used:
I can't confirm your experiment results. Accuracy is ok for all
experiments, it's reasonable for telephone model, voxforge adaptation
result as well for your trained model but speed is not correct.
Given we consider the last experiment, with cd_cont_250 database, the
numbers must be following with latest Sphixntrain-0.7 and
pocketsphinx-0.7 and latest sphinx3 on TIDIGITS test set
Yes, pocketsphinx is a little bit less accurate but it's compensated by
4 time faster decoding. It's possible to tune pocketsphinx for accuracy
(disable top gaussian tracking, score shifting, etc). For semicontinuous
models pocketsphinx should be as accurate as sphinx3. For a real-life
conditions (noise, etc) the difference will become way less significant.
Take into account that the following options are only useful for a large
vocabulary. They reduce accuracy and waste time with TIDIGITS:
-fwdflat yes
-bestpath yes
-pl_window 1
It's better to disable all three
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
and my calculation using perl is:
while( <file> ) {
...
if (/tot:\s+(\S+)\s+xCPU,\s+(\S+)\s+xClk/) {
$xCPU= $1;
$xClk = $2;
}
}
}
$xRT = $xCPU; </file>
Is there anything wrong with the way I extract xRT info from the log files?
and here are the results after using
-fwdflat no
-bestpath no
-pl_window 1
Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT
91 116 182 1.36% 3.46% 98.64% 98.96% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
0.0239
just a bit better.
I am running the experiments in a PC with i7 CPU 870/2.93GHz (quad core/8
logical processors) and 4GB of physical memory
I'm also running pocketsphinx 0.7 on a Motorola Atrix smartphone (dual core
processor) using the demo for android devices. It takes very long for the
pocketsphinx demo to load the voxforge adapted model. Recognition is also
significantly slow (comparing with TIDIGIT models). In all the cases I am
using the same configuration as in the PC. Should I use semicontinuous models?
Thank you again Nickolay
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't think it's the speed calculation issue I think the issue is about
parameters you are using to invoke pocketsphinx. Our sphinx3 performance is
more or less equivalent, the pocketsphinx one is very different. Maybe you can
share the full log to compare.
Recognition is also significantly slow (comparing with TIDIGIT models). In
all the cases I am using the same configuration as in the PC. Should I use
semicontinuous models?
Yes, you need semicont model and few more things. Default model hub4wsj_sc_8k
should be better than voxforge from this point of view.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You are absolutely right. that was the problem. In my script I had hardcoded
some default values and pl_windows was among them. (pl_window default value is
1 in S3 but 0 in PS)
After correcting it I am obtaining
Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT1
50 59 164 1.30% 3.53% 98.70% 99.22% 1.00E-060 1.00E-040 12 1.00E-001 14862.32 0.0058
A bit less accurate than yours but almost the same speed.
Thank you very much Nickolay
Oh, by the way, In a previous message you also said:
Yes, you need semicont model and few more things.
Please, let me know if you had any other suggestion apart from the ones you
have already given to me.
Thanks
Luciano
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am using pocketsphinx for spoken command recognition. Grammar is simple and
it is a small vocablary set; however, recognition accuracy and performance is
lower than expected. In order to check if I were doing anything wrong, I
compared pocketsphinx performance and accuracy with sphinx3 decoder, using
models provided by cmusphinx (US English Tidigits Telephone Acoustic Model and
Voxforge English ) and models I trained. In all
the cases I used 8kHz sampling frequency and continuous models
(tidigits_cd_phone_201103 and voxforge_en_sphinx.cd_cont_3000). I also adapted
the voxforge model with the TIDIGITS database using MLLR and MAP
I trained a model with the TIDIGIT speech database . I called this model
"tidigits.8k.cd_cont_250".
Here are some performance metrics I obtained along with some parameter I used
for sphinx 3 and pocketsphinx:
With pockecsphinx
Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT Model
125 151 323 2.10% 4.89% 97.90% 98.34% 1.00E-060 1.00E-040 12 1.00E-001
14862.32 0.03 tidigits_cd_phone_201103.
112 167 1208 5.20% 13.05% 94.80% 95.19% 1.00E-060 1.00E-040 12 5.00E-001
14862.32 0.07 voxofrge_en (adapted)
103 126 181 1.43% 3.38% 98.57% 98.93% 1.00E-060 1.00E-040 12 1.00E-001
14862.32 0.02 tidigits.8k.cd_cont_250
With sphinx 3 decode
Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT Model
165 29 139 1.17% 3.23% 98.83% 99.41% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
0.01 tidigits_cd_phone_201103
195 67 105 1.28% 3.86% 98.72% 99.40% 1.00E-060 1.00E-040 12 5.00E-001 14862.32
0.02 voxofrge_en (adapted)
68 28 64 0.56% 1.67% 99.44% 99.68% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
0.01 tidigits.8k.cd_cont_250
As you can see, sphinx 3 outperforms pocketsphinx in every case. What can I do
to get better performance with pocketsphinx? Are these performances Ok, or
should I expect better performce with pocketsphinx?
Here are some other tuning I have made:
with tidigits_cd_phone_201103 and tidigits.8k.cd_cont_250 models I used:
-sendump models/tidigits/<model>/hmm/sendump </model>
for pocketsphinx:
-fwdflat yes
-fwdtree yes
-bestpath yes
-pl_window 1
-fillprob 0.01
with voxforge english (adapted) model I used
Tuning - Reducing GMM computation
-ci_pbeam 1e-5
Tuning - Reducing HMM computation and Search
-maxcdsenpf 60
-maxhmmpf 70
-maxwpf 4
-topn 2
for pocketsphinx:
-fwdflat yes
-fwdtree yes
-bestpath yes
-pl_window 1
-fillprob 0.01
Thank you very in advance for your help
Luciano
Hello
I can't confirm your experiment results. Accuracy is ok for all
experiments, it's reasonable for telephone model, voxforge adaptation
result as well for your trained model but speed is not correct.
Given we consider the last experiment, with cd_cont_250 database, the
numbers must be following with latest Sphixntrain-0.7 and
pocketsphinx-0.7 and latest sphinx3 on TIDIGITS test set
Yes, pocketsphinx is a little bit less accurate but it's compensated by
4 time faster decoding. It's possible to tune pocketsphinx for accuracy
(disable top gaussian tracking, score shifting, etc). For semicontinuous
models pocketsphinx should be as accurate as sphinx3. For a real-life
conditions (noise, etc) the difference will become way less significant.
Take into account that the following options are only useful for a large
vocabulary. They reduce accuracy and waste time with TIDIGITS:
It's better to disable all three
Thank you very much Nickolay, for your reply.I really appreciate it very much
Now I understand that accuracy is faily well with all the models I am working
with. Regarding performance you said:
I calculate speed using the following lines from the log files:
In the case of pocketsphinx:
and my calculation using perl is:
while( <file> ) { </file>
print;
if (/TOTAL\s+(\S+)\sseconds speech,\s+(\S+)\sseconds CPU,\s+(\S+)/) {
$speechTime= $1;
$CPUTime = $2;
$wallTime =$3;
}
}
$xRT = $CPUTime / $speechTime;
in the case of sphinx3, it is slightly different:
and my calculation using perl is:
while( <file> ) {
...
if (/tot:\s+(\S+)\s+xCPU,\s+(\S+)\s+xClk/) {
$xCPU= $1;
$xClk = $2;
}
}
}
$xRT = $xCPU; </file>
Is there anything wrong with the way I extract xRT info from the log files?
and here are the results after using
-pl_window 1
Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT
91 116 182 1.36% 3.46% 98.64% 98.96% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
0.0239
just a bit better.
I am running the experiments in a PC with i7 CPU 870/2.93GHz (quad core/8
logical processors) and 4GB of physical memory
I'm also running pocketsphinx 0.7 on a Motorola Atrix smartphone (dual core
processor) using the demo for android devices. It takes very long for the
pocketsphinx demo to load the voxforge adapted model. Recognition is also
significantly slow (comparing with TIDIGIT models). In all the cases I am
using the same configuration as in the PC. Should I use semicontinuous models?
Thank you again Nickolay
I don't think it's the speed calculation issue I think the issue is about
parameters you are using to invoke pocketsphinx. Our sphinx3 performance is
more or less equivalent, the pocketsphinx one is very different. Maybe you can
share the full log to compare.
Yes, you need semicont model and few more things. Default model hub4wsj_sc_8k
should be better than voxforge from this point of view.
Nickolay, I am sharing the full log it at: http://www.mediafire.com/?5st20586
2da4768
Thanks
Hey, I see in the log you are still using pl_window. That makes decoder very
slow. Are you sure you properly pass all the arguments? Check my list.
You are absolutely right. that was the problem. In my script I had hardcoded
some default values and pl_windows was among them. (pl_window default value is
1 in S3 but 0 in PS)
After correcting it I am obtaining
A bit less accurate than yours but almost the same speed.
Thank you very much Nickolay
Oh, by the way, In a previous message you also said:
Please, let me know if you had any other suggestion apart from the ones you
have already given to me.
Thanks
Luciano