Menu

pocketsphinx accuracy and performance

Help
luciano
2011-05-03
2012-09-22
  • luciano

    luciano - 2011-05-03

    Hello,
    I am using pocketsphinx for spoken command recognition. Grammar is simple and
    it is a small vocablary set; however, recognition accuracy and performance is
    lower than expected. In order to check if I were doing anything wrong, I
    compared pocketsphinx performance and accuracy with sphinx3 decoder, using
    models provided by cmusphinx (US English Tidigits Telephone Acoustic Model and
    Voxforge English ) and models I trained. In all
    the cases I used 8kHz sampling frequency and continuous models
    (tidigits_cd_phone_201103 and voxforge_en_sphinx.cd_cont_3000). I also adapted
    the voxforge model with the TIDIGITS database using MLLR and MAP
    I trained a model with the TIDIGIT speech database . I called this model
    "tidigits.8k.cd_cont_250".
    Here are some performance metrics I obtained along with some parameter I used
    for sphinx 3 and pocketsphinx:

    With pockecsphinx
    Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT Model
    125 151 323 2.10% 4.89% 97.90% 98.34% 1.00E-060 1.00E-040 12 1.00E-001
    14862.32 0.03 tidigits_cd_phone_201103.
    112 167 1208 5.20% 13.05% 94.80% 95.19% 1.00E-060 1.00E-040 12 5.00E-001
    14862.32 0.07 voxofrge_en (adapted)
    103 126 181 1.43% 3.38% 98.57% 98.93% 1.00E-060 1.00E-040 12 1.00E-001
    14862.32 0.02 tidigits.8k.cd_cont_250

    With sphinx 3 decode
    Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT Model
    165 29 139 1.17% 3.23% 98.83% 99.41% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
    0.01 tidigits_cd_phone_201103
    195 67 105 1.28% 3.86% 98.72% 99.40% 1.00E-060 1.00E-040 12 5.00E-001 14862.32
    0.02 voxofrge_en (adapted)
    68 28 64 0.56% 1.67% 99.44% 99.68% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
    0.01 tidigits.8k.cd_cont_250

    As you can see, sphinx 3 outperforms pocketsphinx in every case. What can I do
    to get better performance with pocketsphinx? Are these performances Ok, or
    should I expect better performce with pocketsphinx?
    Here are some other tuning I have made:

    with tidigits_cd_phone_201103 and tidigits.8k.cd_cont_250 models I used:

    -sendump models/tidigits/<model>/hmm/sendump </model>

    for pocketsphinx:

    -fwdflat yes
    -fwdtree yes
    -bestpath yes
    -pl_window 1

    -fillprob 0.01

    with voxforge english (adapted) model I used

    Tuning - Reducing GMM computation

    -ci_pbeam 1e-5

    Tuning - Reducing HMM computation and Search

    -maxcdsenpf 60
    -maxhmmpf 70
    -maxwpf 4

    -topn 2

    for pocketsphinx:

    -fwdflat yes
    -fwdtree yes
    -bestpath yes
    -pl_window 1

    -fillprob 0.01

    Thank you very in advance for your help
    Luciano

     
  • Nickolay V. Shmyrev

    Hello

    I can't confirm your experiment results. Accuracy is ok for all
    experiments, it's reasonable for telephone model, voxforge adaptation
    result as well for your trained model but speed is not correct.

    Given we consider the last experiment, with cd_cont_250 database, the
    numbers must be following with latest Sphixntrain-0.7 and
    pocketsphinx-0.7 and latest sphinx3 on TIDIGITS test set

                 WER  SER  RT   BEAM  WBEAM LW  WIP
    pocketsphinx 1.1% 3.1% 0.006 1e-80 1e-40 12  0.1
    sphinx3           0.7  1.8%  0.025  1e-80 1e-40 12  0.1
    

    Yes, pocketsphinx is a little bit less accurate but it's compensated by
    4 time faster decoding. It's possible to tune pocketsphinx for accuracy
    (disable top gaussian tracking, score shifting, etc). For semicontinuous
    models pocketsphinx should be as accurate as sphinx3. For a real-life
    conditions (noise, etc) the difference will become way less significant.

    Take into account that the following options are only useful for a large
    vocabulary. They reduce accuracy and waste time with TIDIGITS:

    -fwdflat yes
    -bestpath yes
    -pl_window 1
    

    It's better to disable all three

     
  • luciano

    luciano - 2011-05-03

    Thank you very much Nickolay, for your reply.I really appreciate it very much

    Now I understand that accuracy is faily well with all the models I am working
    with. Regarding performance you said:

    ... but speed is not correct.

    I calculate speed using the following lines from the log files:

    In the case of pocketsphinx:

    lines in the log file:
    INFO: batch.c(774): TOTAL 14862.32 seconds speech, 391.89 seconds CPU, 392.90
    seconds wall
    INFO: batch.c(776): AVERAGE 0.03 xRT (CPU), 0.03 xRT (elapsed)

    and my calculation using perl is:
    while( <file> ) { </file>

    print;

    if (/TOTAL\s+(\S+)\sseconds speech,\s+(\S+)\sseconds CPU,\s+(\S+)/) {
    $speechTime= $1;
    $CPUTime = $2;
    $wallTime =$3;
    }
    }
    $xRT = $CPUTime / $speechTime;

    in the case of sphinx3, it is slightly different:

    lines in the log file:
    INFO: stat.c(206): SUMMARY: 1486232 fr; 55 cdsen/fr, 102 cisen/fr, 440
    cdgau/fr, 816 cigau/fr, 0.01 xCPU 0.01 xClk ; 32 hmm/fr, 1 wd/fr, 0.00 xCPU
    0.00 xClk; tot: 0.01 xCPU, 0.02 xClk

    and my calculation using perl is:
    while( <file> ) {
    ...
    if (/tot:\s+(\S+)\s+xCPU,\s+(\S+)\s+xClk/) {
    $xCPU= $1;
    $xClk = $2;
    }
    }
    }
    $xRT = $xCPU; </file>

    Is there anything wrong with the way I extract xRT info from the log files?

    and here are the results after using

    -fwdflat no
    -bestpath no

    -pl_window 1

    Ins Dels Subs WER SER ACC CORR BEAM WBEAM LW WIP speech xRT
    91 116 182 1.36% 3.46% 98.64% 98.96% 1.00E-060 1.00E-040 12 1.00E-001 14862.32
    0.0239

    just a bit better.
    I am running the experiments in a PC with i7 CPU 870/2.93GHz (quad core/8
    logical processors) and 4GB of physical memory
    I'm also running pocketsphinx 0.7 on a Motorola Atrix smartphone (dual core
    processor) using the demo for android devices. It takes very long for the
    pocketsphinx demo to load the voxforge adapted model. Recognition is also
    significantly slow (comparing with TIDIGIT models). In all the cases I am
    using the same configuration as in the PC. Should I use semicontinuous models?
    Thank you again Nickolay

     
  • Nickolay V. Shmyrev

    INFO: batch.c(774): TOTAL 14862.32 seconds speech, 391.89 seconds CPU,
    392.90 seconds wall
    INFO: batch.c(776): AVERAGE 0.03 xRT (CPU), 0.03 xRT (elapsed)

    I don't think it's the speed calculation issue I think the issue is about
    parameters you are using to invoke pocketsphinx. Our sphinx3 performance is
    more or less equivalent, the pocketsphinx one is very different. Maybe you can
    share the full log to compare.

    Recognition is also significantly slow (comparing with TIDIGIT models). In
    all the cases I am using the same configuration as in the PC. Should I use
    semicontinuous models?

    Yes, you need semicont model and few more things. Default model hub4wsj_sc_8k
    should be better than voxforge from this point of view.

     
  • luciano

    luciano - 2011-05-03

    Nickolay, I am sharing the full log it at: http://www.mediafire.com/?5st20586
    2da4768

    Thanks

     
  • Nickolay V. Shmyrev

    Hey, I see in the log you are still using pl_window. That makes decoder very
    slow. Are you sure you properly pass all the arguments? Check my list.

    /home/nshmyrev/projects/cmusphinx-dist/tidigits/bin/pocketsphinx_batch \
            -hmm /home/nshmyrev/projects/cmusphinx-dist/tidigits/model_parameters/tidigits.cd_cont_250 \
            -lw 12 \
            -feat 1s_c_d_dd \
            -beam 1e-80 \
            -wbeam 1e-40 \
            -dict /home/nshmyrev/projects/cmusphinx-dist/tidigits/etc/tidigits.dic \
            -lm /home/nshmyrev/projects/cmusphinx-dist/tidigits/etc/tidigits.lm.DMP \
            -wip 0.2 \
            -ctl /home/nshmyrev/projects/cmusphinx-dist/tidigits/etc/tidigits_test.fileids \
            -ctloffset 0 \
            -ctlcount 8688 \
            -cepdir /home/nshmyrev/projects/cmusphinx-dist/tidigits/feat \
            -cepext .mfc \
            -hyp /home/nshmyrev/projects/cmusphinx-dist/tidigits/result/tidigits-1-1.match \
            -agc none \
            -varnorm no \
            -cmn current \
            -fwdflat no \
            -bestpath no
    
     
  • luciano

    luciano - 2011-05-03

    You are absolutely right. that was the problem. In my script I had hardcoded
    some default values and pl_windows was among them. (pl_window default value is
    1 in S3 but 0 in PS)
    After correcting it I am obtaining

    Ins    Dels   Subs   WER    SER    ACC    CORR   BEAM      WBEAM     LW  WIP       speech    xRT1
    50    59     164    1.30%  3.53%  98.70% 99.22% 1.00E-060 1.00E-040 12  1.00E-001 14862.32  0.0058
    

    A bit less accurate than yours but almost the same speed.
    Thank you very much Nickolay

    Oh, by the way, In a previous message you also said:

    Yes, you need semicont model and few more things.

    Please, let me know if you had any other suggestion apart from the ones you
    have already given to me.
    Thanks
    Luciano

     

Log in to post a comment.