Menu

Digits recognition with pocketsphinx

Help
Zaur Aliev
2016-03-24
2016-04-25
  • Zaur Aliev

    Zaur Aliev - 2016-03-24

    Hello,

    I use pocketsphinx_continuous to recognize Digits (0-9) from MP3 files. Those files contain only numbers pronounced by different persons (male/female) + some noise. Pauses between digits are about 2 seconds

    Recognition experts suggested me to use en-us-8khz acoustic model + grammar file. It mostly works but I found the accuracy of recognition is very low. Then I tried to use voxforge acoustic model (8 kHz) instead and got much more accurate results. I have also tried to play with various options to improve accuracy.

    Finally what I have is (statistics on 100+ files):
    en-us-8kHz: only 24% of files are recognized correctly. Other 76% have mistakes.
    voxforge: only 45% of files are recognized correctly. Other 55% have mistakes.

    I feel I'm on wrong way, but I cannot find out how to use effectively us-en-8khz model. Guess it should provide even better results as most accurate acoustic model...

    My Grammar file:

    #JSGF V1.0;
    grammar numbers;
    public <number> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ;
    

    My Dictionary (slightly tweaked):

    0       Z IH R OW
    1       W AH N
    2       T UW
    3       TH R IY
    3(2)    TH R IY IY
    4       F AO R
    5       F AY V
    6       S IH K S
    7       S EH V AH N
    8       EY T
    8(2)    EY T TH
    8(3)    EY TH
    9       N AY N
    

    My Commands:
    pocketsphinx_continuous -dict dict.txt -jsgf gram.txt -hmm vortex -infile file.wav -remove_dc yes -remove_noise no -vad_threshold 3.4 -vad_prespeech 19 -vad_postspeech 37 -silprob 2.5 (chosen experimentally)

    and

    pocketsphinx_continuous -dict dict.txt -jsgf gram.txt -hmm en-us-8kHz -infile file.wav -samprate 8000 (default values)

    Please find 10 test samples (mp3, wav), reference results, en-us results, voxforge results + README's attached

    Thank you for help

     

    Last edit: Zaur Aliev 2016-03-24
  • Zaur Aliev

    Zaur Aliev - 2016-03-25

    Nickolay,

    Can you please take a look at my samples? And (if you have time) try these on the configuration which you think is the best to use. This will answer the question Who is guilty.

    Thank you,
    Zaur

     
    • Nickolay V. Shmyrev

      I'm traveling today, I'll check tomorrow. Most likely you need to use default parameters and experiment with -lw instead. All your vad_prespeech are certainly not needed.

       
      • Zaur Aliev

        Zaur Aliev - 2016-03-25

        Hi Nickolay,

        I have tried to perform recognition with en-us-8kHz again using -lw, here is some results:

        For each of 100 mp3's

        ffmpeg -y -i input.mp3 -ar 8000 -ac 1 input.wav
        
        pocketsphinx_continuous -dict dict.txt -jsgf grammar.txt -hmm en-us-8kHz -infile input.wav  -samprate 8000  -lw X
        
        (Where X is in range [1.0 ... 10.0] Default value is 6.5 I guess)   
        

        Also I reverted my dictionary to the initial state - I gathered numbers from

        .\pocketsphinx\model\en-us\cmudict-en-us.dict
        

        And copied them to dict.txt:

        0 Z IH R OW
        0(2) Z IY R OW
        1 W AH N
        1(2) HH W AH N
        2 T UW
        3 TH R IY
        4 F AO R
        5 F AY V
        6 S IH K S
        7 S EH V AH N
        8 EY T
        8(2) EY TH
        9 N AY N
        

        But results were not significantly changed:

        LW value    Good files %
           1.0          21%
           2.0          23%
           3.0          22%
           4.0          23%
           5.0          25%
           6.0          25%
           7.0          24%
           8.0          24%
           9.0          24%
         10.0          21%
        

        I still think it would be the best way if you have a chance to check my samples in your environment.

        Thank you in advance.

         

        Last edit: Zaur Aliev 2016-03-26
  • Nickolay V. Shmyrev

    Hello Zaur

    You can run with the following arguments to get best accuracy:

    pocketsphinx_batch -ctl test.ctl \
     -cepdir wav -cepext .wav -adcin yes -adchdr 44 \
     -hmm en-us-8khz -samprate 8000 -jsgf test.jsgf -hyp test.hyp \
     -vad_threshold 3.4 -silprob 1.0 -wip 1e-5
    

    You can find the full archive in attachment

    Then it will give you the following result:

    ~~~~
    TOTAL Words: 90 Correct: 80 Errors: 11
    TOTAL Percent correct = 88.89% Error = 12.22% Accuracy = 87.78%
    TOTAL Insertions: 1 Deletions: 0 Substitutions: 10
    ~~~~

    It is not a very good idea to adjust standard dictionary. To improve the accuracy further you need to do several things:

    1) Collect more samples and perform acoustic model adaptation as described in our tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
    2) Avoid conversion to mp3 as much as possible. mp3 for 8khz audio is really harmful.

     
    • Nickolay V. Shmyrev

      And, since it seems you are doing this to crack captchas, it is worth to note that they artificially corrupt spectrum by cutting frequency bands randomly, so it is hard to expect high accuracy from the recognizer. There could be few other approaches like missing feature reconstruction to improve accuracy in this case, but they will require development. Adaptation with sufficient data should help.

       
    • Zaur Aliev

      Zaur Aliev - 2016-03-28

      Hello Nockolay,
      Thank you.

      I noted there is no dictionary parameter in your example. Is this ok?

      Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?

      And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?

      You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)

      And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.

      Sorry for the huge questions.

      Regards,
      Zaur

       
      • Nickolay V. Shmyrev

        I noted there is no dictionary parameter in your example. Is this ok?

        Yes, it used default dictionary.

        Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?

        Yes you can. Batch is used for testing, continuous for normal operation

        And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?

        Adaptation should work fine, you need to provide the data to get help on this issue.

        You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)

        There is a lot of research on similar problems, like I said above, adaptatation should help, then you can train the model with band of 1-2khz since corruption happens above 2khz as far as I see. Then you can read research like this: http://www.cs.cmu.edu/~robust/Papers/RajSeltzerStern04.pdf

        And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.

        Confidence for small vocabularies is a complex issue and is not supported in our codebase yet. You can use keyword spotting mode, but it will work only for 3-4 syllable phrases, not for digits.

         
        • Zaur Aliev

          Zaur Aliev - 2016-03-29

          It seems adaptation can greatly improve the accuracy (unexpected) =)
          My final state is below

           

          Last edit: Zaur Aliev 2016-03-30
  • Zaur Aliev

    Zaur Aliev - 2016-03-30

    So finally:

    I decode mp3's with ffmpeg specifying 8000 Hz samplerate

    Then to adapt the model I split all the obtained wavs to separate words (digits) and perform adaptation of en-us-8khz as per tutorial. My problem here was that I couldn't perform adaptation using mllr_matrix. I replaced the original en-us-8khz with en-us-8khz_adapt instead.

    After adaptation I had accuracy 97% of digits are recognized correctly (with the default dictionary).

    I had to make tuning vad parameters and -wip as well to get better results.

    Nickolay, thank you for help again.

     

    Last edit: Zaur Aliev 2016-03-30
    • Alex Vanderpot

      Alex Vanderpot - 2016-04-05

      Hi Zaur,
      I'm also trying to recognize the same dataset. Would you mind sharing what parameters you ended up with for the VAD and -wip? It seems that attempting to adapt the default model has made my recognition less accurate as well.

       

      Last edit: Alex Vanderpot 2016-04-05
    • Alex Vanderpot

      Alex Vanderpot - 2016-04-05

      I also have about 10,000 data points to train the algorithm with, if you would like to use them.

       
  • Alex Vanderpot

    Alex Vanderpot - 2016-04-06

    Hi,
    I'm attempting to accomplish something very similar to what is being attempted above.
    I'm using the en-us-8khz model, and the same grammar file that he is. He mentioned that he was able to acheive 97% recognition.

    Using the options suggested in that thread,

    pocketsphinx_batch -ctl test.ctl \
     -cepdir wav -cepext .wav -adcin yes -adchdr 44 \
     -hmm en-us-8khz -samprate 8000 -jsgf test.jsgf -hyp test.hyp \
     -vad_threshold 3.4 -silprob 1.0 -wip 1e-5
    

    I'm only getting about 10% correct recognition on full recordings of 10 digits. I attempted to adapt the model, but that only made recognition improve maginally. I have attached a portion of the data I attempted to use to adapt the model, and the results from word_align.pl for testing the adapted model and the original en-us-8khz model. The input is immediately decoded from mp3 ~11khz 16kbps (source format) to wav 8000hz.

    What tweaks can I make to get better recognition? Is this a realistic goal? I have a set of about 1000 human-transcribed recordings of 10 digits that I have already used to attempt to train the model, but it didn't work

     
    • Nickolay V. Shmyrev

      I attempted to adapt the model, but that only made recognition improve maginally.

      I'd try to add <sil> between words in adaptation transcript.

       
      • Alex Vanderpot

        Alex Vanderpot - 2016-04-07
        ==> adapted.txt <==
        TOTAL Words: 9900 Correct: 9341 Errors: 581
        TOTAL Percent correct = 94.35% Error = 5.87% Accuracy = 94.13%
        TOTAL Insertions: 22 Deletions: 188 Substitutions: 371
        
        ==> unadapted.txt <==
        TOTAL Words: 9900 Correct: 8621 Errors: 1595
        TOTAL Percent correct = 87.08% Error = 16.11% Accuracy = 83.89%
        TOTAL Insertions: 316 Deletions: 252 Substitutions: 1027
        

        I think that was it. Seems much better now. Thank you.

         
        • Eladio Alvarez

          Eladio Alvarez - 2016-04-07

          Hi, i was tryngf to achieve the same i was using the digits models and achieved around 55% in best case with many teweaks to the values, also i found that best recognition was by spliting in individual digits, and send only 1 at a time to the recognizer, usefull specially with very small or quick spoken digits.
          But never was able to reach the 94.5% you achieve.
          We maybe can exachange information and help. I will really aprecaite if you share the adpted models and features.

           
          • Nickolay V. Shmyrev

            Where are you all from guys?

             

            Last edit: Nickolay V. Shmyrev 2016-04-07
            • Eladio Alvarez

              Eladio Alvarez - 2016-04-07

              Latin america, why ?

               
            • Alex Vanderpot

              Alex Vanderpot - 2016-04-07

              USA

               
        • Zaur Aliev

          Zaur Aliev - 2016-04-11

          Hi Alex,
          Sorry I forgot to subscribe this thread tracking. Have you solved your problem?
          Zaur

           
  • Eladio Alvarez

    Eladio Alvarez - 2016-04-07

    also used sox for segment and do an adaptation on single digits samples.
    i also tried normalizing the audio files.

     

    Last edit: Eladio Alvarez 2016-04-07
  • Eladio Alvarez

    Eladio Alvarez - 2016-04-08

    i'm still far form the goal:
    TOTAL Words: 1000 Correct: 861 Errors: 153
    TOTAL Percent correct = 86.10% Error = 15.30% Accuracy = 84.70%
    TOTAL Insertions: 14 Deletions: 42 Substitutions: 97

    can someone help me out? i really apreciate and will help you guys with anything on my capabilities.

     
  • Alexis Molestos

    Alexis Molestos - 2016-04-25

    Greetings from the Argentina!
    I'm working with the similar things now using pocketspinx.
    The problem I have is a low accuracy for some numbers. Worst of all it is reconging the number 6.
    I've written a lot of variants in dictionary file but they all don't work:

    6 S IH K S
    6(2) S IY K
    6(3) S II K S
    6(4) S EE K S
    6(5) SH IH K S
    6(6) SH EH K S
    6(7) S YH K S
    6(8) S YI K S
    6(9) S IY K S
    6(10) S EY K S

    How can I improve it? Can you share your dictionary file with me? Did you tried to use <sil> already?</sil>

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.