CMU Sphinx / Forums / Help: Digits recognition with pocketsphinx

Zaur Aliev - 2016-03-24

Hello,

I use pocketsphinx_continuous to recognize Digits (0-9) from MP3 files. Those files contain only numbers pronounced by different persons (male/female) + some noise. Pauses between digits are about 2 seconds

Recognition experts suggested me to use en-us-8khz acoustic model + grammar file. It mostly works but I found the accuracy of recognition is very low. Then I tried to use voxforge acoustic model (8 kHz) instead and got much more accurate results. I have also tried to play with various options to improve accuracy.

Finally what I have is (statistics on 100+ files):
en-us-8kHz: only 24% of files are recognized correctly. Other 76% have mistakes.
voxforge: only 45% of files are recognized correctly. Other 55% have mistakes.

I feel I'm on wrong way, but I cannot find out how to use effectively us-en-8khz model. Guess it should provide even better results as most accurate acoustic model...

My Grammar file:

#JSGF V1.0; grammar numbers; public <number> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ;

My Dictionary (slightly tweaked):

0 Z IH R OW 1 W AH N 2 T UW 3 TH R IY 3(2) TH R IY IY 4 F AO R 5 F AY V 6 S IH K S 7 S EH V AH N 8 EY T 8(2) EY T TH 8(3) EY TH 9 N AY N

My Commands:
pocketsphinx_continuous -dict dict.txt -jsgf gram.txt -hmm vortex -infile file.wav -remove_dc yes -remove_noise no -vad_threshold 3.4 -vad_prespeech 19 -vad_postspeech 37 -silprob 2.5 (chosen experimentally)

and

pocketsphinx_continuous -dict dict.txt -jsgf gram.txt -hmm en-us-8kHz -infile file.wav -samprate 8000 (default values)

Please find 10 test samples (mp3, wav), reference results, en-us results, voxforge results + README's attached

Thank you for help

Last edit: Zaur Aliev 2016-03-24

data.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zaur Aliev - 2016-03-25

Nickolay,

Can you please take a look at my samples? And (if you have time) try these on the configuration which you think is the best to use. This will answer the question Who is guilty.

Thank you,
Zaur

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-03-25
  
  I'm traveling today, I'll check tomorrow. Most likely you need to use default parameters and experiment with -lw instead. All your vad_prespeech are certainly not needed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zaur Aliev - 2016-03-25
    
    Hi Nickolay,
    
    I have tried to perform recognition with en-us-8kHz again using -lw, here is some results:
    
    For each of 100 mp3's
    
    ffmpeg -y -i input.mp3 -ar 8000 -ac 1 input.wav pocketsphinx_continuous -dict dict.txt -jsgf grammar.txt -hmm en-us-8kHz -infile input.wav -samprate 8000 -lw X (Where X is in range [1.0 ... 10.0] Default value is 6.5 I guess)
    
    Also I reverted my dictionary to the initial state - I gathered numbers from
    
    .\pocketsphinx\model\en-us\cmudict-en-us.dict
    
    And copied them to dict.txt:
    
    0 Z IH R OW 0(2) Z IY R OW 1 W AH N 1(2) HH W AH N 2 T UW 3 TH R IY 4 F AO R 5 F AY V 6 S IH K S 7 S EH V AH N 8 EY T 8(2) EY TH 9 N AY N
    
    But results were not significantly changed:
    
    LW value Good files % 1.0 21% 2.0 23% 3.0 22% 4.0 23% 5.0 25% 6.0 25% 7.0 24% 8.0 24% 9.0 24% 10.0 21%
    
    I still think it would be the best way if you have a chance to check my samples in your environment.
    
    Thank you in advance.
    
    Last edit: Zaur Aliev 2016-03-26
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2016-03-27

Hello Zaur

You can run with the following arguments to get best accuracy:

pocketsphinx_batch -ctl test.ctl \ -cepdir wav -cepext .wav -adcin yes -adchdr 44 \ -hmm en-us-8khz -samprate 8000 -jsgf test.jsgf -hyp test.hyp \ -vad_threshold 3.4 -silprob 1.0 -wip 1e-5

You can find the full archive in attachment

Then it will give you the following result:

~~~~
TOTAL Words: 90 Correct: 80 Errors: 11
TOTAL Percent correct = 88.89% Error = 12.22% Accuracy = 87.78%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 10
~~~~

It is not a very good idea to adjust standard dictionary. To improve the accuracy further you need to do several things:

1) Collect more samples and perform acoustic model adaptation as described in our tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
2) Avoid conversion to mp3 as much as possible. mp3 for 8khz audio is really harmful.

digits-problem.tar.gz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-03-27
  
  And, since it seems you are doing this to crack captchas, it is worth to note that they artificially corrupt spectrum by cutting frequency bands randomly, so it is hard to expect high accuracy from the recognizer. There could be few other approaches like missing feature reconstruction to improve accuracy in this case, but they will require development. Adaptation with sufficient data should help.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zaur Aliev - 2016-03-28
  
  Hello Nockolay,
  Thank you.
  
  I noted there is no dictionary parameter in your example. Is this ok?
  
  Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?
  
  And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?
  
  You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)
  
  And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.
  
  Sorry for the huge questions.
  
  Regards,
  Zaur
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2016-03-28
    
    I noted there is no dictionary parameter in your example. Is this ok?
    
    Yes, it used default dictionary.
    
    Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?
    
    Yes you can. Batch is used for testing, continuous for normal operation
    
    And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?
    
    Adaptation should work fine, you need to provide the data to get help on this issue.
    
    You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)
    
    There is a lot of research on similar problems, like I said above, adaptatation should help, then you can train the model with band of 1-2khz since corruption happens above 2khz as far as I see. Then you can read research like this: http://www.cs.cmu.edu/~robust/Papers/RajSeltzerStern04.pdf
    
    And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.
    
    Confidence for small vocabularies is a complex issue and is not supported in our codebase yet. You can use keyword spotting mode, but it will work only for 3-4 syllable phrases, not for digits.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Zaur Aliev - 2016-03-29
      
      It seems adaptation can greatly improve the accuracy (unexpected) =)
      My final state is below
      
      Last edit: Zaur Aliev 2016-03-30
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zaur Aliev - 2016-03-30

So finally:

I decode mp3's with ffmpeg specifying 8000 Hz samplerate

Then to adapt the model I split all the obtained wavs to separate words (digits) and perform adaptation of en-us-8khz as per tutorial. My problem here was that I couldn't perform adaptation using mllr_matrix. I replaced the original en-us-8khz with en-us-8khz_adapt instead.

After adaptation I had accuracy 97% of digits are recognized correctly (with the default dictionary).

I had to make tuning vad parameters and -wip as well to get better results.

Nickolay, thank you for help again.

Last edit: Zaur Aliev 2016-03-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alex Vanderpot - 2016-04-05
  
  Hi Zaur,
  I'm also trying to recognize the same dataset. Would you mind sharing what parameters you ended up with for the VAD and -wip? It seems that attempting to adapt the default model has made my recognition less accurate as well.
  
  Last edit: Alex Vanderpot 2016-04-05
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alex Vanderpot - 2016-04-05
  
  I also have about 10,000 data points to train the algorithm with, if you would like to use them.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Vanderpot - 2016-04-06

Hi,
I'm attempting to accomplish something very similar to what is being attempted above.
I'm using the en-us-8khz model, and the same grammar file that he is. He mentioned that he was able to acheive 97% recognition.

Using the options suggested in that thread,

pocketsphinx_batch -ctl test.ctl \ -cepdir wav -cepext .wav -adcin yes -adchdr 44 \ -hmm en-us-8khz -samprate 8000 -jsgf test.jsgf -hyp test.hyp \ -vad_threshold 3.4 -silprob 1.0 -wip 1e-5

I'm only getting about 10% correct recognition on full recordings of 10 digits. I attempted to adapt the model, but that only made recognition improve maginally. I have attached a portion of the data I attempted to use to adapt the model, and the results from word_align.pl for testing the adapted model and the original en-us-8khz model. The input is immediately decoded from mp3 ~11khz 16kbps (source format) to wav 8000hz.

What tweaks can I make to get better recognition? Is this a realistic goal? I have a set of about 1000 human-transcribed recordings of 10 digits that I have already used to attempt to train the model, but it didn't work

sampledata.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-04-06
  
  I attempted to adapt the model, but that only made recognition improve maginally.
  
  I'd try to add <sil> between words in adaptation transcript.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Alex Vanderpot - 2016-04-07
    
    ==> adapted.txt <== TOTAL Words: 9900 Correct: 9341 Errors: 581 TOTAL Percent correct = 94.35% Error = 5.87% Accuracy = 94.13% TOTAL Insertions: 22 Deletions: 188 Substitutions: 371 ==> unadapted.txt <== TOTAL Words: 9900 Correct: 8621 Errors: 1595 TOTAL Percent correct = 87.08% Error = 16.11% Accuracy = 83.89% TOTAL Insertions: 316 Deletions: 252 Substitutions: 1027
    
    I think that was it. Seems much better now. Thank you.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Eladio Alvarez - 2016-04-07
      
      Hi, i was tryngf to achieve the same i was using the digits models and achieved around 55% in best case with many teweaks to the values, also i found that best recognition was by spliting in individual digits, and send only 1 at a time to the recognizer, usefull specially with very small or quick spoken digits.
      But never was able to reach the 94.5% you achieve.
      We maybe can exachange information and help. I will really aprecaite if you share the adpted models and features.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Nickolay V. Shmyrev - 2016-04-07
        
        Where are you all from guys?
        
        Last edit: Nickolay V. Shmyrev 2016-04-07
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Eladio Alvarez - 2016-04-07
        
        Latin america, why ?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Alex Vanderpot - 2016-04-07
        
        USA
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Zaur Aliev - 2016-04-11
      
      Hi Alex,
      Sorry I forgot to subscribe this thread tracking. Have you solved your problem?
      Zaur
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eladio Alvarez - 2016-04-07

also used sox for segment and do an adaptation on single digits samples.
i also tried normalizing the audio files.

Last edit: Eladio Alvarez 2016-04-07

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eladio Alvarez - 2016-04-08

i'm still far form the goal:
TOTAL Words: 1000 Correct: 861 Errors: 153
TOTAL Percent correct = 86.10% Error = 15.30% Accuracy = 84.70%
TOTAL Insertions: 14 Deletions: 42 Substitutions: 97

can someone help me out? i really apreciate and will help you guys with anything on my capabilities.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexis Molestos - 2016-04-25

Greetings from the Argentina!
I'm working with the similar things now using pocketspinx.
The problem I have is a low accuracy for some numbers. Worst of all it is reconging the number 6.
I've written a lot of variants in dictionary file but they all don't work:

6 S IH K S
6(2) S IY K
6(3) S II K S
6(4) S EE K S
6(5) SH IH K S
6(6) SH EH K S
6(7) S YH K S
6(8) S YI K S
6(9) S IY K S
6(10) S EY K S

How can I improve it? Can you share your dictionary file with me? Did you tried to use <sil> already?</sil>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digits recognition with pocketsphinx

Speech Recognition Toolkit

Forums

Help

Digits recognition with pocketsphinx document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Digits recognition with pocketsphinx