I use pocketsphinx_continuous to recognize Digits (0-9) from MP3 files. Those files contain only numbers pronounced by different persons (male/female) + some noise. Pauses between digits are about 2 seconds
Recognition experts suggested me to use en-us-8khz acoustic model + grammar file. It mostly works but I found the accuracy of recognition is very low. Then I tried to use voxforge acoustic model (8 kHz) instead and got much more accurate results. I have also tried to play with various options to improve accuracy.
Finally what I have is (statistics on 100+ files): en-us-8kHz: only 24% of files are recognized correctly. Other 76% have mistakes. voxforge: only 45% of files are recognized correctly. Other 55% have mistakes.
I feel I'm on wrong way, but I cannot find out how to use effectively us-en-8khz model. Guess it should provide even better results as most accurate acoustic model...
Can you please take a look at my samples? And (if you have time) try these on the configuration which you think is the best to use. This will answer the question Who is guilty.
Thank you,
Zaur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm traveling today, I'll check tomorrow. Most likely you need to use default parameters and experiment with -lw instead. All your vad_prespeech are certainly not needed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have tried to perform recognition with en-us-8kHz again using -lw, here is some results:
For each of 100 mp3's
ffmpeg -y -i input.mp3 -ar 8000 -ac 1 input.wav
pocketsphinx_continuous -dict dict.txt -jsgf grammar.txt -hmm en-us-8kHz -infile input.wav -samprate 8000 -lw X
(Where X is in range [1.0 ... 10.0] Default value is 6.5 I guess)
Also I reverted my dictionary to the initial state - I gathered numbers from
~~~~
TOTAL Words: 90 Correct: 80 Errors: 11
TOTAL Percent correct = 88.89% Error = 12.22% Accuracy = 87.78%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 10
~~~~
It is not a very good idea to adjust standard dictionary. To improve the accuracy further you need to do several things:
1) Collect more samples and perform acoustic model adaptation as described in our tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
2) Avoid conversion to mp3 as much as possible. mp3 for 8khz audio is really harmful.
And, since it seems you are doing this to crack captchas, it is worth to note that they artificially corrupt spectrum by cutting frequency bands randomly, so it is hard to expect high accuracy from the recognizer. There could be few other approaches like missing feature reconstruction to improve accuracy in this case, but they will require development. Adaptation with sufficient data should help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I noted there is no dictionary parameter in your example. Is this ok?
Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?
And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?
You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)
And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.
Sorry for the huge questions.
Regards,
Zaur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I noted there is no dictionary parameter in your example. Is this ok?
Yes, it used default dictionary.
Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?
Yes you can. Batch is used for testing, continuous for normal operation
And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?
Adaptation should work fine, you need to provide the data to get help on this issue.
You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)
There is a lot of research on similar problems, like I said above, adaptatation should help, then you can train the model with band of 1-2khz since corruption happens above 2khz as far as I see. Then you can read research like this: http://www.cs.cmu.edu/~robust/Papers/RajSeltzerStern04.pdf
And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.
Confidence for small vocabularies is a complex issue and is not supported in our codebase yet. You can use keyword spotting mode, but it will work only for 3-4 syllable phrases, not for digits.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I decode mp3's with ffmpeg specifying 8000 Hz samplerate
Then to adapt the model I split all the obtained wavs to separate words (digits) and perform adaptation of en-us-8khz as per tutorial. My problem here was that I couldn't perform adaptation using mllr_matrix. I replaced the original en-us-8khz with en-us-8khz_adapt instead.
After adaptation I had accuracy 97% of digits are recognized correctly (with the default dictionary).
I had to make tuning vad parameters and -wip as well to get better results.
Nickolay, thank you for help again.
Last edit: Zaur Aliev 2016-03-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Zaur,
I'm also trying to recognize the same dataset. Would you mind sharing what parameters you ended up with for the VAD and -wip? It seems that attempting to adapt the default model has made my recognition less accurate as well.
Last edit: Alex Vanderpot 2016-04-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm attempting to accomplish something very similar to what is being attempted above.
I'm using the en-us-8khz model, and the same grammar file that he is. He mentioned that he was able to acheive 97% recognition.
I'm only getting about 10% correct recognition on full recordings of 10 digits. I attempted to adapt the model, but that only made recognition improve maginally. I have attached a portion of the data I attempted to use to adapt the model, and the results from word_align.pl for testing the adapted model and the original en-us-8khz model. The input is immediately decoded from mp3 ~11khz 16kbps (source format) to wav 8000hz.
What tweaks can I make to get better recognition? Is this a realistic goal? I have a set of about 1000 human-transcribed recordings of 10 digits that I have already used to attempt to train the model, but it didn't work
Hi, i was tryngf to achieve the same i was using the digits models and achieved around 55% in best case with many teweaks to the values, also i found that best recognition was by spliting in individual digits, and send only 1 at a time to the recognizer, usefull specially with very small or quick spoken digits.
But never was able to reach the 94.5% you achieve.
We maybe can exachange information and help. I will really aprecaite if you share the adpted models and features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i'm still far form the goal:
TOTAL Words: 1000 Correct: 861 Errors: 153
TOTAL Percent correct = 86.10% Error = 15.30% Accuracy = 84.70%
TOTAL Insertions: 14 Deletions: 42 Substitutions: 97
can someone help me out? i really apreciate and will help you guys with anything on my capabilities.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Greetings from the Argentina!
I'm working with the similar things now using pocketspinx.
The problem I have is a low accuracy for some numbers. Worst of all it is reconging the number 6.
I've written a lot of variants in dictionary file but they all don't work:
6 S IH K S
6(2) S IY K
6(3) S II K S
6(4) S EE K S
6(5) SH IH K S
6(6) SH EH K S
6(7) S YH K S
6(8) S YI K S
6(9) S IY K S
6(10) S EY K S
How can I improve it? Can you share your dictionary file with me? Did you tried to use <sil> already?</sil>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I use pocketsphinx_continuous to recognize Digits (0-9) from MP3 files. Those files contain only numbers pronounced by different persons (male/female) + some noise. Pauses between digits are about 2 seconds
Recognition experts suggested me to use en-us-8khz acoustic model + grammar file. It mostly works but I found the accuracy of recognition is very low. Then I tried to use voxforge acoustic model (8 kHz) instead and got much more accurate results. I have also tried to play with various options to improve accuracy.
Finally what I have is (statistics on 100+ files):
en-us-8kHz: only 24% of files are recognized correctly. Other 76% have mistakes.
voxforge: only 45% of files are recognized correctly. Other 55% have mistakes.
I feel I'm on wrong way, but I cannot find out how to use effectively us-en-8khz model. Guess it should provide even better results as most accurate acoustic model...
My Grammar file:
My Dictionary (slightly tweaked):
My Commands:
pocketsphinx_continuous -dict dict.txt -jsgf gram.txt -hmm vortex -infile file.wav -remove_dc yes -remove_noise no -vad_threshold 3.4 -vad_prespeech 19 -vad_postspeech 37 -silprob 2.5
(chosen experimentally)and
pocketsphinx_continuous -dict dict.txt -jsgf gram.txt -hmm en-us-8kHz -infile file.wav -samprate 8000
(default values)Please find 10 test samples (mp3, wav), reference results, en-us results, voxforge results + README's attached
Thank you for help
Last edit: Zaur Aliev 2016-03-24
Nickolay,
Can you please take a look at my samples? And (if you have time) try these on the configuration which you think is the best to use. This will answer the question Who is guilty.
Thank you,
Zaur
I'm traveling today, I'll check tomorrow. Most likely you need to use default parameters and experiment with -lw instead. All your vad_prespeech are certainly not needed.
Hi Nickolay,
I have tried to perform recognition with en-us-8kHz again using -lw, here is some results:
For each of 100 mp3's
Also I reverted my dictionary to the initial state - I gathered numbers from
And copied them to dict.txt:
But results were not significantly changed:
I still think it would be the best way if you have a chance to check my samples in your environment.
Thank you in advance.
Last edit: Zaur Aliev 2016-03-26
Hello Zaur
You can run with the following arguments to get best accuracy:
You can find the full archive in attachment
Then it will give you the following result:
~~~~
TOTAL Words: 90 Correct: 80 Errors: 11
TOTAL Percent correct = 88.89% Error = 12.22% Accuracy = 87.78%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 10
~~~~
It is not a very good idea to adjust standard dictionary. To improve the accuracy further you need to do several things:
1) Collect more samples and perform acoustic model adaptation as described in our tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
2) Avoid conversion to mp3 as much as possible. mp3 for 8khz audio is really harmful.
And, since it seems you are doing this to crack captchas, it is worth to note that they artificially corrupt spectrum by cutting frequency bands randomly, so it is hard to expect high accuracy from the recognizer. There could be few other approaches like missing feature reconstruction to improve accuracy in this case, but they will require development. Adaptation with sufficient data should help.
Hello Nockolay,
Thank you.
I noted there is no dictionary parameter in your example. Is this ok?
Did you use batch instead of continuous just to collect a statistics or does it affect the result? Can I use continuous with that parameters?
And on Adaptation - I've done adaptation for the samples: I split my wavs to separate words (digits). There was about 1500 digit-files as a result. Then I performed adaptation as per tutorial. It seems, the result became worse... =( Is this possible that adaptation works badly for these spoiled audios?
You said the streams were artificially cut (freq-bands). What kind of development does it require to resolve the issue? C/C++ coding is not a problem if the materiel is available to average brain =)
And I'd like to ask you a question on Confidence again - it's still not clear to me. Can I somehow to estimate probability of results? if so I will probably satisfied with the current percentage.
Sorry for the huge questions.
Regards,
Zaur
Yes, it used default dictionary.
Yes you can. Batch is used for testing, continuous for normal operation
Adaptation should work fine, you need to provide the data to get help on this issue.
There is a lot of research on similar problems, like I said above, adaptatation should help, then you can train the model with band of 1-2khz since corruption happens above 2khz as far as I see. Then you can read research like this: http://www.cs.cmu.edu/~robust/Papers/RajSeltzerStern04.pdf
Confidence for small vocabularies is a complex issue and is not supported in our codebase yet. You can use keyword spotting mode, but it will work only for 3-4 syllable phrases, not for digits.
It seems adaptation can greatly improve the accuracy (unexpected) =)
My final state is below
Last edit: Zaur Aliev 2016-03-30
So finally:
I decode mp3's with ffmpeg specifying 8000 Hz samplerate
Then to adapt the model I split all the obtained wavs to separate words (digits) and perform adaptation of en-us-8khz as per tutorial. My problem here was that I couldn't perform adaptation using mllr_matrix. I replaced the original en-us-8khz with en-us-8khz_adapt instead.
After adaptation I had accuracy 97% of digits are recognized correctly (with the default dictionary).
I had to make tuning vad parameters and -wip as well to get better results.
Nickolay, thank you for help again.
Last edit: Zaur Aliev 2016-03-30
Hi Zaur,
I'm also trying to recognize the same dataset. Would you mind sharing what parameters you ended up with for the VAD and -wip? It seems that attempting to adapt the default model has made my recognition less accurate as well.
Last edit: Alex Vanderpot 2016-04-05
I also have about 10,000 data points to train the algorithm with, if you would like to use them.
Hi,
I'm attempting to accomplish something very similar to what is being attempted above.
I'm using the en-us-8khz model, and the same grammar file that he is. He mentioned that he was able to acheive 97% recognition.
Using the options suggested in that thread,
I'm only getting about 10% correct recognition on full recordings of 10 digits. I attempted to adapt the model, but that only made recognition improve maginally. I have attached a portion of the data I attempted to use to adapt the model, and the results from word_align.pl for testing the adapted model and the original en-us-8khz model. The input is immediately decoded from mp3 ~11khz 16kbps (source format) to wav 8000hz.
What tweaks can I make to get better recognition? Is this a realistic goal? I have a set of about 1000 human-transcribed recordings of 10 digits that I have already used to attempt to train the model, but it didn't work
I'd try to add
<sil>
between words in adaptation transcript.I think that was it. Seems much better now. Thank you.
Hi, i was tryngf to achieve the same i was using the digits models and achieved around 55% in best case with many teweaks to the values, also i found that best recognition was by spliting in individual digits, and send only 1 at a time to the recognizer, usefull specially with very small or quick spoken digits.
But never was able to reach the 94.5% you achieve.
We maybe can exachange information and help. I will really aprecaite if you share the adpted models and features.
Where are you all from guys?
Last edit: Nickolay V. Shmyrev 2016-04-07
Latin america, why ?
USA
Hi Alex,
Sorry I forgot to subscribe this thread tracking. Have you solved your problem?
Zaur
also used sox for segment and do an adaptation on single digits samples.
i also tried normalizing the audio files.
Last edit: Eladio Alvarez 2016-04-07
i'm still far form the goal:
TOTAL Words: 1000 Correct: 861 Errors: 153
TOTAL Percent correct = 86.10% Error = 15.30% Accuracy = 84.70%
TOTAL Insertions: 14 Deletions: 42 Substitutions: 97
can someone help me out? i really apreciate and will help you guys with anything on my capabilities.
Greetings from the Argentina!
I'm working with the similar things now using pocketspinx.
The problem I have is a low accuracy for some numbers. Worst of all it is reconging the number 6.
I've written a lot of variants in dictionary file but they all don't work:
6 S IH K S
6(2) S IY K
6(3) S II K S
6(4) S EE K S
6(5) SH IH K S
6(6) SH EH K S
6(7) S YH K S
6(8) S YI K S
6(9) S IY K S
6(10) S EY K S
How can I improve it? Can you share your dictionary file with me? Did you tried to use <sil> already?</sil>