I am a newbie to Speech Recognition and PocketSphinx and need your help.
I started with recognizing a audio speech using pocketsphinx_continuous which converts the speech to text .
For my input speech the out is below one:-
"seventy one years ago
a bike while most morning
death toll from the sky in the world was strange
the flash of white wall of fire
destroy the city
"
The highligted are the mistakes.
1. 'bike' should be replaced with 'bright'
2. 'while most' should be replaced with 'cloudless'
3. 'strange' should be replaced with 'changed'.
So basically I created some sample .wav files from the original .wav files and the indiidual sample .wav files were containg the content of these words only :- bright, changed etc.
So then I created below files as suggested by the above link:-
1. sample.transcription
2. sample.fileids
and now applied the below command for decoder:-
pocketsphinx_continuous -hmm cmusphinx-en-us-ajay -lm en-us.lm.bin -dict cmudict-en-us.dict -infile hiroshima-speech-60secs_part_1.wav
where,
1. cmusphinx-en-us-ajay is my adapted acoustic model
2. en-us.lm.bin is the default language model which I got when I installed the pocketsphinx.
3. cmudict-en-us.dict is the default dictionary which I got when I installed the pocketsphinx.
After applying the above steps the new output I got is below:-
"say what years ago
a bright
while most morning
death toll from the sky the world was changed
flash of white wall of fire
destroy the city"
Here I am able to get the accuracy for the words 'bright' and 'changed'.
But this has now distorted the sentence
"seventy one years ago" to "say what years ago"
Please suggest me know that after applying the adaptaion I am getting accuracy for the words I was applying adaptaion for but it is side affecting other sentences.
Please let me know where I am making a mistake.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please suggest me know that after applying the adaptaion I am getting accuracy for the words I was applying adaptaion for but it is side affecting other sentences.
It could happen. Speech recognition algorithms are statistical and changes in parameters could cause side effects. Usually the more adaptation data you use, the more accurate system is.
Please let me know where I am making a mistake.
I do not see any mistake here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just to add more, I tried to replace the incorrect word 'strength' to the correct word 'changed'.
And for this I just added the below content to the 'sample.transcription' file [because I do not want the other words to get affected] changed (hiroshimachanged)
where the hiroshimachanged.wav file just speaks the word 'changed'.
For you reply "Speech recognition algorithms are statistical and changes in parameters could cause side effects."
Can you just elaborate that what causes these side effects? How internally it works? Because of what factors the side effect occurs?
Can you suggest me, how can I avoid these side effects?
For you reply "Usually the more adaptation data you use, the more accurate system is."
What do you mean by more adaptaion data to be applied here?
As I just applied the adaptation data for only that word which I was getting the inaccuracy for, for e.g. the word 'changed' here.
To get my system to be accurate without any side effects what more adaptation data I should apply? As I created (cropped from the original audio .wav file which was having the whole speech) the individual audio .wav file [for the words only] to apply the adaptaion.
Can you suggest me how to proceed now?
Last edit: Ajay Kumar Sharma 2018-02-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Step 3:- To apply the adaptation, use the map_adapt program
./map_adapt \
-moddeffn cmusphinx-en-us-ptm-5.2/mdef \
-ts2cbfn .ptm. \
-meanfn cmusphinx-en-us-ptm-5.2/means \
-varfn cmusphinx-en-us-ptm-5.2/variances \
-mixwfn cmusphinx-en-us-ptm-5.2/mixture_weights \
-tmatfn cmusphinx-en-us-ptm-5.2/transition_matrices \
-accumdir . \
-mapmeanfn cmusphinx-en-us-ajay/means \
-mapvarfn cmusphinx-en-us-ajay/variances \
-mapmixwfn cmusphinx-en-us-ajay/mixture_weights \
-maptmatfn cmusphinx-en-us-ajay/transition_matrices
Step 4:- Run the pocketsphinx_continuous decoder.
pocketsphinx_continuous -hmm cmusphinx-en-us-ajay -lm en-us.lm.bin -dict cmudict-en-us.dict -infile hiroshima-speech-60secs_part_1.wav > hiroshima-speech.log
Please let me know if I am missing anything or any correction required here.
I have below more queries/doubts:-
1. Do I need to create a separate language model in this case? [As I have not added a new word so the default language model should work here I think.]
Do I need to create a separate dictionary file in this case? [As I have not added a new word so the default default dictionary file should work here I think.]
I visited this https://cmusphinx.github.io/wiki/tutorialtuning/ link and here encountered -hyp test.hyp option for pocketsphinx_batch command. How to generate test.hyp file for calculating WER (Word Error Rate) ? From somewhere I got to know that pocketsphinx_decode tool is required to generate .hyp file ,if yes, then from where can I get the tool pocketsphinx_decode?
What if I need to add a new word or a different pronounciation to the word to the existing dictionary. How can I acheive that? If the existing dictionary by any mean gets updated then the language model also needs to be updated? If yes, then how to update the existing language model?
Can I provide two dictionary/language model paths as an option to my decoder 'pocketsphinx_continuous' for the case If I have created a new dictionary and a new corresponding language model file and want to use this new as well as the already existing dict/lm for speech recognition?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am stucked at this point and not able to proceed further.
It will be very helpful of you, if you can guide me/give me some pointers on how to proceed further.
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
Greetings!
I am a newbie to Speech Recognition and PocketSphinx and need your help.
I started with recognizing a audio speech using pocketsphinx_continuous which converts the speech to text .
For my input speech the out is below one:-
"seventy one years ago
a bike
while most morning
death toll from the sky in the world was strange
the flash of white wall of fire
destroy the city
"
The highligted are the mistakes.
1. 'bike' should be replaced with 'bright'
2. 'while most' should be replaced with 'cloudless'
3. 'strange' should be replaced with 'changed'.
To improve the accuracy I followed the below link :-
https://bakerstreetsystems.com/blog/post/training-cmu-sphinx-speech-recognition-software-ubuntu-1404
which is basiaclly referencing the link:- https://cmusphinx.github.io/wiki/tutorialadapt/
So basically I created some sample .wav files from the original .wav files and the indiidual sample .wav files were containg the content of these words only :- bright, changed etc.
So then I created below files as suggested by the above link:-
1. sample.transcription
2. sample.fileids
and now applied the below command for decoder:-
pocketsphinx_continuous -hmm cmusphinx-en-us-ajay -lm en-us.lm.bin -dict cmudict-en-us.dict -infile hiroshima-speech-60secs_part_1.wav
where,
1. cmusphinx-en-us-ajay is my adapted acoustic model
2. en-us.lm.bin is the default language model which I got when I installed the pocketsphinx.
3. cmudict-en-us.dict is the default dictionary which I got when I installed the pocketsphinx.
After applying the above steps the new output I got is below:-
"say what years ago
a bright
while most morning
death toll from the sky the world was changed
flash of white wall of fire
destroy the city"
Here I am able to get the accuracy for the words 'bright' and 'changed'.
But this has now distorted the sentence
"seventy one years ago" to "say what years ago"
Please suggest me know that after applying the adaptaion I am getting accuracy for the words I was applying adaptaion for but it is side affecting other sentences.
Please let me know where I am making a mistake.
Thanks
It could happen. Speech recognition algorithms are statistical and changes in parameters could cause side effects. Usually the more adaptation data you use, the more accurate system is.
I do not see any mistake here.
Hello Nickolay,
Thanks for the response!
Just to add more, I tried to replace the incorrect word 'strength' to the correct word 'changed'.
And for this I just added the below content to the 'sample.transcription' file [because I do not want the other words to get affected]
changed(hiroshimachanged)where the hiroshimachanged.wav file just speaks the word 'changed'.
For you reply "Speech recognition algorithms are statistical and changes in parameters could cause side effects."
Can you just elaborate that what causes these side effects? How internally it works? Because of what factors the side effect occurs?
Can you suggest me, how can I avoid these side effects?
For you reply "Usually the more adaptation data you use, the more accurate system is."
What do you mean by more adaptaion data to be applied here?
As I just applied the adaptation data for only that word which I was getting the inaccuracy for, for e.g. the word 'changed' here.
To get my system to be accurate without any side effects what more adaptation data I should apply? As I created (cropped from the original audio .wav file which was having the whole speech) the individual audio .wav file [for the words only] to apply the adaptaion.
Can you suggest me how to proceed now?
Last edit: Ajay Kumar Sharma 2018-02-16
Hello,
Content of the 'sample.transcription' file is not getting correctly printed here.
[
changed(hiroshimachanged)]Hello Nickolay,
Adding more here the commands that I excuted as below:-
Step 1:- generate a set of acoustic model feature files from these WAV audio recordings.
sphinx_fe -argfile ./cmusphinx-en-us-ptm-5.2/feat.params \ -samprate 16000 -c hiroshima.fileids \ -di . -do . -ei wav -eo mfc -mswav yes
Step 2:- collect statistics from the adaptation data.
./bw \ -hmmdir cmusphinx-en-us-ptm-5.2 \ -moddeffn cmusphinx-en-us-ptm-5.2/mdef \ -ts2cbfn .ptm. \ -feat 1s_c_d_dd \ -svspec 0-12/13-25/26-38 \ -cmn current \ -agc none \ -dictfn cmudict-en-us.dict \ -ctlfn hiroshima.fileids \ -lsnfn hiroshima.transcription \ -accumdir .
Step 3:- To apply the adaptation, use the map_adapt program
./map_adapt \ -moddeffn cmusphinx-en-us-ptm-5.2/mdef \ -ts2cbfn .ptm. \ -meanfn cmusphinx-en-us-ptm-5.2/means \ -varfn cmusphinx-en-us-ptm-5.2/variances \ -mixwfn cmusphinx-en-us-ptm-5.2/mixture_weights \ -tmatfn cmusphinx-en-us-ptm-5.2/transition_matrices \ -accumdir . \ -mapmeanfn cmusphinx-en-us-ajay/means \ -mapvarfn cmusphinx-en-us-ajay/variances \ -mapmixwfn cmusphinx-en-us-ajay/mixture_weights \ -maptmatfn cmusphinx-en-us-ajay/transition_matrices
Step 4:- Run the pocketsphinx_continuous decoder.
pocketsphinx_continuous -hmm cmusphinx-en-us-ajay -lm en-us.lm.bin -dict cmudict-en-us.dict -infile hiroshima-speech-60secs_part_1.wav > hiroshima-speech.log
Please let me know if I am missing anything or any correction required here.
I have below more queries/doubts:-
1. Do I need to create a separate language model in this case? [As I have not added a new word so the default language model should work here I think.]
Do I need to create a separate dictionary file in this case? [As I have not added a new word so the default default dictionary file should work here I think.]
I visited this https://cmusphinx.github.io/wiki/tutorialtuning/ link and here encountered -hyp test.hyp option for pocketsphinx_batch command. How to generate test.hyp file for calculating WER (Word Error Rate) ? From somewhere I got to know that pocketsphinx_decode tool is required to generate .hyp file ,if yes, then from where can I get the tool pocketsphinx_decode?
What if I need to add a new word or a different pronounciation to the word to the existing dictionary. How can I acheive that? If the existing dictionary by any mean gets updated then the language model also needs to be updated? If yes, then how to update the existing language model?
Can I provide two dictionary/language model paths as an option to my decoder 'pocketsphinx_continuous' for the case If I have created a new dictionary and a new corresponding language model file and want to use this new as well as the already existing dict/lm for speech recognition?
Hello Nickolay,
Greetings!!
I am stucked at this point and not able to proceed further.
It will be very helpful of you, if you can guide me/give me some pointers on how to proceed further.
Thanks!
Sorry, your questions are too basic, you can answer them yourself.