It would be realy great if you can find something, because this is realy an Go-NoGo moment for my project.
It seems to work on the 'decoding' test after the training, but not in the continues test when reading the same .wav files de decode succeeded with.
Here is my Training folder and Testing folder (with pocketshinx compiled for Win32) https://onedrive.live.com/redir?resid=53DF68CA92747BA6%2173460
(the 'Testing' folder contains a few .wav files for each word, that decoded with succes after the training)
Sorry to ask you, don't want to put any pressure on you, but have you found time to look at the model I want to train?
(the good results decode gives and the bad results pocketsphinx_continues gives)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
PS: is there a way I can understand what the values mean after -cmninit?
CMN is an estimation of channel properties, basically volumes of the sound in each frequency band
and what cmninit does? (maybe I can learn to do it myself)
cmninit parameter sets initial channel estimation so it can be accurate from the start otherwise it takes decoder several seconds to update channel estimation
You can get more information about CMN and feature extraction in a textbook on speech recognition.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Wouw, I couldn't wait to tell but I'm stull half way in testing.....
But it works!!!
After adding -cmninit the recognition started to work, but not so good.
After changing $CFG_CMN to 'none'; I almost get exactly the same result as in de decode after the training! (with pocketsphinx_contin...)
And as an additional improvement, the Error rate went from 25% to 10%...
You made me very happy!
Now hoppping that the pocketsphinx in my code (Windows Phone) gives the same result!
to be continued....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your questions would be more productive if you provide the files (models, audio file you are trying to decode, command line, reference). To emulate live recognition you can recognize continuous recording in audio file with pocketsphinx_continuous.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Tnx for the reply and your offer to help.
I first want to try it myself, I'm already asking a lot of questions...
The thing is, I'm at a stage the concept works but the error rate of a real implementation needs to go down. (mainly because its much higher then the error rate with the recorded sounds).
At this moment pocketsphinx_continuous gets a great error rate of +- 15% with the recorded sounds. But when I do pocketsphinx_continuous with the microphone data, playing the same recorded sounds, the error rate is much more (lets say 50%).
I was wondering if this kind of behaviour is found more often by other users?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm still struggling with getting good Microphone results.
(recorded .wav result are already great; 85% good | 15% error)
Because you asked, but the training is OK now I think.
Here is my Training folder and Testing folder (with pocketshinx compiled for Win32) https://onedrive.live.com/redir?resid=53DF68CA92747BA6%2179892
(the 'Testing' folder contains a few .wav files for each word, that decoded with succes after the training)
THe Config I load is:
-lowerf 130 \
-upperf 6800 \
-nfilt 25 \
-transform dct \
-lifter 22 \
-feat 1s_c_d_dd \
-agc none \
-cmn none \
-varnorm no
-cmninit 65,-1,-35,-10,-5,-24,8,-8,-21,-12,-32,-21,-29
extra for the phone:
-kws_threshold "1e-40"
Above config works great with .wav files, but not so good with live Microphone data.
Hope you can help me.
PS: In real live the key is to recognize "the first vowel in the word", maybe pocketsphinx/training can be tweaked to recognize that?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was wondering if you found some time to check the raw recordings?
(and is there a (doable) way to convert them in .wav files so I can hear and user them?)
Again, merry christmas and happy new year.
Regards, Toine
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I checked the data you provided, thanks for that. Here are my thoughts:
1) You should separate test set and train set. Currently your train set includes your test set and for that reason you get wrong idea about accuracy. Currently accuracy is about 40-30%, not 15. More work on this is required, for example, you might adjust features to shift them to higher frequencies of child voice.
4) You can optimize lw parameter. With -lw 1.0 I get best results.
5) You need to provide initial CMN estimation in feat.params file. In that case you will get more reliable recognition in continuous mode. In the trained add to the feat.params the following line:
See our en-us model how it is configured. Then your online samples will recognized correctly.
6) As for your idea about first vowel, you do not need any special treatment, HMM framework should get it right. However, please note that it's not just a simple sound but a whole recording which makes the difference. The most distinctive factor in such sounds is the movement of formants, they define the distinction and those formats are changed across all the sound, not just in the beginning.
Last edit: Nickolay V. Shmyrev 2015-01-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, I see I suggested you to use none for CMN, that is indeed a good idea, but in your training setup I don't see you are using none, you are using current.
So the proposed changes are:
1) Use CMN none
2) Use 4 gaussians for HMM instead of 1
3) Use -lw 1
Then my WER goes to 20% only.
Another idea is to remove final HH phone from the dictionary, I think it is not really physically present. You need to consider how many distinct regions are present in your data and design HMM based on that.
Another thing related to CMN. I noticed that your training db amplitude is about 7000-8000 while in raw files recordings are more quiet (1000-2000). That means you will get a significant mismatch without CMN and even CMN will not help a lot. I suggest you to normalize recordings to match audio level of training set.
Ideally you need a good recording level normalizer, probably we need to improve AGC or implement very quick CMN.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This thread is realy helpfull for me, many thanks for that and your support.
I have been away for a litle while so I need to re-read myself into the content, and this thread is gettings so long and to big, so I'm going to start a new thread with no such major overhead.
But before doing this I want to get some assumptions and definition clarified if possible.
I hope you can help me with some litle questions is the CMN option for training, final recognition or both? when CMN='none', will -cmninit be useless?
* What do you mean with AGC? (Volume control by change?)
(can option '-agc none' be helpfull in this, or do I need to make my own volume equalizer)
Thanks for your help, I think I'm getting close so I hope I get things right now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I did a training with a result of 15% and 25% error rate.
Now I'm testing with some live sounds/samples with Pocketsphinx_continuous and almost none of the results match????
Any suggestion where to look at??
PS: I'm using the two commands:
You can share your model training folder, the test sample and other required data files to get help on this issue.
Thanks for the support.
It would be realy great if you can find something, because this is realy an Go-NoGo moment for my project.
It seems to work on the 'decoding' test after the training, but not in the continues test when reading the same .wav files de decode succeeded with.
Here is my Training folder and Testing folder (with pocketshinx compiled for Win32)
https://onedrive.live.com/redir?resid=53DF68CA92747BA6%2173460
(the 'Testing' folder contains a few .wav files for each word, that decoded with succes after the training)
I used the following command to test:
My goal is to only detect these words, zo without any gramatics or other language things.
I realy hope you can help me.
PS: (The training is not optimized at all yet, there are a few to small wav files, and I still need to tweak the training for a better output)
Last edit: Toine db 2014-11-24
Sorry to ask you, don't want to put any pressure on you, but have you found time to look at the model I want to train?
(the good results decode gives and the bad results pocketsphinx_continues gives)
Sorry for delay, add the following option in model feat.params:
It should work fine after that. Default cmninit option is not very accurate.
Since your sounds are very short, it might be helpful to train with -cmn none (in sphinx_train.cfg).
No problem, I'm glad you would take a look at it.
And very very glad if it realy works :-)
I will try it as soon as I can.
PS: is there a way I can understand what the values mean after -cmninit? and what cmninit does? (maybe I can learn to do it myself)
CMN is an estimation of channel properties, basically volumes of the sound in each frequency band
cmninit parameter sets initial channel estimation so it can be accurate from the start otherwise it takes decoder several seconds to update channel estimation
You can get more information about CMN and feature extraction in a textbook on speech recognition.
Wouw, I couldn't wait to tell but I'm stull half way in testing.....
But it works!!!
After adding -cmninit the recognition started to work, but not so good.
After changing $CFG_CMN to 'none'; I almost get exactly the same result as in de decode after the training! (with pocketsphinx_contin...)
And as an additional improvement, the Error rate went from 25% to 10%...
You made me very happy!
Now hoppping that the pocketsphinx in my code (Windows Phone) gives the same result!
to be continued....
OK, finished testing (for now).
And pocketsphinx_continuous resulted in great result, with recorded sounds almost te same results as decode after the training. (lets say 95% te same)
Still strugeling to get the same results on from live microphone data.
Any tips where to tweak/test with settings?
Hello Toine
Your questions would be more productive if you provide the files (models, audio file you are trying to decode, command line, reference). To emulate live recognition you can recognize continuous recording in audio file with pocketsphinx_continuous.
Tnx for the reply and your offer to help.
I first want to try it myself, I'm already asking a lot of questions...
The thing is, I'm at a stage the concept works but the error rate of a real implementation needs to go down. (mainly because its much higher then the error rate with the recorded sounds).
I was wondering if this kind of behaviour is found more often by other users?
Hi Nickolay,
I'm still struggling with getting good Microphone results.
(recorded .wav result are already great; 85% good | 15% error)
Because you asked, but the training is OK now I think.
Here is my Training folder and Testing folder (with pocketshinx compiled for Win32)
https://onedrive.live.com/redir?resid=53DF68CA92747BA6%2179892
(the 'Testing' folder contains a few .wav files for each word, that decoded with succes after the training)
THe Config I load is:
-lowerf 130 \ -upperf 6800 \ -nfilt 25 \ -transform dct \ -lifter 22 \ -feat 1s_c_d_dd \ -agc none \ -cmn none \ -varnorm no
-cmninit 65,-1,-35,-10,-5,-24,8,-8,-21,-12,-32,-21,-29
extra for the phone:
-kws_threshold "1e-40"
Above config works great with .wav files, but not so good with live Microphone data.
Hope you can help me.
PS: In real live the key is to recognize "the first vowel in the word", maybe pocketsphinx/training can be tweaked to recognize that?
Hi Toine
Can you collect raw data from microphone with -rawlogdir? Maybe it's different a bit.
I recorded 4 situations that most of the happens.
https://onedrive.live.com/redir?resid=53DF68CA92747BA6%2182258
Maybe you can see/hear what is going wrong or/and hopefuly you can give me some advise how to correct it.
Hope to hear from you.
OH, and maybe you noticed (or not) But I made the PocketSphinx demo in github work on Windows 8 apps as well (besides Windows Phone 8 apps).
PS: the key in the words I'm trying to distinguish is the first letter/vowel (N | E | O | EA | H ). Maybe that can help tweaking something?
Hi Nickolay,
First of all, happy Holiday.
I was wondering if you found some time to check the raw recordings?
(and is there a (doable) way to convert them in .wav files so I can hear and user them?)
Again, merry christmas and happy new year.
Regards, Toine
Hi Toine
Sorry, its a holiday time so not much time for work.
You can open raw files in wavesurfer or audacity (just select sample rate 16khz). You can also convert them to wav with sox:
Merry Christmas and Happy New Year for you as well!
Thanks for the info.
Of course, no work but holiday time :-)
I'm still hoping you could take a look/check in to my Raw audio files (what I could be doing wrong) after the holiday time of course.
Kind Regards,
Toine
Last edit: Toine db 2014-12-25
Hi Nickolay,
Hope you had a nice holiday.
I have taken a look at the raw recordings, but they sound the same in my ears.
Is it possible you can take a look, maybe you can see whats the main issue.
PS: the key in the words I'm trying to distinguish is the first letter/vowel (N | E | O | EA | H ). Maybe that can help tweaking something?
Hi Toine
I checked the data you provided, thanks for that. Here are my thoughts:
1) You should separate test set and train set. Currently your train set includes your test set and for that reason you get wrong idea about accuracy. Currently accuracy is about 40-30%, not 15. More work on this is required, for example, you might adjust features to shift them to higher frequencies of child voice.
2) More data would help too
3) You can enable MGAU training
$CFG_CI_MGAU = 'yes';
$CFG_FINAL_NUM_DENSITIES = 4;
4) You can optimize lw parameter. With -lw 1.0 I get best results.
5) You need to provide initial CMN estimation in feat.params file. In that case you will get more reliable recognition in continuous mode. In the trained add to the feat.params the following line:
See our en-us model how it is configured. Then your online samples will recognized correctly.
6) As for your idea about first vowel, you do not need any special treatment, HMM framework should get it right. However, please note that it's not just a simple sound but a whole recording which makes the difference. The most distinctive factor in such sounds is the movement of formants, they define the distinction and those formats are changed across all the sound, not just in the beginning.
Last edit: Nickolay V. Shmyrev 2015-01-29
Also, I see I suggested you to use none for CMN, that is indeed a good idea, but in your training setup I don't see you are using none, you are using current.
So the proposed changes are:
1) Use CMN none
2) Use 4 gaussians for HMM instead of 1
3) Use -lw 1
Then my WER goes to 20% only.
Another idea is to remove final HH phone from the dictionary, I think it is not really physically present. You need to consider how many distinct regions are present in your data and design HMM based on that.
~~~~~~~~
neh N_neh E_neh
eh E_eh
heh H_heh E_heh
owh O_owh W_owh
eairh E_eairh A_eairh I_eairh R_eairh
~~~~~~~~~~
That gives a bit more accuracy.
Another thing related to CMN. I noticed that your training db amplitude is about 7000-8000 while in raw files recordings are more quiet (1000-2000). That means you will get a significant mismatch without CMN and even CMN will not help a lot. I suggest you to normalize recordings to match audio level of training set.
Ideally you need a good recording level normalizer, probably we need to improve AGC or implement very quick CMN.
Nickolay,
I totaly missed these replies, somehow I don't get messages from Sourceforge..... if there are any.
I will look at yout suggestions as soon as possible, they look promissing.
Hi Again Nickolay,
This thread is realy helpfull for me, many thanks for that and your support.
I have been away for a litle while so I need to re-read myself into the content, and this thread is gettings so long and to big, so I'm going to start a new thread with no such major overhead.
But before doing this I want to get some assumptions and definition clarified if possible.
I hope you can help me with some litle questions
is the CMN option for training, final recognition or both?
when CMN='none', will -cmninit be useless?
* What do you mean with AGC? (Volume control by change?)
(can option '-agc none' be helpfull in this, or do I need to make my own volume equalizer)
Thanks for your help, I think I'm getting close so I hope I get things right now.
Both
Yes
AGC is automatic gain control. Current agc implementation is not functional though.