I have my own language model, and I have successfuly trained it to work for multiple speakers. Now there is an issue - please note the recogniser works well in ideal condition for multiple speakers i.e either in a normal silent room or in a room where there is a constant noise source (however loud). I use CMNInit at 60,3,1
But when there is a music or secondary speech being played in the background for testing, pocketsphinx has difficulty to trigger onEndofSpeech. It goes on for several seconds even after the speaker has completed his command. Obvious to say recognition fails.
I have downloaded the raw file from the Android device, the background music appears very muted but the voice is reasonably clear and undistorted.
So I ran pocketsphinx train on this file and the CMN was very low around 35,-15,3 recognition failed
I edited this file in Audacity to trim the audio to just have 1 second of ambient noise before and after the command.
Now I ran the same file using pocketsphinx and the CMN had improved to 55 and the command was recognised.
From this I feel that the background music is not the issue, as Android noise reduction has supressed it to a great extent without distorting the voice. It appears to be the lengthy recording which has redundant data causing the trouble.
I have a recogniser timeout of 5 seconds.
The utterance I am trying to recognise is of maximum 1 second duration.
When there is NO background music present, when the user speaks the command, the recogniser doesn't wait for the full timeout to complete the recognition, it just returns as soon as the word is uttered and recognises perfectly.
When there is background music present, the listener doesn't times out, neither does the onEndofSpeech called for a long time.
Last edit: Q3Varnam 2018-05-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have made some further tests, the recogniser timeout is not working as intended when there is continuous background music.
I have seen the timeouts working correctly when there is a constant white noise and when no noise.
I have set the timeout on the recogniser to 1 sec as a part of testing. I have music being played on the background. The recogniser has recorded the utterance in the first second but it continues for 15 seconds or more trying to recognize ultimately the recognition fails. However when I chop the audio manually as before to 3 seconds the recogniton works.
Can some one throw some light on how to get this timeout working as per my needs
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been experimenting with the the vad_threshold - it appears if vad_threshold is set to 3.0 in very noisy environments, the CMNinit needs to be lowered for recognition to occur, reduction of CMNInit to 40,3,1 in this conditions, occassionally results in noise being considered as speech and the onEndofSpeech never gets triggered until the noise stops( up to 30-40 seconds). Is there any solution to this?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is no issue with recognition. The issue is with detecting end of speech. Can you please help having a look at the audio.
Please review the attached raw audio file from the log dir, the spoken word is at the start of the file. The decoder thread continues to listen for another 2 minutes till the song ends and recognises correctly the word spoken at the start of the file.
I tried to shout @47 seconds and at 1min 35 sec from the start of file, random words to make the end of speech kick in - but there doesn't seem to be any effect.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is a big research project to separate speech from background song reliably. Ideally you want to rework frontend and, probably, integrate it with a decoder. For easy solution you might want to simply move to push to talk scenario.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
User presses the headset button - recogniser is started with a 5 second time out.
User speaks with in 5 seconds or time out kicks in - this works even with the same music in the background.
In this scenario onspeech started is NOT called.
What is not working is
Same music is playing
User speaks onspeech started gets called.
But onspeechended never gets called the app keeps on waiting
onspeechended gets called if the music is stopped.
What is the maximum value I can specify for -vad_threshold? I have just randomly set it to 3.5 and this seems to be calling onspeechended with music on as soon as the speaker ends speech.
I will confirm after few more regression testings.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have run the test several times now with the following setting, time out 3 sec and vad_threshold 3.5 - the same music which used to prevent onspeechended from being called, was continuosly played, several times over in the background at a volume level which would prevent two humans conversing.
If vad threshold was less than 3.5
1. the song doesn't trigger the onSpeechStarted() call
2. When speaker speaks onSpeechstarted gets triggered
3. onSpeechended never gets called
At vad threshold 3.5
1. the song doesn't trigger the onSpeechStarted() call
2. When speaker speaks onSpeechstarted gets triggered
3. onSpeechended gets called as soon as speaker stops speaking
I tried having vad threshold at 4.0
It doesn't trigger onSpeechstarted even if the speaker shouts.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have my own language model, and I have successfuly trained it to work for multiple speakers. Now there is an issue - please note the recogniser works well in ideal condition for multiple speakers i.e either in a normal silent room or in a room where there is a constant noise source (however loud). I use CMNInit at 60,3,1
But when there is a music or secondary speech being played in the background for testing, pocketsphinx has difficulty to trigger onEndofSpeech. It goes on for several seconds even after the speaker has completed his command. Obvious to say recognition fails.
I have downloaded the raw file from the Android device, the background music appears very muted but the voice is reasonably clear and undistorted.
So I ran pocketsphinx train on this file and the CMN was very low around 35,-15,3 recognition failed
I edited this file in Audacity to trim the audio to just have 1 second of ambient noise before and after the command.
Now I ran the same file using pocketsphinx and the CMN had improved to 55 and the command was recognised.
From this I feel that the background music is not the issue, as Android noise reduction has supressed it to a great extent without distorting the voice. It appears to be the lengthy recording which has redundant data causing the trouble.
I have a recogniser timeout of 5 seconds.
The utterance I am trying to recognise is of maximum 1 second duration.
When there is NO background music present, when the user speaks the command, the recogniser doesn't wait for the full timeout to complete the recognition, it just returns as soon as the word is uttered and recognises perfectly.
When there is background music present, the listener doesn't times out, neither does the onEndofSpeech called for a long time.
Last edit: Q3Varnam 2018-05-06
I have made some further tests, the recogniser timeout is not working as intended when there is continuous background music.
I have seen the timeouts working correctly when there is a constant white noise and when no noise.
I have set the timeout on the recogniser to 1 sec as a part of testing. I have music being played on the background. The recogniser has recorded the utterance in the first second but it continues for 15 seconds or more trying to recognize ultimately the recognition fails. However when I chop the audio manually as before to 3 seconds the recogniton works.
Can some one throw some light on how to get this timeout working as per my needs
I have been experimenting with the the vad_threshold - it appears if vad_threshold is set to 3.0 in very noisy environments, the CMNinit needs to be lowered for recognition to occur, reduction of CMNInit to 40,3,1 in this conditions, occassionally results in noise being considered as speech and the onEndofSpeech never gets triggered until the noise stops( up to 30-40 seconds). Is there any solution to this?
Please see attached log file and the raw audio file.
my feat.params
-lw 10
-wip 0.9
-lowerf 130
-upperf 6800
-nfilt 25
-transform dct
-lifter 22
-feat 1s_c_d_dd
-cmn live
-varnorm no
-cmninit 60,3,1
-beam 1e-80
-wbeam 1e-80
-pbeam 1e-80
-vad_threshold 2.5
There is no issue with recognition. The issue is with detecting end of speech. Can you please help having a look at the audio.
Please review the attached raw audio file from the log dir, the spoken word is at the start of the file. The decoder thread continues to listen for another 2 minutes till the song ends and recognises correctly the word spoken at the start of the file.
I tried to shout @47 seconds and at 1min 35 sec from the start of file, random words to make the end of speech kick in - but there doesn't seem to be any effect.
It is a big research project to separate speech from background song reliably. Ideally you want to rework frontend and, probably, integrate it with a decoder. For easy solution you might want to simply move to push to talk scenario.
I am already doing PTT .
User presses the headset button - recogniser is started with a 5 second time out.
User speaks with in 5 seconds or time out kicks in - this works even with the same music in the background.
In this scenario onspeech started is NOT called.
What is not working is
What is the maximum value I can specify for -vad_threshold? I have just randomly set it to 3.5 and this seems to be calling onspeechended with music on as soon as the speaker ends speech.
I will confirm after few more regression testings.
I have run the test several times now with the following setting, time out 3 sec and vad_threshold 3.5 - the same music which used to prevent onspeechended from being called, was continuosly played, several times over in the background at a volume level which would prevent two humans conversing.
If vad threshold was less than 3.5
1. the song doesn't trigger the onSpeechStarted() call
2. When speaker speaks onSpeechstarted gets triggered
3. onSpeechended never gets called
At vad threshold 3.5
1. the song doesn't trigger the onSpeechStarted() call
2. When speaker speaks onSpeechstarted gets triggered
3. onSpeechended gets called as soon as speaker stops speaking
I tried having vad threshold at 4.0
It doesn't trigger onSpeechstarted even if the speaker shouts.