I'm working on a project where I use pockesphinx to recognize commands in real time. This is the commands format (there are some examples):
Casandra, turn the lights off.
Casandra, turn the lights on.
Casandra, change the color of the light.
The idea of using a grammar model with garbage loop produces delays (the project runs on Raspberry PI).
Reading on the forum topics, I understand that I should use the keyword spotting module to detect when the keyword "Casandra" is spoken and then switch to grammar mode to recognize the entire command.
So I tried the following:
When I start pocketsphinx_continuous I give it as arguments hmm, dict, samprate and inmic.
In the source code, in the main function, after the line "ps = psinit (config)" I added the following lines: ps_set_kws (ps, "kws", "keyword.txt");
ps_set_jsgf_file (ps, "jsgf", "grammar.gram");
Then, at the start of the recognize_from_microphone() function, I set kws as default searcher: "ps_set_search (ps, "kws")";
And here is how I changed the infinite for() loop:
for(;;){if((k=ad_read(ad,adbuf,2048))<0)E_FATAL("Failedtoreadaudio\n");ps_process_raw(ps,adbuf,k,FALSE,FALSE);//checktoseeifcurrentlycollectedaudioisspeechornotin_speech=ps_get_in_speech(ps);if(in_speech&&!utt_started){utt_started=TRUE;E_INFO("Listening...\n");}if(!in_speech&&utt_started){/* speech -> silence transition, time to start new utterance */ps_end_utt(ps);hyp=ps_get_hyp(ps,NULL);if(hyp!=NULL){printf("KWShypothesis:%s\n", hyp); // if the detected keyword is casandra then we switch to the grammar search mode and we decode the buffer again if(strstr(hyp, "casandra")!=NULL) { if(ps_set_search(ps, "jsgf") != 0){ printf("ERROR:Cannotswitchtojsgfmode\n"); }else{ printf("Switchedtojsgfmode\n"); } printf("Mode:%s\n", ps_get_search(ps)); hyp = ps_get_hyp(ps, NULL); if(hyp != NULL){ printf("ASRhypothesis:%s\n", hyp); }else{ printf("ASRhypothesis:NULL\n"); } if(ps_set_search(ps, "kws") != 0){ printf("ERROR:Cannotswitchtokwsmode\n"); }else{ printf("Switchedtokwsmode\n");}}fflush(stdout);}
Is it possible to call "ps_get_hyp()" twice, once in kws mode, and then in jsgf mode, having the same audio buffer as input?
The code compiles, but it doesn't do what I wanted. Can you tell me if the code logic is good or not? Can you give me a hint exactly how should I change it?
Thank you,
Lucian Georgescu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
to start by default in kws mode and to set the maximum buffer size (I put a random value, I have no idea how it should be).
The recognize_from_microphone() function now looks like this:
staticvoidrecognize_from_microphone(){ad_rec_t*ad;int16adbuf[2048];int16adbuf2[2048];int32dim;uint8utt_started,in_speech;int32k;char*hyp;time_tstartTime,endTime;doubleprocessingTime;if((ad=ad_open_dev(cmd_ln_str_r(config,"-adcdev"),(int)cmd_ln_float32_r(config,"-samprate")))==NULL)E_FATAL("Failedtoopenaudiodevice\n");if(ad_start_rec(ad)<0)E_FATAL("Failedtostartrecording\n");if(ps_start_utt(ps)<0)E_FATAL("Failedtostartutterance\n");utt_started=FALSE;E_INFO("Ready....\n");//atthebeginingweareintheKWS-searchstatefor(;;){if((k=ad_read(ad,adbuf,2048))<0)E_FATAL("Failedtoreadaudio\n");ps_process_raw(ps,adbuf,k,FALSE,FALSE);//checktoseeifcurrentlycollectedaudioisspeechornotin_speech=ps_get_in_speech(ps);if(in_speech&&!utt_started){utt_started=TRUE;E_INFO("Listening...\n");}if(!in_speech&&utt_started){/* speech -> silence transition, time to start new utterance */ps_end_utt(ps);hyp=ps_get_hyp(ps,NULL);if(hyp!=NULL){printf("KWShypothesis:%s\n", hyp); // if the detected keyword is casandra then we switch to the grammar search mode and we decode the buffer again if(strstr(hyp, "casandra")!=NULL) { if(ps_set_search(ps, "jsgf") != 0){ printf("ERROR:Cannotswitchtojsgfmode\n"); }else{ printf("Switchedtojsgfmode\n"); } printf("Mode:%s\n", ps_get_search(ps)); ps_get_rawdata(ps,adbuf2, dim); ps_start_utt(ps); ps_process_raw(ps, adbuf2, dim, FALSE, FALSE); ps_end_utt(ps); hyp = ps_get_hyp(ps, NULL); if(hyp != NULL){ printf("ASRhypothesis:%s\n", hyp); }else{ printf("ASRhypothesis:NULL\n"); } if(ps_set_search(ps, "kws") != 0){ printf("ERROR:Cannotswitchtokwsmode\n"); }else{ printf("Switchedtokwsmode\n"); } } fflush(stdout); } if (ps_start_utt(ps) < 0) E_FATAL("Failedtostartutterance\n"); utt_started = FALSE; E_INFO("Ready....\n");}sleep_msec(100);}ad_close(ad);}
Using -rawlogdir it saves raw audio files to disk. But I don't understand if ps_get_rawdata() knows what audio files it needs to load. Should it be specified somewhere? Does it retrieves the latest raw audio saved to disk?
Every time I pronounce a command, it prints the KWS hypothesis correctly, but the ASR hypothesis is always NULL.
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm working on a project where I use pockesphinx to recognize commands in real time. This is the commands format (there are some examples):
Casandra, turn the lights off.
Casandra, turn the lights on.
Casandra, change the color of the light.
The idea of using a grammar model with garbage loop produces delays (the project runs on Raspberry PI).
Reading on the forum topics, I understand that I should use the keyword spotting module to detect when the keyword "Casandra" is spoken and then switch to grammar mode to recognize the entire command.
So I tried the following:
When I start pocketsphinx_continuous I give it as arguments hmm, dict, samprate and inmic.
In the source code, in the main function, after the line "ps = psinit (config)" I added the following lines:
ps_set_kws (ps, "kws", "keyword.txt");
ps_set_jsgf_file (ps, "jsgf", "grammar.gram");
Then, at the start of the recognize_from_microphone() function, I set kws as default searcher: "ps_set_search (ps, "kws")";
And here is how I changed the infinite for() loop:
Is it possible to call "ps_get_hyp()" twice, once in kws mode, and then in jsgf mode, having the same audio buffer as input?
The code compiles, but it doesn't do what I wanted. Can you tell me if the code logic is good or not? Can you give me a hint exactly how should I change it?
Thank you,
Lucian Georgescu
The logic is wrong.
You can use kws mode. Once keyphrase is recognized you can retrieve the audio buffer with ps_get_rawdata and process it again with a jsgf recognizer.
Okay, thank you for the answer. I have changed in the following way. In the main() function, besides:
ps_set_kws (ps, "kws", "keyword.txt");
ps_set_jsgf_file (ps, "jsgf", "grammar.gram");
I added:
ps_set_search (ps, "kws");
ps_set_rawdata_size (ps, 500000);
to start by default in kws mode and to set the maximum buffer size (I put a random value, I have no idea how it should be).
The recognize_from_microphone() function now looks like this:
Using -rawlogdir it saves raw audio files to disk. But I don't understand if ps_get_rawdata() knows what audio files it needs to load. Should it be specified somewhere? Does it retrieves the latest raw audio saved to disk?
Every time I pronounce a command, it prints the KWS hypothesis correctly, but the ASR hypothesis is always NULL.
Thank you.
It does not load files, rawlogdir is irrelevant. Audio is stored in memory when you set rawdata_size and retrieved from memory.
No
Your code is incomplete, but probably your audio buffer adbuf2 is too small, it should have enough data for several seconds of audio.