I am still working on the ASR system for a social robot in Polish. I have trained the acoustic model and prepared the language model and dictionary.
I have encountered a strange issue. For the test recordings the WER and SER are really low - 5% and 10%. When I use my models in pocketsphinx_continuous.exe in cmd, then the recognition quality is good. (I mean recognition from microphone, with -inmic yes, I saw that the -samprate is 16000 by default and -nfft is 512).
But, when I am using pocketsphinx in Qt Creator (on Windows) with the very same AM, LM and dictionary, it is surprisingly much worse... I have thought that it may be something with sampling rate but if the default setting in cmd work fine, why should I config sth in my program.
When testing I have noticed that the results are quite good for words which are the beginnings of endings of my commands, I say something and it outputs only one word or some part of the command as if the loop worked faster than it should(?).
I paste my code here. There are some additions to the original code as I have to make it in a C++ class and emit the result in a QString to another class slot, but they shouldn't affect the ASR, I hope.
voidRsVoiceAnalysis::RecognizeFromMicrophone(cmd_ln_t*config,ps_decoder_t*ps){ad_rec_t*ad;int16adbuf[4096];uint8utt_started,in_speech;int32k;charconst*hyp;//std::stringstrSpoken;if((ad=ad_open_dev(cmd_ln_str_r(config,"-adcdev"),(int)cmd_ln_float32_r(config,"-samprate")))==NULL)E_FATAL("Failed to open audio device\n");if(ad_start_rec(ad)<0)E_FATAL("Failed to start recording\n");if(ps_start_utt(ps)<0)E_FATAL("Failed to start utterance\n");utt_started=FALSE;std::cout<<"LM READY...."<<std::endl;//boost::this_thread::interruption_point();for(;;){if(!mIsRunning)break;if((k=ad_read(ad,adbuf,4096))<0)E_FATAL("Failed to read audio\n");ps_process_raw(ps,adbuf,k,FALSE,FALSE);in_speech=ps_get_in_speech(ps);if(in_speech&&!utt_started){utt_started=TRUE;printf("Listening...\n");}if(!in_speech&&utt_started){//TUTAJWCHODZIGDYPRZESTAJEMYMOWIC.ps_end_utt(ps);hyp=ps_get_hyp(ps,NULL);if(hyp!=NULL){std::cout<<"You said: "<<hyp<<std::endl;QTextCodec*codec=QTextCodec::codecForName("Windows-1250");QStringqstrSpoken=codec->toUnicode(hyp);qCDebug(va)<<"Recognized, emitting signal";emitVoiceRecognized(qstrSpoken);break;}elseqCDebug(va)<<"You didnt say a thing";if(ps_start_utt(ps)<0)E_FATAL("Failed to start utterance\n");utt_started=FALSE;printf("LM READY....\n");}}ad_close(ad);}//andtheconfig(calledinanotherfunction):config=cmd_ln_init(NULL,ps_args(),TRUE,"-samprate","16000","-nfft","512","-hmm",MODELDIR"/pl","-lm",MODELDIR"/pl/irys.lm","-dict",MODELDIR"/pl/irys.dic","-logfn",MODELDIR"log.txt",NULL);
Last edit: Nickolay V. Shmyrev 2015-12-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your answer. I guess you may be right that it is an issue with GUI and threading - when I ran recognize_from_microphone in a program containing just the main function it worked well, with same results as in the CMD.
Same problem goes when I run keyword spotting mode - I was wondering why I have to repeat the word many times (and by the cmd it worked well) but it occurs that the microphone just didn't record my words, right?
However, somehow my Audacity isn't able to import the .raw file from rawlogdir. I also tried to convert it to .wav with SoX, not working too. Therefore, I cannot check if what you suggested happens (but presumably yes).
Would you have any suggestions on how to solve the GUI/thread problem?
Best regards,
Artur
Last edit: Artur Zygadło 2015-12-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
However, somehow my Audacity isn't able to import the .raw file from rawlogdir. I also tried to convert it to .wav with SoX, not working too. Therefore, I cannot check if what you suggested happens (but presumably yes).
It seems my problem had nothing to do with threading, but rather - as I read on this forum - probably with the CMN init values. The first hypothesis after configuration wasn't good and it was sent to the GUI and made me think everything is recognized badly (because after the first hypothesis it returned to the keyword spotting mode). When I switched the "return to the KWS mode" off (for testing) it started recognizing well (from the second hypothesis, the first is still badly recognized).
Now I have made it a bit different, listening to KWS and LM in two simultaneous threads and as the configuratio is done only once (in the beginning) it is acceptable and it works quite OK. I mean - the LM commands are recognized well, but the keyword sometimes isn't. I guess I should try to use a longer word (now it is the name of the robot which has only two syllables).
What I also tried is to increase the vad_threshold parameter. It seems the results for it with the value of 3.0 are much better than for the default 2.0 (with default settings sounds such as writing on the keyboard or moving some papers made the system start the utterance, now these effects are mostly gone, same as for background music). Is it a good way to improve the recognition rate? Should the vad_threshold be changed?
Best regards,
Artur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Now I have made it a bit different, listening to KWS and LM in two simultaneous threads and as the configuratio is done only once (in the beginning) it is acceptable and it works quite OK. I mean - the LM commands are recognized well, but the keyword sometimes isn't. I guess I should try to use a longer word (now it is the name of the robot which has only two syllables).
Search mode change should not affect CMN settings, you should see it from the logs. I don't think you need threads. Maybe you incorrectly restarted recognizer every time, this is not a good idea as well. You can just search search mode between utterances with ps_set_search, that should keep the accuracy.
What I also tried is to increase the vad_threshold parameter. It seems the results for it with the value of 3.0 are much better than for the default 2.0 (with default settings sounds such as writing on the keyboard or moving some papers made the system start the utterance, now these effects are mostly gone, same as for background music). Is it a good way to improve the recognition rate? Should the vad_threshold be changed?
Yes, 3.0 is found to be more reasonable by other users as well, you can change this temporarily in case you have more noises. We are still in process of designing a better VAD, so this part might change soon.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am still working on the ASR system for a social robot in Polish. I have trained the acoustic model and prepared the language model and dictionary.
I have encountered a strange issue. For the test recordings the WER and SER are really low - 5% and 10%. When I use my models in pocketsphinx_continuous.exe in cmd, then the recognition quality is good. (I mean recognition from microphone, with -inmic yes, I saw that the -samprate is 16000 by default and -nfft is 512).
But, when I am using pocketsphinx in Qt Creator (on Windows) with the very same AM, LM and dictionary, it is surprisingly much worse... I have thought that it may be something with sampling rate but if the default setting in cmd work fine, why should I config sth in my program.
When testing I have noticed that the results are quite good for words which are the beginnings of endings of my commands, I say something and it outputs only one word or some part of the command as if the loop worked faster than it should(?).
I paste my code here. There are some additions to the original code as I have to make it in a C++ class and emit the result in a QString to another class slot, but they shouldn't affect the ASR, I hope.
Last edit: Nickolay V. Shmyrev 2015-12-06
You can add an option
-rawlogdir <dir>in order to store recorded audio. You can listen recorded audio in audacity to see if it was recorded properly.Most likely because your thread runs in GUI application, it has breaks in recording.
Last edit: Nickolay V. Shmyrev 2015-12-06
Dear Nickolay,
Thank you for your answer. I guess you may be right that it is an issue with GUI and threading - when I ran recognize_from_microphone in a program containing just the main function it worked well, with same results as in the CMD.
Same problem goes when I run keyword spotting mode - I was wondering why I have to repeat the word many times (and by the cmd it worked well) but it occurs that the microphone just didn't record my words, right?
However, somehow my Audacity isn't able to import the .raw file from rawlogdir. I also tried to convert it to .wav with SoX, not working too. Therefore, I cannot check if what you suggested happens (but presumably yes).
Would you have any suggestions on how to solve the GUI/thread problem?
Best regards,
Artur
Last edit: Artur Zygadło 2015-12-06
You can check again, see for details
http://manual.audacityteam.org/o/man/file_menu.html#raw
It is hard to say without understanding what your program does. You can share at least recognizer logs and raw files.
Dear Nickolay,
It seems my problem had nothing to do with threading, but rather - as I read on this forum - probably with the CMN init values. The first hypothesis after configuration wasn't good and it was sent to the GUI and made me think everything is recognized badly (because after the first hypothesis it returned to the keyword spotting mode). When I switched the "return to the KWS mode" off (for testing) it started recognizing well (from the second hypothesis, the first is still badly recognized).
Now I have made it a bit different, listening to KWS and LM in two simultaneous threads and as the configuratio is done only once (in the beginning) it is acceptable and it works quite OK. I mean - the LM commands are recognized well, but the keyword sometimes isn't. I guess I should try to use a longer word (now it is the name of the robot which has only two syllables).
What I also tried is to increase the vad_threshold parameter. It seems the results for it with the value of 3.0 are much better than for the default 2.0 (with default settings sounds such as writing on the keyboard or moving some papers made the system start the utterance, now these effects are mostly gone, same as for background music). Is it a good way to improve the recognition rate? Should the vad_threshold be changed?
Best regards,
Artur
Search mode change should not affect CMN settings, you should see it from the logs. I don't think you need threads. Maybe you incorrectly restarted recognizer every time, this is not a good idea as well. You can just search search mode between utterances with
ps_set_search, that should keep the accuracy.Yes, 3.0 is found to be more reasonable by other users as well, you can change this temporarily in case you have more noises. We are still in process of designing a better VAD, so this part might change soon.