Menu

lower performance with API than with pocketsphinx_continuous.exe

Help
2015-12-06
2015-12-14
  • Artur Zygadło

    Artur Zygadło - 2015-12-06

    Hello,

    I am still working on the ASR system for a social robot in Polish. I have trained the acoustic model and prepared the language model and dictionary.

    I have encountered a strange issue. For the test recordings the WER and SER are really low - 5% and 10%. When I use my models in pocketsphinx_continuous.exe in cmd, then the recognition quality is good. (I mean recognition from microphone, with -inmic yes, I saw that the -samprate is 16000 by default and -nfft is 512).

    But, when I am using pocketsphinx in Qt Creator (on Windows) with the very same AM, LM and dictionary, it is surprisingly much worse... I have thought that it may be something with sampling rate but if the default setting in cmd work fine, why should I config sth in my program.

    When testing I have noticed that the results are quite good for words which are the beginnings of endings of my commands, I say something and it outputs only one word or some part of the command as if the loop worked faster than it should(?).

    I paste my code here. There are some additions to the original code as I have to make it in a C++ class and emit the result in a QString to another class slot, but they shouldn't affect the ASR, I hope.

    void RsVoiceAnalysis::RecognizeFromMicrophone(cmd_ln_t *config, ps_decoder_t *ps)
    {
        ad_rec_t *ad;
        int16 adbuf[4096];
        uint8 utt_started, in_speech;
        int32 k;
        char const *hyp;
    
        //std::string strSpoken;
    
        if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"), (int)cmd_ln_float32_r(config, "-samprate"))) == NULL)
            E_FATAL("Failed to open audio device\n");
        if (ad_start_rec(ad) < 0)
            E_FATAL("Failed to start recording\n");
    
        if (ps_start_utt(ps) < 0)
            E_FATAL("Failed to start utterance\n");
        utt_started = FALSE;
    
        std::cout<<"LM READY...."<<std::endl;
    
          //  boost::this_thread::interruption_point();
        for(;;){
            if (!mIsRunning) break;
    
            if ((k = ad_read(ad, adbuf, 4096)) < 0)
                E_FATAL("Failed to read audio\n");
            ps_process_raw(ps, adbuf, k, FALSE, FALSE);
            in_speech = ps_get_in_speech(ps);
    
            if (in_speech && !utt_started) {
                utt_started = TRUE;
            printf("Listening...\n");
            }
    
            if (!in_speech && utt_started) {
                // TUTAJ WCHODZI GDY PRZESTAJEMY MOWIC.
                ps_end_utt(ps);
                hyp = ps_get_hyp(ps, NULL );
    
                if (hyp != NULL)
                    {
                    std::cout<<"You said: "<<hyp<<std::endl;
    
                    QTextCodec *codec = QTextCodec::codecForName("Windows-1250");
                    QString qstrSpoken = codec->toUnicode(hyp);
                    qCDebug(va) << "Recognized, emitting signal";
                    emit VoiceRecognized(qstrSpoken);
                    break;
                    }
                else qCDebug(va) << "You didnt say a thing";
    
    
                if (ps_start_utt(ps) < 0)
                                E_FATAL("Failed to start utterance\n");
                            utt_started = FALSE;
                            printf("LM READY....\n");
                }
        }
        ad_close(ad);
    }
    
    //and the config (called in another function):
    config = cmd_ln_init(NULL,ps_args(),TRUE,"-samprate", "16000","-nfft", "512",
                                "-hmm",MODELDIR "/pl",
                         "-lm", MODELDIR "/pl/irys.lm",
                         "-dict", MODELDIR "/pl/irys.dic","-logfn", MODELDIR "log.txt",
                         NULL);
    
     

    Last edit: Nickolay V. Shmyrev 2015-12-06
    • Nickolay V. Shmyrev

      You can add an option -rawlogdir <dir> in order to store recorded audio. You can listen recorded audio in audacity to see if it was recorded properly.

      Most likely because your thread runs in GUI application, it has breaks in recording.

       

      Last edit: Nickolay V. Shmyrev 2015-12-06
      • Artur Zygadło

        Artur Zygadło - 2015-12-06

        Dear Nickolay,

        Thank you for your answer. I guess you may be right that it is an issue with GUI and threading - when I ran recognize_from_microphone in a program containing just the main function it worked well, with same results as in the CMD.
        Same problem goes when I run keyword spotting mode - I was wondering why I have to repeat the word many times (and by the cmd it worked well) but it occurs that the microphone just didn't record my words, right?

        However, somehow my Audacity isn't able to import the .raw file from rawlogdir. I also tried to convert it to .wav with SoX, not working too. Therefore, I cannot check if what you suggested happens (but presumably yes).

        Would you have any suggestions on how to solve the GUI/thread problem?

        Best regards,
        Artur

         

        Last edit: Artur Zygadło 2015-12-06
  • Nickolay V. Shmyrev

    However, somehow my Audacity isn't able to import the .raw file from rawlogdir. I also tried to convert it to .wav with SoX, not working too. Therefore, I cannot check if what you suggested happens (but presumably yes).

    You can check again, see for details
    http://manual.audacityteam.org/o/man/file_menu.html#raw

    Would you have any suggestions on how to solve the GUI/thread problem?

    It is hard to say without understanding what your program does. You can share at least recognizer logs and raw files.

     
    • Artur Zygadło

      Artur Zygadło - 2015-12-11

      Dear Nickolay,

      It seems my problem had nothing to do with threading, but rather - as I read on this forum - probably with the CMN init values. The first hypothesis after configuration wasn't good and it was sent to the GUI and made me think everything is recognized badly (because after the first hypothesis it returned to the keyword spotting mode). When I switched the "return to the KWS mode" off (for testing) it started recognizing well (from the second hypothesis, the first is still badly recognized).

      Now I have made it a bit different, listening to KWS and LM in two simultaneous threads and as the configuratio is done only once (in the beginning) it is acceptable and it works quite OK. I mean - the LM commands are recognized well, but the keyword sometimes isn't. I guess I should try to use a longer word (now it is the name of the robot which has only two syllables).

      What I also tried is to increase the vad_threshold parameter. It seems the results for it with the value of 3.0 are much better than for the default 2.0 (with default settings sounds such as writing on the keyboard or moving some papers made the system start the utterance, now these effects are mostly gone, same as for background music). Is it a good way to improve the recognition rate? Should the vad_threshold be changed?

      Best regards,
      Artur

       
  • Nickolay V. Shmyrev

    Now I have made it a bit different, listening to KWS and LM in two simultaneous threads and as the configuratio is done only once (in the beginning) it is acceptable and it works quite OK. I mean - the LM commands are recognized well, but the keyword sometimes isn't. I guess I should try to use a longer word (now it is the name of the robot which has only two syllables).

    Search mode change should not affect CMN settings, you should see it from the logs. I don't think you need threads. Maybe you incorrectly restarted recognizer every time, this is not a good idea as well. You can just search search mode between utterances with ps_set_search, that should keep the accuracy.

    What I also tried is to increase the vad_threshold parameter. It seems the results for it with the value of 3.0 are much better than for the default 2.0 (with default settings sounds such as writing on the keyboard or moving some papers made the system start the utterance, now these effects are mostly gone, same as for background music). Is it a good way to improve the recognition rate? Should the vad_threshold be changed?

    Yes, 3.0 is found to be more reasonable by other users as well, you can change this temporarily in case you have more noises. We are still in process of designing a better VAD, so this part might change soon.

     

Log in to post a comment.