CMU Sphinx / Forums / Help: [pocketsphinx] switch between kws and jsgf mode on the fly

Lucian Georgescu - 2017-09-07

Hello,

I'm working on a project where I use pockesphinx to recognize commands in real time. This is the commands format (there are some examples):

Casandra, turn the lights off.
Casandra, turn the lights on.
Casandra, change the color of the light.

The idea of using a grammar model with garbage loop produces delays (the project runs on Raspberry PI).
Reading on the forum topics, I understand that I should use the keyword spotting module to detect when the keyword "Casandra" is spoken and then switch to grammar mode to recognize the entire command.

So I tried the following:
When I start pocketsphinx_continuous I give it as arguments hmm, dict, samprate and inmic.
In the source code, in the main function, after the line "ps = psinit (config)" I added the following lines:
ps_set_kws (ps, "kws", "keyword.txt");
ps_set_jsgf_file (ps, "jsgf", "grammar.gram");

Then, at the start of the recognize_from_microphone() function, I set kws as default searcher: "ps_set_search (ps, "kws")";

And here is how I changed the infinite for() loop:

for (;;) { if ((k = ad_read(ad, adbuf, 2048)) < 0) E_FATAL("Failed to read audio\n"); ps_process_raw(ps, adbuf, k, FALSE, FALSE); // check to see if currently collected audio is speech or not in_speech = ps_get_in_speech(ps); if (in_speech && !utt_started) { utt_started = TRUE; E_INFO("Listening...\n"); } if (!in_speech && utt_started) { /* speech -> silence transition, time to start new utterance */ ps_end_utt(ps); hyp = ps_get_hyp(ps, NULL); if (hyp != NULL) { printf("KWS hypothesis: %s\n", hyp); // if the detected keyword is casandra then we switch to the grammar search mode and we decode the buffer again if(strstr(hyp, "casandra")!=NULL) { if(ps_set_search(ps, "jsgf") != 0){ printf("ERROR: Cannot switch to jsgf mode \n"); }else{ printf("Switched to jsgf mode \n"); } printf("Mode: %s\n", ps_get_search(ps)); hyp = ps_get_hyp(ps, NULL); if(hyp != NULL){ printf("ASR hypothesis: %s\n", hyp); }else{ printf("ASR hypothesis: NULL\n"); } if(ps_set_search(ps, "kws") != 0){ printf("ERROR: Cannot switch to kws mode \n"); }else{ printf("Switched to kws mode \n"); } } fflush(stdout); }

Is it possible to call "ps_get_hyp()" twice, once in kws mode, and then in jsgf mode, having the same audio buffer as input?
The code compiles, but it doesn't do what I wanted. Can you tell me if the code logic is good or not? Can you give me a hint exactly how should I change it?

Thank you,

Lucian Georgescu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2017-09-07
  
  Can you tell me if the code logic is good or not?
  
  The logic is wrong.
  
  Can you give me a hint exactly how should I change it?
  
  You can use kws mode. Once keyphrase is recognized you can retrieve the audio buffer with ps_get_rawdata and process it again with a jsgf recognizer.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Okay, thank you for the answer. I have changed in the following way. In the main() function, besides:

ps_set_kws (ps, "kws", "keyword.txt");
ps_set_jsgf_file (ps, "jsgf", "grammar.gram");

I added:

ps_set_search (ps, "kws");
ps_set_rawdata_size (ps, 500000);

to start by default in kws mode and to set the maximum buffer size (I put a random value, I have no idea how it should be).

The recognize_from_microphone() function now looks like this:

static void
recognize_from_microphone()
{
    ad_rec_t *ad;
    int16 adbuf[2048];
    int16 adbuf2[2048];
    int32 dim;
    uint8 utt_started, in_speech;
    int32 k;
    char *hyp;
    time_t startTime, endTime;
    double processingTime;


    if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"),
                          (int) cmd_ln_float32_r(config,
                                                 "-samprate"))) == NULL)
        E_FATAL("Failed to open audio device\n");
    if (ad_start_rec(ad) < 0)
        E_FATAL("Failed to start recording\n");

    if (ps_start_utt(ps) < 0)
        E_FATAL("Failed to start utterance\n");
    utt_started = FALSE;
    E_INFO("Ready....\n");

    // at the begining we are in the KWS-search state
    for (;;) {
        if ((k = ad_read(ad, adbuf, 2048)) < 0)
            E_FATAL("Failed to read audio\n");
        ps_process_raw(ps, adbuf, k, FALSE, FALSE);
    // check to see if currently collected audio is speech or not
        in_speech = ps_get_in_speech(ps);
        if (in_speech && !utt_started) {
            utt_started = TRUE;
            E_INFO("Listening...\n");
        }
        if (!in_speech && utt_started) {
            /* speech -> silence transition, time to start new utterance  */
            ps_end_utt(ps); 
            hyp = ps_get_hyp(ps, NULL);
            if (hyp != NULL) {
        printf("KWS hypothesis: %s\n", hyp);
        // if the detected keyword is casandra then we switch to the grammar search mode and we decode the buffer again
        if(strstr(hyp, "casandra")!=NULL)
        {
            if(ps_set_search(ps, "jsgf") != 0){
                printf("ERROR: Cannot switch to jsgf mode \n");
            }else{
                printf("Switched to jsgf mode \n");
            }

            printf("Mode: %s\n", ps_get_search(ps));

            ps_get_rawdata(ps,adbuf2, dim);
            ps_start_utt(ps);
            ps_process_raw(ps, adbuf2, dim, FALSE, FALSE);
            ps_end_utt(ps);
            hyp = ps_get_hyp(ps, NULL);

            if(hyp != NULL){
                printf("ASR hypothesis: %s\n", hyp);
            }else{
                printf("ASR hypothesis: NULL\n");
            }

                if(ps_set_search(ps, "kws") != 0){
                printf("ERROR: Cannot switch to kws mode \n");
                }else{
                printf("Switched to kws mode \n");
            }
        }
        fflush(stdout);
            }

            if (ps_start_utt(ps) < 0)
                E_FATAL("Failed to start utterance\n");
            utt_started = FALSE;
            E_INFO("Ready....\n");
        }
        sleep_msec(100);
    }
    ad_close(ad);
}

Using -rawlogdir it saves raw audio files to disk. But I don't understand if ps_get_rawdata() knows what audio files it needs to load. Should it be specified somewhere? Does it retrieves the latest raw audio saved to disk?
Every time I pronounce a command, it prints the KWS hypothesis correctly, but the ASR hypothesis is always NULL.

Thank you.

Nickolay V. Shmyrev - 2017-09-10

Using -rawlogdir it saves raw audio files to disk. But I don't understand if ps_get_rawdata() knows what audio files it needs to load.

It does not load files, rawlogdir is irrelevant. Audio is stored in memory when you set rawdata_size and retrieved from memory.

Does it retrieves the latest raw audio saved to disk?

No

Every time I pronounce a command, it prints the KWS hypothesis correctly, but the ASR hypothesis is always NULL.

Your code is incomplete, but probably your audio buffer adbuf2 is too small, it should have enough data for several seconds of audio.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[pocketsphinx] switch between kws and jsgf mode on the fly

Speech Recognition Toolkit

Forums

Help

[pocketsphinx] switch between kws and jsgf mode on the fly document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

[pocketsphinx] switch between kws and jsgf mode on the fly