CMU Sphinx / Forums / Help: pocketsphinx, first transcription always wrong

Hi Nickolay, I have a question, If I decode audio files with pocketsphinx, the first transcription after model initialization seems to be always wrong, I don't know if it's my code or it's the way pocketsphinx works;

Let's assume we have 1 audio file, we decode it 5 times consecutively, so the same audio file is decoded with the same model 5 times, my issue is that the first decoding hypothesis is always different from the other 4, this is the code that reproduces the issue, (I'm decoding 8khz audio files):

using namespace std;


int main(int argc, char *argv[])
{

        ps_decoder_t *ps;
        cmd_ln_t *config;
        FILE *fh;
        char const *hyp ;
        int16 buf[512];
        int rv;
        int32 score;



        //Initialisation
        //-----------------------
        config = cmd_ln_init(NULL, ps_args(), TRUE,
                "-hmm", "path-to/en-us-8khz/",
                "-lm", "path-to/cmusphinx-5.0-en-us.lm",
                "-dict", "path-to/cmudict.0.6d",
                "-samprate", "8000",
        NULL);


        if (config == NULL){
                cout << "config returned NULL" << endl;
                return 1;
        }

        ps = ps_init(config);
        if (ps == NULL){
                cout << "ps returned NULL" << endl;
                return 1;
        }

        //Decoding
        //-----------------------
        for (int j = 0 ;  j < 5 ;  j++){

                fh = fopen("audioFile.wav", "rb");

                if (fh == NULL){
                        return -1;
                        cout << "fh returned NULL" << endl;
                }


                fseek(fh, 0, SEEK_SET);
                rv = ps_start_utt(ps);

                if (rv < 0){
                        cout << "rv returned NULL" << endl;
                        return 1;
                }

                while (!feof(fh)) {
                        size_t nsamp;
                        nsamp = fread(buf, 2, 512, fh);
                        rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
                }

                rv = ps_end_utt(ps);
                if (rv < 0){
                        cout << "rv returned NULL" << endl;
                        return 1;
                }

                hyp = ps_get_hyp(ps, &score);

                if (hyp == NULL) {
                        cout << "hyp returned NULL" << endl;
                        return 1;
                }

                printf("Recognized: %s\n", hyp);

                fclose(fh);

                //----------------------------------
                cout << "Now going to decode again" << endl;
                cout << "______________________________________________" << endl;

        }

        ps_free(ps);
        cmd_ln_free_r(config);
        return 0;

};

This happens with any audio-file, so it seems like it's not related to a specific audio file, so feel free to try it with any compatible 8khz audio file, and it happens with any acoustic model, this is a log showing how the first Recognition hypothesis is totally unrelated to the others:

http://pastebin.com/YTCTYtMp

Is this an expected behavior? Do you get the same behaviour?
Right now I'm decoding an audio-file just after initialization as a workaround, I'd like to know how can I avoid doing that and if it's possible.

Last edit: Orest 2015-05-06

bic-user - 2015-05-06

hi. you're applying online CMN in your decoding. In the beginning of first utterance initial values for CMN are applied which is not very reliable. You can specify more accurate init values for your channel. According to your log it would be something like "-cmninit 47,1,-2,-5,-10,-8,-8,-7,-7,-7,-3,-3"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Orest - 2015-05-06

Thanks bic-user, I've deleted "-cmninit" from feat.params (it was replacing the initialization parameters in my code) and inserted that parameter in the initialization code, it works now from the first attempt for that audio file, now the natural question that comes into my mind is:
Since all audio files have different values for cmn, does it make sense to decode an audio file twice if the priority is accuracy rather than speed?

My naive guess is that the more the noise of the audio files is diverse, the more decoding an audio-file twice (for cmn prior_update) helps accuracy, does this make sense?

Last edit: Orest 2015-05-06

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bic-user - 2015-05-06
  
  Since all audio files have different values for cmn, does it make sense to decode an audio file twice if the priority is accuracy rather than speed?
  
  As I said your decoding is configured to be online (-cmn prior). In this case cmn values are updated during recognition once in a while. The other approach is to update cmn values once for whole utterance (-cmn current).
  
  the more the noise of the audio files is diverse, the more decoding an audio-file twice helps accuracy, is this correct?
  
  CMN is not about the noise. This technique tries to neutralize convolutive (channel) disturbances. This is quite simple technique, just check some sources to have a solid understanding on what is it: http://goo.gl/jIPoF6
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Orest - 2015-05-06
    
    thanks for the link about CMN
    
    As I said your decoding is configured to be online (-cmn prior). In this case cmn values are updated during recognition once in a while. The other approach is to update cmn values once for whole utterance (-cmn current).
    
    I think I am already using -cmn current, this is my full log with cminit set in the code:
    
    http://pastebin.com/Px3GxUGk
    
    "-cmn current" shows "cmn_prior_update" log-lines in the log, is this an expected behavior?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - bic-user - 2015-05-06
      
      Sorry, I confused you a bit. To update cmn values once for whole utterance you also need to process utterance at once, not in a stream. Check api for details. You may use:
      
      long ps_decode_raw(ps_decoder_t ps, FILE rawfh, long maxsamps)
      
      or allocate buffer large enough to read whole file and set full_utt=TRUE when calling ps_process_raw
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Orest - 2015-05-06

ok, I changed it to ps_decode_raw, now I get only one line of CMN log for each audio file

for example:

INFO: cmn.c(183): CMN: 36.07 -9.53 -11.06 7.40 -12.21 -5.34 3.07 -10.89 1.22 -5.18 -1.77 1.30 -2.39

and I get exactly the same result in all iterations, which is awesome, thank you

Last edit: Orest 2015-05-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

pocketsphinx, first transcription always wrong

Speech Recognition Toolkit

Forums

Help

pocketsphinx, first transcription always wrong document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

pocketsphinx, first transcription always wrong