Menu

Partial hypothesis problem

Help
Matt Hall
2011-03-25
2012-09-22
1 2 > >> (Page 1 of 2)
  • Matt Hall

    Matt Hall - 2011-03-25

    Hi everyone,

    We're trying to do recognition on a longish, ongoing utterance. So we're
    getting hypotheses on a regular basis during that utterance. We are
    encountering a weird situation though where the hypothesis isn't including the
    last few words from the audio, and won't include it until more non-silence
    audio comes in. When we do say something new then the hypothesis suddenly
    contains the last few words in addition to the new one. The weird thing is
    that we can just add some non-silent garbage audio and then it'll add the
    words it's missing. So it's like it has these words correctly identified and
    stored internally, but doesn't think that they should be added to the
    hypothesis yet. Any new data causes it to reassess that and realize it has new
    words to add. Does this make any sense? Is there anything we can do to fix it?
    Our accuracy seems to be good, but we have these weird stalled recognitions
    because the hypothesis isn't up to date.

    Any advice greatly appreciated.
    Matt

     
  • Nickolay V. Shmyrev

    Sorry, which decoder are you talking about?

     
  • Matt Hall

    Matt Hall - 2011-03-28

    We're using a trigram model and the wsj8k model, is that what you're asking?
    Sorry if I misunderstood.

     
  • Nickolay V. Shmyrev

    I'm asking if you are using pocketsphinx or sphinx3 or something else. It's
    better to provide more information like logs, input data, exact command line,
    decoder versions. That will help to resolve the issue faster.

     
  • John Watkinson

    John Watkinson - 2011-03-28

    Hi, I can field this inquiry (I'm working with Matt on this):

    We are using PocketSphinx on iOS. We are making use of the OpenEars library,
    but have 'hacked' it somewhat to get hypotheses in realtime via polling (as
    opposed to waiting for the end of an utterance). We also turn all silence
    filtering off and are controlling the utterance starts/stops manually. It is
    understood that any given polling call to ps_get_hyp may result in poor
    quality or even null hypotheses. However, what we are finding is that it can
    somehow get 'behind' for a protracted period of time, where even after the
    user has stopped speaking, it is unable to recognize the last 3 or 4 words
    that were said. Even if there is a considerable silence after speaking, it
    will still not recognize these words. Then, as soon as the user starts
    speaking the next word, it suddenly matches the previously-spoken words
    correctly. This is unrelated to utterance starting/stopping, all of this
    happens within a single utterance. We are using a small trigram language
    model.

    Here is what our main code loop looks like (roughly, this is simplified a
    bit):

    ps_decoder_t *pocketSphinxDecoder;

    // ... pocketSphinxDecoder is initialized with default settings and with our
    tri-gram language model

    // Initialize audio device and continuous listener (using raw mode), start
    recording
    pocketsphinxAudioDevice audioDevice = openAudioDevice("device",16000));
    cont_ad_t
    continuousListener = cont_ad_init_rawmode(audioDevice,
    readBufferContents)) == NULL)
    startRecording(audioDevice);

    // Start utterance
    ps_start_utt(pocketSphinxDecoder, NULL);

    // Main loop
    int32 speechData = 0;
    int32 remainingSpeechData = 0;
    int32 sampleCount = 0;
    for (;;) {
    speechData = cont_ad_read(continuousListener, audioDeviceBuffer,
    SPEECH_BUFFER);
    sampleCount += speechData;
    if (speechData > 0) {
    remainingSpeechData = ps_process_raw(pocketSphinxDecoder, audioDeviceBuffer,
    speechData, FALSE, FALSE);
    // After we've read a threshold amount of data, recognize speech
    if (sampleCount > THRESHOLD) {
    sampleCount = 0;
    char const *hypothesis = ps_get_hyp(pocketSphinxDecoder, &recognitionScore,
    &utteranceID);
    // ... Do things with the hypothesis.
    }
    }
    // ... Stop and restart utterance based on app logic
    }

     
  • Nickolay V. Shmyrev

    Hello guys

    Ok, there might be an issue. But i'm not ready to say where is it. It might
    depend on the search method (are you using -fwdflat no) of if you are are
    calling m

    ad_stop_rec(ad);
    while (ad_read(ad, adbuf, 4096) >= 0);
    cont_ad_reset(cont);

    exactly like in pocketsphinx_continuous.c To check theat I basically need to
    setup a test case and try to reproduce your problem. It will take some time.
    If you can provide it, it will be helpful.

     
  • John Watkinson

    John Watkinson - 2011-03-31

    Thanks Nickolay. We'll work to get a test case put together that isolates our
    issue. We are using -fwdflat yes.

     
  • Matt Hall

    Matt Hall - 2011-04-01

    Hi Nickolay,

    In the process of writing the test case we have figured out our issue I think.
    The test case we wrote didn't have the issue we were talking about so we were
    able to track the problem back to our code. It came down to an error in how we
    were dealing with the returned hypothesis and some sphinx config that wasn't a
    good idea. Going back to defaults and properly handling the hypothesis now
    seems to have cleared things up. So thanks very much for your assistance, it
    got us to focus on simplifying the problem until we could see where it was
    happening.

    Thanks again,
    Matt

     
  • Nickolay V. Shmyrev

    Nice. I'm glad it was helpful.

     
  • Matt Hall

    Matt Hall - 2011-04-05

    After further investigation via our test cases we are seeing a big difference
    between the hypothesis returned when the utterance is still open, and when we
    get a hypothesis when the utterance has been ended. Here's how the test case
    is getting interim hypotheses:

    int
    decode_raw_continuous(ps_decoder_t *ps, FILE *rawfh,
                         char const *uttid, long maxsamps)
    {
        long total, pos;
        int32 intscore;
    
        ps_start_utt(ps, uttid);
        printf("Decoding as a stream.\n");
        total = 0;
        while (!feof(rawfh)) {
            int16 data[HYP_GEN_SAMPLE_COUNT];
            size_t nread;
    
            nread = fread(data, sizeof(*data), sizeof(data)/sizeof(*data), rawfh);
            ps_process_raw(ps, data, nread, FALSE, FALSE);
            total += nread;
            char const *inthyp = ps_get_hyp(ps, &intscore, &uttid);
            printf("  Hypothesis at %i: %s\n", total, inthyp);
        }
        ps_end_utt(ps);
        char const *inthyp = ps_get_hyp(ps, &intscore, &uttid);
        printf("  End utterance hypothesis: %s\n", inthyp);
        return total;
    }
    

    The hypotheses that we get during the loop can get "stuck" it seems and not
    improve, but when we end the utterance the hypothesis is pretty good. From
    reading the experimenting with the ngram _search code it seems that it does
    some sort of globally optimal search over all the audio when the utterance is
    ended whereas the interim hyptheses are from some incremental best path or
    something. Is this correct? Is there an easy way we can force the globally
    optimal search to occur every time without ending the utterance?

    The last hypothesis vs. the hypothesis after the utterance is ended looks like
    (text is from a story which we've made a custom language model and dictionary
    for):

    Hypothesis at 225547: WINKEN BLINKEN AND NOD ONE NIGHT IN A WOODEN SHOE DOWN FROM THE SO CAME THE THE THE
    End utterance hypothesis: WINKEN BLINKEN AND NOD ONE NIGHT LONG IN A WOODEN SHOE SAILED OFF IN A RIVER OF CRYSTAL LIGHT INTO A SEA
    

    You can see it gets stuck down one path of the tree and can't see a better
    path until it ends the utterance and looks at everything again. Or something
    like that.

    We are using all default parameters for pocketsphinx.

    Thanks for any advice!
    Matt

     
  • Nickolay V. Shmyrev

    From reading the experimenting with the ngram _search code it seems that it
    does some sort of globally optimal search over all the audio when the
    utterance is ended whereas the interim hyptheses are from some incremental
    best path or something. Is this correct?

    Yes, this is fwdflat second pass enabled by -fwdflat yes

    Is there an easy way we can force the globally optimal search to occur every
    time without ending the utterance?

    This approach is implemented in multisphinx (in our subversion)

    The hypotheses that we get during the loop can get "stuck" it seems and not
    improve, but when we end the utterance the hypothesis is pretty good

    This may seem very good sometimes but overall it's not really 100% better. If
    you will setup the test, you'll see that fwdflat adds just 5-10% to accuracy.
    There is approach to take to disable fwdflat but increase beams in fwdtree the
    way you'll reach the accuracy of fwdflat. But that will end in slower search
    maybe.

     
  • Matt Hall

    Matt Hall - 2011-04-06

    I did disable fwdflat and it works as you said it would. It seems you're right
    about the 5-10% accuracy improvement in the case where the acoustic model is
    recognizing words with strong confidence, but on our more marginal test cases
    the hypothesis is quite a bit different (and better).

    As a result I think we have narrowed the problem down to some sort of speaker
    dependent acoustic model issue. It seems that our recognition is much worse
    for female speakers. So related to that:
    1) Is the hub4wsj_sc_8k model biased towards male speakers in some way?
    2) We are using an iPad for recognition, could the microphone on that be
    causing poor recognition vs. the wsj model? Something related to this: http:/
    /sourceforge.net/tracker/?func=detail&atid=101904&aid=3117707&group_id=1904
    perhaps? I've tried changing that cmninit setting with poor results.
    3) Since we are recognizing a limited dictionary, should we attempt to make or
    modify an acoustic model ourselves?

    Thanks very much again for your help!

    Matt

     
  • Matt Hall

    Matt Hall - 2011-04-06

    And a followup:
    4) How does pocketsphinx deal with silence? Is it something that should be
    showing up in the hypothesis as a recognized word potentially (<sil> or SIL
    for example)? We have never seen it and are wondering if we need to enable it
    in some way. </sil>

    Thanks!

     
  • Nickolay V. Shmyrev

    1) Is the hub4wsj_sc_8k model biased towards male speakers in some way?

    Maybe, no evidence of that

    2) We are using an iPad for recognition, could the microphone on that be
    causing poor recognition vs. the wsj model?
    Something related to this: http://sourceforge.net/tracker/?func=detail&atid=1
    01904&aid=3117707&group_id=1904
    perhaps?

    Overall IPad microphone is more long-distance one than the microphone types
    WSJ and HUB4 where collected from. Yes, there is a mismatch but it's not just
    about CMN but also about spectral properties as a whole.

    3) Since we are recognizing a limited dictionary, should we attempt to make
    or modify an acoustic model ourselves? Thanks very much again for your help!
    Matt

    The conditions when you need to make or modify a model are described in
    tutorial:

    http://cmusphinx.sourceforge.net/wiki/tutorialam

    Model adaptation for IPad is definitely required. I would train a model as
    well but the problem arise how to collect the data.

    4) How does pocketsphinx deal with silence? Is it something that should be
    showing up in the hypothesis as a recognized word potentially (<sil> or SIL
    for example)? We have never seen it and are wondering if we need to enable it
    in some way. Thanks!</sil>

    Silence is recognized but not available in hypothesis string. Try with
    "-backtrace yes" to see it. You need to hack over API to get it.

     
  • Matt Hall

    Matt Hall - 2011-04-07

    I have successfully adapted the standard acoustic model with a recording that
    was giving poor recognition accuracy. I'm not sure if it's the speaker or the
    iPad microphone, but accuracy has really improved. So that was great advice,
    thanks.

    So now that we know it works, is there a recommended way to incorporate many
    speakers when adapting the acoustic model? Should we do:
    1) Adapt with several wav files for each sentence, one from each speaker - all
    processed together to modify the original acoustic model? Or
    2) Adapt the already adapted model in sequence, using the same steps but with
    the next speaker's data? So we'd end up with a series of adapted models with
    the final one including all speakers.

    I'm just not sure if the adaptation process expects all the wav files it's
    processing to come from the same speaker and recording session.

    Thanks again for all the great advice.

     
  • Matt Hall

    Matt Hall - 2011-04-07

    And one followup question: We're using the recommended acoustic model
    hub4wsj_sc_8k, but does the 8k there indicate it's optimized for telephone
    recognition? I see some things indicating that in other forum posts but
    nothing definitive. Would using this acoustic model on the iPad be the cause
    of our main recognition issues?

    Note: As a test I've tried the WSJ1 model and recognition accuracy was pretty
    bad.

    Thanks!

     
  • Nickolay V. Shmyrev

    So now that we know it works, is there a recommended way to incorporate many
    speakers when adapting the acoustic model?

    First adapt on the whole dataset from all speakers and all conditions. That
    will give you new generic model. Then adapt generic model for paticular
    speaker using speaker-only recorings. That will give you speaker-specific
    model.

    I'm just not sure if the adaptation process expects all the wav files it's
    processing to come from the same speaker and recording session.

    There is no such requirement

    But does the 8k there indicate it's optimized for telephone recognition?

    No, it doesn't. This model can decode 8khz speech but it doesn't really for
    telephone

    Would using this acoustic model on the iPad be the cause of our main
    recognition issues?

    Are there issues? I think you need to operate word error figures here. hub4wsj
    is one of the best models available.Of course there are better commercial
    models but that's another story.

     
  • sarinsukumar

    sarinsukumar - 2011-06-25

    Hi Matt,
    I am also trying to achieve the same. but when i use get_hypothesis i am
    getting "nil". but at the end i am getting the proper hypothesis. I have tried
    fwdflat no, but no difference. can you please advice how did you do this??

     
  • Anuj Kumar

    Anuj Kumar - 2011-06-28

    Hi Sarinsukumar,

    Could you provide a code of what you have done and a sample output for
    debugging purposes?

     
  • sarinsukumar

    sarinsukumar - 2011-06-29

    Hai Anuj,
    I have also found your question on the web, I am really thankful for your help
    I have tracked the issue, it was ps_process_raw(ps, data, nread, TRUE, FALSE);
    So when i put ps_process_raw(ps, data, nread, FALSE, FALSE); instead of that
    and it went fine.

    But i would like to get another advice from you.
    1) When I get partial hypothesis, I am not getting a probability or utterance
    id , is there any way to enable that?
    2) Actually I want to run a live decoding to know how similar user spoken a
    word and stop when he speaks a wrong word.
    Note: My expected word at a time is only one.
    ex: I am expecting a sentence "Cat met a dog" in the same sequence and
    indicate when user speak a wrong word. I have to stop at "Cat" When user says
    "Cat bite a dog"

    Thanks in advance

     
  • Anuj Kumar

    Anuj Kumar - 2011-06-29

    When I get partial hypothesis, I am not getting a probability or utterance
    id , is there any way to enable that?

    Did you take a look at the data structures in the ps_lattice_internal.h file?
    I believe ps_get_hyp returns an object of the type ps_lattice_s, which in turn
    has objects of the type ps_latnode_s. Each lattice node should have the ID,
    and a score with it.

    Actually I want to run a live decoding to know how similar user spoken a
    word and stop when he speaks a wrong word.

    You need to compare the partial hypothesis with your expected sentence and
    then make the application stop; however, the tricky part is that each frame in
    speech decoding is 10ms, whereas each word maybe more than that. So, even
    though the partial hypothesis at a particular instance may not have the
    correct output or maybe NULL, its possible that with some more acoustic
    information, you will get the expected output. So, your top-level application
    should make those decisions.

     
  • sarinsukumar

    sarinsukumar - 2011-06-29

    Hi Anuj
    I had gone through that header file once and look there again defently.
    But we can pass utterenceID and pathscore parameters to ps_get_hypothesis,
    isnt it?. They are returning with a zero. and get_probability function too
    returning a zero with partial hypothesis.

    I have seen live decoding APIs with sphinx3 and 4 , Does pocketsphinx have
    such APIs?
    Is there any documentation available for multisphinx and can I use it for my
    purpose? does it in developing stage ?

     
  • Anuj Kumar

    Anuj Kumar - 2011-06-29

    But we can pass utterenceID and pathscore parameters to ps_get_hypothesis,
    isnt it?. They are returning with a zero. and get_probability function too
    returning a zero with partial hypothesis.

    Look at the recognize_from_microphone function in continuous.c. The ps_get_hyp
    function passes a NULL for the best score, and address of an _uninitialized
    _char const variable uttid. My guess is that when you call the
    ps_get_hypothesis function it stores the best score and the utterance ID in
    those variables when it returns. So, you don't actually pass any utteranceID
    or score parameters, just address to uninitialized variables where you want
    the function to store those values for the best partial hypothesis. Correct me
    if I'm wrong, though.

    I have seen live decoding APIs with sphinx3 and 4 , Does pocketsphinx have
    such APIs?

    Not to my knowledge, Nickolay and/or others could confirm.

    Is there any documentation available for multisphinx and can I use it for my
    purpose? does it in developing stage ?

    Best documentation on multisphinx is David's thesis.

     
  • Anuj Kumar

    Anuj Kumar - 2011-06-29

    and get_probability function too returning a zero with partial hypothesis.

    /
    Get posterior probability.

    @note Unless the -bestpath option is enabled, this function will
    always return zero (corresponding to a posterior probability of
    1.0). Even if -bestpath is enabled, it will also return zero when
    called on a partial result. Ongoing research into effective
    confidence annotation for partial hypotheses may result in these
    restrictions being lifted in future versions.

    @param ps Decoder.
    @param out_uttid Output: utterance ID for this utterance.
    @return Posterior probability of the best hypothesis.
    /
    POCKETSPHINX_EXPORT
    int32 ps_get_prob(ps_decoder_t
    ps, char const
    out_uttid);

     
  • sarinsukumar

    sarinsukumar - 2011-07-03

    hi Anuj
    very thanks for the replay.
    I understand now it wont give probability measure for partial results.
    I would like get one more advice.

    My scenario is i have maximum 20-30 words. and i expect a sentence in its
    exact same order.
    so can say , expect one word at a time. So do you have any advice to increase
    the accuracy by changing the parameters or number of hmms or method of
    searching?

    Thanks in advance.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.