CMU Sphinx / Forums / Help: Partial hypothesis problem

Matt Hall - 2011-03-25

Hi everyone,

We're trying to do recognition on a longish, ongoing utterance. So we're
getting hypotheses on a regular basis during that utterance. We are
encountering a weird situation though where the hypothesis isn't including the
last few words from the audio, and won't include it until more non-silence
audio comes in. When we do say something new then the hypothesis suddenly
contains the last few words in addition to the new one. The weird thing is
that we can just add some non-silent garbage audio and then it'll add the
words it's missing. So it's like it has these words correctly identified and
stored internally, but doesn't think that they should be added to the
hypothesis yet. Any new data causes it to reassess that and realize it has new
words to add. Does this make any sense? Is there anything we can do to fix it?
Our accuracy seems to be good, but we have these weird stalled recognitions
because the hypothesis isn't up to date.

Any advice greatly appreciated.
Matt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-26

Sorry, which decoder are you talking about?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-03-28

We're using a trigram model and the wsj8k model, is that what you're asking?
Sorry if I misunderstood.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-28

I'm asking if you are using pocketsphinx or sphinx3 or something else. It's
better to provide more information like logs, input data, exact command line,
decoder versions. That will help to resolve the issue faster.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Watkinson - 2011-03-28

Hi, I can field this inquiry (I'm working with Matt on this):

We are using PocketSphinx on iOS. We are making use of the OpenEars library,
but have 'hacked' it somewhat to get hypotheses in realtime via polling (as
opposed to waiting for the end of an utterance). We also turn all silence
filtering off and are controlling the utterance starts/stops manually. It is
understood that any given polling call to ps_get_hyp may result in poor
quality or even null hypotheses. However, what we are finding is that it can
somehow get 'behind' for a protracted period of time, where even after the
user has stopped speaking, it is unable to recognize the last 3 or 4 words
that were said. Even if there is a considerable silence after speaking, it
will still not recognize these words. Then, as soon as the user starts
speaking the next word, it suddenly matches the previously-spoken words
correctly. This is unrelated to utterance starting/stopping, all of this
happens within a single utterance. We are using a small trigram language
model.

Here is what our main code loop looks like (roughly, this is simplified a
bit):

ps_decoder_t *pocketSphinxDecoder;

// ... pocketSphinxDecoder is initialized with default settings and with our
tri-gram language model

// Initialize audio device and continuous listener (using raw mode), start
recording
pocketsphinxAudioDevice audioDevice = openAudioDevice("device",16000));
cont_ad_t continuousListener = cont_ad_init_rawmode(audioDevice,
readBufferContents)) == NULL)
startRecording(audioDevice);

// Start utterance
ps_start_utt(pocketSphinxDecoder, NULL);

// Main loop
int32 speechData = 0;
int32 remainingSpeechData = 0;
int32 sampleCount = 0;
for (;;) {
speechData = cont_ad_read(continuousListener, audioDeviceBuffer,
SPEECH_BUFFER);
sampleCount += speechData;
if (speechData > 0) {
remainingSpeechData = ps_process_raw(pocketSphinxDecoder, audioDeviceBuffer,
speechData, FALSE, FALSE);
// After we've read a threshold amount of data, recognize speech
if (sampleCount > THRESHOLD) {
sampleCount = 0;
char const *hypothesis = ps_get_hyp(pocketSphinxDecoder, &recognitionScore,
&utteranceID);
// ... Do things with the hypothesis.
}
}
// ... Stop and restart utterance based on app logic
}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-30

Hello guys

Ok, there might be an issue. But i'm not ready to say where is it. It might
depend on the search method (are you using -fwdflat no) of if you are are
calling m

ad_stop_rec(ad);
while (ad_read(ad, adbuf, 4096) >= 0);
cont_ad_reset(cont);

exactly like in pocketsphinx_continuous.c To check theat I basically need to
setup a test case and try to reproduce your problem. It will take some time.
If you can provide it, it will be helpful.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Watkinson - 2011-03-31

Thanks Nickolay. We'll work to get a test case put together that isolates our
issue. We are using -fwdflat yes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-04-01

Hi Nickolay,

In the process of writing the test case we have figured out our issue I think.
The test case we wrote didn't have the issue we were talking about so we were
able to track the problem back to our code. It came down to an error in how we
were dealing with the returned hypothesis and some sphinx config that wasn't a
good idea. Going back to defaults and properly handling the hypothesis now
seems to have cleared things up. So thanks very much for your assistance, it
got us to focus on simplifying the problem until we could see where it was
happening.

Thanks again,
Matt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-04-01

Nice. I'm glad it was helpful.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-04-05

After further investigation via our test cases we are seeing a big difference
between the hypothesis returned when the utterance is still open, and when we
get a hypothesis when the utterance has been ended. Here's how the test case
is getting interim hypotheses:

int decode_raw_continuous(ps_decoder_t *ps, FILE *rawfh, char const *uttid, long maxsamps) { long total, pos; int32 intscore; ps_start_utt(ps, uttid); printf("Decoding as a stream.\n"); total = 0; while (!feof(rawfh)) { int16 data[HYP_GEN_SAMPLE_COUNT]; size_t nread; nread = fread(data, sizeof(*data), sizeof(data)/sizeof(*data), rawfh); ps_process_raw(ps, data, nread, FALSE, FALSE); total += nread; char const *inthyp = ps_get_hyp(ps, &intscore, &uttid); printf(" Hypothesis at %i: %s\n", total, inthyp); } ps_end_utt(ps); char const *inthyp = ps_get_hyp(ps, &intscore, &uttid); printf(" End utterance hypothesis: %s\n", inthyp); return total; }

The hypotheses that we get during the loop can get "stuck" it seems and not
improve, but when we end the utterance the hypothesis is pretty good. From
reading the experimenting with the ngram _search code it seems that it does
some sort of globally optimal search over all the audio when the utterance is
ended whereas the interim hyptheses are from some incremental best path or
something. Is this correct? Is there an easy way we can force the globally
optimal search to occur every time without ending the utterance?

The last hypothesis vs. the hypothesis after the utterance is ended looks like
(text is from a story which we've made a custom language model and dictionary
for):

Hypothesis at 225547: WINKEN BLINKEN AND NOD ONE NIGHT IN A WOODEN SHOE DOWN FROM THE SO CAME THE THE THE End utterance hypothesis: WINKEN BLINKEN AND NOD ONE NIGHT LONG IN A WOODEN SHOE SAILED OFF IN A RIVER OF CRYSTAL LIGHT INTO A SEA

You can see it gets stuck down one path of the tree and can't see a better
path until it ends the utterance and looks at everything again. Or something
like that.

We are using all default parameters for pocketsphinx.

Thanks for any advice!
Matt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-04-06

From reading the experimenting with the ngram _search code it seems that it
does some sort of globally optimal search over all the audio when the
utterance is ended whereas the interim hyptheses are from some incremental
best path or something. Is this correct?

Yes, this is fwdflat second pass enabled by -fwdflat yes

Is there an easy way we can force the globally optimal search to occur every
time without ending the utterance?

This approach is implemented in multisphinx (in our subversion)

The hypotheses that we get during the loop can get "stuck" it seems and not
improve, but when we end the utterance the hypothesis is pretty good

This may seem very good sometimes but overall it's not really 100% better. If
you will setup the test, you'll see that fwdflat adds just 5-10% to accuracy.
There is approach to take to disable fwdflat but increase beams in fwdtree the
way you'll reach the accuracy of fwdflat. But that will end in slower search
maybe.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-04-06

I did disable fwdflat and it works as you said it would. It seems you're right
about the 5-10% accuracy improvement in the case where the acoustic model is
recognizing words with strong confidence, but on our more marginal test cases
the hypothesis is quite a bit different (and better).

As a result I think we have narrowed the problem down to some sort of speaker
dependent acoustic model issue. It seems that our recognition is much worse
for female speakers. So related to that:
1) Is the hub4wsj_sc_8k model biased towards male speakers in some way?
2) We are using an iPad for recognition, could the microphone on that be
causing poor recognition vs. the wsj model? Something related to this: http:/
/sourceforge.net/tracker/?func=detail&atid=101904&aid=3117707&group_id=1904 perhaps? I've tried changing that cmninit setting with poor results.
3) Since we are recognizing a limited dictionary, should we attempt to make or
modify an acoustic model ourselves?

Thanks very much again for your help!

Matt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-04-06

And a followup:
4) How does pocketsphinx deal with silence? Is it something that should be
showing up in the hypothesis as a recognized word potentially (<sil> or SIL
for example)? We have never seen it and are wondering if we need to enable it
in some way. </sil>

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-04-07

1) Is the hub4wsj_sc_8k model biased towards male speakers in some way?

Maybe, no evidence of that

2) We are using an iPad for recognition, could the microphone on that be
causing poor recognition vs. the wsj model?
Something related to this: http://sourceforge.net/tracker/?func=detail&atid=1
01904&aid=3117707&group_id=1904 perhaps?

Overall IPad microphone is more long-distance one than the microphone types
WSJ and HUB4 where collected from. Yes, there is a mismatch but it's not just
about CMN but also about spectral properties as a whole.

3) Since we are recognizing a limited dictionary, should we attempt to make
or modify an acoustic model ourselves? Thanks very much again for your help!
Matt

The conditions when you need to make or modify a model are described in
tutorial:

http://cmusphinx.sourceforge.net/wiki/tutorialam

Model adaptation for IPad is definitely required. I would train a model as
well but the problem arise how to collect the data.

4) How does pocketsphinx deal with silence? Is it something that should be
showing up in the hypothesis as a recognized word potentially (<sil> or SIL
for example)? We have never seen it and are wondering if we need to enable it
in some way. Thanks!</sil>

Silence is recognized but not available in hypothesis string. Try with
"-backtrace yes" to see it. You need to hack over API to get it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-04-07

I have successfully adapted the standard acoustic model with a recording that
was giving poor recognition accuracy. I'm not sure if it's the speaker or the
iPad microphone, but accuracy has really improved. So that was great advice,
thanks.

So now that we know it works, is there a recommended way to incorporate many
speakers when adapting the acoustic model? Should we do:
1) Adapt with several wav files for each sentence, one from each speaker - all
processed together to modify the original acoustic model? Or
2) Adapt the already adapted model in sequence, using the same steps but with
the next speaker's data? So we'd end up with a series of adapted models with
the final one including all speakers.

I'm just not sure if the adaptation process expects all the wav files it's
processing to come from the same speaker and recording session.

Thanks again for all the great advice.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matt Hall - 2011-04-07

And one followup question: We're using the recommended acoustic model
hub4wsj_sc_8k, but does the 8k there indicate it's optimized for telephone
recognition? I see some things indicating that in other forum posts but
nothing definitive. Would using this acoustic model on the iPad be the cause
of our main recognition issues?

Note: As a test I've tried the WSJ1 model and recognition accuracy was pretty
bad.

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-04-08

So now that we know it works, is there a recommended way to incorporate many
speakers when adapting the acoustic model?

First adapt on the whole dataset from all speakers and all conditions. That
will give you new generic model. Then adapt generic model for paticular
speaker using speaker-only recorings. That will give you speaker-specific
model.

I'm just not sure if the adaptation process expects all the wav files it's
processing to come from the same speaker and recording session.

There is no such requirement

But does the 8k there indicate it's optimized for telephone recognition?

No, it doesn't. This model can decode 8khz speech but it doesn't really for
telephone

Would using this acoustic model on the iPad be the cause of our main
recognition issues?

Are there issues? I think you need to operate word error figures here. hub4wsj
is one of the best models available.Of course there are better commercial
models but that's another story.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sarinsukumar - 2011-06-25

Hi Matt,
I am also trying to achieve the same. but when i use get_hypothesis i am
getting "nil". but at the end i am getting the proper hypothesis. I have tried
fwdflat no, but no difference. can you please advice how did you do this??

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anuj Kumar - 2011-06-28

Hi Sarinsukumar,

Could you provide a code of what you have done and a sample output for
debugging purposes?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sarinsukumar - 2011-06-29

Hai Anuj,
I have also found your question on the web, I am really thankful for your help
I have tracked the issue, it was ps_process_raw(ps, data, nread, TRUE, FALSE);
So when i put ps_process_raw(ps, data, nread, FALSE, FALSE); instead of that
and it went fine.

But i would like to get another advice from you.
1) When I get partial hypothesis, I am not getting a probability or utterance
id , is there any way to enable that?
2) Actually I want to run a live decoding to know how similar user spoken a
word and stop when he speaks a wrong word.
Note: My expected word at a time is only one.
ex: I am expecting a sentence "Cat met a dog" in the same sequence and
indicate when user speak a wrong word. I have to stop at "Cat" When user says
"Cat bite a dog"

Thanks in advance

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anuj Kumar - 2011-06-29

When I get partial hypothesis, I am not getting a probability or utterance
id , is there any way to enable that?

Did you take a look at the data structures in the ps_lattice_internal.h file?
I believe ps_get_hyp returns an object of the type ps_lattice_s, which in turn
has objects of the type ps_latnode_s. Each lattice node should have the ID,
and a score with it.

Actually I want to run a live decoding to know how similar user spoken a
word and stop when he speaks a wrong word.

You need to compare the partial hypothesis with your expected sentence and
then make the application stop; however, the tricky part is that each frame in
speech decoding is 10ms, whereas each word maybe more than that. So, even
though the partial hypothesis at a particular instance may not have the
correct output or maybe NULL, its possible that with some more acoustic
information, you will get the expected output. So, your top-level application
should make those decisions.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sarinsukumar - 2011-06-29

Hi Anuj
I had gone through that header file once and look there again defently.
But we can pass utterenceID and pathscore parameters to ps_get_hypothesis,
isnt it?. They are returning with a zero. and get_probability function too
returning a zero with partial hypothesis.

I have seen live decoding APIs with sphinx3 and 4 , Does pocketsphinx have
such APIs?
Is there any documentation available for multisphinx and can I use it for my
purpose? does it in developing stage ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anuj Kumar - 2011-06-29

But we can pass utterenceID and pathscore parameters to ps_get_hypothesis,
isnt it?. They are returning with a zero. and get_probability function too
returning a zero with partial hypothesis.

Look at the recognize_from_microphone function in continuous.c. The ps_get_hyp
function passes a NULL for the best score, and address of an _uninitialized
_char const variable uttid. My guess is that when you call the
ps_get_hypothesis function it stores the best score and the utterance ID in
those variables when it returns. So, you don't actually pass any utteranceID
or score parameters, just address to uninitialized variables where you want
the function to store those values for the best partial hypothesis. Correct me
if I'm wrong, though.

I have seen live decoding APIs with sphinx3 and 4 , Does pocketsphinx have
such APIs?

Not to my knowledge, Nickolay and/or others could confirm.

Is there any documentation available for multisphinx and can I use it for my
purpose? does it in developing stage ?

Best documentation on multisphinx is David's thesis.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anuj Kumar - 2011-06-29

and get_probability function too returning a zero with partial hypothesis.

/
Get posterior probability.

@note Unless the -bestpath option is enabled, this function will
always return zero (corresponding to a posterior probability of
1.0). Even if -bestpath is enabled, it will also return zero when
called on a partial result. Ongoing research into effective
confidence annotation for partial hypotheses may result in these
restrictions being lifted in future versions.

@param ps Decoder.
@param out_uttid Output: utterance ID for this utterance.
@return Posterior probability of the best hypothesis.
/
POCKETSPHINX_EXPORT
int32 ps_get_prob(ps_decoder_t ps, char const out_uttid);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sarinsukumar - 2011-07-03

hi Anuj
very thanks for the replay.
I understand now it wont give probability measure for partial results.
I would like get one more advice.

My scenario is i have maximum 20-30 words. and i expect a sentence in its
exact same order.
so can say , expect one word at a time. So do you have any advice to increase
the accuracy by changing the parameters or number of hmms or method of
searching?

Thanks in advance.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Partial hypothesis problem

Speech Recognition Toolkit

Forums

Help

Partial hypothesis problem document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Partial hypothesis problem