How do you get partial hypothesis from the decoder in pocketSphinx i.e. get
the hypothesis while the user is speaking? Is there some example and/or link
that talks about this for pocketSphinx? I know that pocketSphinx supports
this, but don't know how to get that.
Thanks for any help in advance,
Anuj
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is possible to configure Sphinx-4 to generate partial results, that is,
to inform you periodically as to what it thinks is the best possible
hypothesis so far, even before the user has stopped speaking.
To get this information, add a result listener to the recognizer. Your
listener will receive a result (which may or not be a final result). The
hypothesis text can be extracted from the text.
There is a good example of this in sphinx4/tests/live/Live.java
You can control how often the result listener is fired by setting the
configuration variable 'featureBlockSize' in the decoder. The default setting
of 50 indicates that the listener will be called after every 50 frames. Since
each frame represents 10MS of speech, the listener is called every 500ms.
Do i have featureblocksize available in pocketsphinx? I tried to find it but
couldnt.
I am currently getting partial results but after 5 to 6 second delay, whereas
ideally i would like to get them instantaneously
Any help?
Regards,
Shamsa.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do i have featureblocksize available in pocketsphinx?
Pocketsphinx uses "push" processing so you decide yourself which amount of
frames to push into the decoder. Sphinx4 uses "pull" processing so you have
feature block size parameter. In pocketsphinx you just call ps_process_raw
with the required amount of frames and then you call ps_get_hyp to get a
partial result as soon as frames are processed. You decide when utterance end
yourself with the flag to ps_process_raw.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am getting a partial hypothesis as null more number of times than a proper
word during voice input.
This would mean that in case of null, for the current block, no matches have
been found.
What if i want to get a partial hypothesis based on a character i supply?
Would it be possible for the process_raw function to return me one or more
matches from partial hypothesis result starting with a character 'a'.
I sort of want filtered partial hypothesis based on a character parameter. Am
i headed in the right direction?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
am getting a partial hypothesis as null more number of times than a proper
word during voice input.
Maybe you don't set do_search parameter in ps_process_raw to TRUE. It must be
TRUE.
This would mean that in case of null, for the current block, no matches have
been found.
It's a normal situation
What if i want to get a partial hypothesis based on a character i supply?
Would it be possible for the process_raw function to return me one or more
matches from partial hypothesis result starting with a character 'a'. I sort
of want filtered partial hypothesis based on a character parameter. Am i
headed in the right direction?
If you want to have some restrictions on the results you need to set them
through the grammar, not through the code.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On the above link is a video by david huggins which is the thing im trying to
do on android.
In that video you can see partial results as the user proceeds with his
speech.
What amazes me is that
1) Its very fast, (with a vocab of 4000 words)
2) Getting a result every time, when in some cases even null can be returned.
3) During speech user can pick suggestions from a popup as corrections. Where
do these come from? A lattice? I tried to get a lattice while in listening
mode but probably its not available unless utterance is completed.
Please guide me as to how to get results from the decoder during speech and
other suggestions as well.
I need to figure out the logic behind this, that is, whats happening here?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On the video varants are accessed when utterance is completed already (after
pause)
How do you make/identify a PAUSE? I used ps.endutterance(); during my
listening mode. It didnt give me any results.
Right now im trying to get useful results from my partial hypothesis.
I start recognition and i say
"can you send me"
with the following code what i get as partial hypothesis is
"can "
and then
"can you"
and then
"can you an"
However, i would actually like to have it like
"can"
and then
"you"
and then
"send"
How can i break up my partial hypothesis?
i tried stopping utterance and restarting it at the end but it didnt work.
i tried stopping the audio (before processraw)and starting a new one and that
didnt help.
how does this code need to be changed?
So basically during the listening state i want my hypstring to containg only
the latest word that i have spoken.
I also tried accessing the nbest from the decoder at this satge but couldnt.
How do you make/identify a PAUSE? I used ps.endutterance(); during my
listening mode. It didnt give me any results.
Pause is identified by endpointer by energy of signal (when cont_ad_read
returns 0 for some amount of time). After pause is identified utterance is
ended with ps_end_utt.
How can i break up my partial hypothesis? So basically during the listening
state i want my hypstring to containg only the latest word that i have spoken.
This requires modification of the pocketsphinx source. It's also a fuzzy
stated problem because you don't quite understand how search is being
performed to create a hypothesis. Last word spoken can be only identified
after some delay with partial traceback. You can read more about search in a
book
In the sample above the partial result "can you an" is retrieved. Is it possible that any of that part changes in the next partial result? E.g. "canyon blah blah"
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks. I'm just wondering how stable this "partial hypothesis" is. Consider using some logic around the received partial hypothesis... If it can change completely, that means it's not stable.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
How do you get partial hypothesis from the decoder in pocketSphinx i.e. get
the hypothesis while the user is speaking? Is there some example and/or link
that talks about this for pocketSphinx? I know that pocketSphinx supports
this, but don't know how to get that.
Thanks for any help in advance,
You can get the partial hypothesis during the recognition before utterance
ended with the same function ps_get_hyp.
http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#partial_results
from the above link i found
To get this information, add a result listener to the recognizer. Your
listener will receive a result (which may or not be a final result). The
hypothesis text can be extracted from the text.
There is a good example of this in sphinx4/tests/live/Live.java
You can control how often the result listener is fired by setting the
configuration variable 'featureBlockSize' in the decoder. The default setting
of 50 indicates that the listener will be called after every 50 frames. Since
each frame represents 10MS of speech, the listener is called every 500ms.
Do i have featureblocksize available in pocketsphinx? I tried to find it but
couldnt.
I am currently getting partial results but after 5 to 6 second delay, whereas
ideally i would like to get them instantaneously
Any help?
Regards,
Shamsa.
Pocketsphinx uses "push" processing so you decide yourself which amount of
frames to push into the decoder. Sphinx4 uses "pull" processing so you have
feature block size parameter. In pocketsphinx you just call ps_process_raw
with the required amount of frames and then you call ps_get_hyp to get a
partial result as soon as frames are processed. You decide when utterance end
yourself with the flag to ps_process_raw.
I am getting a partial hypothesis as null more number of times than a proper
word during voice input.
This would mean that in case of null, for the current block, no matches have
been found.
What if i want to get a partial hypothesis based on a character i supply?
Would it be possible for the process_raw function to return me one or more
matches from partial hypothesis result starting with a character 'a'.
I sort of want filtered partial hypothesis based on a character parameter. Am
i headed in the right direction?
Maybe you don't set do_search parameter in ps_process_raw to TRUE. It must be
TRUE.
It's a normal situation
If you want to have some restrictions on the results you need to set them
through the grammar, not through the code.
http://www.youtube.com/watch?v=OEUeJb6Pwt4
On the above link is a video by david huggins which is the thing im trying to
do on android.
In that video you can see partial results as the user proceeds with his
speech.
What amazes me is that
1) Its very fast, (with a vocab of 4000 words)
2) Getting a result every time, when in some cases even null can be returned.
3) During speech user can pick suggestions from a popup as corrections. Where
do these come from? A lattice? I tried to get a lattice while in listening
mode but probably its not available unless utterance is completed.
Please guide me as to how to get results from the decoder during speech and
other suggestions as well.
I need to figure out the logic behind this, that is, whats happening here?
Yes, pocketsphinx can ran very fast
Usually when you passed some speech into the recognizer there is a partial
result
Yes
On the video varants are accessed when utterance is completed already (after
pause)
You said:
How do you make/identify a PAUSE? I used ps.endutterance(); during my
listening mode. It didnt give me any results.
Right now im trying to get useful results from my partial hypothesis.
I start recognition and i say
"can you send me"
with the following code what i get as partial hypothesis is
"can "
and then
"can you"
and then
"can you an"
However, i would actually like to have it like
"can"
and then
"you"
and then
"send"
How can i break up my partial hypothesis?
i tried stopping utterance and restarting it at the end but it didnt work.
i tried stopping the audio (before processraw)and starting a new one and that
didnt help.
how does this code need to be changed?
So basically during the listening state i want my hypstring to containg only
the latest word that i have spoken.
I also tried accessing the nbest from the decoder at this satge but couldnt.
if (state == State.LISTENING) {
assert this.audio != null;
try {
short buf = this.audioq.take();
Log.d(getClass().getName(), "Reading " + buf.length + " samples from queue");
this.ps.processRaw(buf, buf.length, false, false);
Hypothesis hyp = this.ps.getHyp();
if (hyp != null) {
String hypstr = hyp.getHypstr();
if (hypstr != partial_hyp) //just to make sure im not getting the same
hypothesis returned every time
{
if (this.rl != null && hyp != null) {
Bundle b = new Bundle();
b.putString("hyp", hyp.getHypstr());
this.rl.onPartialResults(b);
}
}
partial_hyp = hypstr;
}
}
catch (InterruptedException e) {
Log.d(getClass().getName(), "Interrupted in audioq.take");
}
}
}
}
Pause is identified by endpointer by energy of signal (when cont_ad_read
returns 0 for some amount of time). After pause is identified utterance is
ended with ps_end_utt.
This requires modification of the pocketsphinx source. It's also a fuzzy
stated problem because you don't quite understand how search is being
performed to create a hypothesis. Last word spoken can be only identified
after some delay with partial traceback. You can read more about search in a
book
http://www.amazon.com/Spoken-Language-Processing-Algorithm-
Development/dp/0130226165
Specifically you could read about:
In the sample above the partial result "can you an" is retrieved. Is it possible that any of that part changes in the next partial result? E.g. "canyon blah blah"
Yes, sure. Partial traceback is exactly supposed to return "certain" part, everything else may change.
Ok. So the result isn't simply appended to the previous result?
E.g. NOT this?
"Hi"
"Hi there"
"Hi there little"
a.s.o.
But this?
"Hi"
"High the"
"Hi there"
"Hi there lit"
Decoding considers many thousands of decoding sequences at once, not just one best sequence. You can check https://en.wikipedia.org/wiki/Viterbi_algorithm for details.
Thanks. I'm just wondering how stable this "partial hypothesis" is. Consider using some logic around the received partial hypothesis... If it can change completely, that means it's not stable.
If you want a stable partial hypothesis you need to implement partial traceback (not implemented in pocketsphinx).
Ok. Thanks for your help.