Occasionally, when stopping and then immediately thereafter starting a listening task, Pocketsphinx (for Android) seems to hang for ~40 seconds before throwing the titular error and then recovering successfully. This is probably a bug that should be fixed. The log around the event is below, with my comments inserted:
I'm working with pre-recorded files. First, I perform VAD using WebRTC. Then I feed the "active" audio segments to PocketSphinx for word recognition -- every segment as a separate utterance. Sometimes, WebRTC will give a false positive and detect a short sequence of background noise (about 0.1s) as active. When I pass this audio sequence to PocketSphinx, it reproducably logs the error you mentioned.
What's worse, it seems that at this point, some internal decoder state is getting corrupted: The next utterance I feed to PocketSphinx is always detected as garbage (something like "[UM] <sil> you <sil> know " instead of "marley was dead". After that, the decoder seems to have recovered: The following utterances are usually recognized just fine.</sil></sil>
Here's a shortened version of what I'm doing:
for(/* each VAD segment... */){ps_start_stream(decoder);ps_start_utt(decoder);while(/* more data... */){ps_process_raw(decoder,/* data */);}ps_end_utt(decoder);for(ps_seg_t*it=ps_seg_iter(decoder);it;it=ps_seg_next(it)){/* processing */}}
My actual code checks every single function call for failure. There is no error until the call to ps_seg_iter to get the recognition result. This is where the actual error occurs. Here is the relevant call stack:
find_start_node(ngram_search_s * ngs, ps_lattice_s * dag) Line 1139
ngram_search_lattice(ps_search_s * search) Line 1244ngram_search_seg_iter(ps_search_s * search) Line 1016
ps_seg_iter(ps_decoder_s * ps) Line 1216
Please let me know if there is any more information I can supply!
Last edit: Daniel Wolf 2016-06-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is exactly where we are in our thinking. Shouldn't the case of no words recognized be valid? I suppose it could be a philosophical dilemma for a speech recognizer to recognize "not speech"...
We've been trying to find a work around (calling start/stop less often, modifiying LM data), but have had no luck thus far.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So far, I couldn't find a workaround, either. What I tried is checking ngs->bpidx right after ps_end_utt. If it's 0, I skip the ps_seg_iter call because I know there won't be any result.
This gets rid of the error message, of course. But the bigger problem still remains -- the following utterance is detectected as garbage.
One theory I have is that after processing an "utterance" containing only background noise, the decoder's noise statistics are way off, so it takes some time for them to re-calibrate. If that's the case, maybe there is a way to reset these values after a 0-word recognition.
I'm running out of ideas, though. Maybe Nickolay can help?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've implemented a rather ugly hack: Right after ps_end_utt, I check ngs->bpidx. If it's 0, I skip the ps_seg_iter call as described above. In addition, I then discard the decoder altogether and re-load it.
Re-loading the decoder takes some time, but at least it means that the next utterance will be recognized correctly.
Nickolay, please let me know what you think on the matter. There must be a simple way of fixing this and I'm willing to help in any way I can!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After looking at your fixes and investigating a bit further from there, we were able to determine that our issue, a lengthy pause of 30+ seconds followed by the n-gram_serch.c error, was being caused by Android AudioRecorder during AudioRecord.read freezing. The behavior only seems to appear on Samsung devices with stock Android L, which is our main development device.
Thanks fo ryour help on this! Sorry that we couldn't be more helpful with your issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's good to hear that you found a solution for your cause of the problem! Not I'm hoping for a word from Nickolay regarding the decoder problem. It's uncommon for him not to reply within days.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Occasionally, when stopping and then immediately thereafter starting a listening task, Pocketsphinx (for Android) seems to hang for ~40 seconds before throwing the titular error and then recovering successfully. This is probably a bug that should be fixed. The log around the event is below, with my comments inserted:
Any suggestions? Is there a different place I should submit this (e.g. Github)?
I am having an eerily similar problem. Will post logs as well.
Bumping this thread. Were you able to figure this out?
No I was not. I hopefully will hear back from someone on the project soon.
Bump.
I'm having a similar problem.
I'm working with pre-recorded files. First, I perform VAD using WebRTC. Then I feed the "active" audio segments to PocketSphinx for word recognition -- every segment as a separate utterance. Sometimes, WebRTC will give a false positive and detect a short sequence of background noise (about 0.1s) as active. When I pass this audio sequence to PocketSphinx, it reproducably logs the error you mentioned.
What's worse, it seems that at this point, some internal decoder state is getting corrupted: The next utterance I feed to PocketSphinx is always detected as garbage (something like "[UM] <sil> you <sil> know " instead of "marley was dead". After that, the decoder seems to have recovered: The following utterances are usually recognized just fine.</sil></sil>
Here's a shortened version of what I'm doing:
My actual code checks every single function call for failure. There is no error until the call to
ps_seg_iter
to get the recognition result. This is where the actual error occurs. Here is the relevant call stack:Please let me know if there is any more information I can supply!
Last edit: Daniel Wolf 2016-06-28
I did some more digging.
find_start_node
(inngram_search.c
) doesn't find any nodes and logs that as an error (accompanied by the comment "This is probably impossible.")dag->nodes
isNULL
.ngram_search_lattice
callscreate_dag_nodes
before.create_dag_nodes
only adds nodes ifngs->bpidx > 0
. This isn't the case, it's 0.So I'm wondering two things:
This is exactly where we are in our thinking. Shouldn't the case of no words recognized be valid? I suppose it could be a philosophical dilemma for a speech recognizer to recognize "not speech"...
We've been trying to find a work around (calling start/stop less often, modifiying LM data), but have had no luck thus far.
So far, I couldn't find a workaround, either. What I tried is checking
ngs->bpidx
right afterps_end_utt
. If it's 0, I skip theps_seg_iter
call because I know there won't be any result.This gets rid of the error message, of course. But the bigger problem still remains -- the following utterance is detectected as garbage.
One theory I have is that after processing an "utterance" containing only background noise, the decoder's noise statistics are way off, so it takes some time for them to re-calibrate. If that's the case, maybe there is a way to reset these values after a 0-word recognition.
I'm running out of ideas, though. Maybe Nickolay can help?
I've implemented a rather ugly hack: Right after
ps_end_utt
, I checkngs->bpidx
. If it's 0, I skip theps_seg_iter
call as described above. In addition, I then discard the decoder altogether and re-load it.Re-loading the decoder takes some time, but at least it means that the next utterance will be recognized correctly.
Nickolay, please let me know what you think on the matter. There must be a simple way of fixing this and I'm willing to help in any way I can!
Daniel,
How long does the reloading take? Are we talking milliseconds, seconds, or tens of seconds?
Thanks!
Last edit: Dan Yaeger 2016-07-05
I'd say one or two seconds, but I didn't measure it.
Thanks. We'll try that and see how it affects the behavior we've been experiencing.
Daniel:
After looking at your fixes and investigating a bit further from there, we were able to determine that our issue, a lengthy pause of 30+ seconds followed by the n-gram_serch.c error, was being caused by Android AudioRecorder during AudioRecord.read freezing. The behavior only seems to appear on Samsung devices with stock Android L, which is our main development device.
Thanks fo ryour help on this! Sorry that we couldn't be more helpful with your issue.
Dan,
It's good to hear that you found a solution for your cause of the problem! Not I'm hoping for a word from Nickolay regarding the decoder problem. It's uncommon for him not to reply within days.
Dear Daniel, I'll check this issue, it will just take a little more time to setup the test to reproduce it.
Hi Nickolay,
That's great to hear! Please let me know if there's anything I can do to help.
Hi Nickolay,
I just saw your commit "Properly finalizes utt even if there was no speech". Is that the fix for this issue?
No, sorry, that was a different thing in gstreamer.
Pity. And I thought I could toss out my workaround.
Now I can indeed remove my workaround. It seems that at least in my case, the problem was this:
For me, the solution was to switch to 'batch' mode instead of 'continuous' mode. See this discussion for details.