Some background:
We have been using PocketSphinx quite successfully for about 3 years in various applications including voice commands (using n-gram search and grammar mapping), keyword spotting for wakeup word and voice commands, and some limited dictation (n-gram search).
We are currently building an eyes-free VUI for completing forms. It is built around a wakeup word and voice command dialogue workflow. However, there are some form fields that require longer text dictation. We have successfully trained a language model to cover our specific domain. The results from the language model are quite good. We are using a "wakeup word" recognizer to control audio input and recognition states (receiving speech, enOfUtterance, et.), and passing raw audio to the n-gram recognizer in this case.
Our issue is that when a user speaks to the system, it tends to be in many separate utterances, adding up to as much as 1-2 minutes total time. For efficiency's sake, we'd like to decode results from the language model at the end of each smaller utterance. To do this, on endOfSpeech we call StopListening and queue the raw audio for processing in the n-gram recognizer. We then begin listening again and start the process over, continuously queuing up audio for each utterance.
However, when doing this, we have noticed that due to the time it takes stopListening to complete, we are losing parts of each following utterance. Is it possible to queue each utterance successfully without the user having to wait for the recognizer to be ready?
I'm limited as to sharing code due to the sensitive nature of the project, but I can potentially provide isolated snippets if it is helpful.
Thanks in advance for any assistance!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Some background:
We have been using PocketSphinx quite successfully for about 3 years in various applications including voice commands (using n-gram search and grammar mapping), keyword spotting for wakeup word and voice commands, and some limited dictation (n-gram search).
We are currently building an eyes-free VUI for completing forms. It is built around a wakeup word and voice command dialogue workflow. However, there are some form fields that require longer text dictation. We have successfully trained a language model to cover our specific domain. The results from the language model are quite good. We are using a "wakeup word" recognizer to control audio input and recognition states (receiving speech, enOfUtterance, et.), and passing raw audio to the n-gram recognizer in this case.
Our issue is that when a user speaks to the system, it tends to be in many separate utterances, adding up to as much as 1-2 minutes total time. For efficiency's sake, we'd like to decode results from the language model at the end of each smaller utterance. To do this, on endOfSpeech we call StopListening and queue the raw audio for processing in the n-gram recognizer. We then begin listening again and start the process over, continuously queuing up audio for each utterance.
However, when doing this, we have noticed that due to the time it takes stopListening to complete, we are losing parts of each following utterance. Is it possible to queue each utterance successfully without the user having to wait for the recognizer to be ready?
I'm limited as to sharing code due to the sensitive nature of the project, but I can potentially provide isolated snippets if it is helpful.
Thanks in advance for any assistance!
Bumping this post for visibility. Suprised that no one is using Pocketsphinx for transcription in this manner.