I have a small-vocabulary application (vocabulary size of about 30 words) and I've been experimenting with PocketSphinx. Based on my preliminary tests, I've achieved very low error rates using a JSGF and reasonably low error rates using a custom statistical language model I built with LMTool (which I must say is a fantastic resource!).
My biggest concern at this point is "controlling" the error. In my application, I have a fairly low tolerance for incorrect predictions. I would much rather ignore some amount of guesses in order to have the guesses that are used be more accurate. I hope that makes sense.
To further illustrate, I'm considering using some form of confidence threshold, and simply ignoring any predictions from PocketSphinx that fall below the threshold. It seems that PocketSphinx exposes probability, confidence, and score attributes with predictions. I can only make vague guesses at what these values represent, so I would appreciate any insight into what these mean.
And that brings me to my main question: Is my proposed approach likely to do what I want? Is there some other way to accomplish this?
For what it's worth, I'm using PocketSphinx now, but I can move to Sphinx4 if that will provide an advantage in this situation.
Thanks for your time!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My team has been very successful with that approach, although small LMs will still give you trouble with false positive recognitions, with short commands in particular. We used the probablility score in particular, which generally seemed to track best with a higher score for on target utterances and lower scores for OOV utterances.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As far as I know, the JSGF recognizer doesn't provide a usable confidence score, so we don't. We instead used the SRILM and Sphinx tools to extract all of the possible commands from a JSGF grammar, and train a LM with those phrases plus a reasonable number of "junk" phrases to add phonemic complexity. We then use n-gram search with that LM and map the LM results back to the original JSGF grammar. If we get an exact match with a probablility score over a certain threshold, we act on the result.
Hope that is helpful!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Our setup operated in an always on, trailing silence based workflow, so it needed to good at rejecting OOV utterances. We found that using small LMs for voice commands was accurate and efficient for in-grammar utterances, but resulted in more false-positives on OOV utterances than was practical. We developed a strategy, similar to a phone loop or garbage model approach, where we injected data in to the models's training set to provide a wider variety possible phrases, without making the model so large that we sacrificed in-grammar accuracy and responsiveness. We called that data "Automated Model Optimization Data" or AMOD. We published a paper on this in 2017 at the Advanced Human Factors and Ergonomics conference in L.A. I've attachedthat paper, which explains the process and reasoning behind it. We had a unique situation, so it may not be completely necessary for your use case, but hopefully is helpful to you anyway.
I have read your paper, it's really a good work. I am use the KWS model in pocketsphinx to spot key word, but the result is not good, especially the OOV utterances. Do you test KWS using the method in your paper? If you do, how about the result?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a small-vocabulary application (vocabulary size of about 30 words) and I've been experimenting with PocketSphinx. Based on my preliminary tests, I've achieved very low error rates using a JSGF and reasonably low error rates using a custom statistical language model I built with LMTool (which I must say is a fantastic resource!).
My biggest concern at this point is "controlling" the error. In my application, I have a fairly low tolerance for incorrect predictions. I would much rather ignore some amount of guesses in order to have the guesses that are used be more accurate. I hope that makes sense.
To further illustrate, I'm considering using some form of confidence threshold, and simply ignoring any predictions from PocketSphinx that fall below the threshold. It seems that PocketSphinx exposes probability, confidence, and score attributes with predictions. I can only make vague guesses at what these values represent, so I would appreciate any insight into what these mean.
And that brings me to my main question: Is my proposed approach likely to do what I want? Is there some other way to accomplish this?
For what it's worth, I'm using PocketSphinx now, but I can move to Sphinx4 if that will provide an advantage in this situation.
Thanks for your time!
My team has been very successful with that approach, although small LMs will still give you trouble with false positive recognitions, with short commands in particular. We used the probablility score in particular, which generally seemed to track best with a higher score for on target utterances and lower scores for OOV utterances.
Thanks. I'll give that a shot.
Have you tried using this approach with JSGF grammars by chance?
As far as I know, the JSGF recognizer doesn't provide a usable confidence score, so we don't. We instead used the SRILM and Sphinx tools to extract all of the possible commands from a JSGF grammar, and train a LM with those phrases plus a reasonable number of "junk" phrases to add phonemic complexity. We then use n-gram search with that LM and map the LM results back to the original JSGF grammar. If we get an exact match with a probablility score over a certain threshold, we act on the result.
Hope that is helpful!
Yes! I appreciate it.
Could you elaborate on what you mean by "junk" phrases, and why adding "phonemic complexity" in this case is beneficial?
Our setup operated in an always on, trailing silence based workflow, so it needed to good at rejecting OOV utterances. We found that using small LMs for voice commands was accurate and efficient for in-grammar utterances, but resulted in more false-positives on OOV utterances than was practical. We developed a strategy, similar to a phone loop or garbage model approach, where we injected data in to the models's training set to provide a wider variety possible phrases, without making the model so large that we sacrificed in-grammar accuracy and responsiveness. We called that data "Automated Model Optimization Data" or AMOD. We published a paper on this in 2017 at the Advanced Human Factors and Ergonomics conference in L.A. I've attachedthat paper, which explains the process and reasoning behind it. We had a unique situation, so it may not be completely necessary for your use case, but hopefully is helpful to you anyway.
Dan
Hi Dan,
I have read your paper, it's really a good work. I am use the KWS model in pocketsphinx to spot key word, but the result is not good, especially the OOV utterances. Do you test KWS using the method in your paper? If you do, how about the result?
Thanks!