I'm looking for some guidance regarding the following project:
OBJECTIVES
- Voice recognition engine should run offline on an Android phone/tablet
Engine should do continuous voice recognition.
Engine should wake-up for ten specific words (left, right...).
It would be acceptable to have a magic unique distinguishable one or two syllable word preceding the wake-up words, a kind of wake-up wake-up.
Engine should be optimized for only one person (=myself) speaking English but with a French accent.
ACHIEVED SO FAR
- I have pocketsphinx running on Android with the standard en_US acoustic model.
I have added some timer in order to put the engine in continuous mode (onpartialresults forces onresult, 1-second stop/start sequence) so that I have results continuously.
I have first tested the engine with the default large standard US dictionary. Recognition was not very good - similar experience to what I encounter when speaking with a 800-number voice recognition system.
I have then replaced the dictionary by a dictionary with only the ten wake-up words. But obviously there is no distinction between wake-up command or talk.
I have strictly followed the long http://cmusphinx.sourceforge.net/wiki/tutorialadapt which I have run with the ten wake-up words as arctic inputs. Perhaps it has improved a little bit but I'm not sure if it's a placebo or not.
PROBLEM
- In the faq http://cmusphinx.sourceforge.net/wiki/faq, the question "How to implement 'Wake-up listening'" leads to a paper (http://www.ece.umassd.edu/Faculty/acosta/ICASSP/Icassp_2000/pdf/346_610.PDF) that cannot be found.
PLANS
- I intend to come back to the full dictionary so that I can distinguish wake-up words from continuous talk.
QUESTIONS
-1. Where is the paper of Sahar E. Bou-Ghazale and Ayman O. Asadi about wake-up listening?
-2. Doing the model adaptation for +1100 sentences is a crazy and painful effort so I want to make sure that it's worthwhile to do it. Will it be very useful taking into account my objectives and constraints?
2.b. Should I include separately the ten wake-up words in the arctic inputs? Is there a way to increase the weight of those 10 sentences in the model?
-3. Any other good idea?
Thanks in advance for any pointer,
Grégoire
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Where is the paper of Sahar E. Bou-Ghazale and Ayman O. Asadi about wake-up listening?
This FAQ entry probably has to be updated.
There is not much to look there for actually. What you need is to have a keyword spotting functionality together with some activation logic to be able to decode commands after activation keyword.
Keyword spotting is not supported in pocketsphinx out of box, but can be implemented. A lot of methods for implementation of keyword spotting could be found on the web. The right thing to do is to implement a garbage loop search in order to match the alternative speech except the keyword.
Activation keyword should be 3 syllables for better detection.
Keyword spotting was also discussed on this forum before. Thee is a simple workaround described here, but proper implementation will take more work.
In answer to your question above, Nockolay has made reference to a post discussing
some experiments done by us. I want to add more to it here:
a) We use pocketsphinx and hub4wsj_sc_8k acoustic model.
b) With the keyword experiment we are seeing that if we use values for
parameters wip = 1e-4 and silprob = 0.1 instead of their defaults, the
spotting accuracy goes up significantly.
c) We use bestpath = 0 option to reduce the recognition response time.
Following are our observations:
a) In first set, we removed those phonemes from the garbage that were present
in the keyword.
jsgf file: https://www.dropbox.com/s/fhiotpcqz616oka/keyword_charlie.jsgf
dictionary file:
https://www.dropbox.com/s/9blgq7hcrm1igcc/keyword_charlie.dic
We got excellent spotting accuracy but the false trigger rate was observed
to be high too.
b) In the second set, we added all the phonemes in the garbage (including the
ones present in the keyword).
jsgf file: same as above.
dictionary file:
https://www.dropbox.com/s/7dvo0pm38cw8ym3/keyword_charlie_full_garbage.dic
This gave a good mix of spotting accuracy and false trigger rate. This
approach is also more conducive to the multiple keyword requirement
that you have.
Couple other observations are as follows:
a) Adaptation doesn't consistently increase keyword spotting accuracy
across various keywords and users and even when in does, increase
is not worth the effort.
b) Using DAG (bestpath = 1) is also found to be inconsistent as adaptation
even though it increases the response time so is not worth for keyword
spotting.
If you want, you can try these approaches and share your observations on this forum.
There are many who would like to see a reasonably good workaround to be used with
pocketsphinx.
In real life applications, I guess a grammer/dictionary switch would be required
to proceed with the regular application once the device has woken up.
Thanks and regards,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wonder if anybody managed to get reasonably good results with the simplistic approach.
By reasonably good, I mean, e.g., precision >= 0.95 and recall >= 0.7 in moderately noisy environments (car cabin).
If the way to get better results is to use Sphinx4, I wonder if adapting it to Android may be an easier approach.
Your feedback is very welcome!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm looking for some guidance regarding the following project:
OBJECTIVES
- Voice recognition engine should run offline on an Android phone/tablet
Engine should do continuous voice recognition.
Engine should wake-up for ten specific words (left, right...).
It would be acceptable to have a magic unique distinguishable one or two syllable word preceding the wake-up words, a kind of wake-up wake-up.
Engine should be optimized for only one person (=myself) speaking English but with a French accent.
ACHIEVED SO FAR
- I have pocketsphinx running on Android with the standard en_US acoustic model.
I have added some timer in order to put the engine in continuous mode (onpartialresults forces onresult, 1-second stop/start sequence) so that I have results continuously.
I have first tested the engine with the default large standard US dictionary. Recognition was not very good - similar experience to what I encounter when speaking with a 800-number voice recognition system.
I have then replaced the dictionary by a dictionary with only the ten wake-up words. But obviously there is no distinction between wake-up command or talk.
I have strictly followed the long http://cmusphinx.sourceforge.net/wiki/tutorialadapt which I have run with the ten wake-up words as arctic inputs. Perhaps it has improved a little bit but I'm not sure if it's a placebo or not.
PROBLEM
- In the faq http://cmusphinx.sourceforge.net/wiki/faq, the question "How to implement 'Wake-up listening'" leads to a paper (http://www.ece.umassd.edu/Faculty/acosta/ICASSP/Icassp_2000/pdf/346_610.PDF) that cannot be found.
PLANS
- I intend to come back to the full dictionary so that I can distinguish wake-up words from continuous talk.
QUESTIONS
-1. Where is the paper of Sahar E. Bou-Ghazale and Ayman O. Asadi about wake-up listening?
-2. Doing the model adaptation for +1100 sentences is a crazy and painful effort so I want to make sure that it's worthwhile to do it. Will it be very useful taking into account my objectives and constraints?
2.b. Should I include separately the ten wake-up words in the arctic inputs? Is there a way to increase the weight of those 10 sentences in the model?
-3. Any other good idea?
Thanks in advance for any pointer,
Grégoire
This FAQ entry probably has to be updated.
There is not much to look there for actually. What you need is to have a keyword spotting functionality together with some activation logic to be able to decode commands after activation keyword.
Keyword spotting is not supported in pocketsphinx out of box, but can be implemented. A lot of methods for implementation of keyword spotting could be found on the web. The right thing to do is to implement a garbage loop search in order to match the alternative speech except the keyword.
Activation keyword should be 3 syllables for better detection.
Keyword spotting was also discussed on this forum before. Thee is a simple workaround described here, but proper implementation will take more work.
http://sourceforge.net/p/cmusphinx/discussion/help/thread/1c6cb941/?limit=25#1952
I don't think there is much sense in doing that. Just adaptation on 20 sentences should be enough
Yes, you can include them in several variations. There is no need to increase the weight.
Hi Gregoire,
In answer to your question above, Nockolay has made reference to a post discussing
some experiments done by us. I want to add more to it here:
a) We use pocketsphinx and hub4wsj_sc_8k acoustic model.
b) With the keyword experiment we are seeing that if we use values for
parameters wip = 1e-4 and silprob = 0.1 instead of their defaults, the
spotting accuracy goes up significantly.
c) We use bestpath = 0 option to reduce the recognition response time.
Couple other observations are as follows:
If you want, you can try these approaches and share your observations on this forum.
There are many who would like to see a reasonably good workaround to be used with
pocketsphinx.
In real life applications, I guess a grammer/dictionary switch would be required
to proceed with the regular application once the device has woken up.
Thanks and regards,
I wonder if anybody managed to get reasonably good results with the simplistic approach.
By reasonably good, I mean, e.g., precision >= 0.95 and recall >= 0.7 in moderately noisy environments (car cabin).
If the way to get better results is to use Sphinx4, I wonder if adapting it to Android may be an easier approach.
Your feedback is very welcome!