I am working on a project for which I need an instance of PocketSphinx setup
such that it can recognize at max 100 different English words, when spoken one
at a time. So I do not need natural language recognition.
The user's will see a menu of lets say 5 options and will speak the name of
their preferred option. This will result in that menu option getting selected
and its sub menus being displayed. Then the user will do the same thing again
to chose their desired option from the sub menu.
So from what I understand of speech recognition so far, a limited vocabulary
like this should be pretty easy to get recognized. However, the problem is
that there will be noise in the background during the recognition. On top of
that, I would prefer if the recognition was open mic, which means that the
application would always be listening for user's commands and the user
wouldn't need to notify the application before giving it speech input.
So now I am wondering how I should go about setting this all up. I have got
the PocketSphinxDemo working atm. However, its accuracy is REALLY bad. My
Android application has to be speaker independent and has to be able to cater
to a lot of accents. This is probably why the PocketSphinxDemo with the en_US
model files doesn't work very well.
So any suggestions about the following will be really helpful:
- Using one of the existing models or creating a new model
- Adapting one of the existing models or training it
- how to go about allowing the application to do noise reduction
- how to implement an open mic approach
This is a pretty long post, so please let me know if I haven't been very clear
or have missed out any information that would help you answer the question in
a better way.
Cheers!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi guys,
I am working on a project for which I need an instance of PocketSphinx setup
such that it can recognize at max 100 different English words, when spoken one
at a time. So I do not need natural language recognition.
The user's will see a menu of lets say 5 options and will speak the name of
their preferred option. This will result in that menu option getting selected
and its sub menus being displayed. Then the user will do the same thing again
to chose their desired option from the sub menu.
So from what I understand of speech recognition so far, a limited vocabulary
like this should be pretty easy to get recognized. However, the problem is
that there will be noise in the background during the recognition. On top of
that, I would prefer if the recognition was open mic, which means that the
application would always be listening for user's commands and the user
wouldn't need to notify the application before giving it speech input.
So now I am wondering how I should go about setting this all up. I have got
the PocketSphinxDemo working atm. However, its accuracy is REALLY bad. My
Android application has to be speaker independent and has to be able to cater
to a lot of accents. This is probably why the PocketSphinxDemo with the en_US
model files doesn't work very well.
So any suggestions about the following will be really helpful:
- Using one of the existing models or creating a new model
- Adapting one of the existing models or training it
- how to go about allowing the application to do noise reduction
- how to implement an open mic approach
This is a pretty long post, so please let me know if I haven't been very clear
or have missed out any information that would help you answer the question in
a better way.
Cheers!
See the FAQ
http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor
http://cmusphinx.sourceforge.net/wiki/faq#qhow_to_do_the_noise_reduction
http://cmusphinx.sourceforge.net/wiki/faq#qcan_pocketsphinx_reject_out-of-
grammar_words_and_noises