I would like to write a speech recognition program and appreciate anyone
pointing me in the right direction. I would like it to be able to recognize
about 6000 English technical words and it does not need to be able to
recognize anything else. When a person is using the app, they will enter a
list, so the first word will go into the first structured field, the second
word into another field, and so on. I could even have the person say the word
"next" or something else to signify divisions in the words so the program will
not need to understand context.
It seems to me that this would be substantially easier than a general speech
recognition program that needs to be able to understand context, a large
number of words, etc. Is this the case and how would I build my language
model? Is there a program I can use that would create how the pronunciations
should sound for these 6000 words or do I need to find pronunciations on the
Internet or record persons saying them?
Also the idea behind my app is for lay users to be able to use speech
recognition for this technical field, so I don't care if the pronunciations
are accurate, but rather I would match how lay persons think it should be
pronounced to the actual words. This is why I could potentially use an
automated pronunciation procedure to create my dictionary.
Thanks for any insight!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
does answer my question about building a dictionary. My gist from reading the
dictionary is that my task is easier than general speech recognition, because
of the constrained problem as I described it above, but wanted to confirm
this. I also was not able to find any section on "wake-up" phrases, but the
voice while driving app uses CMU Sphinx and does this, so I think it could
work for me as well.
To confirm, it sounds like I do not need to find speakers to say each of the
6000 words I want, I can use an automatic text to voice program which will
guess at the pronunciation and can use some machine learning to improve it
once the app is being used.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To confirm, it sounds like I do not need to find speakers to say each of the
6000 words I want
There is no need to train acoustic model for new words unless you have very
different recording conditions. Current acoustic model is good enough for
dictation. There is no need to record every word supposed to be recognized
too.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your replies, I find this helpful. The current acoustic model is
good even for technical vocabulary? It seemed to me that it would be easier to
achieve accuracy by building an acoustic/language model just for this purpose.
Does this mean that I would just need a language model for the actual words,
and not an acoustic model? The current solutions on my smartphone do not do
well with a technical vocabulary so it seems that something needs to be done.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to write a speech recognition program and appreciate anyone
pointing me in the right direction. I would like it to be able to recognize
about 6000 English technical words and it does not need to be able to
recognize anything else. When a person is using the app, they will enter a
list, so the first word will go into the first structured field, the second
word into another field, and so on. I could even have the person say the word
"next" or something else to signify divisions in the words so the program will
not need to understand context.
It seems to me that this would be substantially easier than a general speech
recognition program that needs to be able to understand context, a large
number of words, etc. Is this the case and how would I build my language
model? Is there a program I can use that would create how the pronunciations
should sound for these 6000 words or do I need to find pronunciations on the
Internet or record persons saying them?
Also the idea behind my app is for lay users to be able to use speech
recognition for this technical field, so I don't care if the pronunciations
are accurate, but rather I would match how lay persons think it should be
pronounced to the actual words. This is why I could potentially use an
automated pronunciation procedure to create my dictionary.
Thanks for any insight!!
Please read the tutorial
http://cmusphinx.sourceforge.net/wiki/tutorial
Thanks for your reply, I did read the tutorial but did not completely
understand it. Re-reading the following page
http://cmusphinx.sourceforge.net/wiki/tutorialdict
does answer my question about building a dictionary. My gist from reading the
dictionary is that my task is easier than general speech recognition, because
of the constrained problem as I described it above, but wanted to confirm
this. I also was not able to find any section on "wake-up" phrases, but the
voice while driving app uses CMU Sphinx and does this, so I think it could
work for me as well.
To confirm, it sounds like I do not need to find speakers to say each of the
6000 words I want, I can use an automatic text to voice program which will
guess at the pronunciation and can use some machine learning to improve it
once the app is being used.
There is no need to train acoustic model for new words unless you have very
different recording conditions. Current acoustic model is good enough for
dictation. There is no need to record every word supposed to be recognized
too.
Thanks for your replies, I find this helpful. The current acoustic model is
good even for technical vocabulary? It seemed to me that it would be easier to
achieve accuracy by building an acoustic/language model just for this purpose.
Does this mean that I would just need a language model for the actual words,
and not an acoustic model? The current solutions on my smartphone do not do
well with a technical vocabulary so it seems that something needs to be done.
Yes
This issue is covered in tutorial
Yes