After doing a fair amount of reading, it is my understanding there are 3
components to Sphinx recognition:
dictionary
defines each possible word as a group of sounds (phonemes)
language model
groups together up to 3 words at a time to define probabilities for sentence recognition
acoustic model
maps the waveform sound to the phonemes
If these definitions are correct, then the acoustic model is independent of
the dictionary & language model. In other words, if you have a complete
acoustic model, you should be able to use this one model with any
dictionary/language model. Therefore, what is the best acoustic model to use
for US english speech (assuming no heavy dialect)?
Second, does the increasing size of the acoustic model degrade recognition
performance? In my testing, it's not the size of the acoustic model that
matters, but the size of the language model that matters. I've used very large
acoustic models and very large dictionaries, with a small language model, and
the recognition is faster. This leads me to my question, if I need to
recognize a limited set of 50 English words, and all numbers between 0 and
1,000,000, what is the best combination of open source HMM, LM, and Dictionary
to achieve the best and fastest recognition? The problem is that of the
predefined set of ~50 english words, some of those words are Product names
that probably haven't been recorded or trained in any acoustic models. But if
the acoustic models contain many of the same phonemes that are in these custom
words, will they still get recognized, or do I need to create an adapted
acoustic model that trains those additional words?
Thank You,
Eric
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Therefore, what is the best acoustic model to use for US english speech
(assuming no heavy dialect)?
Acoustic models are trained for specific recording conditions. Model to
recognize broadcast speech is not suitable for telephone one. There is no such
thing like best model
Second, does the increasing size of the acoustic model degrade recognition
performance?
Size of the model and recognition accuracy are not related. Size obviously
affects recognition speed.
This leads me to my question, if I need to recognize a limited set of 50
English words, and all numbers between 0 and 1,000,000, what is the best
combination of open source HMM, LM, and Dictionary to achieve the best and
fastest recognition?
It depends on type of speech - telephone/microphone recording/far distance
recording
The problem is that of the predefined set of ~50 english words, some of
those words are Product names that probably haven't been recorded or trained
in any acoustic models. But if the acoustic models contain many of the same
phonemes that are in these custom words, will they still get recognized, or do
I need to create an adapted acoustic model that trains those additional words?
There is no problem here. Most acoustic models are generic enough and let you
recognize any word transcribed in the dictionary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After doing a fair amount of reading, it is my understanding there are 3
components to Sphinx recognition:
If these definitions are correct, then the acoustic model is independent of
the dictionary & language model. In other words, if you have a complete
acoustic model, you should be able to use this one model with any
dictionary/language model. Therefore, what is the best acoustic model to use
for US english speech (assuming no heavy dialect)?
Second, does the increasing size of the acoustic model degrade recognition
performance? In my testing, it's not the size of the acoustic model that
matters, but the size of the language model that matters. I've used very large
acoustic models and very large dictionaries, with a small language model, and
the recognition is faster. This leads me to my question, if I need to
recognize a limited set of 50 English words, and all numbers between 0 and
1,000,000, what is the best combination of open source HMM, LM, and Dictionary
to achieve the best and fastest recognition? The problem is that of the
predefined set of ~50 english words, some of those words are Product names
that probably haven't been recorded or trained in any acoustic models. But if
the acoustic models contain many of the same phonemes that are in these custom
words, will they still get recognized, or do I need to create an adapted
acoustic model that trains those additional words?
Thank You,
Eric
Acoustic models are trained for specific recording conditions. Model to
recognize broadcast speech is not suitable for telephone one. There is no such
thing like best model
Size of the model and recognition accuracy are not related. Size obviously
affects recognition speed.
It depends on type of speech - telephone/microphone recording/far distance
recording
There is no problem here. Most acoustic models are generic enough and let you
recognize any word transcribed in the dictionary.