I'd like to train an acoustic model for a system with a very small number of words, say 20 or less. WSJ1 (the one in trunk of pocketsphinx) works great with my little dictionary and language model, but what I'd like is an incredibly fast implementation and the size of the acoustic model should certainly play into this.
How much audio do I need to record to get a good acoustic model? Also, should I do a word model instead of phone model? If I understand this, the dictionary would look simply repeat the word and instead of phones in the .phone file, you would put the same word list.
Thanks!
M
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, let me be explicit. What you are saying is that if I want to have the recognizer work for ONE GUY well, I could have that one guy read the following sentence a whole bunch of times (10? 50?), while recording:
WAKEUP EVA HOME PAGE PAGE FORWARD PAGE BACK NEXT STEP PREVIOUS STEP I NEED HELP CLOSE HELP HOME PAGE GO TO SLEEP
These are the words/commands I care about (and only these). Can I just have them read this sentence over and over? Do I need to break it down (i.e. in one audio file would just be one command, like "go to sleep"), or can the whole sentence be in each file?
M
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Several hours total from around 400 speakers say a minute per speaker
What if I am not making a general system - I don't need any Joe to be able to walk up and speak and the system to recognize it. Rather, I'll have say 10 people that I want the system to be REALLY GOOD at recognizing (or even just 1 to start). How would that affect your above statement?
Does it matter what you record them saying? In other words, if I have 15 words, can I just record someone saying those 15 words a bunch of times?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Rather, I'll have say 10 people that I want the system to be REALLY GOOD at recognizing (or even just 1 to start). How would that affect your above statement?
It wouldn't. The amount of audio is comparable, you need a lot of samples from each speaker you want to recognize.
> Does it matter what you record them saying? In other words, if I have 15 words, can I just record someone saying those 15 words a bunch of times?
No, you need to have representation of each speaker.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi -
I'd like to train an acoustic model for a system with a very small number of words, say 20 or less. WSJ1 (the one in trunk of pocketsphinx) works great with my little dictionary and language model, but what I'd like is an incredibly fast implementation and the size of the acoustic model should certainly play into this.
How much audio do I need to record to get a good acoustic model? Also, should I do a word model instead of phone model? If I understand this, the dictionary would look simply repeat the word and instead of phones in the .phone file, you would put the same word list.
Thanks!
M
OK, let me be explicit. What you are saying is that if I want to have the recognizer work for ONE GUY well, I could have that one guy read the following sentence a whole bunch of times (10? 50?), while recording:
WAKEUP EVA HOME PAGE PAGE FORWARD PAGE BACK NEXT STEP PREVIOUS STEP I NEED HELP CLOSE HELP HOME PAGE GO TO SLEEP
These are the words/commands I care about (and only these). Can I just have them read this sentence over and over? Do I need to break it down (i.e. in one audio file would just be one command, like "go to sleep"), or can the whole sentence be in each file?
M
> l, I could have that one guy read the following sentence a whole bunch of times (10? 50?),
100
> an I just have them read this sentence over and over? Do I need to break it down
You need to break them. It's better to read by word if you want to recognize them as a separate command, don't join them into sentences
> How much audio do I need to record to get a good acoustic model?
Several hours total from around 400 speakers say a minute per speaker
> Also, should I do a word model instead of phone model?
Yes
> If I understand this, the dictionary would look simply repeat the word and instead of phones in the .phone file, you would put the same word list.
No, this question is covered in docs. Check tidigits dictionary for example.
> Several hours total from around 400 speakers say a minute per speaker
What if I am not making a general system - I don't need any Joe to be able to walk up and speak and the system to recognize it. Rather, I'll have say 10 people that I want the system to be REALLY GOOD at recognizing (or even just 1 to start). How would that affect your above statement?
Does it matter what you record them saying? In other words, if I have 15 words, can I just record someone saying those 15 words a bunch of times?
> Rather, I'll have say 10 people that I want the system to be REALLY GOOD at recognizing (or even just 1 to start). How would that affect your above statement?
It wouldn't. The amount of audio is comparable, you need a lot of samples from each speaker you want to recognize.
> Does it matter what you record them saying? In other words, if I have 15 words, can I just record someone saying those 15 words a bunch of times?
No, you need to have representation of each speaker.
Also, could you point me to directions in the documentation on training a word model? I can't seem to find them...
Also, could you point me to directions in the documentation on training a word model? I can't seem to find them...
SphinxTrain has folder template with template for tidigits. Just use the dictionary like this one and the small number of senones:
eight EY_eight T_eight
five F_five AY_five V_five
four F_four OW_four R_four
nine N_nine AY_nine N_nine_2
oh OW_oh
one W_one AX_one N_one
seven S_seven EH_seven V_seven E_seven N_seven
six S_six I_six K_six S_six_2
three TH_three R_three II_three
two T_two OO_two
zero Z_zero II_zero R_zero OW_zero