CMU Sphinx / Forums / Help: Training a very small acoustic model

Mike Medved - 2009-06-23

Hi -

I'd like to train an acoustic model for a system with a very small number of words, say 20 or less. WSJ1 (the one in trunk of pocketsphinx) works great with my little dictionary and language model, but what I'd like is an incredibly fast implementation and the size of the acoustic model should certainly play into this.

How much audio do I need to record to get a good acoustic model? Also, should I do a word model instead of phone model? If I understand this, the dictionary would look simply repeat the word and instead of phones in the .phone file, you would put the same word list.

Thanks!
M

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Medved - 2009-06-25
  
  OK, let me be explicit. What you are saying is that if I want to have the recognizer work for ONE GUY well, I could have that one guy read the following sentence a whole bunch of times (10? 50?), while recording:
  
  WAKEUP EVA HOME PAGE PAGE FORWARD PAGE BACK NEXT STEP PREVIOUS STEP I NEED HELP CLOSE HELP HOME PAGE GO TO SLEEP
  
  These are the words/commands I care about (and only these). Can I just have them read this sentence over and over? Do I need to break it down (i.e. in one audio file would just be one command, like "go to sleep"), or can the whole sentence be in each file?
  
  M
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2009-06-26
    
    > l, I could have that one guy read the following sentence a whole bunch of times (10? 50?),
    
    100
    
    > an I just have them read this sentence over and over? Do I need to break it down
    
    You need to break them. It's better to read by word if you want to recognize them as a separate command, don't join them into sentences
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2009-06-24
  
  > How much audio do I need to record to get a good acoustic model?
  
  Several hours total from around 400 speakers say a minute per speaker
  
  > Also, should I do a word model instead of phone model?
  
  Yes
  
  > If I understand this, the dictionary would look simply repeat the word and instead of phones in the .phone file, you would put the same word list.
  
  No, this question is covered in docs. Check tidigits dictionary for example.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mike Medved - 2009-06-24
    
    > Several hours total from around 400 speakers say a minute per speaker
    
    What if I am not making a general system - I don't need any Joe to be able to walk up and speak and the system to recognize it. Rather, I'll have say 10 people that I want the system to be REALLY GOOD at recognizing (or even just 1 to start). How would that affect your above statement?
    
    Does it matter what you record them saying? In other words, if I have 15 words, can I just record someone saying those 15 words a bunch of times?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2009-06-24
      
      > Rather, I'll have say 10 people that I want the system to be REALLY GOOD at recognizing (or even just 1 to start). How would that affect your above statement?
      
      It wouldn't. The amount of audio is comparable, you need a lot of samples from each speaker you want to recognize.
      
      > Does it matter what you record them saying? In other words, if I have 15 words, can I just record someone saying those 15 words a bunch of times?
      
      No, you need to have representation of each speaker.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Medved - 2009-06-25
  
  Also, could you point me to directions in the documentation on training a word model? I can't seem to find them...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Medved - 2009-06-25
  
  Also, could you point me to directions in the documentation on training a word model? I can't seem to find them...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2009-06-26
    
    SphinxTrain has folder template with template for tidigits. Just use the dictionary like this one and the small number of senones:
    
    eight EY_eight T_eight
    five F_five AY_five V_five
    four F_four OW_four R_four
    nine N_nine AY_nine N_nine_2
    oh OW_oh
    one W_one AX_one N_one
    seven S_seven EH_seven V_seven E_seven N_seven
    six S_six I_six K_six S_six_2
    three TH_three R_three II_three
    two T_two OO_two
    zero Z_zero II_zero R_zero OW_zero
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Training a very small acoustic model

Speech Recognition Toolkit

Forums

Help

Training a very small acoustic model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Training a very small acoustic model