CMU Sphinx / Forums / Help: SphinxTrain and Sphinx3

Héctor Delgado Flores - 2008-04-07

Hello,

I'm starting with SphinxTrain and Sphinx3. I'm developing a system that recognize three keywords from an audio input file. For trainning, I'm using 45 audio files: each word is recorded 3 three times by 3 differents speakers. I also use the default settings of sphinx_train.cfg. When I run Sphinx3_livepretend for decoding the accuracy is very poor. I use the same recordings that in training phase and no words are recognized, only silence.

Do you think is this number of recordings sufficient for trainning the system?

Should I change any parameters from the sphinx_train.cfg or use the default settings?

For this amount of data, do I need subvector quantization?

Should I use default settings for sphinx3_livepretend command?

¿Are there any important considerations?

Thanks a lot.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Héctor Delgado Flores - 2008-04-21
  
  Thank you for your help.
  
  I tried with -fillprob 0.9, but results I get are poor.
  
  It seems that models provided are for Spanish of Mexico (I'm using Spanisch of Spain) and there are phones than don't appear on the phonelist. Can it be a problem? For example, we need the phone Z for the word "ACERQUESEALMICROFONO" and there aren't any phone for this on the phonelist.
  
  Thanks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2008-04-07
  
  > Do you think is this number of recordings sufficient for trainning the system?
  
  No, for each word you need around 20 samples from 200 speakers.
  
  > Should I change any parameters from the sphinx_train.cfg or use the default settings?
  
  Sure, for example you don't need context dependent models. Also I'm not sure you correctly extracted the features, remember your files must have particular format - 16 kHz mono.
  
  > For this amount of data, do I need subvector quantization?
  
  no
  
  > Should I use default settings for sphinx3_livepretend command?
  
  Mostly yes. Though I'd recommend you to use sphinx3_decode with current CMN instead of sphinx3_livepretend with prior CMN
  
  > Are there any important considerations?
  
  What language are you training for? I'd better use any of already existing models instead. Also if you are not sure about steps you've done, just upload all files you have to some resource and give us a link.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Héctor Delgado Flores - 2008-04-08
  
  Thanks for your answer.
  
  I want the system to recognize 3 "keywords" inside a "continuous" speech file for spanish. These keywords are not simple words, but little sentences. For example: "Con la venia" is the keyword "conlavenia".
  
  For this, do I need CI models only?
  
  Here are all my files: http://rapidshare.com/files/105874204/files.tar.gz.html
  
  This is the process I have chosen:
  
  The audio files are 16 khz mono.
  
  I ran the training with the default settings
  
  I have tried sphinx3_decode with this argfile:
  
  -hmm /home/hector/sphinx/prueba/model_parameters/prueba.cd_cont_1000
  -lm /home/hector/sphinx/prueba/lm/lm.arpa.DMP
  -dict /home/hector/sphinx/prueba/etc/prueba.dic
  -fdict /home/hector/sphinx/prueba/etc/prueba.filler
  -ctl /home/hector/sphinx/prueba/ctlfile
  -hyp /home/hector/sphinx/prueba/resultado
  
  For flag -ctl I have put the feature files of the audio files I want to decode. For this test, they are the same that in trainning phase.
  
  I have made tests for three different language models: One for each word I want recognize:
  
  For the word "conlavenia" the result is:
  
  (microfono01)
  (microfono02)
  (microfono03)
  (microfono04)
  (microfono05)
  (microfono06)
  (microfono07)
  (microfono08)
  (microfono09)
  (microfono10)
  (microfono11)
  (microfono12)
  (microfono13)
  (microfono14)
  (microfono15)
  CONLAVENIA (preguntas01)
  CONLAVENIA (preguntas02)
  CONLAVENIA (preguntas03)
  CONLAVENIA (preguntas04)
  CONLAVENIA (preguntas05)
  CONLAVENIA (preguntas06)
  CONLAVENIA (preguntas07)
  CONLAVENIA (preguntas08)
  CONLAVENIA (preguntas09)
  CONLAVENIA (preguntas10)
  CONLAVENIA (preguntas11)
  CONLAVENIA (preguntas12)
  CONLAVENIA (preguntas13)
  CONLAVENIA (preguntas14)
  CONLAVENIA (preguntas15)
  (venia01)
  (venia02)
  (venia03)
  (venia04)
  (venia05)
  (venia06)
  (venia07)
  (venia08)
  (venia09)
  (venia10)
  (venia11)
  (venia12)
  (venia13)
  (venia14)
  (venia15)
  
  For the word "acerquesealmicrofono" the result is:
  
  ACERQUESEALMICROFONO (microfono01)
  ACERQUESEALMICROFONO (microfono02)
  ACERQUESEALMICROFONO (microfono03)
  ACERQUESEALMICROFONO (microfono04)
  ACERQUESEALMICROFONO (microfono05)
  ACERQUESEALMICROFONO (microfono06)
  ACERQUESEALMICROFONO (microfono07)
  (microfono08)
  ACERQUESEALMICROFONO (microfono09)
  ACERQUESEALMICROFONO (microfono10)
  (microfono11)
  ACERQUESEALMICROFONO (microfono12)
  ACERQUESEALMICROFONO (microfono13)
  (microfono14)
  ACERQUESEALMICROFONO (microfono15)
  (preguntas01)
  (preguntas02)
  (preguntas03)
  (preguntas04)
  (preguntas05)
  (preguntas06)
  (preguntas07)
  (preguntas08)
  (preguntas09)
  (preguntas10)
  (preguntas11)
  (preguntas12)
  (preguntas13)
  (preguntas14)
  (preguntas15)
  (venia01)
  (venia02)
  (venia03)
  (venia04)
  (venia05)
  (venia06)
  (venia07)
  (venia08)
  (venia09)
  (venia10)
  (venia11)
  (venia12)
  (venia13)
  (venia14)
  (venia15)
  
  For the word "nohaymaspreguntas" the result is:
  
  (microfono01)
  (microfono02)
  (microfono03)
  (microfono04)
  (microfono05)
  (microfono06)
  (microfono07)
  (microfono08)
  (microfono09)
  (microfono10)
  (microfono11)
  (microfono12)
  (microfono13)
  (microfono14)
  (microfono15)
  (preguntas01)
  (preguntas02)
  (preguntas03)
  (preguntas04)
  (preguntas05)
  (preguntas06)
  (preguntas07)
  (preguntas08)
  (preguntas09)
  (preguntas10)
  (preguntas11)
  (preguntas12)
  (preguntas13)
  (preguntas14)
  (preguntas15)
  NOHAYMASPREGUNTAS (venia01)
  NOHAYMASPREGUNTAS (venia02)
  NOHAYMASPREGUNTAS (venia03)
  NOHAYMASPREGUNTAS (venia04)
  NOHAYMASPREGUNTAS (venia05)
  NOHAYMASPREGUNTAS (venia06)
  (venia07)
  NOHAYMASPREGUNTAS (venia08)
  NOHAYMASPREGUNTAS (venia09)
  NOHAYMASPREGUNTAS (venia10)
  NOHAYMASPREGUNTAS (venia11)
  NOHAYMASPREGUNTAS (venia12)
  NOHAYMASPREGUNTAS (venia13)
  NOHAYMASPREGUNTAS (venia14)
  NOHAYMASPREGUNTAS (venia15)
  
  As you can see,the recognition fails for 1st and 3rd language models.
  
  Tanks for any help!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-04-09
    
    > I want the system to recognize 3 "keywords" inside a "continuous" speech file for spanish. These keywords are not simple words, but little sentences. For example: "Con la venia" is the keyword "conlavenia".
    
    How many speakers will you have in production?
    
    > For this, do I need CI models only?
    
    Well, it's early to speak about training as I said. This hub4 model will work perfectly for you and will be much more stable than your homemade model:
    
    http://www.speech.cs.cmu.edu/sphinx/models/hub4spanish_itesm/
    
    For example of it's usage check the files I uploaded to you. It recognize your test almost perfectly and I suppose with a little adjustment it will recognize just perfectly.
    
    http://www.mediafire.com/?yq1fmywy7xy
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - xavic383 - 2008-04-15
      
      Hi,
      
      Some few question, please:
      
      1) how do you generate the "test.fsg" file in del5-test zip ? Is it related with "test.gram" file ?
      
      2) How "test.gram" gets involved with the recognition process ?? I don't see any flag in "test_sphinx.sh" pointing to "test.gram"...
      
      Thanks a lot !!
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Nickolay V. Shmyrev - 2008-04-15
        
        Use sphinx_jsgf2fsg from sphinxbase
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Héctor Delgado Flores - 2008-04-11
  
  Thank you very much for your help!
  
  Now I know how to run the test and it works. The next step I want to achieve is, for a continuous speaking audio file, to detect only the words from my dicctionary spoken at the audio input file, and to refuse the rest of the speech. How should my model language be for this porpose?
  
  Thanks
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-04-14
    
    For every phone add a word with a single phone to the filler dictionary and make fsg use fillers. Or just construct phone loop with a small probability in fsg yourself.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Héctor Delgado Flores - 2008-04-16
  
  Thanks!!
  
  I have added the phones to the filler dictionary as:
  
  <s> SIL
  </s> SIL
  <sil> SIL
  A A
  B B
  CH CH
  D D
  E E
  F F
  G G
  GN GN
  I I
  J J
  K K
  L L
  LL LL
  M M
  N N
  O O
  P P
  R R
  RR RR
  S S
  T T
  U U
  V V
  X X
  Y Y
  
  And I have written this grammar:
  
  JSGF V1.0;
  
  grammar test;
  
  public <public> = ( ACERQUESEALMICROFONO | CONLAVENIA | NOHAYMASPREGUNTAS | <filler> )*;
  <filler> = (A | B | CH | D | E | F | G | GN | I | J | K | L | LL | M | N | O | P | R | RR | S | T | U | V | X | Y);
  
  But it doesn't work well. When I run sphinx3_decode, it recognize words from the dicctionary than are not said in the input file.
  Am I writting the grammar badly?
  
  Thanks a lot.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-04-16
    
    Could you provide full example please
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Héctor Delgado Flores - 2008-04-16
  
  Sorry.
  
  Here is the example: http://rapidshare.com/files/107937656/test_sphinx.tar.gz.html
  
  There are 3 sentences recorded 3 times each one:
  
  For frase11.wav, frase12.wav and frase13.wav the result must be "ACERQUESEALMICROFONO NOHAYMASPREGUNTAS"
  For frase21.wav, frase22.wav and frase23.wav the result must be also "ACERQUESEALMICROFONO NOHAYMASPREGUNTAS"
  For frase31.wav, frase32.wav and frase33.wav the result mus be "CONLAVENIA NOHAYMASPREGUNTAS ACERQUESEALMICROFONO NOHAYMASPREGUNTAS CONLAVENIA ACERQUESEALMICROFONO"
  
  My grammar only works with frase31,32 and 33 because it haven't any other words, only have words from the dictionary.
  
  I have tried to change the probabilities of the transitions state2-state4 and state2-state5, but the result is poor.
  
  Thanks
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-04-17
    
    Hm, indeed it needs some tunning, I'm trying on this but no results yet. I'll try to get something soon. As a quick way one should also increase filprob with -filprob 0.9 but results aren't perfect.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SphinxTrain and Sphinx3

Speech Recognition Toolkit

Forums

Help

SphinxTrain and Sphinx3 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

JSGF V1.0;

SphinxTrain and Sphinx3