Menu

SphinxTrain and Sphinx3

Help
2008-04-07
2012-09-22
  • Héctor Delgado Flores

    Hello,

    I'm starting with SphinxTrain and Sphinx3. I'm developing a system that recognize three keywords from an audio input file. For trainning, I'm using 45 audio files: each word is recorded 3 three times by 3 differents speakers. I also use the default settings of sphinx_train.cfg. When I run Sphinx3_livepretend for decoding the accuracy is very poor. I use the same recordings that in training phase and no words are recognized, only silence.

    • Do you think is this number of recordings sufficient for trainning the system?

    • Should I change any parameters from the sphinx_train.cfg or use the default settings?

    • For this amount of data, do I need subvector quantization?

    • Should I use default settings for sphinx3_livepretend command?

    • ¿Are there any important considerations?

    Thanks a lot.

     
    • Héctor Delgado Flores

      Thank you for your help.

      I tried with -fillprob 0.9, but results I get are poor.

      It seems that models provided are for Spanish of Mexico (I'm using Spanisch of Spain) and there are phones than don't appear on the phonelist. Can it be a problem? For example, we need the phone Z for the word "ACERQUESEALMICROFONO" and there aren't any phone for this on the phonelist.

      Thanks.

       
    • Nickolay V. Shmyrev

      > Do you think is this number of recordings sufficient for trainning the system?

      No, for each word you need around 20 samples from 200 speakers.

      > Should I change any parameters from the sphinx_train.cfg or use the default settings?

      Sure, for example you don't need context dependent models. Also I'm not sure you correctly extracted the features, remember your files must have particular format - 16 kHz mono.

      > For this amount of data, do I need subvector quantization?

      no

      > Should I use default settings for sphinx3_livepretend command?

      Mostly yes. Though I'd recommend you to use sphinx3_decode with current CMN instead of sphinx3_livepretend with prior CMN

      > Are there any important considerations?

      What language are you training for? I'd better use any of already existing models instead. Also if you are not sure about steps you've done, just upload all files you have to some resource and give us a link.

       
    • Héctor Delgado Flores

      Thanks for your answer.

      I want the system to recognize 3 "keywords" inside a "continuous" speech file for spanish. These keywords are not simple words, but little sentences. For example: "Con la venia" is the keyword "conlavenia".

      • For this, do I need CI models only?

      Here are all my files: http://rapidshare.com/files/105874204/files.tar.gz.html

      This is the process I have chosen:

      • The audio files are 16 khz mono.

      • I ran the training with the default settings

      • I have tried sphinx3_decode with this argfile:

        -hmm /home/hector/sphinx/prueba/model_parameters/prueba.cd_cont_1000
        -lm /home/hector/sphinx/prueba/lm/lm.arpa.DMP
        -dict /home/hector/sphinx/prueba/etc/prueba.dic
        -fdict /home/hector/sphinx/prueba/etc/prueba.filler
        -ctl /home/hector/sphinx/prueba/ctlfile
        -hyp /home/hector/sphinx/prueba/resultado

      For flag -ctl I have put the feature files of the audio files I want to decode. For this test, they are the same that in trainning phase.

      I have made tests for three different language models: One for each word I want recognize:

      • For the word "conlavenia" the result is:

      (microfono01)
      (microfono02)
      (microfono03)
      (microfono04)
      (microfono05)
      (microfono06)
      (microfono07)
      (microfono08)
      (microfono09)
      (microfono10)
      (microfono11)
      (microfono12)
      (microfono13)
      (microfono14)
      (microfono15)
      CONLAVENIA (preguntas01)
      CONLAVENIA (preguntas02)
      CONLAVENIA (preguntas03)
      CONLAVENIA (preguntas04)
      CONLAVENIA (preguntas05)
      CONLAVENIA (preguntas06)
      CONLAVENIA (preguntas07)
      CONLAVENIA (preguntas08)
      CONLAVENIA (preguntas09)
      CONLAVENIA (preguntas10)
      CONLAVENIA (preguntas11)
      CONLAVENIA (preguntas12)
      CONLAVENIA (preguntas13)
      CONLAVENIA (preguntas14)
      CONLAVENIA (preguntas15)
      (venia01)
      (venia02)
      (venia03)
      (venia04)
      (venia05)
      (venia06)
      (venia07)
      (venia08)
      (venia09)
      (venia10)
      (venia11)
      (venia12)
      (venia13)
      (venia14)
      (venia15)

      • For the word "acerquesealmicrofono" the result is:

      ACERQUESEALMICROFONO (microfono01)
      ACERQUESEALMICROFONO (microfono02)
      ACERQUESEALMICROFONO (microfono03)
      ACERQUESEALMICROFONO (microfono04)
      ACERQUESEALMICROFONO (microfono05)
      ACERQUESEALMICROFONO (microfono06)
      ACERQUESEALMICROFONO (microfono07)
      (microfono08)
      ACERQUESEALMICROFONO (microfono09)
      ACERQUESEALMICROFONO (microfono10)
      (microfono11)
      ACERQUESEALMICROFONO (microfono12)
      ACERQUESEALMICROFONO (microfono13)
      (microfono14)
      ACERQUESEALMICROFONO (microfono15)
      (preguntas01)
      (preguntas02)
      (preguntas03)
      (preguntas04)
      (preguntas05)
      (preguntas06)
      (preguntas07)
      (preguntas08)
      (preguntas09)
      (preguntas10)
      (preguntas11)
      (preguntas12)
      (preguntas13)
      (preguntas14)
      (preguntas15)
      (venia01)
      (venia02)
      (venia03)
      (venia04)
      (venia05)
      (venia06)
      (venia07)
      (venia08)
      (venia09)
      (venia10)
      (venia11)
      (venia12)
      (venia13)
      (venia14)
      (venia15)

      • For the word "nohaymaspreguntas" the result is:

      (microfono01)
      (microfono02)
      (microfono03)
      (microfono04)
      (microfono05)
      (microfono06)
      (microfono07)
      (microfono08)
      (microfono09)
      (microfono10)
      (microfono11)
      (microfono12)
      (microfono13)
      (microfono14)
      (microfono15)
      (preguntas01)
      (preguntas02)
      (preguntas03)
      (preguntas04)
      (preguntas05)
      (preguntas06)
      (preguntas07)
      (preguntas08)
      (preguntas09)
      (preguntas10)
      (preguntas11)
      (preguntas12)
      (preguntas13)
      (preguntas14)
      (preguntas15)
      NOHAYMASPREGUNTAS (venia01)
      NOHAYMASPREGUNTAS (venia02)
      NOHAYMASPREGUNTAS (venia03)
      NOHAYMASPREGUNTAS (venia04)
      NOHAYMASPREGUNTAS (venia05)
      NOHAYMASPREGUNTAS (venia06)
      (venia07)
      NOHAYMASPREGUNTAS (venia08)
      NOHAYMASPREGUNTAS (venia09)
      NOHAYMASPREGUNTAS (venia10)
      NOHAYMASPREGUNTAS (venia11)
      NOHAYMASPREGUNTAS (venia12)
      NOHAYMASPREGUNTAS (venia13)
      NOHAYMASPREGUNTAS (venia14)
      NOHAYMASPREGUNTAS (venia15)

      As you can see,the recognition fails for 1st and 3rd language models.

      Tanks for any help!

       
      • Nickolay V. Shmyrev

        > I want the system to recognize 3 "keywords" inside a "continuous" speech file for spanish. These keywords are not simple words, but little sentences. For example: "Con la venia" is the keyword "conlavenia".

        How many speakers will you have in production?

        > For this, do I need CI models only?

        Well, it's early to speak about training as I said. This hub4 model will work perfectly for you and will be much more stable than your homemade model:

        http://www.speech.cs.cmu.edu/sphinx/models/hub4spanish_itesm/

        For example of it's usage check the files I uploaded to you. It recognize your test almost perfectly and I suppose with a little adjustment it will recognize just perfectly.

        http://www.mediafire.com/?yq1fmywy7xy

         
        • xavic383

          xavic383 - 2008-04-15

          Hi,

          Some few question, please:

          1) how do you generate the "test.fsg" file in del5-test zip ? Is it related with "test.gram" file ?

          2) How "test.gram" gets involved with the recognition process ?? I don't see any flag in "test_sphinx.sh" pointing to "test.gram"...

          Thanks a lot !!

           
          • Nickolay V. Shmyrev

            Use sphinx_jsgf2fsg from sphinxbase

             
    • Héctor Delgado Flores

      Thank you very much for your help!

      Now I know how to run the test and it works. The next step I want to achieve is, for a continuous speaking audio file, to detect only the words from my dicctionary spoken at the audio input file, and to refuse the rest of the speech. How should my model language be for this porpose?

      Thanks

       
      • Nickolay V. Shmyrev

        For every phone add a word with a single phone to the filler dictionary and make fsg use fillers. Or just construct phone loop with a small probability in fsg yourself.

         
    • Héctor Delgado Flores

      Thanks!!

      I have added the phones to the filler dictionary as:

      <s> SIL
      </s> SIL
      <sil> SIL
      A A
      B B
      CH CH
      D D
      E E
      F F
      G G
      GN GN
      I I
      J J
      K K
      L L
      LL LL
      M M
      N N
      O O
      P P
      R R
      RR RR
      S S
      T T
      U U
      V V
      X X
      Y Y

      And I have written this grammar:

      JSGF V1.0;

      grammar test;

      public <public> = ( ACERQUESEALMICROFONO | CONLAVENIA | NOHAYMASPREGUNTAS | <filler> )*;
      <filler> = (A | B | CH | D | E | F | G | GN | I | J | K | L | LL | M | N | O | P | R | RR | S | T | U | V | X | Y);

      But it doesn't work well. When I run sphinx3_decode, it recognize words from the dicctionary than are not said in the input file.
      Am I writting the grammar badly?

      Thanks a lot.

       
      • Nickolay V. Shmyrev

        Could you provide full example please

         
    • Héctor Delgado Flores

      Sorry.

      Here is the example: http://rapidshare.com/files/107937656/test_sphinx.tar.gz.html

      There are 3 sentences recorded 3 times each one:

      For frase11.wav, frase12.wav and frase13.wav the result must be "ACERQUESEALMICROFONO NOHAYMASPREGUNTAS"
      For frase21.wav, frase22.wav and frase23.wav the result must be also "ACERQUESEALMICROFONO NOHAYMASPREGUNTAS"
      For frase31.wav, frase32.wav and frase33.wav the result mus be "CONLAVENIA NOHAYMASPREGUNTAS ACERQUESEALMICROFONO NOHAYMASPREGUNTAS CONLAVENIA ACERQUESEALMICROFONO"

      My grammar only works with frase31,32 and 33 because it haven't any other words, only have words from the dictionary.

      I have tried to change the probabilities of the transitions state2-state4 and state2-state5, but the result is poor.

      Thanks

       
      • Nickolay V. Shmyrev

        Hm, indeed it needs some tunning, I'm trying on this but no results yet. I'll try to get something soon. As a quick way one should also increase filprob with -filprob 0.9 but results aren't perfect.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.