Menu

Sphinx for Nanodesktop: SphinxTrain utility ?

Help
2008-12-19
2012-09-22
  • pegasus2000

    pegasus2000 - 2008-12-19

    I'm creating a model for Nanodesktop PocketSphinx.

    My model is composed by few words:

    ENABLE ROOM RECOGNITION
    ENABLE FACE RECOGNITION
    ENABLE OCR SYSTEM

    So, I have thought to use a predefined model (wsj) after
    changing the dictionary file in way that the software can
    recognize only the words that are part of my set.

    It works, but I don't know if this solution is the better
    in terms of memory consumption.

    So, I've tried to create my Sphinx model using this tutorial:

    http://www.ce.unipr.it/~ghelfi/Sphinx/body.php?page=train

    The operation is very hard, so I have some questions:

    a) There is an utility that creates the model files
    automatically ?

    b) I'm using the make_feats.pl script for Perl. My tutorial
    says that I must digit:

    bin/make_feats etc/prova.fileids

    But I believe that the correct syntax is

    ./bin/make_feats -c etc/prova.fileids

    Am I right ?

    c) In my tutorial it is written that SphinxTrain doesn't support
    wav files, so they must converted to raw format using SOX
    before calling make_feats. Is it right ?

    d) I've started make_feats but it is stopped on WAVE0001.RAW without proceeding. Is it normal ? How many time is required
    for the elaboration ?

     
    • pegasus2000

      pegasus2000 - 2008-12-29

      >The initial footprint for this task is around 10 Mb. To use it you just >need to build a dictionary with required words and write jsgf. That's >all. You can use existing acoustic models.

      Ok, thanks for your informations. Can you tell me where can I
      find a how-to about these operations ? For example, how can I
      build a dictionary ? Using the CMU site ? (I have done a similar
      thing in that document that I've written, stage 8, it is the
      operation whose you refer, isn't it ?)

      And how can I write the jsgf ? Which are the operations that have I
      do in pocketsphinx_continuous after all ?

      >For small dictionary model can be smaller by using smaller amount >of senones. Tidigits model is 800 kb, but it's not trivial to build it. >The tidigits was training from the data of 300 speakers. None of >your users will be able to do this correctly.

      Can tidigits model be used also with words like "Recognizer, mail"
      etc. I believed that it can be used only for the terms one, two, three
      etc.

       
      • Nickolay V. Shmyrev

        > For example, how can I build a dictionary ?

        for word in cat wordlist; do grep -w $word cmudict.0.6d; done > your.dict

        > And how can I write the jsgf

        http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html

        > Which are the operations that have I do in pocketsphinx_continuous after all ?

        I'm not really sure what "in pocketsphinx_continuous" means

        > believed that it can be used only for the terms one, two, three
        etc.

        yes, it only supposed to recognize digits. It's still not a trivial model though.

         
    • suresh chandra sekaran

         As far as feature extraction is concerned, you don't need to worry about the format of your audio file. By default , "make_feats.pl" will be set to "sph" format of audio files. So, you open the make_feats.pl file and change the extension to the format you want( " the extension is mentined in the line number 77 of the file make_feats.pl)  and thats all. Now you can convert your raw files to features. Moreover the feature extraction process won't take more than five minutes to convert some "100" wave files each one approximately "5Mb" size.
      
       
    • pegasus2000

      pegasus2000 - 2008-12-29

      Thank for your help. We are working but the training operation
      appears very difficult.

      We're writing a guide that can assist step by step the
      Nanodesktop users to do the training operation.

      I have a thing to ask: can you see the manual and correct
      eventual errors ? You can download it here:

      http://rapidshare.com/files/177865262/Nanodesktop_ndPocketSphinx_User_Guide.odt.html

      I'm trying to work around the various troubles that I've find
      during the training using documentation.

      At the actual stage, I've tried to execute the RunAll.pl script.
      But the system fail: it reports that:

      CTL line does not parse correctly

      I've downloaded here the .html file that contains SphinxTrain log:

      http://rapidshare.com/files/177866192/FirstDictionary.html

      I don't understand why it fails. The operations that I have done
      are the same indicated in the guide.

      I've another question. Effectively, for me it would be sufficient
      an utility that creates a subset of an existing vocal dictionary
      (like the WSJ dictionary), containing only the diphones of
      my few words.

      There is a procedure that can be used to realized a reduced
      subset of an existing vocal dictionary, so avoiding the complex
      procedure of training from the wave files ?

       
      • Nickolay V. Shmyrev

        > We're writing a guide that can assist step by step the
        > Nanodesktop users to do the training operation.

        It's like giving a guide how to play a piano to the people who just want to listen music. It will take significant time to do training properly. It's not recommended to train.

        You are doing many things wrong in your howto. For example you are using English phoneset with Italian words. For a limited vocabulary things must be different as described in our FAQ.

        Many things are wrong as well:

        1) you don't need to setup perl INC, it's done automatically
        2) you don't need to convert your files to sph, sphinx can work with wav directly.
        3) The size of the data for training a five words database must be around 10 hours of speech of 200 speakers. Your five utterances in ctl files are not enough for sure.
        4) it's must easier to use jsgf instead of language model.

        > I've another question. Effectively, for me it would be sufficient
        > an utility that creates a subset of an existing vocal dictionary
        > (like the WSJ dictionary), containing only the diphones of
        > my few words.

        Your guide has enormous amount of incorrect terminology. There is no such thing like a "vocal" dictionary. The cmudict is a phonetic dictionary. This dictionary has no diphones, diphone is a completely different thing. This dictionary contains transcription of the words with English phones form the CMU phoneset.

        I don't quite understand your request to create a "subset" of the dictionary. You can just do it with grep or a little python script.

        > WARNING: CTL line does not parse correctly:

        Your fileids file has empty line in the end. It's not allowed.

        In general, you probably need something different than this guide. Try to define your requirements first.

         
        • pegasus2000

          pegasus2000 - 2008-12-29

          The central points is this. Let's suppose
          to write an nd applications that can
          manage the mail.

          The program has voice control and the set
          of voice commands is this:

          SEND MY MAIL
          RECEIVE MY MAIL
          DELETE ALL MAILS
          HALT THE SYSTEM

          So, I need that Sphinx recognizes these
          commands, but without losses of memory
          (so, in ram must be only the data required
          by the words that are part of my command
          set, not the data of all wsj dictionary).

          What is the right procedure in this case ?

          I thought that the user had to create a new
          dictionary using the procedure that I've
          found here:

          http://www.ce.unipr.it/~ghelfi/Sphinx/body.php?page=train

           
          • Nickolay V. Shmyrev

            > So, I need that Sphinx recognizes these commands, but without losses of memory (so, in ram must be only the data required by the words that are part of my command set, not the data of all wsj dictionary).

            The initial footprint for this task is around 10 Mb. To use it you just need to build a dictionary with required words and write jsgf. That's all. You can use existing acoustic models.

            The memory footprint can be reduced to 3 Mb or around, but it requires some advanced coding. There is http://www.cs.berkeley.edu/~eomer/SphinxTiny/SphinxTiny-0.7.html for example.

            For small dictionary model can be smaller by using smaller amount of senones. Tidigits model is 800 kb, but it's not trivial to build it. The tidigits was training from the data of 300 speakers. None of your users will be able to do this correctly.

            > http://www.ce.unipr.it/~ghelfi/Sphinx/body.php?page=train

            It's not related to your task

             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.