CMU Sphinx / Forums / Help: Sphinx for Nanodesktop: SphinxTrain utility ?

pegasus2000 - 2008-12-19

I'm creating a model for Nanodesktop PocketSphinx.

My model is composed by few words:

ENABLE ROOM RECOGNITION
ENABLE FACE RECOGNITION
ENABLE OCR SYSTEM

So, I have thought to use a predefined model (wsj) after
changing the dictionary file in way that the software can
recognize only the words that are part of my set.

It works, but I don't know if this solution is the better
in terms of memory consumption.

So, I've tried to create my Sphinx model using this tutorial:

http://www.ce.unipr.it/~ghelfi/Sphinx/body.php?page=train

The operation is very hard, so I have some questions:

a) There is an utility that creates the model files
automatically ?

b) I'm using the make_feats.pl script for Perl. My tutorial
says that I must digit:

bin/make_feats etc/prova.fileids

But I believe that the correct syntax is

./bin/make_feats -c etc/prova.fileids

Am I right ?

c) In my tutorial it is written that SphinxTrain doesn't support
wav files, so they must converted to raw format using SOX
before calling make_feats. Is it right ?

d) I've started make_feats but it is stopped on WAVE0001.RAW without proceeding. Is it normal ? How many time is required
for the elaboration ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- pegasus2000 - 2008-12-29
  
  >The initial footprint for this task is around 10 Mb. To use it you just >need to build a dictionary with required words and write jsgf. That's >all. You can use existing acoustic models.
  
  Ok, thanks for your informations. Can you tell me where can I
  find a how-to about these operations ? For example, how can I
  build a dictionary ? Using the CMU site ? (I have done a similar
  thing in that document that I've written, stage 8, it is the
  operation whose you refer, isn't it ?)
  
  And how can I write the jsgf ? Which are the operations that have I
  do in pocketsphinx_continuous after all ?
  
  >For small dictionary model can be smaller by using smaller amount >of senones. Tidigits model is 800 kb, but it's not trivial to build it. >The tidigits was training from the data of 300 speakers. None of >your users will be able to do this correctly.
  
  Can tidigits model be used also with words like "Recognizer, mail"
  etc. I believed that it can be used only for the terms one, two, three
  etc.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-12-29
    
    > For example, how can I build a dictionary ?
    
    for word in cat wordlist; do grep -w $word cmudict.0.6d; done > your.dict
    
    > And how can I write the jsgf
    
    http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html
    
    > Which are the operations that have I do in pocketsphinx_continuous after all ?
    
    I'm not really sure what "in pocketsphinx_continuous" means
    
    > believed that it can be used only for the terms one, two, three
    etc.
    
    yes, it only supposed to recognize digits. It's still not a trivial model though.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- suresh chandra sekaran - 2008-12-19
  
  As far as feature extraction is concerned, you don't need to worry about the format of your audio file. By default , "make_feats.pl" will be set to "sph" format of audio files. So, you open the make_feats.pl file and change the extension to the format you want( " the extension is mentined in the line number 77 of the file make_feats.pl) and thats all. Now you can convert your raw files to features. Moreover the feature extraction process won't take more than five minutes to convert some "100" wave files each one approximately "5Mb" size.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- pegasus2000 - 2008-12-29
  
  Thank for your help. We are working but the training operation
  appears very difficult.
  
  We're writing a guide that can assist step by step the
  Nanodesktop users to do the training operation.
  
  I have a thing to ask: can you see the manual and correct
  eventual errors ? You can download it here:
  
  http://rapidshare.com/files/177865262/Nanodesktop_ndPocketSphinx_User_Guide.odt.html
  
  I'm trying to work around the various troubles that I've find
  during the training using documentation.
  
  At the actual stage, I've tried to execute the RunAll.pl script.
  But the system fail: it reports that:
  
  CTL line does not parse correctly
  
  I've downloaded here the .html file that contains SphinxTrain log:
  
  http://rapidshare.com/files/177866192/FirstDictionary.html
  
  I don't understand why it fails. The operations that I have done
  are the same indicated in the guide.
  
  I've another question. Effectively, for me it would be sufficient
  an utility that creates a subset of an existing vocal dictionary
  (like the WSJ dictionary), containing only the diphones of
  my few words.
  
  There is a procedure that can be used to realized a reduced
  subset of an existing vocal dictionary, so avoiding the complex
  procedure of training from the wave files ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-12-29
    
    > We're writing a guide that can assist step by step the
    > Nanodesktop users to do the training operation.
    
    It's like giving a guide how to play a piano to the people who just want to listen music. It will take significant time to do training properly. It's not recommended to train.
    
    You are doing many things wrong in your howto. For example you are using English phoneset with Italian words. For a limited vocabulary things must be different as described in our FAQ.
    
    Many things are wrong as well:
    
    1) you don't need to setup perl INC, it's done automatically
    2) you don't need to convert your files to sph, sphinx can work with wav directly.
    3) The size of the data for training a five words database must be around 10 hours of speech of 200 speakers. Your five utterances in ctl files are not enough for sure.
    4) it's must easier to use jsgf instead of language model.
    
    > I've another question. Effectively, for me it would be sufficient
    > an utility that creates a subset of an existing vocal dictionary
    > (like the WSJ dictionary), containing only the diphones of
    > my few words.
    
    Your guide has enormous amount of incorrect terminology. There is no such thing like a "vocal" dictionary. The cmudict is a phonetic dictionary. This dictionary has no diphones, diphone is a completely different thing. This dictionary contains transcription of the words with English phones form the CMU phoneset.
    
    I don't quite understand your request to create a "subset" of the dictionary. You can just do it with grep or a little python script.
    
    > WARNING: CTL line does not parse correctly:
    
    Your fileids file has empty line in the end. It's not allowed.
    
    In general, you probably need something different than this guide. Try to define your requirements first.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - pegasus2000 - 2008-12-29
      
      The central points is this. Let's suppose
      to write an nd applications that can
      manage the mail.
      
      The program has voice control and the set
      of voice commands is this:
      
      SEND MY MAIL
      RECEIVE MY MAIL
      DELETE ALL MAILS
      HALT THE SYSTEM
      
      So, I need that Sphinx recognizes these
      commands, but without losses of memory
      (so, in ram must be only the data required
      by the words that are part of my command
      set, not the data of all wsj dictionary).
      
      What is the right procedure in this case ?
      
      I thought that the user had to create a new
      dictionary using the procedure that I've
      found here:
      
      http://www.ce.unipr.it/~ghelfi/Sphinx/body.php?page=train
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Nickolay V. Shmyrev - 2008-12-29
        
        > So, I need that Sphinx recognizes these commands, but without losses of memory (so, in ram must be only the data required by the words that are part of my command set, not the data of all wsj dictionary).
        
        The initial footprint for this task is around 10 Mb. To use it you just need to build a dictionary with required words and write jsgf. That's all. You can use existing acoustic models.
        
        The memory footprint can be reduced to 3 Mb or around, but it requires some advanced coding. There is http://www.cs.berkeley.edu/~eomer/SphinxTiny/SphinxTiny-0.7.html for example.
        
        For small dictionary model can be smaller by using smaller amount of senones. Tidigits model is 800 kb, but it's not trivial to build it. The tidigits was training from the data of 300 speakers. None of your users will be able to do this correctly.
        
        > http://www.ce.unipr.it/~ghelfi/Sphinx/body.php?page=train
        
        It's not related to your task
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sphinx for Nanodesktop: SphinxTrain utility ?

Speech Recognition Toolkit

Forums

Help

Sphinx for Nanodesktop: SphinxTrain utility ? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Sphinx for Nanodesktop: SphinxTrain utility ?