CMU Sphinx / Forums / Help: CMU Sphinx Trainer Problems in Module 6

NeoGermi - 2005-03-28

Hello,

I tried to train a test model for getting along with the SphinxTrainer. I'm working under Windows XP with Cygwin. I did not used the win32 sources but I built with the make program provided with cygwin.
I had some natural problems with the tools by treating files and strings under Windows and corrected them myself. But now, I'm at a point where I haven no idea where the failure could lie...

The failure lies in the 6th module of the trainer, in the state where tiestates should be created. An assertion is thrown and I have no idea why this is done..

I placed my modified code with the results at http://mitglied.lycos.de/Germi/cmu/ . (test.rar contains the created files within the test.html (the output-log-file and the results printed on the shell (cygwin-prints).
in SprinxTrain.rar you can find the CVS with my little changes. I only changed code in:

src/libs/libcommon/acmod_set.c
src/libs/libcommon/mk_phone_list.c
src/libs/libcommon/mk_wordlist.c
src/libs/libio/corpus.c
src/libs/libs2io/areadfloat.c
src/programs/bw/baum_welch.c
src/programs/bw/main.c
src/programs/tiestate/main.c )

Could you please give me a hint, perhaps I only have some misunderstandings oder something like that. Or, if you aren't the rigth contact person could you give me the adress of the right one who can help me?

Thanks a lot,

Sebastian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-03-28
  
  Using Windows XP, I don't seem to be able to interpret your .rar files. They are clearly not plain text files. Can you rewrite them as plain text so they'll be easier to read? Thanks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-03-29
  
  ok, sorry for the confusion. I uploaded the plain text files in the same directory...
  
  greetz,
  
  Sebastian
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-03-29
  
  Sebastian -- this is the correct forum for obtaining help. I'm not a "contact person", just someone with some experience using SphinxTrain. I took a quick look at your test directory. I don't understand the assertion failure either, but let me suggest:
  
  Your training set consists of only 5 short utterances made up from only three distinct words (and 11 phones, including SIL). SphinxTrain is for training statistical models, and there's not much "statistics" that can be derived from such a tiny amount of speech! Perhaps yuo are starting so small to confirm that the algorithms work, but I suggest that this is far too small. The training algorithms may do degenerate things when working with such a tiny amount of data -- things may not scale down to the point that you have taken them.
  Therefore I suggest that one thing to try would be to produce more training utterances -- perhaps 30 or more with that set of words. Better yet, expand to more than 3 words, and even more utterances.
  
  You have $CFG_N_TIED_STATES set to 45. I don't know what a reasonable number of tied states might be, given the very small number of phones and phonetic contexts in your training data, but you might try setting this higher.
  
  cheers,
  jerry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - NeoGermi - 2005-03-29
    
    Hi jerry,
    
    first of all: thanks for the fast answer!
    
    So, I think the best thing I can do at the moment is to try SphinxTrain on a Linux machine and with some more training data...
    
    Only for getting closer, how SphinxTrain works. Do you mean more than one speaker (wav-file) for the same words (utterances) or a bigger vocab?
    
    thx a lot,
    
    Sebastian
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-03-29
  
  I forgot one more thing. When I view your plain text files using my web browser, I see <return>s in places where I don't expect them. This would not be a problem in log files, but it could be in data files. For example, test/model_architecture/test.phonelist begins:
  AO
  
  D
  
  EY
  
  It should look like:
  AO - - -
  D - - -
  EY - - -
  
  And there are other examples in the model definition files (.mdef). If this is true on your machine, this could be causing problems. See for example, test/model_architecture/test.alltriphones.mdef. I see:
  D D EY b n/a 1 36 37 38 N
  D EY EY b n/a 1 39 40 41 N
  D NG EY b n/a 1 42 43 44 N
  D SIL EY b n/a 1 45 46 47 N
  D SIL
  EY b n/a 1 48 49 50 N
  
  The last 2 (or 3) lines appear to show the same triphone "D SIL EY b" with distinct state id's, except that one line has an extra <return>. This is the sort of thing that can happen when porting unix code to the PC. I do not know Cygwin and how it may or may not handle these line-ending problems, but that could be another source of your problems.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-03-29
  
  It depends on the goal of your acoustic model. The data requirements can be quite small if your goal is a very limited model, but for a general, speaker-independent model, the training set can grow to encompass tens of hours of speech by hundreds of speakers.
  
  If you add more utterances by the same speaker, then your model will accumulate statistics on the different ways that this speaker says the words and phones. Since no two utterances, even of the same words, are alike, this is useful. Ideally, a training set should contain multiple examples of everything. If you wish your model to represent only how this one speaker says just this small set of words, this is OK, but if you wish your model to be usable for many different words, then you must expand your training set to include phones in many contexts.
  
  If you add more utterances by different speakers, then the training set will show how many speakers realize those words and phones. Speech of many (100+) speakers is essential if you wish your model to be speaker-independent.
  
  cheers,
  jerry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-04-04
  
  Ok,
  
  news: ;-)
  
  I tried to install everything under Debian.
  It worked better, not yet the whole thing (some ERROR messages occured) but some failures noted above are gone...
  
  An other question would be interesting. I first tried to use .WAV files as input (created with windows) but that was not a good idea... So I recoded the files with sox and used the "raw" switch... But there are seemingly some problems... Perhaps a false setting with sox. So: which input-files should I use or how should the sox-command be?
  
  Thanks,
  
  Sebastian
  
  P.S: There is not so much training data because it is only a test. The real training corpus is much greater but for getting familiay with this program I would prefer the smaller one.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2005-04-04
    
    There was a time (prior to November 2004) when wave2feat would work with raw and NIST format input files only, but apparently MS .WAV capability was added (for 1 channel files only) 23 Nov 04. I have used wave2feat with raw and NIST format files only.
    
    If you wish to convert a single channel MS .wav file to raw, you don't need to specify a lot of format switches to sox:
    sox foo.wav foo.raw
    
    A very useful sox option to know about is -V ("verbose"?), which causes sox to print out the details of what it's doing:
    > sox -V foo.wav foo.raw
    
    sox: Detected file format type: wav
    
    sox: Chunk fmt
    sox: Chunk data
    sox: Reading Wave file: Microsoft PCM format, 1 channel, 8000 samp/sec
    sox: 16000 byte/sec, 2 block align, 16 bits/samp, 96218 data bytes
    sox: Input file foo.wav: using sample rate 8000
    size shorts, encoding signed (2's complement), 1 channel
    sox: Input file foo.wav: comment "foo.wav"
    
    sox: Output file foo.raw: using sample rate 8000
    size shorts, encoding signed (2's complement), 1 channel
    sox: Output file: comment "foo.wav"
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ivan Uemlianin - 2005-04-05
  
  I have used wave2feat with .wav files. Remember to set '-mswav yes'.
  
  In case you're not aware of it, the script bin/make_feats is just a wrapper for wave2feat, set up for nist format audio.
  
  Ivan
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-04-05
  
  so, everything worked well... (i'm so happy :-) )
  
  but what to do know with it?
  
  I tried to follow the instructions on
  http://cmusphinx.sourceforge.net/sphinx4/doc/UsingSphinxTrainModels.html
  
  but the described files have not been created... (f.e. cd_continous_...)
  
  what to do now with my trained files to use them with sphinx4?
  
  thx, greetz,
  
  Sebastian
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-04-05
  
  Sebastian -- If your SphinxTraining went all the way through Module 07, you'll find the files that constitute the acoustic model in:
  model_architecture/XXX.<n_senones>.mdef
  model_parameters/XXX.cd_cont_<n_senones>_<n_gau>/*
  where XXX is the name of your model, <n_senones> is the number of tied states, and n_gau is the number of Gaussian densities.
  
  In order to use this data with Sphinx-4, you need to arrange these files (or copies of them) as described in steps 1 and 2 of "How to Use Models...", create a model.props file (step 3), and do the rest. It's a number of manual steps, and a typo anywhere is fatal.
  
  I wrote a little perl script that automates steps 1, 2, and 3. See the Sphinx-4 Open Discussion thread http://sourceforge.net/forum/forum.php?thread_id=1259365&forum_id=382337
  
  cheers,
  jerry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-04-12
  
  So, ok... meanwhile it was possible to build the JAR file with the ant scripts in sphinx....
  
  That worked (with the help of your script @jjwolf2 (thx a lot)) fine...
  
  But now, while trying to get something started with that trained data I get problems too :-(
  
  I need something like the WavFile - demo. So I decided to copy it's data and change the places where the both examples differ...
  
  But when I try to start it, it will bring following result:
  
  java -jar bin/Test.jar
  Loading Recognizer...
  
  Problem configuring WavFile: Property Exception component:'flatLinguist' property:'acousticModel' - Can't instantiate: acousticModel Can't find class edu.cmu.sphinx.model.acoustic.test_6000sen_8gau_13dCep_8k_31mel_200Hz_3500Hz.Model object:acousticModel
  Property Exception component:'flatLinguist' property:'acousticModel' - Can't instantiate: acousticModel Can'tfind class edu.cmu.sphinx.model.acoustic.test_6000sen_8gau_13dCep_8k_31mel_200Hz_3500Hz.Model object:acousticModel
  at edu.cmu.sphinx.util.props.ValidatingPropertySheet.getComponent(ValidatingPropertySheet.java:414)
  at edu.cmu.sphinx.linguist.flat.FlatLinguist.setupAcousticModel(FlatLinguist.java:299)
  at edu.cmu.sphinx.linguist.flat.FlatLinguist.newProperties(FlatLinguist.java:246)
  at edu.cmu.sphinx.util.props.ConfigurationManager.lookup(ConfigurationManager.java:214)
  at edu.cmu.sphinx.util.props.ValidatingPropertySheet.getComponent(ValidatingPropertySheet.java:403)
  at edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager.newProperties(SimpleBreadthFirstSearchManager.java:180)
  at edu.cmu.sphinx.util.props.ConfigurationManager.lookup(ConfigurationManager.java:214)
  at edu.cmu.sphinx.util.props.ValidatingPropertySheet.getComponent(ValidatingPropertySheet.java:403)
  at edu.cmu.sphinx.decoder.Decoder.newProperties(Decoder.java:71)
  at edu.cmu.sphinx.util.props.ConfigurationManager.lookup(ConfigurationManager.java:214)
  at edu.cmu.sphinx.util.props.ValidatingPropertySheet.getComponent(ValidatingPropertySheet.java:403)
  at edu.cmu.sphinx.recognizer.Recognizer.newProperties(Recognizer.java:93)
  at edu.cmu.sphinx.util.props.ConfigurationManager.lookup(ConfigurationManager.java:214)
  at demo.sphinx.test.Test.main(Test.java:62)
  
  Both, the trained data and the changed demo - Test - is placed at http://mitglied.lycos.de/Germi/cmu/ .
  
  Could you please look at it and tell me what I do wrong... I think the failure lies in my missunderstanding in the config.xml file...
  
  best greetz,
  
  Sebastian
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-04-12
  
  Hi Sebastian -- for one thing, test.Manifest must contain the new acoustic model .jar file (compare with wavfile/wavfile.Manifest).
  
  If that solves your "can't find the acoustic model" problem, there will be other problems when you try to run. The name of your new acoustic model suggests that it's for 8 kHz-sampled speech. Therefore, your Sphinx4 mfcFrontEnd must be configured to do the same signal processing on the input speech. For a previous discussion, see http://sourceforge.net/forum/message.php?msg_id=3010772 .
  
  In streamDataSource, set sampleRate to 8000. Since you don't have an endpointer, you probably don't have to set the bytesPerRead. But you must set the melFilterBank properties to match the processing done in SphinxTrain -- I don't know what parameters you used in wave2feat, but the Sphinx4 front end properties must match them.
  
  cheers,
  jerry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-04-13
  
  thanks... of course I forgot to put in the link to the .jar file and now it works.
  
  But the recognition won't do what I expected...
  
  The result is:
  
  $ java -jar bin/Test.jar goodmorning-01.wav
  Loading Recognizer...
  
  Decoding /c:/bachelor/sphinx/sphinx4/goodmorning-01.wav
  WAVE (.wav) file, byte length: 6458, data format: PCM_UNSIGNED 8000.0 Hz, 8 bit,
  mono, 1 bytes/frame, , frame length: 6400
  
  RESULT: 0019 -2,8779415E06 0,0000000E00 0,0000000E00 0,0000000E00 *S2_U1<UH[G,D]
  >_P(good[SIL,G])-G110
  
  with both - a .au-file and a .wav-file - as input...
  
  What could that be? Is it because of the little training data I have or only a wrong setting in the config.xml file?
  
  greetz,
  
  Sebastian
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2005-04-13
  
  Sebastian -- If you are still using an acoustic model trained from only 5 utterances, then I don't think you can hope for success using it in recognition.
  
  But the "RESULT:" that is being output is not what one would expect, so I suspect something else is wrong. I suggest that you add some instrumentation http://cmusphinx.sourceforge.net/sphinx4/sphinx4-1.0beta/javadoc/edu/cmu/sphinx/instrumentation/doc-files/Instrumentation.html (start with the logger to get errors and warnings) so you can see what's happening.
  
  cheers,
  jerry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-04-17
  
  I'm so stupid... ;-)
  
  The not expected output was just the token not the word...
  
  Fixed and it works...
  
  Yes, I have only 5 utterances but that was just to get familiar with the SphinxTrainer.
  Now, I will get on and train with much more training data... :-)
  
  Thanks a lot,
  
  Sebastian
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- NeoGermi - 2005-04-17
  
  .. just another question... :-)
  
  It's not the right place but I think it's not important enough for starting a new thread...
  
  The make_s4_models.pl - script produces more than one model. Each for a n-gaussian (1,2,4,8, 16)... What are they for? Which one should be used? And what are the differences betwwen them?
  
  Greetz,
  
  Sebastian
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2005-04-17
    
    When you train the tied-state CHMM using SphinxTrain Module 07, it first trains a model with 1 Gaussian for each state. Then it splits those Gaussians and trains a new 2-Gaussian model (that is, the output PDF is a Gaussian mixture model with 2 Gaussians), then 4, 8, ... until it reaches the desired $CFG_FINAL_N_DENSITIES. The question of how many Gaussians is optimum is an interesting one, which I've asked on the Open Discussion forum, and to which the best answer was "try it and see".
    
    cheerrs,
    jerry
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CMU Sphinx Trainer Problems in Module 6

Speech Recognition Toolkit

Forums

Help

CMU Sphinx Trainer Problems in Module 6 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

CMU Sphinx Trainer Problems in Module 6