CMU Sphinx / Forums / Help: sphinx3_livepretend does not work correctly

chotty - 2007-11-28

Hi all,

I just started experimenting with sphinx3_livepretend and face some problems using this tool.

I did the following steps:

Installation of sphinxbase and sphinx3 under Linux

I downloaded the open source models from http://www.speech.cs.cmu.edu/sphinx/models/
cmudict.06d
fillerdict
language_model.arpaformat.DMP
language_model.vocabulary

Extracted an audio stream from an wmv video file to .wav using ffmpeg

Converted this wav output file to .raw with a sample rate of 16000 using sox.

Executed
$SPHINX_ROOT/sphinx3/src/programs/sphinx3_livepretend \
$SPHINX_ROOT/wav/wavfiles_noext.txt \
$AUDIO_DIRECTORY \
$SPHINX_ROOT/_CFG \
&> /dev/stdout | tee dump.txt

$SPHINX_ROOT/_CFG contains

-samprate 16000 -hmm /usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd -dict /usr/local/mir/sphinx/cmu/cmudict.06d -fdict /usr/local/mir/sphinx/cmu/fillerdict -lm /usr/local/mir/sphinx/cmu/language_model.arpaformat.DMP

My output file, which I uploaded to http://us.share.geocities.com/ww.ranger/sphinx_output.txt now contains several errors:

INFO: kbcore.c(404): Begin Initialization of Core Models:
INFO: cmd_ln.c(599): Cannot open configuration file /usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/feat.params for reading

> What is this file for and how do I include it?

INFO: cont_mgau.c(505): Reading mixture weights file '/usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights'
ERROR: "cont_mgau.c", line 645: Weight normalization failed for 3 senones

...

INFO: lm.c(681): The LM routine is operating at 16 bits mode
ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABIDJAN is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABIMAEL is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABIQUIU is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABRIDGING is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABSCOND is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABSCONDED is not a word in dictionary and it is not a class tag.
ERROR: "wid.c", line 282: ABSCONDING is not a word in dictionary and it is not a class tag.
....
ERROR: "wid.c", line 282: ZOOLOGISTS is not a word in dictionary and it is not a class tag.
INFO: wid.c(292): 711 LM words not in dictionary; ignored

> What's wrong with these (711!) words?

The program stops with the message:

lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.

What could be the reason for this? I thought that maybe the audio file is corrupt? I read somewhere that it might cause problems to use ffmpeg and sox together!?

Thanks a lot
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Masrur Doostdar - 2007-12-01
  
  Hi chotty,
  
  i made a test with your configuration:
  
  INFO: kbcore.c(404): Begin Initialization of Core Models:
  INFO: cmd_ln.c(599): Cannot open configuration file /usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/feat.params for reading
  
  > What is this file for and how do I include it?
  
  I don't get this error. But i adressed the model files individually in the config-file. Never heard about feat.params file yet. Dont think its important...
  
  >INFO: cont_mgau.c(505): Reading mixture weights file >'/usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights'
  >ERROR: "cont_mgau.c", line 645: Weight normalization failed for 3 senones
  
  This one I get to. Dont know whats the problem..
  
  >INFO: lm.c(681): The LM routine is operating at 16 bits mode
  >ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.
  ...
  > What's wrong with these (711!) words?
  
  Also this error I get. Seems that the language model file contains words the dictionary does not contain. But I think this will not cause problems if you dont want to recognize these not really frequent words.
  
  >lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' >failed.
  
  >What could be the reason for this? I thought that maybe the audio file is corrupt? I read somewhere that >it might cause problems to use ffmpeg and sox together!?
  
  I had a look at your output file. Think your utterance is just too long.
  But what I'm interested in is, whether the output transcribed so far had any similarity with the audiofile? If yes, do you have a estimated WER? I made the experience so far that using this big language-model and dictionary, i can not hope to get an output being near to the spoken text in any kind. Dont know if I'm doing something wrong...
  
  regards
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2007-12-01
    
    >INFO: cmd_ln.c(599): Cannot open configuration file >/usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/feat.params for >reading
    > What is this file for and how do I include it?
    >
    >I don't get this error. But i adressed the model files individually in the config-file. Never >heard about feat.params file yet. Dont think its important...
    
    This file is created by sphinxtrain since this summer and describes the feature extraction model was created with. For older models like hub4 it's not required.
    
    >>What could be the reason for this? I thought that maybe the audio file is corrupt? I read >somewhere that >it might cause problems to use ffmpeg and sox together!?
    >I had a look at your output file. Think your utterance is just too long.
    
    Exactly. livepretend decodes only short utterances, if you need to decode long text, use sphinx3_continuous instead.
    
    > i can not hope to get an output being near to the spoken text in any kind.
    
    For big vocabulary you can't expect much, even very advanced IBM decoder gives around 80% of correct words on switchboard test set. Also I suggest you to use wsj instead of hub now. wsj models use more data and they are more up to date than hub4 which is very outdated.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - chotty - 2007-12-03
      
      Yep, the recognotion is quite bad unfortunately. As I am trying to get a transcript from (british) news videos, which model and/or dictionary would you recommend? I am currently using the following:
      
      -mdef /usr/local/sphinx/model_architecture/wsj_all_cont_3no_8000.mdef \ -mean /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/means \ -var /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/variances \ -mixw /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/mixture_weights \ -tmat /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/transition_matrices \ -dict /usr/local/sphinx/lm_giga_5k_nvp_3gram/lm_giga_5k_nvp.sphinx.dic \ -fdict /usr/local/sphinx/lm_giga_5k_nvp_3gram/lm_giga_5k_nvp.sphinx.filler \ -lm /usr/local/sphinx/lm_giga_5k_nvp_3gram/lm_giga_5k_nvp_3gram.arpa.DMP \ -feat s3_1x39 \ -lw 12.0 \ -wip 0.2 \ -beam 1e-60 \ -wbeam 1e-50 \ -pbeam 1e-60 \ -maxhmmpf 20000 \ -maxwpf 20 \ -maxhistpf 100 \
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Nickolay V. Shmyrev - 2007-12-03
        
        British English is very different from US English, so I wonder if you can get reasonable results with any US model.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2007-12-03
        
        Yes, it is okay to use US English models for British English for control and simple dialog type systems, not so much so for dictation or transcription.
        
        At the very least you'd want to create a new dictionary which maps the British pronunciations to some approximation of them using American phones, e.g.
        
        ORDER AO D AH
        
        instead of
        
        ORDER AO R D ER
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        chotty - 2007-12-03
        
        Yep, noticed that. :( But which model and dictionary would you use for american news videos than?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2007-12-03
  
  I should mention that the dictionary-mapping method can actually be suprisingly effective for simple tasks even across languages - for example, one researcher here at CMU has been using English models to recognize a limited vocabulary in Urdu.
  
  But for dictation and transcription, where the language model is not so constrained, it is not going to work very well, although by using acoustic model adaptation you can get a pretty long way.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sphinx3_livepretend does not work correctly

Speech Recognition Toolkit

Forums

Help

sphinx3_livepretend does not work correctly document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

sphinx3_livepretend does not work correctly