Menu

sphinx3_livepretend does not work correctly

Help
chotty
2007-11-28
2012-09-22
  • chotty

    chotty - 2007-11-28

    Hi all,

    I just started experimenting with sphinx3_livepretend and face some problems using this tool.

    I did the following steps:

    1. Installation of sphinxbase and sphinx3 under Linux
    2. I downloaded the open source models from http://www.speech.cs.cmu.edu/sphinx/models/
      cmudict.06d
      fillerdict
      language_model.arpaformat.DMP
      language_model.vocabulary

    3. Extracted an audio stream from an wmv video file to .wav using ffmpeg

    4. Converted this wav output file to .raw with a sample rate of 16000 using sox.

    5. Executed
      $SPHINX_ROOT/sphinx3/src/programs/sphinx3_livepretend \ $SPHINX_ROOT/wav/wavfiles_noext.txt \ $AUDIO_DIRECTORY \ $SPHINX_ROOT/_CFG \ &> /dev/stdout | tee dump.txt

    $SPHINX_ROOT/_CFG contains

    -samprate 16000 -hmm /usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd -dict /usr/local/mir/sphinx/cmu/cmudict.06d -fdict /usr/local/mir/sphinx/cmu/fillerdict -lm /usr/local/mir/sphinx/cmu/language_model.arpaformat.DMP

    My output file, which I uploaded to http://us.share.geocities.com/ww.ranger/sphinx_output.txt now contains several errors:

    INFO: kbcore.c(404): Begin Initialization of Core Models:
    INFO: cmd_ln.c(599): Cannot open configuration file /usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/feat.params for reading

    > What is this file for and how do I include it?

    INFO: cont_mgau.c(505): Reading mixture weights file '/usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights'
    ERROR: "cont_mgau.c", line 645: Weight normalization failed for 3 senones

    ...

    INFO: lm.c(681): The LM routine is operating at 16 bits mode
    ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABIDJAN is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABIMAEL is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABIQUIU is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABRIDGING is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABSCOND is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABSCONDED is not a word in dictionary and it is not a class tag.
    ERROR: "wid.c", line 282: ABSCONDING is not a word in dictionary and it is not a class tag.
    ....
    ERROR: "wid.c", line 282: ZOOLOGISTS is not a word in dictionary and it is not a class tag.
    INFO: wid.c(292): 711 LM words not in dictionary; ignored

    > What's wrong with these (711!) words?

    The program stops with the message:

    lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.

    What could be the reason for this? I thought that maybe the audio file is corrupt? I read somewhere that it might cause problems to use ffmpeg and sox together!?

    Thanks a lot

     
    • Masrur Doostdar

      Masrur Doostdar - 2007-12-01

      Hi chotty,

      i made a test with your configuration:

      INFO: kbcore.c(404): Begin Initialization of Core Models:
      INFO: cmd_ln.c(599): Cannot open configuration file /usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/feat.params for reading

      > What is this file for and how do I include it?

      I don't get this error. But i adressed the model files individually in the config-file. Never heard about feat.params file yet. Dont think its important...

      >INFO: cont_mgau.c(505): Reading mixture weights file >'/usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights'
      >ERROR: "cont_mgau.c", line 645: Weight normalization failed for 3 senones

      This one I get to. Dont know whats the problem..

      >INFO: lm.c(681): The LM routine is operating at 16 bits mode
      >ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.
      ...
      > What's wrong with these (711!) words?

      Also this error I get. Seems that the language model file contains words the dictionary does not contain. But I think this will not cause problems if you dont want to recognize these not really frequent words.

      >lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' >failed.

      >What could be the reason for this? I thought that maybe the audio file is corrupt? I read somewhere that >it might cause problems to use ffmpeg and sox together!?

      I had a look at your output file. Think your utterance is just too long.
      But what I'm interested in is, whether the output transcribed so far had any similarity with the audiofile? If yes, do you have a estimated WER? I made the experience so far that using this big language-model and dictionary, i can not hope to get an output being near to the spoken text in any kind. Dont know if I'm doing something wrong...

      regards

       
      • Nickolay V. Shmyrev

        >INFO: cmd_ln.c(599): Cannot open configuration file >/usr/local/mir/sphinx/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/feat.params for >reading
        > What is this file for and how do I include it?
        >
        >I don't get this error. But i adressed the model files individually in the config-file. Never >heard about feat.params file yet. Dont think its important...

        This file is created by sphinxtrain since this summer and describes the feature extraction model was created with. For older models like hub4 it's not required.

        >>What could be the reason for this? I thought that maybe the audio file is corrupt? I read >somewhere that >it might cause problems to use ffmpeg and sox together!?
        >I had a look at your output file. Think your utterance is just too long.

        Exactly. livepretend decodes only short utterances, if you need to decode long text, use sphinx3_continuous instead.

        > i can not hope to get an output being near to the spoken text in any kind.

        For big vocabulary you can't expect much, even very advanced IBM decoder gives around 80% of correct words on switchboard test set. Also I suggest you to use wsj instead of hub now. wsj models use more data and they are more up to date than hub4 which is very outdated.

         
        • chotty

          chotty - 2007-12-03

          Yep, the recognotion is quite bad unfortunately. As I am trying to get a transcript from (british) news videos, which model and/or dictionary would you recommend? I am currently using the following:

          -mdef /usr/local/sphinx/model_architecture/wsj_all_cont_3no_8000.mdef \
          -mean /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/means \
          -var /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/variances \
          -mixw /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/mixture_weights \
          -tmat /usr/local/sphinx/model_parameters/wsj_all_cont_3no_8000_16.cd/transition_matrices \
          -dict /usr/local/sphinx/lm_giga_5k_nvp_3gram/lm_giga_5k_nvp.sphinx.dic \
          -fdict /usr/local/sphinx/lm_giga_5k_nvp_3gram/lm_giga_5k_nvp.sphinx.filler \
          -lm /usr/local/sphinx/lm_giga_5k_nvp_3gram/lm_giga_5k_nvp_3gram.arpa.DMP \
          -feat s3_1x39 \
          -lw 12.0 \
          -wip 0.2 \
          -beam 1e-60 \
          -wbeam 1e-50 \
          -pbeam 1e-60 \
          -maxhmmpf 20000 \
          -maxwpf 20 \
          -maxhistpf 100 \
          
           
          • Nickolay V. Shmyrev

            British English is very different from US English, so I wonder if you can get reasonable results with any US model.

             
            • David Huggins-Daines

              Yes, it is okay to use US English models for British English for control and simple dialog type systems, not so much so for dictation or transcription.

              At the very least you'd want to create a new dictionary which maps the British pronunciations to some approximation of them using American phones, e.g.

              ORDER AO D AH

              instead of

              ORDER AO R D ER

               
            • chotty

              chotty - 2007-12-03

              Yep, noticed that. :( But which model and dictionary would you use for american news videos than?

               
    • David Huggins-Daines

      I should mention that the dictionary-mapping method can actually be suprisingly effective for simple tasks even across languages - for example, one researcher here at CMU has been using English models to recognize a limited vocabulary in Urdu.

      But for dictation and transcription, where the language model is not so constrained, it is not going to work very well, although by using acoustic model adaptation you can get a pretty long way.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.