Menu

Sphinx 3 language model problem

2010-10-25
2012-09-22
  • haojin yang

    haojin yang - 2010-10-25

    Hallo everyone,

    i have trained acousitc models using sphinx trainer. In order to evaluate them
    i have also created several language models, by using of LM that trained with
    transcripts, the sphinx trainer evaluation program was successfully finished
    but when i used a large LM which was trained with much more text resources, i
    have got this error:

    INFO: utt.c(196): Processing: meinel_02_1300
    INFO: feat.c(1148): At directory
    /home/haojin/acoustictrain/trainingworkspace/tutorial/hpi_de/feat
    INFO: feat.c(378): Reading mfc file: '/home/haojin/acoustictrain/trainingworks
    pace/tutorial/hpi_de/feat/meinel/02/meinel_02_1300.mfc'
    INFO: cmn.c(175): CMN: 12.73 0.36 -0.12 0.04 -0.42 -0.28 -0.16 -0.28 -0.29
    -0.34 -0.26 -0.07 -0.15
    .FATAL_ERROR: "lm.c", line 1260: Bad lw2 argument (268435455) to lm_bg_score

    The sphinx trainer was from SVN, should be the last version.
    Is there any parameter about language model should be set?

    can anyone help me? thanks in advance!

    Ian

     
  • Nickolay V. Shmyrev

    You can try to convert model to DMP32 format with sphinx3_lm_convert, probably
    it will work. There is little sense to use such a big model though.

     
  • haojin yang

    haojin yang - 2010-10-25

    Thanks for your reply.

    i have tried to convert model to DMP32 like this:
    ./sphinx3_lm_convert -i leipzig+train/de_LM_35419.arpa -ienc utf8 -o
    leipzig+train/de_LM_35419.DMP32 -oenc utf8 -ofmt DMP32

    but the same error occurred again,
    und the language is not so big, the DMP32 file is about 78MB.
    can you give me more direction?

    lg
    Ian

     
  • Nickolay V. Shmyrev

    "big" here means that vocabulary is big, not the model is big itself. The
    typical solution is to split compound words in German. See for example:

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4359

    You also need to limit the lexicon size. Something above 100k will be just
    slow. It's recommended to keep it 60k.

     
  • haojin yang

    haojin yang - 2010-10-25

    The paper is very interesting to me. Thanks!!
    hmm.. the lexicon that i have used for LM training contains just 35149 words,
    it did not exceeded the 65535 limit.

     
  • Nickolay V. Shmyrev

    That must be a bug then. Please zip your arpa model and upload it somewhere.
    Give me a link to download. My mail is nshmyrev@nexiwave.com

     
  • Nickolay V. Shmyrev

    Hello

    I've checked your model. It's a bug in sphinx code which I will probably fix
    soon.

    But I can already give you few recommendations how to proceed without fix:

    1. Make sure your fillers aren case-insensitive. There is no advantage to use both and
    2. Make sure you are training closed-vocabulary model. That's -vocab_type 1 option of idngram2lm which is default in latest cmuclmtk. Alternatively you can filter all n-grams with <unk> from the model with SRILM. There is no sense in using them anyway, decoder doesn't take any advantage from them. </unk>

    After your model will not have <unk> it will be way smaller and it will work
    as expected. </unk>

     
  • hindiasradmin

    hindiasradmin - 2010-12-28

    I also have faced "Bad lw2 argument" error many times even in cases where
    vocabulary size was small. Is this bug fixed now ? Where can we get updated
    version ?

     

Log in to post a comment.