Sphinx 3 language model problem

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Sphinx 3 language model problem

Forum: Speech Recognition Theory

Creator: haojin yang

Created: 2010-10-25

Updated: 2012-09-22

haojin yang - 2010-10-25

Hallo everyone,

i have trained acousitc models using sphinx trainer. In order to evaluate them
i have also created several language models, by using of LM that trained with
transcripts, the sphinx trainer evaluation program was successfully finished
but when i used a large LM which was trained with much more text resources, i
have got this error:

INFO: utt.c(196): Processing: meinel_02_1300
INFO: feat.c(1148): At directory
/home/haojin/acoustictrain/trainingworkspace/tutorial/hpi_de/feat
INFO: feat.c(378): Reading mfc file: '/home/haojin/acoustictrain/trainingworks
pace/tutorial/hpi_de/feat/meinel/02/meinel_02_1300.mfc'
INFO: cmn.c(175): CMN: 12.73 0.36 -0.12 0.04 -0.42 -0.28 -0.16 -0.28 -0.29
-0.34 -0.26 -0.07 -0.15
.FATAL_ERROR: "lm.c", line 1260: Bad lw2 argument (268435455) to lm_bg_score

The sphinx trainer was from SVN, should be the last version.
Is there any parameter about language model should be set?

can anyone help me? thanks in advance!

Ian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-10-25

You can try to convert model to DMP32 format with sphinx3_lm_convert, probably
it will work. There is little sense to use such a big model though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

haojin yang - 2010-10-25

Thanks for your reply.

i have tried to convert model to DMP32 like this:
./sphinx3_lm_convert -i leipzig+train/de_LM_35419.arpa -ienc utf8 -o
leipzig+train/de_LM_35419.DMP32 -oenc utf8 -ofmt DMP32

but the same error occurred again,
und the language is not so big, the DMP32 file is about 78MB.
can you give me more direction?

lg
Ian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-10-25

"big" here means that vocabulary is big, not the model is big itself. The
typical solution is to split compound words in German. See for example:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4359

You also need to limit the lexicon size. Something above 100k will be just
slow. It's recommended to keep it 60k.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

haojin yang - 2010-10-25

The paper is very interesting to me. Thanks!!
hmm.. the lexicon that i have used for LM training contains just 35149 words,
it did not exceeded the 65535 limit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-10-25

That must be a bug then. Please zip your arpa model and upload it somewhere.
Give me a link to download. My mail is nshmyrev@nexiwave.com

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-10-26

Hello

I've checked your model. It's a bug in sphinx code which I will probably fix
soon.

But I can already give you few recommendations how to proceed without fix:

Make sure your fillers aren case-insensitive. There is no advantage to use both ~~and~~

Make sure you are training closed-vocabulary model. That's -vocab_type 1 option of idngram2lm which is default in latest cmuclmtk. Alternatively you can filter all n-grams with <unk> from the model with SRILM. There is no sense in using them anyway, decoder doesn't take any advantage from them. </unk>

After your model will not have <unk> it will be way smaller and it will work
as expected. </unk>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hindiasradmin - 2010-12-28

I also have faced "Bad lw2 argument" error many times even in cases where
vocabulary size was small. Is this bug fixed now ? Where can we get updated
version ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.