i have trained acousitc models using sphinx trainer. In order to evaluate them
i have also created several language models, by using of LM that trained with
transcripts, the sphinx trainer evaluation program was successfully finished
but when i used a large LM which was trained with much more text resources, i
have got this error:
i have tried to convert model to DMP32 like this:
./sphinx3_lm_convert -i leipzig+train/de_LM_35419.arpa -ienc utf8 -o
leipzig+train/de_LM_35419.DMP32 -oenc utf8 -ofmt DMP32
but the same error occurred again,
und the language is not so big, the DMP32 file is about 78MB.
can you give me more direction?
lg
Ian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The paper is very interesting to me. Thanks!!
hmm.. the lexicon that i have used for LM training contains just 35149 words,
it did not exceeded the 65535 limit.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've checked your model. It's a bug in sphinx code which I will probably fix
soon.
But I can already give you few recommendations how to proceed without fix:
Make sure your fillers aren case-insensitive. There is no advantage to use both and
Make sure you are training closed-vocabulary model. That's -vocab_type 1 option of idngram2lm which is default in latest cmuclmtk. Alternatively you can filter all n-grams with <unk> from the model with SRILM. There is no sense in using them anyway, decoder doesn't take any advantage from them. </unk>
After your model will not have <unk> it will be way smaller and it will work
as expected. </unk>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I also have faced "Bad lw2 argument" error many times even in cases where
vocabulary size was small. Is this bug fixed now ? Where can we get updated
version ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hallo everyone,
i have trained acousitc models using sphinx trainer. In order to evaluate them
i have also created several language models, by using of LM that trained with
transcripts, the sphinx trainer evaluation program was successfully finished
but when i used a large LM which was trained with much more text resources, i
have got this error:
INFO: utt.c(196): Processing: meinel_02_1300
INFO: feat.c(1148): At directory
/home/haojin/acoustictrain/trainingworkspace/tutorial/hpi_de/feat
INFO: feat.c(378): Reading mfc file: '/home/haojin/acoustictrain/trainingworks
pace/tutorial/hpi_de/feat/meinel/02/meinel_02_1300.mfc'
INFO: cmn.c(175): CMN: 12.73 0.36 -0.12 0.04 -0.42 -0.28 -0.16 -0.28 -0.29
-0.34 -0.26 -0.07 -0.15
.FATAL_ERROR: "lm.c", line 1260: Bad lw2 argument (268435455) to lm_bg_score
The sphinx trainer was from SVN, should be the last version.
Is there any parameter about language model should be set?
can anyone help me? thanks in advance!
Ian
You can try to convert model to DMP32 format with sphinx3_lm_convert, probably
it will work. There is little sense to use such a big model though.
Thanks for your reply.
i have tried to convert model to DMP32 like this:
./sphinx3_lm_convert -i leipzig+train/de_LM_35419.arpa -ienc utf8 -o
leipzig+train/de_LM_35419.DMP32 -oenc utf8 -ofmt DMP32
but the same error occurred again,
und the language is not so big, the DMP32 file is about 78MB.
can you give me more direction?
lg
Ian
"big" here means that vocabulary is big, not the model is big itself. The
typical solution is to split compound words in German. See for example:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4359
You also need to limit the lexicon size. Something above 100k will be just
slow. It's recommended to keep it 60k.
The paper is very interesting to me. Thanks!!
hmm.. the lexicon that i have used for LM training contains just 35149 words,
it did not exceeded the 65535 limit.
That must be a bug then. Please zip your arpa model and upload it somewhere.
Give me a link to download. My mail is nshmyrev@nexiwave.com
Hello
I've checked your model. It's a bug in sphinx code which I will probably fix
soon.
But I can already give you few recommendations how to proceed without fix:
andAfter your model will not have <unk> it will be way smaller and it will workas expected. </unk>
I also have faced "Bad lw2 argument" error many times even in cases where
vocabulary size was small. Is this bug fixed now ? Where can we get updated
version ?