CMU Sphinx / Forums / Help: lm_3g.c not reading my LM

Dear All

I have built an acoustic model and a language model for a small-vocabulary single-word recogniser. I am now testing it using the python test frame in sphinx3/python/_sphinx3_test.py (changing filenames as appropriate).

Sphinx reads in the acoustic model and the various ancillary files, but rejects the LM. The LM is a arpa/text format LM created following the Typical Use section in the cmuclmtk doc. As I'm building a single-word recogniser I set n=1 (i.e., in text2idngram and idngram2lm). Here's the relevant Sphinx output:

...
INFO: lm.c(606): LM read('model/lm/wordrec.lm_arpa', lw= 9.50, wip= 0.70, uw= 0.70)
INFO: lm.c(608): Reading LM file model/lm/wordrec.lm_arpa (LM name "default")
INFO: lm_3g_dmp.c(471): Bad magic number: 589505315(23232323), not an LM dumpfile??
ERROR: "lm_3g_dmp.c", line 1274: Error in reading the header of the DUMP file.
INFO: lm.c(616): In lm_read, LM is not a DMP file. Trying to read it as a txt file
WARNING: "lm.c", line 618: On-disk LM not supported for text files, reading it into memory.
INFO: lm_3g.c(831): Reading LM file model/lm/wordrec.lm_arpa
WARNING: "lm_3g.c", line 261: Bad or missing ngram count
WARNING: "lm_3g.c", line 842: Couldnt' read the ngram count
INFO: lm.c(636): Lm is both not DMP and TXT format
FATAL_ERROR: "lmset.c", line 295: lm_read_advance(model/lm/wordrec.lm_arpa, 9.500000e+00, 7.000000e-01, 7.000000e-01 20 [Arbitrary Fmt], Weighted Apply) failed

Using a binary LM gets the same error, as does attempting to convert the arpa LM into a DMP format LM with sphinx3_lm_convert.

The actual LM I used is below, in the ps.

I have retried, with a trigram model. Sphinx reads in the LM, but the _sphinx3.init() function in _sphinx3module.c generates the following error and actually causes Python to crash (the last line is from the Python interpreter):

...
INFO: kb.c(306): SEARCH MODE INDEX 4
INFO: srch.c(373): Search Initialization.
WARNING: "srch_time_switch_tree.c", line 283: -Nstalextree is omitted in TST search.
INFO: lextree.c(222): Creating Unigram Table for lm (name: default)
INFO: lextree.c(235): Size of word table after unigram + words in class: 17.
INFO: lextree.c(244): Size of word table after adding alternative prons: 17.
Assertion failed: (n_emit_state <= MAX_HMM_NSTATE), function hmm_context_init, file hmm.c, line 111.
Abort trap

Can anyone suggest what I might be doing wrong, or see what is wrong with the LM below?

Thanks.

Best wishes

Ivan

p.s.: the LM I am using:

Copyright (c) 1996, Carnegie Mellon University, Cambridge University,

Ronald Rosenfeld and Philip Clarkson

Version 3, Copyright (c) 2006, Carnegie Mellon University

Contributors includes Wen Xu, Ananlada Chotimongkol,

David Huggins-Daines, Arthur Chan and Alan Black

=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 1-gram language model, based on a vocabulary of 17 words,
which begins "eight", "eleven", "fifteen"...
This is a CLOSED-vocabulary model
(OOVs eliminated from training data and are forbidden in test data)
Good-Turing discounting was applied.
1-gram frequency of frequency : 18
1-gram discounting ratios :
This file is in the ARPA-standard format introduced by Doug Paul.

p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)

p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)

All probs and back-off weights (bo_wt) are given in log10 form.

Data formats:

Beginning of data mark: \data\
ngram 1=nr # number of 1-grams

\1-grams:
p_1 wd_1

end of data mark: \end\

\data\
ngram 1=17

\1-grams:
-1.2304 eight
-1.2304 eleven
-1.2304 fifteen
-1.2304 five
-1.2304 four
-1.2304 fourteen
-1.2304 nine
-1.2304 one
-1.2304 seven
-1.2304 seventeen
-1.2304 six
-1.2304 sixteen
-1.2304 ten
-1.2304 thirteen
-1.2304 three
-1.2304 twelve
-1.2304 two

\end\

lm_3g.c not reading my LM

Speech Recognition Toolkit

Forums

Help

lm_3g.c not reading my LM document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };