Menu

lm_3g.c not reading my LM

Help
2009-01-05
2012-09-22
  • Ivan Uemlianin

    Ivan Uemlianin - 2009-01-05

    Dear All

    I have built an acoustic model and a language model for a small-vocabulary single-word recogniser. I am now testing it using the python test frame in sphinx3/python/_sphinx3_test.py (changing filenames as appropriate).

    Sphinx reads in the acoustic model and the various ancillary files, but rejects the LM. The LM is a arpa/text format LM created following the Typical Use section in the cmuclmtk doc. As I'm building a single-word recogniser I set n=1 (i.e., in text2idngram and idngram2lm). Here's the relevant Sphinx output:

    ...
    INFO: lm.c(606): LM read('model/lm/wordrec.lm_arpa', lw= 9.50, wip= 0.70, uw= 0.70)
    INFO: lm.c(608): Reading LM file model/lm/wordrec.lm_arpa (LM name "default")
    INFO: lm_3g_dmp.c(471): Bad magic number: 589505315(23232323), not an LM dumpfile??
    ERROR: "lm_3g_dmp.c", line 1274: Error in reading the header of the DUMP file.
    INFO: lm.c(616): In lm_read, LM is not a DMP file. Trying to read it as a txt file
    WARNING: "lm.c", line 618: On-disk LM not supported for text files, reading it into memory.
    INFO: lm_3g.c(831): Reading LM file model/lm/wordrec.lm_arpa
    WARNING: "lm_3g.c", line 261: Bad or missing ngram count
    WARNING: "lm_3g.c", line 842: Couldnt' read the ngram count
    INFO: lm.c(636): Lm is both not DMP and TXT format
    FATAL_ERROR: "lmset.c", line 295: lm_read_advance(model/lm/wordrec.lm_arpa, 9.500000e+00, 7.000000e-01, 7.000000e-01 20 [Arbitrary Fmt], Weighted Apply) failed

    Using a binary LM gets the same error, as does attempting to convert the arpa LM into a DMP format LM with sphinx3_lm_convert.

    The actual LM I used is below, in the ps.

    I have retried, with a trigram model. Sphinx reads in the LM, but the _sphinx3.init() function in _sphinx3module.c generates the following error and actually causes Python to crash (the last line is from the Python interpreter):

    ...
    INFO: kb.c(306): SEARCH MODE INDEX 4
    INFO: srch.c(373): Search Initialization.
    WARNING: "srch_time_switch_tree.c", line 283: -Nstalextree is omitted in TST search.
    INFO: lextree.c(222): Creating Unigram Table for lm (name: default)
    INFO: lextree.c(235): Size of word table after unigram + words in class: 17.
    INFO: lextree.c(244): Size of word table after adding alternative prons: 17.
    Assertion failed: (n_emit_state <= MAX_HMM_NSTATE), function hmm_context_init, file hmm.c, line 111.
    Abort trap

    Can anyone suggest what I might be doing wrong, or see what is wrong with the LM below?

    Thanks.

    Best wishes

    Ivan

    p.s.: the LM I am using:

    Ronald Rosenfeld and Philip Clarkson

    Contributors includes Wen Xu, Ananlada Chotimongkol,

    David Huggins-Daines, Arthur Chan and Alan Black

    =============================================================================
    =============== This file was produced by the CMU-Cambridge ===============
    =============== Statistical Language Modeling Toolkit ===============
    =============================================================================
    This is a 1-gram language model, based on a vocabulary of 17 words,
    which begins "eight", "eleven", "fifteen"...
    This is a CLOSED-vocabulary model
    (OOVs eliminated from training data and are forbidden in test data)
    Good-Turing discounting was applied.
    1-gram frequency of frequency : 18
    1-gram discounting ratios :
    This file is in the ARPA-standard format introduced by Doug Paul.

    p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
    else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
    else p(wd3|w2)

    p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
    else bo_wt_1(wd1)*p_1(wd2)

    All probs and back-off weights (bo_wt) are given in log10 form.

    Data formats:

    Beginning of data mark: \data\
    ngram 1=nr # number of 1-grams

    \1-grams:
    p_1 wd_1

    end of data mark: \end\

    \data\
    ngram 1=17

    \1-grams:
    -1.2304 eight
    -1.2304 eleven
    -1.2304 fifteen
    -1.2304 five
    -1.2304 four
    -1.2304 fourteen
    -1.2304 nine
    -1.2304 one
    -1.2304 seven
    -1.2304 seventeen
    -1.2304 six
    -1.2304 sixteen
    -1.2304 ten
    -1.2304 thirteen
    -1.2304 three
    -1.2304 twelve
    -1.2304 two

    \end\

     
    • David Huggins-Daines

      Hi,

      This is, unfortunately, a known bug in Sphinx3. It is unable to use unigram language models. One work-around for this is to use the finite-state grammar search mode instead.

       
    • Ivan Uemlianin

      Ivan Uemlianin - 2009-01-09

      Thanks! I'll give it a go.

      Ivan

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.