Menu

create language model

Help
Arslan
2007-06-11
2012-09-22
  • Arslan

    Arslan - 2007-06-11

    Hi everyone,

    I am trying to create my own language model.
    I'm using the lmtool:
    http://www.speech.cs.cmu.edu/tools/lmtool.html

    It created a .lm langage model. so I convert it to .DMP file using lm3g2dmp.

    Here is my problem. when I start the application sphinx_decoder using sphinx3and my .DMP file, It gives me this error message:

    FATAL_ERROR: "lextree.c", line 243: 0 active words in default

    Do you have any idea what I might have done wrong or if I have missed a step.

    Thanks
    Best regards

     
    • Nickolay V. Shmyrev

      You should update dictionary with the words from language model. Look into .lm file, check that words are pointed properly there and has non-zero probabilities. Update dictionary with transcription of the words in language model.

      Paste full log next time, often references to the problem are listed not in the last row with error but earlier too.

       
    • Arslan

      Arslan - 2007-06-12

      Hi Nickolay,
      Thank's for answering.
      The word from my langage model are in my dictionnary. Indeed, the dictionnary have been built from that model.
      although I can't find out what is going wrong:
      Here are my logs:

      INFO: dict.c(463): Reading main dictionary: ../../Sphinx3/sphinx3-0.6/model/hmm/tidigits/Message2/1821.dic
      ERROR: "dict.c", line 251: Line 1: Bad ciphone: AX; word A ignored
      ERROR: "dict.c", line 251: Line 2: Bad ciphone: EY; word A(2) ignored
      ERROR: "dict.c", line 251: Line 3: Bad ciphone: AX; word ABOUT ignored
      ERROR: "dict.c", line 251: Line 4: Bad ciphone: B; word BOOK ignored
      ...
      ...
      INFO: lm.c(592): LM read('../../Sphinx3/sphinx3-0.6/model/hmm/tidigits/test_message.lm.DMP', lw= 9.50, wip= 0.70, uw= 0.70)
      INFO: lm.c(594): Reading LM file ../../Sphinx3/sphinx3-0.6/model/hmm/tidigits/test_message.lm.DMP (LM name "default")
      INFO: lm_3g_dmp.c(618): Reading LM in 16 bits format
      INFO: lm_3g_dmp.c(674): Read 30 unigrams [in memory]
      INFO: lm_3g_dmp.c(747): 47 bigrams [on disk]
      INFO: lm_3g_dmp.c(820): 58 bigrams [on disk]
      INFO: lm_3g_dmp.c(890): 15 bigram prob entries
      INFO: lm_3g_dmp.c(924): 14 trigram bowt entries
      INFO: lm_3g_dmp.c(955): 8 trigram prob entries
      INFO: lm_3g_dmp.c(986): 1 trigram segtable entries (512 segsize)
      INFO: lm_3g_dmp.c(1041): 30 word strings
      INFO: lm.c(685): The LM routine is operating at 16 bits mode
      ERROR: "wid.c", line 282: A is not a word in dictionary and it is not a class tag.
      ERROR: "wid.c", line 282: ABOUT is not a word in dictionary and it is not a class tag.
      ERROR: "wid.c", line 282: BOOK is not a word in dictionary and it is not a class tag.
      ERROR: "wid.c", line 282: BUY is not a word in dictionary and it is not a class tag.
      ERROR: "wid.c", line 282: CHECK is not a word in dictionary and it is not a class tag.
      ERROR: "wid.c", line 282: DESTINATION is not a word in dictionary and it is not a class tag.
      ...
      ...
      INFO: wid.c(292): 28 LM words not in dictionary; ignored
      INFO: Initialization of fillpen_t, report:
      INFO: Language weight =9.500000
      INFO: Word Insertion Penalty =0.700000
      INFO: Silence probability =0.100000
      INFO: Filler probability =0.020000
      INFO:
      INFO: dict2pid.c(567): Building PID tables for dictionary
      INFO: Initialization of dict2pid_t, report:
      INFO: Dict2pid is in composite triphone mode
      INFO: 3 composite states; 1 composite sseq
      INFO:
      INFO: kbcore.c(645): Inside kbcore: Verifying models consistency ......
      INFO: kbcore.c(667): End of Initialization of Core Models:
      INFO: Initialization of beam_t, report:
      INFO: Parameters used in Beam Pruning of Viterbi Search:
      INFO: Beam=-307006
      INFO: PBeam=-230254
      INFO: WBeam=-153503 (Skip=0)
      INFO: WEndBeam=-7675
      INFO: No of CI Phone assumed=34
      INFO:
      INFO: Initialization of fast_gmm_t, report:
      INFO: Parameters used in Fast GMM computation:
      INFO: Frame-level: Down Sampling Ratio 1, Conditional Down Sampling? 0, Distance-based Down Sampling? 0
      INFO: GMM-level: CI phone beam -38375. MAX CD 100000
      INFO: Gaussian-level: GS map would be used for Gaussian Selection? =1, SVQ would be used as Gaussian Score? =0 SubVQ Beam -15350
      INFO:
      INFO: Initialization of pl_t, report:
      INFO: Parameters used in phoneme lookahead:
      INFO: Phoneme look-ahead type = 0
      INFO: Phoneme look-ahead beam size = 65945
      INFO: No of CI Phones assumed=34
      INFO:
      INFO: Initialization of ascr_t, report:
      INFO: No. of CI senone =102
      INFO: No. of senone = 602
      INFO: No. of composite senone = 3
      INFO: No. of senone sequence = 308
      INFO: No. of composite senone sequence=1
      INFO: Parameters used in phoneme lookahead:
      INFO: Phoneme lookahead window = 1
      INFO:
      INFO: vithist.c(167): Initializing Viterbi-history module
      INFO: Initialization of vithist_t, report:
      INFO: Word beam = -153503
      INFO: Bigram Mode =0
      INFO: Rescore Mode =1
      INFO: Trace sil Mode =1
      INFO:
      INFO: srch.c(447): Search Initialization.
      WARNING: "srch_time_switch_tree.c", line 166: -Nstalextree is omitted in TST search.
      INFO: lextree.c(226): Creating Unigram Table for lm (name: default)
      INFO: lextree.c(239): Size of word table after unigram + words in class: 0.
      FATAL_ERROR: "lextree.c", line 243: 0 active words in default

      I also show you my .lm file since I do have 0.000 probability for bigrams, I don't know if the problem could come from here:

      Language model created by QuickLM on Tue Jun 12 04:55:33 EDT 2007
      Copyright (c) 1996-2000
      Carnegie Mellon University and Alexander I. Rudnicky

      This model based on a corpus of 1 sentences and 30 words
      The (fixed) discount mass is 0.5

      \data\ ngram 1=30
      ngram 2=47
      ngram 3=58

      \1-grams:
      -2.4771 </s> -0.3010
      -2.4771 <s> -0.2833
      -1.5740 A -0.2893
      -2.1761 ABOUT -0.2680
      -1.8751 BOOK -0.2893
      -2.1761 BUY -0.2893
      -1.8751 CHECK -0.2863
      -2.1761 DESTINATION -0.2680
      -2.1761 DOES -0.2981
      -2.1761 FLIGHT -0.2680
      -1.6990 HAS -0.2923
      -1.3979 I -0.2833
      -2.1761 IF -0.2893
      -2.1761 IN -0.2981
      -2.1761 IS -0.2981
      -1.8751 LANDED -0.2833
      -1.4771 LIKE -0.2553
      -1.6990 MADRID -0.2617
      -1.6990 OFF -0.2680
      -1.4771 PLANE -0.2803
      -1.8751 SYDNEY -0.2818
      -1.8751 TAKE -0.2923
      -2.1761 TAKEN -0.2923
      -1.5740 THE -0.2863
      -2.1761 THIS -0.2863
      -1.6990 TICKET -0.2680
      -1.1347 TO -0.2648
      -2.1761 WANT -0.2680
      -2.1761 WHEN -0.2981
      -1.4771 WOULD -0.2863

      \2-grams:
      -0.3010 <s> I -0.0669
      -0.9031 A FLIGHT -0.1891
      -0.4260 A TICKET -0.1891
      -0.3010 ABOUT TO -0.2808
      -0.3010 BOOK A -0.1891
      -0.3010 BUY A -0.0969
      -0.6021 CHECK IF -0.1891
      -0.6021 CHECK THE -0.1891
      -0.3010 DESTINATION TO -0.2374
      -0.3010 DOES THIS -0.1891
      -0.3010 FLIGHT TO -0.2374
      -0.4771 HAS LANDED -0.1891
      -0.7782 HAS TAKEN -0.1891
      -1.0792 I WANT -0.1891
      -0.3802 I WOULD -0.1891
      -0.3010 IF THE -0.1891
      -0.3010 IN DESTINATION -0.1891
      -0.3010 IS ABOUT -0.1891
      -0.3010 LANDED I -0.0669
      -1.0000 LIKE A -0.0969
      -0.3979 LIKE TO -0.1891
      -0.7782 MADRID HAS -0.2218
      -0.7782 MADRID I -0.2632
      -0.7782 MADRID THE -0.1891
      -0.7782 OFF I -0.0669
      -0.7782 OFF THE -0.1891
      -0.7782 OFF WHEN -0.1891
      -0.6990 PLANE HAS -0.1249
      -1.0000 PLANE IN -0.1891
      -1.0000 PLANE IS -0.1891
      -1.0000 PLANE TAKE -0.1891
      -0.9031 SYDNEY </s> -0.3010
      -0.4260 SYDNEY I -0.0669
      -0.3010 TAKE OFF -0.1249
      -0.3010 TAKEN OFF -0.2218
      -0.3010 THE PLANE -0.0792
      -0.3010 THIS PLANE -0.2553
      -0.3010 TICKET TO -0.1891
      -1.0414 TO BOOK -0.1891
      -1.3424 TO BUY -0.1891
      -1.0414 TO CHECK -0.1891
      -0.8653 TO MADRID -0.1891
      -1.0414 TO SYDNEY -0.1891
      -1.3424 TO TAKE -0.1891
      -0.3010 WANT TO -0.2596
      -0.3010 WHEN DOES 0.0000
      -0.3010 WOULD LIKE 0.0000

      \3-grams:
      -0.3010 <s> I WOULD
      -0.3010 A FLIGHT TO
      -0.3010 A TICKET TO
      -0.3010 ABOUT TO TAKE
      -0.6021 BOOK A FLIGHT
      -0.6021 BOOK A TICKET
      -0.3010 BUY A TICKET
      -0.3010 CHECK IF THE
      -0.3010 CHECK THE PLANE
      -0.3010 DESTINATION TO MADRID
      -0.3010 DOES THIS PLANE
      -0.3010 FLIGHT TO MADRID
      -0.3010 HAS LANDED I
      -0.3010 HAS TAKEN OFF
      -0.3010 I WANT TO
      -0.3010 I WOULD LIKE
      -0.3010 IF THE PLANE
      -0.3010 IN DESTINATION TO
      -0.3010 IS ABOUT TO
      -0.3010 LANDED I WOULD
      -0.3010 LIKE A TICKET
      -0.6021 LIKE TO BOOK
      -0.9031 LIKE TO BUY
      -0.9031 LIKE TO CHECK
      -0.3010 MADRID HAS TAKEN
      -0.3010 MADRID I WANT
      -0.3010 MADRID THE PLANE
      -0.3010 OFF I WOULD
      -0.3010 OFF THE PLANE
      -0.3010 OFF WHEN DOES
      -0.3010 PLANE HAS LANDED
      -0.3010 PLANE IN DESTINATION
      -0.3010 PLANE IS ABOUT
      -0.3010 PLANE TAKE OFF
      -0.3010 SYDNEY I WOULD
      -0.6021 TAKE OFF I
      -0.6021 TAKE OFF THE
      -0.3010 TAKEN OFF WHEN
      -0.6021 THE PLANE HAS
      -0.9031 THE PLANE IN
      -0.9031 THE PLANE IS
      -0.3010 THIS PLANE TAKE
      -0.7782 TICKET TO MADRID
      -0.4771 TICKET TO SYDNEY
      -0.3010 TO BOOK A
      -0.3010 TO BUY A
      -0.6021 TO CHECK IF
      -0.6021 TO CHECK THE
      -0.7782 TO MADRID HAS
      -0.7782 TO MADRID I
      -0.7782 TO MADRID THE
      -0.9031 TO SYDNEY </s>
      -0.4260 TO SYDNEY I
      -0.3010 TO TAKE OFF
      -0.3010 WANT TO CHECK
      -0.3010 WHEN DOES THIS
      -1.0000 WOULD LIKE A
      -0.3979 WOULD LIKE TO

      \end\

      Best regards.

       
      • Nickolay V. Shmyrev

        ERROR: "dict.c", line 251: Line 2: Bad ciphone: EY; word A(2) ignored

        Look is EY a part of you phoneset? Its in etc/something.phone. Probably your phoneset is different or has wrong format.

         
    • Arslan

      Arslan - 2007-06-12

      sorry, didn't find any ".phone"
      but I guess this is not the problem, the error you quoted appears to every phone not just only EY.
      I use this dictionnary:
      A AX
      A(2) EY
      ABOUT AX B AW T
      BOOK B UH K
      BUY B AY
      CHECK CH EH K
      DESTINATION D EH S T AX N EY SH AX N
      DESTINATION(2) D EH S T IX N EY SH AX N
      DOES D AH Z
      DOES(2) D IX Z
      FLIGHT F L AY T
      HAS HH AE Z
      HAS(2) HH AX Z
      I AY
      IF IH F
      IF(2) IX F
      IN IH N
      IN(2) IX N
      IS IH Z
      IS(2) IX Z
      LANDED L AE N D AX D
      LANDED(2) L AE N D IX D
      LIKE L AY K
      MADRID M AX D R IH D
      OFF AO F
      PLANE P L EY N
      SYDNEY S IH D N IY
      TAKE T EY K
      TAKEN T EY K AX N
      THE DH AH
      THE(2) DH AX
      THE(3) DH IY
      THIS DH IH S
      THIS(2) DH IX S
      TICKET T IH K AX T
      TICKET(2) T IH K IX T
      TO T AX
      TO(2) T IX
      TO(3) T UW
      WANT W AA N T
      WANT(2) W AO N T
      WHEN HH W EH N
      WHEN(2) HH W IH N
      WHEN(3) W EH N
      WHEN(4) W IH N
      WOULD W UH D

      but even when I use the CMU dictionnary it gives me the same kind of trouble.

      I really don't know what's wrong. I've try to change many things but I didn't get any result.

       
      • Nickolay V. Shmyrev

        Hm, I thought there should be phone file. It should just contain the list of phones used. It should be mentioned in etc/sphinx_train.cfg:

        $CFG_RAWPHONEFILE = "$CFG_LIST_DIR/$CFG_DB_NAME.phone";

        See http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html too:

        phonelist, which is a list of all acoustic units that you want to train models for. The SPHINX does not permit you to have units other than those in your dictionaries. All units in your two dictionaries must be listed here. In other words, your phonelist must have exactly the same units used in your dictionaries, no more and no less. Each phone must be listed on a separate line in the file, begining from the left, with no extra spaces after the phone. an example:

        AA
        AE
        OW
        B
        CH

         
    • Nickolay V. Shmyrev

      Ups, I thought you are working with sphinxtrain :( Are you trying your model with tidigits? Tidigits models are based on different phoneset (from .mdef file):

      AX_one - - - n/a 0 0 1 2 N
      AY_five - - - n/a 1 3 4 5 N
      AY_nine - - - n/a 2 6 7 8 N
      EH_seven - - - n/a 3 9 10 11 N
      EY_eight - - - n/a 4 12 13 14 N
      E_seven - - - n/a 5 15 16 17 N
      F_five - - - n/a 6 18 19 20 N
      F_four - -

      So if you'd like tidigits you need to transcribe your dictionaries in terms of phoneset above:
      AX_one and so on. If you need generic phoneset, use hub4 model in config file.

       
    • Arslan

      Arslan - 2007-06-12

      yep,
      you were right.
      I was using the tidigits model.
      I was sure I wasn't, newbee mistake ;)
      Anyway, thank's for coming to my rescue.

      It's working fine now.

      Best regards.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.