Menu

ARPA model training with SRILM

Help
kk_huk
2016-07-20
2016-07-21
  • kk_huk

    kk_huk - 2016-07-20

    I have followed http://cmusphinx.sourceforge.net/wiki/tutoriallm tutorial.

    After I run this code, It gives me "one of modified KneserNey discounts is negative error in discount estimator for order 2" error

    ngram-count -kndiscount -interpolate -text train-text.txt -lm your.lm

    How can I solve this problem ?

     
  • Arseniy Gorin

    Arseniy Gorin - 2016-07-20

    Your train-text.txt is likely small. Try other discounting in this case (-kndiscount -interpolate options)
    See more in C3 of this FAQ

     

    Last edit: Arseniy Gorin 2016-07-20
    • kk_huk

      kk_huk - 2016-07-20

      Thanks alot for your quick response Arseniy,

      I have removed that in the code line. Finally, It created "your.lm" file.

      But there are some warnings.

      warning: discount coeff 1 is out of range: 0
      warning: count of count 8 is zero -- lowering maxcount
      warning: count of count 7 is zero -- lowering maxcount
      warning: count of count 6 is zero -- lowering maxcount
      warning: count of count 5 is zero -- lowering maxcount
      warning: discount coeff 1 is out of range: 0
      warning: discount coeff 3 is out of range: 3.24638
      warning: count of count 8 is zero -- lowering maxcount
      warning: count of count 7 is zero -- lowering maxcount
      warning: count of count 6 is zero -- lowering maxcount
      warning: count of count 5 is zero -- lowering maxcount
      warning: count of count 4 is zero -- lowering maxcount
      warning: count of count 3 is zero -- lowering maxcount
      warning: discount coeff 1 is out of range: 0
      

      Is it normal ?

       
      • Arseniy Gorin

        Arseniy Gorin - 2016-07-20

        It just looks like your data are too small for a proper n-gram training...If you have just a few phrases, consider building a grammar instead.
        You can also set "-gtnmax 0" to drop discounting completely

         

        Last edit: Arseniy Gorin 2016-07-20
        • kk_huk

          kk_huk - 2016-07-21

          Hi Arseniy,

          I have to use language model for adaptation with defualt acoustic model.

          I have tried to set gtnmax parameter just like you said :

          ngram-count -gtnmax 0 -text train-text.txt -lm your.lm

          The output is:

          Unknown option "-gtnmax";  type "ngram-count -help" for information
          warning: discount coeff 1 is out of range: 0
          warning: count of count 8 is zero -- lowering maxcount
          warning: count of count 7 is zero -- lowering maxcount
          warning: count of count 6 is zero -- lowering maxcount
          warning: count of count 5 is zero -- lowering maxcount
          warning: discount coeff 1 is out of range: 0
          warning: discount coeff 3 is out of range: 3.24638
          warning: count of count 8 is zero -- lowering maxcount
          warning: count of count 7 is zero -- lowering maxcount
          warning: count of count 6 is zero -- lowering maxcount
          warning: count of count 5 is zero -- lowering maxcount
          warning: count of count 4 is zero -- lowering maxcount
          warning: count of count 3 is zero -- lowering maxcount
          warning: discount coeff 1 is out of range: 0
          
           
          • Arseniy Gorin

            Arseniy Gorin - 2016-07-21

            Sorry, my fault. Warning is removed with "-gt3max 0 -gt2max 0 -gt1max 0"
            However, I think you better keep default setting. Your command still produces the LM.

            To sum up, go with "ngram-count -text train-text.txt -lm your.lm"

            By the way, not sure what you mean by adaptation of the default acoustic model, but normally you do not need language model for that, You need audios with transcripts and the dictionary covering words in transcriptions

             
            • kk_huk

              kk_huk - 2016-07-21

              I have test with add "-gt3max 0 -gt2max 0 -gt1max 0" parameter in the code line.

              The output is worse than the former one, I guess:

              GT discounting disabled
              GT discounting disabled
              GT discounting disabled
              

              It would be better that my corpus has much sentences.

              By the way, not sure what you mean by adaptation of the default acoustic model, but normally you do not need language model for that, You need audios with transcripts and the dictionary covering words in transcriptions

              Your totally right. It stuck in my mind wrongly.

              I have one more question. In the tutorial, there is a information :

              "for book-like texts you need to use Knesser-Ney discounting. For command-like texts you should use Witten-Bell discounting or Absolute discounting. You can try different methods and see which gives better perplexity on a test set".

              That means for book-like texts we should add parameter "-kndiscount"
              My corpus is formed from command-like texts. What should I write as a parameter ?

              -wbdiscount ? or something..

              Thanks a lot for your quick response Arseniy.

               
              • Arseniy Gorin

                Arseniy Gorin - 2016-07-21

                Yes, using other type of discounting is also possible. If your commands are fixed, using no discounting should also work fine. However, I'd prefer building a grammar in this case.

                According to SRILM data sheet for WB discounting you add:
                "-wbdiscount -wbdiscount1 -wbdiscount2"

                I am trying to reproduce your error with acoustic model. Will back when it is done

                 
                • kk_huk

                  kk_huk - 2016-07-21

                  Thanks alot Arseniy, You are the man :)

                   

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.