CMU Sphinx / Forums / Help: ARPA model training with SRILM

kk_huk - 2016-07-20

I have followed http://cmusphinx.sourceforge.net/wiki/tutoriallm tutorial.

After I run this code, It gives me "one of modified KneserNey discounts is negative error in discount estimator for order 2" error

ngram-count -kndiscount -interpolate -text train-text.txt -lm your.lm

How can I solve this problem ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arseniy Gorin - 2016-07-20

Your train-text.txt is likely small. Try other discounting in this case (-kndiscount -interpolate options)
See more in C3 of this FAQ

Last edit: Arseniy Gorin 2016-07-20

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- kk_huk - 2016-07-20
  
  Thanks alot for your quick response Arseniy,
  
  I have removed that in the code line. Finally, It created "your.lm" file.
  
  But there are some warnings.
  
  warning: discount coeff 1 is out of range: 0 warning: count of count 8 is zero -- lowering maxcount warning: count of count 7 is zero -- lowering maxcount warning: count of count 6 is zero -- lowering maxcount warning: count of count 5 is zero -- lowering maxcount warning: discount coeff 1 is out of range: 0 warning: discount coeff 3 is out of range: 3.24638 warning: count of count 8 is zero -- lowering maxcount warning: count of count 7 is zero -- lowering maxcount warning: count of count 6 is zero -- lowering maxcount warning: count of count 5 is zero -- lowering maxcount warning: count of count 4 is zero -- lowering maxcount warning: count of count 3 is zero -- lowering maxcount warning: discount coeff 1 is out of range: 0
  
  Is it normal ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Arseniy Gorin - 2016-07-20
    
    It just looks like your data are too small for a proper n-gram training...If you have just a few phrases, consider building a grammar instead.
    You can also set "-gtnmax 0" to drop discounting completely
    
    Last edit: Arseniy Gorin 2016-07-20
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - kk_huk - 2016-07-21
      
      Hi Arseniy,
      
      I have to use language model for adaptation with defualt acoustic model.
      
      I have tried to set gtnmax parameter just like you said :
      
      ngram-count -gtnmax 0 -text train-text.txt -lm your.lm
      
      The output is:
      
      Unknown option "-gtnmax"; type "ngram-count -help" for information warning: discount coeff 1 is out of range: 0 warning: count of count 8 is zero -- lowering maxcount warning: count of count 7 is zero -- lowering maxcount warning: count of count 6 is zero -- lowering maxcount warning: count of count 5 is zero -- lowering maxcount warning: discount coeff 1 is out of range: 0 warning: discount coeff 3 is out of range: 3.24638 warning: count of count 8 is zero -- lowering maxcount warning: count of count 7 is zero -- lowering maxcount warning: count of count 6 is zero -- lowering maxcount warning: count of count 5 is zero -- lowering maxcount warning: count of count 4 is zero -- lowering maxcount warning: count of count 3 is zero -- lowering maxcount warning: discount coeff 1 is out of range: 0
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Arseniy Gorin - 2016-07-21
        
        Sorry, my fault. Warning is removed with "-gt3max 0 -gt2max 0 -gt1max 0"
        However, I think you better keep default setting. Your command still produces the LM.
        
        To sum up, go with "ngram-count -text train-text.txt -lm your.lm"
        
        By the way, not sure what you mean by adaptation of the default acoustic model, but normally you do not need language model for that, You need audios with transcripts and the dictionary covering words in transcriptions
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        kk_huk - 2016-07-21
        
        I have test with add "-gt3max 0 -gt2max 0 -gt1max 0" parameter in the code line.
        
        The output is worse than the former one, I guess:
        
        GT discounting disabled GT discounting disabled GT discounting disabled
        
        It would be better that my corpus has much sentences.
        
        By the way, not sure what you mean by adaptation of the default acoustic model, but normally you do not need language model for that, You need audios with transcripts and the dictionary covering words in transcriptions
        
        Your totally right. It stuck in my mind wrongly.
        
        I have one more question. In the tutorial, there is a information :
        
        "for book-like texts you need to use Knesser-Ney discounting. For command-like texts you should use Witten-Bell discounting or Absolute discounting. You can try different methods and see which gives better perplexity on a test set".
        
        That means for book-like texts we should add parameter "-kndiscount"
        My corpus is formed from command-like texts. What should I write as a parameter ?
        
        -wbdiscount ? or something..
        
        Thanks a lot for your quick response Arseniy.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Arseniy Gorin - 2016-07-21
        
        Yes, using other type of discounting is also possible. If your commands are fixed, using no discounting should also work fine. However, I'd prefer building a grammar in this case.
        
        According to SRILM data sheet for WB discounting you add:
        "-wbdiscount -wbdiscount1 -wbdiscount2"
        
        I am trying to reproduce your error with acoustic model. Will back when it is done
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        kk_huk - 2016-07-21
        
        Thanks alot Arseniy, You are the man :)
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ARPA model training with SRILM

Speech Recognition Toolkit

Forums

Help

ARPA model training with SRILM document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

ARPA model training with SRILM