Menu

Bias LM towards bi- and tri- grams?

Help
2010-06-21
2012-09-22
  • John Watkinson

    John Watkinson - 2010-06-21

    I am using a small N-gram language model. It is working fairly well, but I
    would like to bias the model further towards matching bi-grams and especially
    tri-grams. I experimented a bit by adjusting upwards the log-transformed
    probabilities in the ARPA file for bi- and tri-grams but it didn't seem to
    have much effect. Are there any suggestions for how best to do this kind of
    tuning?

    Regards,

    John

     
  • Nickolay V. Shmyrev

    Hello John

    1. What decoder are you using?
    2. What exactly do you mean by "bias", do you want to make all bigrams more probable or just part of them?
    3. If you need all trigrams, it's probably better to lower backoff weights, isn't it?
     
  • John Watkinson

    John Watkinson - 2010-06-21

    Thanks for the reply:

    1. What decoder are you using?

    I'm using PocketSphinx on iPhoneOS, with code forked from the VocalKit setup
    provided by Brian King. The hub4wsj_sc_8k HMM is being used for acoustic
    modeling.

    1. What exactly do you mean by "bias", do you want to make all bigrams more
      probable or just part of them?

    Yes, I mean that all bigrams/trigrams should be more probable than two- or
    three-word groups that are not bi- tri-grams in my model. For example, if I
    have the trigram "left and right" in my model, but no "left and white", then
    the probability of arriving at "left and white" as the best hypothesis should
    be very low relative to the "left and white". Of course, the N-gram model is
    already doing this to an extent, but I would like to tune it to make it more
    extreme.

    1. If you need all trigrams, it's probably better to lower backoff weights,
      isn't it?

    This sounds encouraging, but I'm not sure how to go about it. Is there a
    parameter I can control upon the creation of my language model, or should I
    post-process the ARPA file to achieve the effect?

    Regards,

    John

     
  • Nickolay V. Shmyrev

    Is there a parameter I can control upon the creation of my language model,
    or should I post-process the ARPA file to achieve the effect?

    Well, it depends on your goal. The discounting process used during lm
    estimation tries to archive many goals and parameters used aren't really
    straightforward ones. Cmuclmtk is not flexible in this area

    If you are using SRILM, you can split process of discount parameter estimation
    instead of just

    ngram-counte -text a.text -lm a.lm
    

    you can dump parameters first

    ngram-count -kndiscount -text a.text -kn kn.param
    

    then edit the parameters (smaller values of discounts usually means more prob
    to seen n-grams and less prob to unseen) then you can build the language model

    ngram-count -text a.text -kn kn.param -lm a.lm
    

    If you want to use simple strategies like absolute discount, then just
    -cdiscount 0.001 of ngram-count will do the work.

     
  • John Watkinson

    John Watkinson - 2010-06-24

    Thanks, the -cdiscount parameter seemed to already help, I'll try the more
    detailed approaches next.

     

Log in to post a comment.