Bias LM towards bi- and tri- grams?

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Bias LM towards bi- and tri- grams?

Forum: Help

Creator: John Watkinson

Created: 2010-06-21

Updated: 2012-09-22

John Watkinson - 2010-06-21

I am using a small N-gram language model. It is working fairly well, but I
would like to bias the model further towards matching bi-grams and especially
tri-grams. I experimented a bit by adjusting upwards the log-transformed
probabilities in the ARPA file for bi- and tri-grams but it didn't seem to
have much effect. Are there any suggestions for how best to do this kind of
tuning?

Regards,

John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-06-21

Hello John

What decoder are you using?

What exactly do you mean by "bias", do you want to make all bigrams more probable or just part of them?

If you need all trigrams, it's probably better to lower backoff weights, isn't it?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Watkinson - 2010-06-21

Thanks for the reply:

What decoder are you using?

I'm using PocketSphinx on iPhoneOS, with code forked from the VocalKit setup
provided by Brian King. The hub4wsj_sc_8k HMM is being used for acoustic
modeling.

What exactly do you mean by "bias", do you want to make all bigrams more
probable or just part of them?

Yes, I mean that all bigrams/trigrams should be more probable than two- or
three-word groups that are not bi- tri-grams in my model. For example, if I
have the trigram "left and right" in my model, but no "left and white", then
the probability of arriving at "left and white" as the best hypothesis should
be very low relative to the "left and white". Of course, the N-gram model is
already doing this to an extent, but I would like to tune it to make it more
extreme.

If you need all trigrams, it's probably better to lower backoff weights,
isn't it?

This sounds encouraging, but I'm not sure how to go about it. Is there a
parameter I can control upon the creation of my language model, or should I
post-process the ARPA file to achieve the effect?

Regards,

John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-06-22

Is there a parameter I can control upon the creation of my language model,
or should I post-process the ARPA file to achieve the effect?

Well, it depends on your goal. The discounting process used during lm
estimation tries to archive many goals and parameters used aren't really
straightforward ones. Cmuclmtk is not flexible in this area

If you are using SRILM, you can split process of discount parameter estimation
instead of just

ngram-counte -text a.text -lm a.lm

you can dump parameters first

ngram-count -kndiscount -text a.text -kn kn.param

then edit the parameters (smaller values of discounts usually means more prob
to seen n-grams and less prob to unseen) then you can build the language model

ngram-count -text a.text -kn kn.param -lm a.lm

If you want to use simple strategies like absolute discount, then just
-cdiscount 0.001 of ngram-count will do the work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Watkinson - 2010-06-24

Thanks, the -cdiscount parameter seemed to already help, I'll try the more
detailed approaches next.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.