I am using a small N-gram language model. It is working fairly well, but I
would like to bias the model further towards matching bi-grams and especially
tri-grams. I experimented a bit by adjusting upwards the log-transformed
probabilities in the ARPA file for bi- and tri-grams but it didn't seem to
have much effect. Are there any suggestions for how best to do this kind of
tuning?
Regards,
John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using PocketSphinx on iPhoneOS, with code forked from the VocalKit setup
provided by Brian King. The hub4wsj_sc_8k HMM is being used for acoustic
modeling.
What exactly do you mean by "bias", do you want to make all bigrams more
probable or just part of them?
Yes, I mean that all bigrams/trigrams should be more probable than two- or
three-word groups that are not bi- tri-grams in my model. For example, if I
have the trigram "left and right" in my model, but no "left and white", then
the probability of arriving at "left and white" as the best hypothesis should
be very low relative to the "left and white". Of course, the N-gram model is
already doing this to an extent, but I would like to tune it to make it more
extreme.
If you need all trigrams, it's probably better to lower backoff weights,
isn't it?
This sounds encouraging, but I'm not sure how to go about it. Is there a
parameter I can control upon the creation of my language model, or should I
post-process the ARPA file to achieve the effect?
Regards,
John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there a parameter I can control upon the creation of my language model,
or should I post-process the ARPA file to achieve the effect?
Well, it depends on your goal. The discounting process used during lm
estimation tries to archive many goals and parameters used aren't really
straightforward ones. Cmuclmtk is not flexible in this area
If you are using SRILM, you can split process of discount parameter estimation
instead of just
ngram-counte -text a.text -lm a.lm
you can dump parameters first
ngram-count -kndiscount -text a.text -kn kn.param
then edit the parameters (smaller values of discounts usually means more prob
to seen n-grams and less prob to unseen) then you can build the language model
ngram-count -text a.text -kn kn.param -lm a.lm
If you want to use simple strategies like absolute discount, then just
-cdiscount 0.001 of ngram-count will do the work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am using a small N-gram language model. It is working fairly well, but I
would like to bias the model further towards matching bi-grams and especially
tri-grams. I experimented a bit by adjusting upwards the log-transformed
probabilities in the ARPA file for bi- and tri-grams but it didn't seem to
have much effect. Are there any suggestions for how best to do this kind of
tuning?
Regards,
John
Hello John
Thanks for the reply:
I'm using PocketSphinx on iPhoneOS, with code forked from the VocalKit setup
provided by Brian King. The hub4wsj_sc_8k HMM is being used for acoustic
modeling.
Yes, I mean that all bigrams/trigrams should be more probable than two- or
three-word groups that are not bi- tri-grams in my model. For example, if I
have the trigram "left and right" in my model, but no "left and white", then
the probability of arriving at "left and white" as the best hypothesis should
be very low relative to the "left and white". Of course, the N-gram model is
already doing this to an extent, but I would like to tune it to make it more
extreme.
This sounds encouraging, but I'm not sure how to go about it. Is there a
parameter I can control upon the creation of my language model, or should I
post-process the ARPA file to achieve the effect?
Regards,
John
Well, it depends on your goal. The discounting process used during lm
estimation tries to archive many goals and parameters used aren't really
straightforward ones. Cmuclmtk is not flexible in this area
If you are using SRILM, you can split process of discount parameter estimation
instead of just
you can dump parameters first
then edit the parameters (smaller values of discounts usually means more prob
to seen n-grams and less prob to unseen) then you can build the language model
If you want to use simple strategies like absolute discount, then just
-cdiscount 0.001 of ngram-count will do the work.
Thanks, the -cdiscount parameter seemed to already help, I'll try the more
detailed approaches next.