Kaldi / Discussion / Help: Language Model Interpolation

Tony Esposito - 2014-12-16

Hi!

I have 2 different sets of text, one from my domain, and the other generic English, and I want to interpolate them, in the meaning that I want to sum them, e.g. 0.8my_domain + 0.2Generic_English.

Does kaldi_lm support that? If not, where would be a good place for me to go into the algorithm and change it to get what I want?

Thank you!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nagendra Kumar Goel - 2014-12-16
  
  You can use srilm or mitlm.
  On Dec 16, 2014 6:19 AM, "Tony Esposito" antonioesposito@users.sf.net
  wrote:
  
  Hi!
  
  I have 2 different sets of text, one from my domain, and the other generic
  English, and I want to interpolate them, in the meaning that I want to sum
  them, e.g. 0.8my_domain + 0.2Generic_English.
  
  Does kaldi_lm support that? If not, where would be a good place for me to
  go into the algorithm and change it to get what I want?
  
  Thank you!
  
  Language Model Interpolation
  https://sourceforge.net/p/kaldi/discussion/1355348/thread/736a3de4/?limit=25#70a3
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Esposito - 2014-12-17

Thank you, this is a possible option.

I still wonder though -- Does kaldi_lm support that, and if not, where would be a good place for me to go into the algorithm and change it to get what I want?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-12-17
  
  Hi,
  Actually, the whole reason I created kaldi_lm was because I wanted a
  better way to merge different LM sources, but I never got around to
  implementing this at the script level for any example script.
  The problem I was trying to solve is the following: that when you
  merge two different LM sources, you do interpolation with the same
  weight in each n-gram history state, so it doesn't take into account
  the fact that if source A has many more examples of a history-state
  than source B, you probably want to give source A a higher weight in
  that history-state.
  The way to do the merging in kaldi_lm is to merge the two sources at
  the "ngrams_disc.gz" stage where you have discounted n-gram counts,
  before turning it into an ARPA. The way to merge these files is to
  cat them, sort them, and pipe them through merge_ngrams. At this
  stage you can also weight one of the sources if you want to apply
  corpus weights, but bear in mind that these are counts not
  probabilities so you should take into account how large the two data
  sources were. Weighting can be done in awk, I think the count field
  is the first or last field.
  The reason kaldi_lm's primary format for LMs is discounted n-gram
  counts rather than probabilities, is to make operations like LM
  merging more natural. However, as I said I never got around to doing
  anything with this.
  After you have the merged ngrams_disc.gz, you can continue through the
  rest of the train_lm.sh script. All of this only makes sense if the
  "word_map" (the map that maps from words to short-form words) is the
  same in both sources; you'll have to make sure of this at the
  calling-script level.
  Dan
  
  On Wed, Dec 17, 2014 at 1:35 AM, Tony Esposito
  antonioesposito@users.sf.net wrote:
  
  Thank you, this is a possible option.
  
  I still wonder though -- Does kaldi_lm support that, and if not, where would
  be a good place for me to go into the algorithm and change it to get what I
  want?
  
  Language Model Interpolation
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Language Model Interpolation

Forums

Help

Language Model Interpolation document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Thank you!

Language Model Interpolation