Menu

Language Model Interpolation

Help
2014-12-16
2014-12-17
  • Tony Esposito

    Tony Esposito - 2014-12-16

    Hi!

    I have 2 different sets of text, one from my domain, and the other generic English, and I want to interpolate them, in the meaning that I want to sum them, e.g. 0.8my_domain + 0.2Generic_English.

    Does kaldi_lm support that? If not, where would be a good place for me to go into the algorithm and change it to get what I want?

    Thank you!

     
  • Tony Esposito

    Tony Esposito - 2014-12-17

    Thank you, this is a possible option.

    I still wonder though -- Does kaldi_lm support that, and if not, where would be a good place for me to go into the algorithm and change it to get what I want?

     
    • Daniel Povey

      Daniel Povey - 2014-12-17

      Hi,
      Actually, the whole reason I created kaldi_lm was because I wanted a
      better way to merge different LM sources, but I never got around to
      implementing this at the script level for any example script.
      The problem I was trying to solve is the following: that when you
      merge two different LM sources, you do interpolation with the same
      weight in each n-gram history state, so it doesn't take into account
      the fact that if source A has many more examples of a history-state
      than source B, you probably want to give source A a higher weight in
      that history-state.
      The way to do the merging in kaldi_lm is to merge the two sources at
      the "ngrams_disc.gz" stage where you have discounted n-gram counts,
      before turning it into an ARPA. The way to merge these files is to
      cat them, sort them, and pipe them through merge_ngrams. At this
      stage you can also weight one of the sources if you want to apply
      corpus weights, but bear in mind that these are counts not
      probabilities so you should take into account how large the two data
      sources were. Weighting can be done in awk, I think the count field
      is the first or last field.
      The reason kaldi_lm's primary format for LMs is discounted n-gram
      counts rather than probabilities, is to make operations like LM
      merging more natural. However, as I said I never got around to doing
      anything with this.
      After you have the merged ngrams_disc.gz, you can continue through the
      rest of the train_lm.sh script. All of this only makes sense if the
      "word_map" (the map that maps from words to short-form words) is the
      same in both sources; you'll have to make sure of this at the
      calling-script level.
      Dan

      On Wed, Dec 17, 2014 at 1:35 AM, Tony Esposito
      antonioesposito@users.sf.net wrote:

      Thank you, this is a possible option.

      I still wonder though -- Does kaldi_lm support that, and if not, where would
      be a good place for me to go into the algorithm and change it to get what I
      want?


      Language Model Interpolation


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/