I have 2 different sets of text, one from my domain, and the other generic English, and I want to interpolate them, in the meaning that I want to sum them, e.g. 0.8my_domain + 0.2Generic_English.
Does kaldi_lm support that? If not, where would be a good place for me to go into the algorithm and change it to get what I want?
Thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have 2 different sets of text, one from my domain, and the other generic
English, and I want to interpolate them, in the meaning that I want to sum
them, e.g. 0.8my_domain + 0.2Generic_English.
Does kaldi_lm support that? If not, where would be a good place for me to
go into the algorithm and change it to get what I want?
I still wonder though -- Does kaldi_lm support that, and if not, where would be a good place for me to go into the algorithm and change it to get what I want?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Actually, the whole reason I created kaldi_lm was because I wanted a
better way to merge different LM sources, but I never got around to
implementing this at the script level for any example script.
The problem I was trying to solve is the following: that when you
merge two different LM sources, you do interpolation with the same
weight in each n-gram history state, so it doesn't take into account
the fact that if source A has many more examples of a history-state
than source B, you probably want to give source A a higher weight in
that history-state.
The way to do the merging in kaldi_lm is to merge the two sources at
the "ngrams_disc.gz" stage where you have discounted n-gram counts,
before turning it into an ARPA. The way to merge these files is to
cat them, sort them, and pipe them through merge_ngrams. At this
stage you can also weight one of the sources if you want to apply
corpus weights, but bear in mind that these are counts not
probabilities so you should take into account how large the two data
sources were. Weighting can be done in awk, I think the count field
is the first or last field.
The reason kaldi_lm's primary format for LMs is discounted n-gram
counts rather than probabilities, is to make operations like LM
merging more natural. However, as I said I never got around to doing
anything with this.
After you have the merged ngrams_disc.gz, you can continue through the
rest of the train_lm.sh script. All of this only makes sense if the
"word_map" (the map that maps from words to short-form words) is the
same in both sources; you'll have to make sure of this at the
calling-script level.
Dan
I still wonder though -- Does kaldi_lm support that, and if not, where would
be a good place for me to go into the algorithm and change it to get what I
want?
Hi!
I have 2 different sets of text, one from my domain, and the other generic English, and I want to interpolate them, in the meaning that I want to sum them, e.g. 0.8my_domain + 0.2Generic_English.
Does kaldi_lm support that? If not, where would be a good place for me to go into the algorithm and change it to get what I want?
Thank you!
You can use srilm or mitlm.
On Dec 16, 2014 6:19 AM, "Tony Esposito" antonioesposito@users.sf.net
wrote:
Thank you, this is a possible option.
I still wonder though -- Does kaldi_lm support that, and if not, where would be a good place for me to go into the algorithm and change it to get what I want?
Hi,
Actually, the whole reason I created kaldi_lm was because I wanted a
better way to merge different LM sources, but I never got around to
implementing this at the script level for any example script.
The problem I was trying to solve is the following: that when you
merge two different LM sources, you do interpolation with the same
weight in each n-gram history state, so it doesn't take into account
the fact that if source A has many more examples of a history-state
than source B, you probably want to give source A a higher weight in
that history-state.
The way to do the merging in kaldi_lm is to merge the two sources at
the "ngrams_disc.gz" stage where you have discounted n-gram counts,
before turning it into an ARPA. The way to merge these files is to
cat them, sort them, and pipe them through merge_ngrams. At this
stage you can also weight one of the sources if you want to apply
corpus weights, but bear in mind that these are counts not
probabilities so you should take into account how large the two data
sources were. Weighting can be done in awk, I think the count field
is the first or last field.
The reason kaldi_lm's primary format for LMs is discounted n-gram
counts rather than probabilities, is to make operations like LM
merging more natural. However, as I said I never got around to doing
anything with this.
After you have the merged ngrams_disc.gz, you can continue through the
rest of the train_lm.sh script. All of this only makes sense if the
"word_map" (the map that maps from words to short-form words) is the
same in both sources; you'll have to make sure of this at the
calling-script level.
Dan
On Wed, Dec 17, 2014 at 1:35 AM, Tony Esposito
antonioesposito@users.sf.net wrote: