I've been using modified versions of 'rnnlmrescore'/'rnnlm_compute_scores' scripts for rescoring via n-best probability reestimation. Those modifications were made so I could handle generic language models in the re-scoring step.
I discovered that the lattice-to-nbest tool has parameters which greatly impact in this task, namely the AM/LM weights and the n-best list length. To check the implementation validity, I used a SRILM-trained LM both as a const-arpa for the 'lattice-lmrescore-const-arpa' tool, and as a generic language model on the modified rescoring scripts. Then, I compared the results of both rescoring recipes in WER terms.
However, on some validation datasets, there is a big difference between the WER reduction on the full lattice const-arpa rescoring and in the nbest-to-lattice rescoring, even with optimal parameters (big enought nbest length, best AM/LM weight estimated on the first pass evaluation). In online decoding, for instance, the nbest-to-lattice simply doesn't reduce the WER at all, increasing it instead.
Is this an expected behavior? Is this in any way similar to rnnlm rescoring behavior, dealt by those who used the original scripts?
Thanks,
Last edit: Akira Miasato 2015-05-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Doing language model rescoring by rescoring n-best lists is always
going to be approximate because the n-best list (for any reasonable n)
can only represent a tiny portion of the variety in the lattice. It's
important to generate it using a lmwt/acwt ratio similar to the
optimal one.
At some point Guoguo Chen (cc'd) is going to work on an RMMLM
rescoring strategy based on lattice rescoring rather than n-best
lists, which will be much closer to exact.
Dan
I've been using modified versions of 'rnnlmrescore'/'rnnlm_compute_scores'
scripts for rescoring via n-best probability reestimation. Those
modifications were made so I could handle generic language models in the
re-scoring step.
I discovered that the lattice-to-nbest tool has parameters which greatly
impact in this task, namely the AM/LM weights and the n-best list length. To
check the implementation validity, I used a SRILM-trained LM both as a
const-arpa for the 'lattice-lmrescore-const-arpa' tool, and as a generic
language model on the modified rescoring scripts. Then, I compared the
results of both rescoring recipes in WER terms.
However, on some validation datasets, there is a big difference between the
WER reduction on the full lattice const-arpa rescoring and in the
nbest-to-lattice rescoring, even with optimal parameters (big enought nbest
length, best LM score estimated on the first pass evaluation). In online
decoding, for instance, the nbest-to-lattice simply doesn't reduce the WER
at all, increasing it instead.
Is this an expected behavior? Is this in any way similar to rnnlm rescoring
behavior, dealt by those who used the original scripts?
Thanks,
Using ARPA N-grams both in n-best sentence rescoring and const-arpa lattice
rescoring
On another note, are there any big differences between lattices generated by online and offline recipes? It seems to me that it is very difficult to beat the 1st-pass in online decoding using n-best rescoring.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On another note, are there any big differences between lattices generated by
online and offline recipes? It seems to me that it is very difficult to beat
the 1st-pass in online decoding using n-best rescoring.
The format of the lattices is the same. Make sure you are using the
correct LM-scale (normally the inverse of the acoustic scale you
decoded with) to get the n-best list. In this case the 1-best from
the lattice, before LM rescoring, should be the same as the decoded
output; it's a good idea to verify this. And because it will be
difficult to replicate the acoustic computations that go on in the
online decoding, make sure you don't use any of the modes of language
model rescoring that touch the acoustic scores. (If you don't know
what this means, it means basically don't use any program with 'gmm'
or 'nnet' in its name; probably you're not anyway so just ignore this
if you find it confusing).
There are a lot of things that can go wrong in LM rescoring, and there
are different strategies of LM rescoring demonstrated in the scripts.
Kaldi lattices have two costs per arc: the "acoustic" cost, and the
"graph" cost. The graph cost contains transition probabilities,
lexicon costs and language model scores.
One strategy is to completely remove the "graph" part of the score,
and reconstruct it by adding transition-model scores, and composing
with the lexicon and then the LM. Another strategy is to subtract the
score from the "old" LM (the one that you decoded with) and then add
in the score from the "new" LM (the one that you want to use). Of
course if you want interpolation you can just subtract a constant
times the "old" LM score and add in another constant times the new LM
score.
Dan
Using ARPA N-grams both in n-best sentence rescoring and const-arpa lattice
rescoring
Hello everyone,
I've been using modified versions of 'rnnlmrescore'/'rnnlm_compute_scores' scripts for rescoring via n-best probability reestimation. Those modifications were made so I could handle generic language models in the re-scoring step.
I discovered that the lattice-to-nbest tool has parameters which greatly impact in this task, namely the AM/LM weights and the n-best list length. To check the implementation validity, I used a SRILM-trained LM both as a const-arpa for the 'lattice-lmrescore-const-arpa' tool, and as a generic language model on the modified rescoring scripts. Then, I compared the results of both rescoring recipes in WER terms.
However, on some validation datasets, there is a big difference between the WER reduction on the full lattice const-arpa rescoring and in the nbest-to-lattice rescoring, even with optimal parameters (big enought nbest length, best AM/LM weight estimated on the first pass evaluation). In online decoding, for instance, the nbest-to-lattice simply doesn't reduce the WER at all, increasing it instead.
Is this an expected behavior? Is this in any way similar to rnnlm rescoring behavior, dealt by those who used the original scripts?
Thanks,
Last edit: Akira Miasato 2015-05-19
Doing language model rescoring by rescoring n-best lists is always
going to be approximate because the n-best list (for any reasonable n)
can only represent a tiny portion of the variety in the lattice. It's
important to generate it using a lmwt/acwt ratio similar to the
optimal one.
At some point Guoguo Chen (cc'd) is going to work on an RMMLM
rescoring strategy based on lattice rescoring rather than n-best
lists, which will be much closer to exact.
Dan
Thanks for the informative answer.
On another note, are there any big differences between lattices generated by online and offline recipes? It seems to me that it is very difficult to beat the 1st-pass in online decoding using n-best rescoring.
The format of the lattices is the same. Make sure you are using the
correct LM-scale (normally the inverse of the acoustic scale you
decoded with) to get the n-best list. In this case the 1-best from
the lattice, before LM rescoring, should be the same as the decoded
output; it's a good idea to verify this. And because it will be
difficult to replicate the acoustic computations that go on in the
online decoding, make sure you don't use any of the modes of language
model rescoring that touch the acoustic scores. (If you don't know
what this means, it means basically don't use any program with 'gmm'
or 'nnet' in its name; probably you're not anyway so just ignore this
if you find it confusing).
There are a lot of things that can go wrong in LM rescoring, and there
are different strategies of LM rescoring demonstrated in the scripts.
Kaldi lattices have two costs per arc: the "acoustic" cost, and the
"graph" cost. The graph cost contains transition probabilities,
lexicon costs and language model scores.
One strategy is to completely remove the "graph" part of the score,
and reconstruct it by adding transition-model scores, and composing
with the lexicon and then the LM. Another strategy is to subtract the
score from the "old" LM (the one that you decoded with) and then add
in the score from the "new" LM (the one that you want to use). Of
course if you want interpolation you can just subtract a constant
times the "old" LM score and add in another constant times the new LM
score.
Dan