Re: [Kaldi-developers] Baum-Welch training for HMMs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I implemented a simple version of the forward and backward algorithms,
in order to support Baum-Welch training in Kaldi. There is still
plenty of work to do, mostly in terms of speed, but I achieved to
replicate HTK's F/B results (average likelihood/frame and HMM-states
occupancy stats) with both toy examples and real handwritten samples.

I apologize for the length of this email, but I wanted to explain the
details of my implementation and ideas that I have to improve it.

1. High level description
=========================

a) Compile training graphs, as you would do with the Viterbi training:

$ compile-train-graphs --self-loop-scale=0 --transition-scale=0 tree
0.mdl lex.fst ark:train.transc.ark ark:train.fsts.ark

b) Get the transition-id posteriors, for your training data:

$ gmm-fb-compiled --self-loop-scale=1 --transition-scale=1 0.mdl
ark:train.fsts.ark ark:train.features.ark ark:0.posteriors.ark

Source code:
https://github.com/jpuigcerver/kaldi/tree/em/trunk/src/gmmbin/gmm-fb-compiled.cc

This program is similar to gmm-align-compiled, the idea is that
instead of obtaining a list of transition-ids for each training
sample, you get a list of posteriors over the transition-ids for each
training sample. This is the only thing that you need to do the MLE
reestimation.

c) Accumulate statistics from the posteriors, using gmm-acc-stats, as usual:

gmm-acc-stats 0.mdl ark:train.features.ark ark:0.posteriors.ark 0.acc

d) MLE re-estimation:

gmm-est 0.mdl 0.acc 1.mdl

2. Implementation details
=========================

gmm-fb-compiled uses the Forward/Backward algorithm to compute the
transition-id posteriors. The Forward/Backward algorithms are
implemented, for now, in two separate classes that can be found in:

https://github.com/jpuigcerver/kaldi/tree/em/trunk/src/fb

Both SimpleForward and SimpleBackward classes are very similar to
SimpleDecoder, so it should not be difficult to read them:

Basically, I keep a list of unordered_maps (one for each timestep) of
active Tokens, similar to the used in the Viterbi algorithm. Instead
of storing the cost of the path with the minimum cost to the given
state, I store the log-sum-exp of all paths to the given state.

The most significant difference is in the ProcessNonemitting() method
of both classes: Viterbi algorithm uses the Tropical semiring, which
is idempotent with respect to the product operation (min), while
Foward/Backward use the Log semiring, which is not idempotent respect
it (log-sum-exp operation). This makes that some simplifications that
work for Viterbi's algorithm, do not apply for Forward/Backward's. I
implemented the same algorithm that OpenFST's fstshortestdistance
uses, in order to ensure the correct computation of the
Forward/Backward scores, even when epsilon transitions or loops are
present in the input WFST. See [1] for more details about the
algorithm.

3. Experiments
==============

I did some experiments to check my implementation:

With some toy examples I used OpenFST's fstshortestdistance tool to do
the forward/backward pass. These toy examples include some
not-so-trivial WFST with epsilon transitions and even epsilon-loops
and bucles (cycles). These kind of WFST would not be possible with
HTK.

I also compared the occupancy statistics obtained by HTK with the ones
obtained by gmm-acc-stats, after the first iteration of my Baum-Welch
recipe, using real handwritten text recognition data.

I implemented several toy tests that can be run with
fb/simple-forward-test, fb/simple-backward-test and
fb/simple-posteriors-test.

You can find the files required to run the comparison with HTK in:
https://www.prhlt.upv.es/~jpuigcerver/kaldi_bentham.tar.gz
https://www.prhlt.upv.es/~jpuigcerver/htk_bentham.tar.gz

The log-likelihood of each training sample and the average/frame is in
this file:
https://www.prhlt.upv.es/~jpuigcerver/kaldi_vs_htk_likelihood_comparison.txt

And the summary of the HMM-state occupancy of both systems is here:
https://www.prhlt.upv.es/~jpuigcerver/kaldi_hmmstate_occupancy.txt
https://www.prhlt.upv.es/~jpuigcerver/htk_hmmstate_occupancy.txt

As you can see, both results are very similar. I assume the diferences
are due to implementation details.

4. Improvement ideas
====================

a) Current implementation is about 20-30 times slower than HTK's.
Possible reasons:

a.1) The backward pass is most likely the bottleneck, plus beam
pruning would add no-benefit: The problem is OpenFST's traversal of
the arcs. There is no efficient way to traverse the incoming arcs,
given a particular node (I mean in O(k), where k is the number of
outgoing arcs of a particular state): ArcIterator only allows to
traverse the outgoing arcs, given a node. This is a problem in the
backward pass, since I need to traverse basically all arcs in the WFST
to detect those which enter a particular active state in the backward
pass (see source ProcessEmitting/ProcessNonemitting code in
SimpleBackward.cc).

A possible solution would be to work with the transpose of the WFST in
the backward pass, similar to what [2] do. Unfortunately, this would
increase the memory cost.

Another option, which would not increase the memory cost so
dramatically, would be to implement a custom WFST class that has both
input/output arcs information in each node, but this would certainly
require more programming efforts.

a.2) HTK always does a beam pruning in the forward pass: First it does
the backward pass. Then, once it is doing the forward pass, at a given
time t, and state s, one can check whether beta(t, s) == 0. If so,
even if alpha(t, s) > 0, there is no point in keeping this token
alive, since we know that this token will not lead to any final state.
(Does anyone know why HTK does first the backward pass? AFAIK, the
same pruning could be done if the forward pass was done first: one
would just prune the backward pass instead)

a.3) This is more a question than a observation: Does the
DecodableAmDiagGmm cache the GMM log-likelihood computations? If not,
this would probably also improve the running time of the F/B passes.

b) Memory usage can also be improved: It consumes 3-4 times more
memory than HTK. This is not so critical as the speed, but certainly
would be appreciated. Right now, I keeping two copies of the trellis
in RAM: one for the forward pass, and one for the backward pass.

I think only a copy is needed (say, for the forward pass), plus a few
extra rows of the trellis (say for t, and t - 1). If the forward
scores are available, one can simply compute alpha(s, t) * beta(s, t)
on the fly during the backward pass, to compute the state posteriors,
or alpha(s, t) * beta(q, t+1) to compute the edge posteriors.

I would like to hear comments and suggestions, regarding the current
implementation and, especially, the ideas that I have to improve the
speed/memory.

I'd be very glad to hear comments from Kaldi's main developers, which
probably have much more experience implementing Baum-Welch training,
than I do, and know many implementation tricks. Also, since I aspire
to include my code to the Kaldi main branch, at some point, it should
comply with Kaldi's code standards.

Thank you.

[1] Mohri, Mehryar. "Semiring frameworks and algorithms for
shortest-distance problems." Journal of Automata, Languages and
Combinatorics 7.3 (2002): 321-350.
[2] Hannemann, Mirko, Daniel Povey, and Geoffrey Zweig. "Combining
forward and backward search in decoding." Acoustics, Speech and Signal
Processing (ICASSP), 2013 IEEE International Conference on. IEEE,
2013.

2015-04-01 23:30 GMT+02:00 Daniel Povey <dp...@gm...>:
>> However, I am not so sure how to implement the Backward algorithm, since I
>> must traverse the edges in the FST backwards (to do the backward pass in O(T
>> * (V + E))), and OpenFST does not support this AFAIK. Also, I am not sure if
>> simply transposing the FST would work, since I would have many initial
>> states... Any suggestion on that?
>
> It should be possible to handle this by arranging your code in the right
> way, e.g. first iterate over the start-state of the arc.
>>
>> > This would create a difficulty for converting alignments though (this
>> > happens
>> > when bootstrapping later systems, e.g. starting tri2 from tri1).  You
>> > would
>> > probably have to just to Viterbi for that one stage.
>>
>> I am not sure what you mean. Could you extend your explanation or point to
>> a recipe where you had to overcome this difficulty?
>
> I mean the program convert-ali wouldn't work unless you had alignments.
> Dan
>
>>
>> Many thanks for your help and advices,
>>
>> Joan Puigcerver.
>
>