From: Joan P. <joa...@gm...> - 2015-05-06 17:01:51
|
Hi, I implemented a simple version of the forward and backward algorithms, in order to support Baum-Welch training in Kaldi. There is still plenty of work to do, mostly in terms of speed, but I achieved to replicate HTK's F/B results (average likelihood/frame and HMM-states occupancy stats) with both toy examples and real handwritten samples. I apologize for the length of this email, but I wanted to explain the details of my implementation and ideas that I have to improve it. 1. High level description ========================= a) Compile training graphs, as you would do with the Viterbi training: $ compile-train-graphs --self-loop-scale=0 --transition-scale=0 tree 0.mdl lex.fst ark:train.transc.ark ark:train.fsts.ark b) Get the transition-id posteriors, for your training data: $ gmm-fb-compiled --self-loop-scale=1 --transition-scale=1 0.mdl ark:train.fsts.ark ark:train.features.ark ark:0.posteriors.ark Source code: https://github.com/jpuigcerver/kaldi/tree/em/trunk/src/gmmbin/gmm-fb-compiled.cc This program is similar to gmm-align-compiled, the idea is that instead of obtaining a list of transition-ids for each training sample, you get a list of posteriors over the transition-ids for each training sample. This is the only thing that you need to do the MLE reestimation. c) Accumulate statistics from the posteriors, using gmm-acc-stats, as usual: gmm-acc-stats 0.mdl ark:train.features.ark ark:0.posteriors.ark 0.acc d) MLE re-estimation: gmm-est 0.mdl 0.acc 1.mdl 2. Implementation details ========================= gmm-fb-compiled uses the Forward/Backward algorithm to compute the transition-id posteriors. The Forward/Backward algorithms are implemented, for now, in two separate classes that can be found in: https://github.com/jpuigcerver/kaldi/tree/em/trunk/src/fb Both SimpleForward and SimpleBackward classes are very similar to SimpleDecoder, so it should not be difficult to read them: Basically, I keep a list of unordered_maps (one for each timestep) of active Tokens, similar to the used in the Viterbi algorithm. Instead of storing the cost of the path with the minimum cost to the given state, I store the log-sum-exp of all paths to the given state. The most significant difference is in the ProcessNonemitting() method of both classes: Viterbi algorithm uses the Tropical semiring, which is idempotent with respect to the product operation (min), while Foward/Backward use the Log semiring, which is not idempotent respect it (log-sum-exp operation). This makes that some simplifications that work for Viterbi's algorithm, do not apply for Forward/Backward's. I implemented the same algorithm that OpenFST's fstshortestdistance uses, in order to ensure the correct computation of the Forward/Backward scores, even when epsilon transitions or loops are present in the input WFST. See [1] for more details about the algorithm. 3. Experiments ============== I did some experiments to check my implementation: With some toy examples I used OpenFST's fstshortestdistance tool to do the forward/backward pass. These toy examples include some not-so-trivial WFST with epsilon transitions and even epsilon-loops and bucles (cycles). These kind of WFST would not be possible with HTK. I also compared the occupancy statistics obtained by HTK with the ones obtained by gmm-acc-stats, after the first iteration of my Baum-Welch recipe, using real handwritten text recognition data. I implemented several toy tests that can be run with fb/simple-forward-test, fb/simple-backward-test and fb/simple-posteriors-test. You can find the files required to run the comparison with HTK in: https://www.prhlt.upv.es/~jpuigcerver/kaldi_bentham.tar.gz https://www.prhlt.upv.es/~jpuigcerver/htk_bentham.tar.gz The log-likelihood of each training sample and the average/frame is in this file: https://www.prhlt.upv.es/~jpuigcerver/kaldi_vs_htk_likelihood_comparison.txt And the summary of the HMM-state occupancy of both systems is here: https://www.prhlt.upv.es/~jpuigcerver/kaldi_hmmstate_occupancy.txt https://www.prhlt.upv.es/~jpuigcerver/htk_hmmstate_occupancy.txt As you can see, both results are very similar. I assume the diferences are due to implementation details. 4. Improvement ideas ==================== a) Current implementation is about 20-30 times slower than HTK's. Possible reasons: a.1) The backward pass is most likely the bottleneck, plus beam pruning would add no-benefit: The problem is OpenFST's traversal of the arcs. There is no efficient way to traverse the incoming arcs, given a particular node (I mean in O(k), where k is the number of outgoing arcs of a particular state): ArcIterator only allows to traverse the outgoing arcs, given a node. This is a problem in the backward pass, since I need to traverse basically all arcs in the WFST to detect those which enter a particular active state in the backward pass (see source ProcessEmitting/ProcessNonemitting code in SimpleBackward.cc). A possible solution would be to work with the transpose of the WFST in the backward pass, similar to what [2] do. Unfortunately, this would increase the memory cost. Another option, which would not increase the memory cost so dramatically, would be to implement a custom WFST class that has both input/output arcs information in each node, but this would certainly require more programming efforts. a.2) HTK always does a beam pruning in the forward pass: First it does the backward pass. Then, once it is doing the forward pass, at a given time t, and state s, one can check whether beta(t, s) == 0. If so, even if alpha(t, s) > 0, there is no point in keeping this token alive, since we know that this token will not lead to any final state. (Does anyone know why HTK does first the backward pass? AFAIK, the same pruning could be done if the forward pass was done first: one would just prune the backward pass instead) a.3) This is more a question than a observation: Does the DecodableAmDiagGmm cache the GMM log-likelihood computations? If not, this would probably also improve the running time of the F/B passes. b) Memory usage can also be improved: It consumes 3-4 times more memory than HTK. This is not so critical as the speed, but certainly would be appreciated. Right now, I keeping two copies of the trellis in RAM: one for the forward pass, and one for the backward pass. I think only a copy is needed (say, for the forward pass), plus a few extra rows of the trellis (say for t, and t - 1). If the forward scores are available, one can simply compute alpha(s, t) * beta(s, t) on the fly during the backward pass, to compute the state posteriors, or alpha(s, t) * beta(q, t+1) to compute the edge posteriors. I would like to hear comments and suggestions, regarding the current implementation and, especially, the ideas that I have to improve the speed/memory. I'd be very glad to hear comments from Kaldi's main developers, which probably have much more experience implementing Baum-Welch training, than I do, and know many implementation tricks. Also, since I aspire to include my code to the Kaldi main branch, at some point, it should comply with Kaldi's code standards. Thank you. [1] Mohri, Mehryar. "Semiring frameworks and algorithms for shortest-distance problems." Journal of Automata, Languages and Combinatorics 7.3 (2002): 321-350. [2] Hannemann, Mirko, Daniel Povey, and Geoffrey Zweig. "Combining forward and backward search in decoding." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013. 2015-04-01 23:30 GMT+02:00 Daniel Povey <dp...@gm...>: >> However, I am not so sure how to implement the Backward algorithm, since I >> must traverse the edges in the FST backwards (to do the backward pass in O(T >> * (V + E))), and OpenFST does not support this AFAIK. Also, I am not sure if >> simply transposing the FST would work, since I would have many initial >> states... Any suggestion on that? > > It should be possible to handle this by arranging your code in the right > way, e.g. first iterate over the start-state of the arc. >> >> > This would create a difficulty for converting alignments though (this >> > happens >> > when bootstrapping later systems, e.g. starting tri2 from tri1). You >> > would >> > probably have to just to Viterbi for that one stage. >> >> I am not sure what you mean. Could you extend your explanation or point to >> a recipe where you had to overcome this difficulty? > > I mean the program convert-ali wouldn't work unless you had alignments. > Dan > >> >> Many thanks for your help and advices, >> >> Joan Puigcerver. > > |