Thread: [Kaldi-developers] Kaldi-LSTM

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Daniel & Kaldi developers,

I saw there was a thread about “WER of LSTM & DNN” in Kaldi sourceforge
forum, I’m the author of the LSTM codes.  This morning the thread creator
Alim emailed me asking if I’d like to share my LSTM implementation with the
Kaldi community, my answer is definitely a “yes” of course.

The code is on github: https://github.com/dophist/kaldi-lstm

1.       The implementation is under Karel's nnet1 framework. The whole
LSTM architecture is condensed into a single configurable component. So in
the forum thread, Daniel asked about the “external tool” Alim used, it’s
actually “internal”, all Kaldi users will find it easy to compile and use.

2.       There are two versions of my implementation, “standard” &
“google”. The “standard” version can be seen as a general purpose LSTM tool
with epoch-wise BPTT, you can even adapt it to train an LSTM-LM if you
want, but currently I used it only for sequential training and decoding
tool(nnet-forward). The “google” version is primarily used for
Cross-Entropy training in my experiments. There are docs in my github repo
with detailed descriptions.

3.       Testing. The code has been tested as on an industry-size speech
corpus around 4000+ hours that is not publicly available, my experiment
reproduced google's results and their conclusions are solid.  In the last
few months I have got feedbacks from Siri group and Cambridge Lab and many
others, I suppose they have already got similar results.

4.       Legal stuff. Although I’m now working at Baidu, the coding is done
in my personal spare time, so I have the freedom to make it open-sourced,
under Kaldi’s license.

Known issues:

1.       Gradient explosion. Gradient explosion is far from solved in RNN
training, gradient clipping seems to be the best practice from my own
experience, it is implemented in “standard” version, but tuning the
clipping threshold can be painful towards different tasks.  “google”
version is less likely to explode because they limit the BPTT expansion to
20, but explosion still exists in certain cases.

2.       Training speed. Training LSTM is slow, indeed, especially when
most institutes don’t have huge infrastructures like DistBelief at google.
My current implementation is based on nnet1 so it only use 1 GPU card(or
CPU), the training might take months to converge on industrial-size
dataset.  Multiple GPU cards in single host server won’t scale as the
dataset is getting larger and larger.  And parallelizing SGD on GPU cluster
is still an open issue, most GPU cluster solutions I know requires
InfiniBand network, Yann LeCun group’s EA-SGD seems most promising to me,
but I don’t have time to try it.  Daniel’s nnet2 averaging strategy can be
another promising option but I can be sure if it will work on LSTM.

These remaining issues (particularly training speedup) might require great
effort to solve, and I’m not sure if I have enough time to do it. At least
I hope my LSTM implementation can be a quick starting point towards RNN
acoustic modeling for Kaldi community.

if anyone have questions about the code, feel free to email me.

jer...@gm...

And since china gov occasionally blocks gmail, my back-up email address:

jer...@qq...

Best,

Jerry (Jiayu DU)

Thread: [Kaldi-developers] Kaldi-LSTM

kaldi-developers