[Kaldi-users] LSTM as feature transform for RBM training

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I trained 2 layers of LSTM, with 2 hidden layers on top of that.
The decoding performance on eval92 is reasonable.
Now I want to do RBM pre-training.
The straightforward way is to remove the hidden layers, and use the LSTM 
layers as feature transform, just as the way in Karel's cnn pre-train 
recipe.
However, no matter how small the learn rate is, the first RBM seems not 
converging, log is pasted below:
================================================
LOG (rbm-train-cd1-frmshuff:Init():nnet-randomizer.cc:31) Seeding by 
srand with : 777
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:138) RBM 
TRAINING STARTED
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:141) 
Iteration 1/2
LOG 
(rbm-train-cd1-frmshuff:PropagateFnc():nnet/nnet-lstm-projected-streams.h:303) 
Running nnet-forward with per-utterance LSTM-state reset
LOG 
(rbm-train-cd1-frmshuff:PropagateFnc():nnet/nnet-lstm-projected-streams.h:303) 
Running nnet-forward with per-utterance LSTM-state reset
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.5e-06 after processing 0.000277778h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 1h]: 218.955 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.45e-06 after processing 1.38889h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 2h]: 222.583 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.4e-06 after processing 2.77778h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 3h]: 220.827 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 4h]: 221.531 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.35e-06 after processing 4.16667h
.......
================================================

Mse does not decrease.
However, after 1.rbm is trained, and concatenated with LSTM, (now the 
transform become LSTM+RBM), the training of 2.rbm seems converging....
================================================
LOG (rbm-train-cd1-frmshuff:Init():nnet-randomizer.cc:31) Seeding by 
srand with : 777
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:138) RBM 
TRAINING STARTED
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:141) 
Iteration 1/2
LOG 
(rbm-train-cd1-frmshuff:PropagateFnc():nnet/nnet-lstm-projected-streams.h:303) 
Running nnet-forward with per-utterance LSTM-state reset
LOG 
(rbm-train-cd1-frmshuff:PropagateFnc():nnet/nnet-lstm-projected-streams.h:303) 
Running nnet-forward with per-utterance LSTM-state reset
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.5e-06 after processing 0.000277778h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 1h]: 56.9416 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.45e-06 after processing 1.38889h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 2h]: 39.1901 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.4e-06 after processing 2.77778h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 3h]: 34.2891 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 4h]: 30.5311 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.35e-06 after processing 4.16667h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:213) 
ProgressLoss[last 1h of 5h]: 29.2614 (Mse)
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:235) 
Setting momentum 0.9 and learning rate 2.3e-06 after processing 5.55556h
.......
===============================================

I am quite confused about this. I believe further fine tuning of the 
weights based on these RBMs does not make sense.
What am I missing?

Best,
Xingyu