I'm using currennt for a binary classification task with 15-dim real-valued inputs. When I try to use parallel_sequences, training will continue, but optimization fails, leading to NaN weights and inf training error.
I've seen this behavior in a variety of networks, for instance, two BLSTM layers w/100 nodes, and feedforward_logistic output layer.
parallel_sequences is likely a red-hearing at this point. The issue was weight initialization. Switching from Uniform[-1,1] to Normal(0,0.1) random initialization has fixed the random convergence failures.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It also has to do with the parallel_sequences option, because the more parallel sequences you have, the larger the derivatives w.r.t. the weights become, as these are just summed up over all timesteps in the mini-batch: Intuitively, the optimizer has a tendency to take larger steps if the direction is based on more data. My first guess when experiencing convergence issues is to decrease the learning rate. The weight initialization parameter is also very important for RNNs, especially if you have long sequences. You can also try to further decrease the truncate_seq value, typically 100-200 should give more stable results and be much faster too (more parallelism).
Thanks,
Felix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the explanation and intuition, glad to know they are related. I've not experienced problems since switching to Gaussian initialization, however, CV performance degrades by 5% relative compared to using uniform (which I've also noticed all the latest LSTM papers use).
I'll try my luck the learning rate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your great toolkit. I have used it for several projects. However, recently I have the same problem with a project. The error of train and validation decrease and system begins to converge; but after 5 6 7 epoches the error (of both train and validation) jump suddenly to the 100%. Here is my config information:
learning_rate = 1e-4 (I also tried 1e-6)
weights_dist = normal (I also tried uniform distribution)
weights_normal_sigma = 0.1
weights_normal_mean = 0
stochastic = true
parallel_sequences = 50
input_noise_sigma = 0.6
shuffle_fractions = true
shuffle_sequences = false
What can possibly be cause of this problem?
I highly appreciate your help
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using currennt for a binary classification task with 15-dim real-valued inputs. When I try to use parallel_sequences, training will continue, but optimization fails, leading to NaN weights and inf training error.
I've seen this behavior in a variety of networks, for instance, two BLSTM layers w/100 nodes, and feedforward_logistic output layer.
shuffle_fractions = true
shuffle_sequence = false
truncate_seq = 5000
stochastic = true
learning rate = 1e-6
momentum = 0.9
Please let me know if you have any tips.
I've really enjoyed this toolkit. Thank you for releasing it and I hope you continue to update it.
Thank you!
parallel_sequences is likely a red-hearing at this point. The issue was weight initialization. Switching from Uniform[-1,1] to Normal(0,0.1) random initialization has fixed the random convergence failures.
Hi Scott,
It also has to do with the parallel_sequences option, because the more parallel sequences you have, the larger the derivatives w.r.t. the weights become, as these are just summed up over all timesteps in the mini-batch: Intuitively, the optimizer has a tendency to take larger steps if the direction is based on more data. My first guess when experiencing convergence issues is to decrease the learning rate. The weight initialization parameter is also very important for RNNs, especially if you have long sequences. You can also try to further decrease the truncate_seq value, typically 100-200 should give more stable results and be much faster too (more parallelism).
Thanks,
Felix
Thanks for the explanation and intuition, glad to know they are related. I've not experienced problems since switching to Gaussian initialization, however, CV performance degrades by 5% relative compared to using uniform (which I've also noticed all the latest LSTM papers use).
I'll try my luck the learning rate.
Thank you for your great toolkit. I have used it for several projects. However, recently I have the same problem with a project. The error of train and validation decrease and system begins to converge; but after 5 6 7 epoches the error (of both train and validation) jump suddenly to the 100%. Here is my config information:
learning_rate = 1e-4 (I also tried 1e-6)
weights_dist = normal (I also tried uniform distribution)
weights_normal_sigma = 0.1
weights_normal_mean = 0
stochastic = true
parallel_sequences = 50
input_noise_sigma = 0.6
shuffle_fractions = true
shuffle_sequences = false
What can possibly be cause of this problem?
I highly appreciate your help