From: Daniel P. <dp...@gm...> - 2013-09-02 22:55:13
|
I think the underlying cause is instability in the training, causing the derivatives to become too large. This is something that commonly happens in neural net training, and the solution is generally to decrease the learning rate. What nonlinearity type are you using? And do the log-probs printed out in train.*.log or compute_prob_*.log get very negative? Unbounded nonlinearities such as ReLUs are more susceptible to this instability. Dan On Mon, Sep 2, 2013 at 6:50 PM, Ben Jiang <be...@ne...> wrote: > I see. Thanks for the fast response, Dan. > > So any idea on this "random" error I am stuck with at pass 27? I have > pasted the stacktrace below. This error doesn't always happen, even after > I removed the randomness introduced in the input mdl and shuffled egs. (eg, > save the input mdl and shuffled egs to files and re-run the failed > nnet-train-parallel from those files in debugger). The re-run would > sometimes fail and sometimes succeed. > > Anyway, I was able catch the error in my debugger and examine the variables. > I think the reason is that the deriv variable in NnetUpdater::Backprop() > contains some "bad" value, such as 1.50931703e+20. This caused the trace of > the matrix to become infinite and in turn cause the p_trace to become 0 and > fail the assert. I probably need more time to see how this value got in > there, but again, since the exact re-run would pass sometimes, it's kind of > hard to debug. > > Any idea? > > Here's the stacktrace: > =============================== > KALDI_ASSERT: at > nnet-train-parallel:PreconditionDirectionsAlphaRescaled:nnet-precondition.cc:128, > failed: p_trace != 0.0 > Stack trace is: > kaldi::KaldiGetStackTrace() > kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*) > kaldi::nnet2::PreconditionDirectionsAlphaRescaled(kaldi::MatrixBase<float> > const&, double, kaldi::MatrixBase<float>*) > kaldi::nnet2::AffineComponentPreconditioned::Update(kaldi::MatrixBase<float> > const&, kaldi::MatrixBase<float> const&) > kaldi::nnet2::AffineComponent::Backprop(kaldi::MatrixBase<float> const&, > kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&, int, > kaldi::nnet2::Component*, kaldi::Matrix<float>*) const > kaldi::nnet2::NnetUpdater::Backprop(std::vector<kaldi::nnet2::NnetTrainingExample, > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&, > kaldi::Matrix<float>*) > kaldi::nnet2::NnetUpdater::ComputeForMinibatch(std::vector<kaldi::nnet2::NnetTrainingExample, > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&) > kaldi::nnet2::DoBackprop(kaldi::nnet2::Nnet const&, > std::vector<kaldi::nnet2::NnetTrainingExample, > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&, > kaldi::nnet2::Nnet*) > kaldi::nnet2::DoBackpropParallelClass::operator()() > kaldi::MultiThreadable::run(void*) > > Ben > > > On Mon, Sep 2, 2013 at 6:25 PM, Daniel Povey <dp...@gm...> wrote: >> >> That's how it's supposed to be-- AFAIK that's basically the point of >> Hogwild, that you allow these kinds of updates and accept the >> possibility that due to race conditions you will occasionally lose a >> bit of date. The parameters only change slightly on the timescales >> that these different threads access them. >> Dan >> >> >> On Mon, Sep 2, 2013 at 6:01 PM, Ben Jiang <be...@ne...> wrote: >> > Hi all, >> > >> > While hunting some random error from nnet-train-parallel, I noticed the >> > nnet_to_update is shared among the threads, but there is no >> > synchronization >> > checks when updating the components in the threads. I still haven't >> > gone >> > too deep in the code yet, but should there be synchronization checks? >> > >> > For example, the deriv variable in NnetUpdater::Backprop() is updated >> > and >> > passed between the components. Could this be an issue if the components >> > are >> > being updated by other threads? >> > >> > >> > Or am I missing something totally? >> > >> > >> > -- >> > Thanks >> > Ben >> > >> > >> > ------------------------------------------------------------------------------ >> > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! >> > Discover the easy way to master current and previous Microsoft >> > technologies >> > and advance your career. Get an incredible 1,500+ hours of step-by-step >> > tutorial videos with LearnDevNow. Subscribe today and save! >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Kaldi-developers mailing list >> > Kal...@li... >> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers >> > > > > > > -- > > -- > Thanks > Ben Jiang > > Co-Founder/Principal/CTO > Nexiwave.com > Tel: 226-975-2172 / 617-245-0916 > "Confidential & Privileged: This email message is for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution is > prohibited. if you are not the intended recipient, please contact the sender > by reply email and destroy all copies of the original message.” |