From: Ben J. <be...@ne...> - 2013-09-02 22:51:01
|
I see. Thanks for the fast response, Dan. So any idea on this "random" error I am stuck with at pass 27? I have pasted the stacktrace below. This error doesn't always happen, even after I removed the randomness introduced in the input mdl and shuffled egs. (eg, save the input mdl and shuffled egs to files and re-run the failed nnet-train-parallel from those files in debugger). The re-run would sometimes fail and sometimes succeed. Anyway, I was able catch the error in my debugger and examine the variables. I think the reason is that the deriv variable in NnetUpdater::Backprop() contains some "bad" value, such as 1.50931703e+20. This caused the trace of the matrix to become infinite and in turn cause the p_trace to become 0 and fail the assert. I probably need more time to see how this value got in there, but again, since the exact re-run would pass sometimes, it's kind of hard to debug. Any idea? Here's the stacktrace: =============================== KALDI_ASSERT: at nnet-train-parallel:PreconditionDirectionsAlphaRescaled:nnet-precondition.cc:128, failed: p_trace != 0.0 Stack trace is: kaldi::KaldiGetStackTrace() kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*) kaldi::nnet2::PreconditionDirectionsAlphaRescaled(kaldi::MatrixBase<float> const&, double, kaldi::MatrixBase<float>*) kaldi::nnet2::AffineComponentPreconditioned::Update(kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&) kaldi::nnet2::AffineComponent::Backprop(kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&, int, kaldi::nnet2::Component*, kaldi::Matrix<float>*) const kaldi::nnet2::NnetUpdater::Backprop(std::vector<kaldi::nnet2::NnetTrainingExample, std::allocator<kaldi::nnet2::NnetTrainingExample> > const&, kaldi::Matrix<float>*) kaldi::nnet2::NnetUpdater::ComputeForMinibatch(std::vector<kaldi::nnet2::NnetTrainingExample, std::allocator<kaldi::nnet2::NnetTrainingExample> > const&) kaldi::nnet2::DoBackprop(kaldi::nnet2::Nnet const&, std::vector<kaldi::nnet2::NnetTrainingExample, std::allocator<kaldi::nnet2::NnetTrainingExample> > const&, kaldi::nnet2::Nnet*) kaldi::nnet2::DoBackpropParallelClass::operator()() kaldi::MultiThreadable::run(void*) Ben On Mon, Sep 2, 2013 at 6:25 PM, Daniel Povey <dp...@gm...> wrote: > That's how it's supposed to be-- AFAIK that's basically the point of > Hogwild, that you allow these kinds of updates and accept the > possibility that due to race conditions you will occasionally lose a > bit of date. The parameters only change slightly on the timescales > that these different threads access them. > Dan > > > On Mon, Sep 2, 2013 at 6:01 PM, Ben Jiang <be...@ne...> wrote: > > Hi all, > > > > While hunting some random error from nnet-train-parallel, I noticed the > > nnet_to_update is shared among the threads, but there is no > synchronization > > checks when updating the components in the threads. I still haven't gone > > too deep in the code yet, but should there be synchronization checks? > > > > For example, the deriv variable in NnetUpdater::Backprop() is updated and > > passed between the components. Could this be an issue if the components > are > > being updated by other threads? > > > > > > Or am I missing something totally? > > > > > > -- > > Thanks > > Ben > > > > > ------------------------------------------------------------------------------ > > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! > > Discover the easy way to master current and previous Microsoft > technologies > > and advance your career. Get an incredible 1,500+ hours of step-by-step > > tutorial videos with LearnDevNow. Subscribe today and save! > > > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk > > _______________________________________________ > > Kaldi-developers mailing list > > Kal...@li... > > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > > -- -- Thanks Ben Jiang Co-Founder/Principal/CTO Nexiwave.com Tel: 226-975-2172 / 617-245-0916 "Confidential & Privileged: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. if you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.” |