From: Daniel P. <dp...@gm...> - 2013-09-03 00:07:07
|
Sorry, in rm/s5, it's local/run_nnet2.sh, in wsj/s5 it's local/run_nnet_cpu.sh Dan On Mon, Sep 2, 2013 at 8:04 PM, Ben Jiang <be...@ne...> wrote: > Ok, got it. Let me try 200k first. > > I just updated the trunk, but couldn't find run_nnet2.sh. Is it supposed to > be in wsj/s5/local/? > > > Thanks > Ben > > > > On Mon, Sep 2, 2013 at 7:52 PM, Daniel Povey <dp...@gm...> wrote: >> >> That log-prob per frame if -7.31 is too low, it should be something >> like -2, no lower-- maybe -3 on the 1st iteration. The size of your >> training data does not matter, what matters is the #samples you >> process per iteration. Maybe try reducing it from 400k (the default, >> I think) to 200k. Or use the newer example scripts where I think that >> is the default. (if you update the trunk and look at the example >> scripts run_nnet2.sh, you'll see what I mean). >> >> But definitely something is wrong here. >> >> Dan >> >> >> On Mon, Sep 2, 2013 at 7:47 PM, Ben Jiang <be...@ne...> wrote: >> > The nonlinearaty type should be the default in train_nnet_cpu.sh, which >> > should be tanh. The log-prob doesn't look too bad. Below is the output >> > from >> > a run that actually succeeded: >> > LOG >> > (nnet-train-parallel:DoBackpropParallel():nnet-update-parallel.cc:179) >> > Did backprop on 399889 examples, average log-prob per frame is -7.31817 >> > >> > The learning rates are 0.01 initial and 0.001 final. I kind of used the >> > value from swbd, but maybe my training data is quite bigger than swbd. >> > I >> > previously tried 0.001 and 0.0001, which also failed due to an error of >> > "Cannot invert: matrix is singular", but I didn't have debug on back >> > then, >> > so it's probably the same issue. Maybe I should try even smaller, such >> > as >> > 0.0001 and 0.00001? >> > >> > >> > Ben >> > >> > >> > >> > On Mon, Sep 2, 2013 at 6:55 PM, Daniel Povey <dp...@gm...> wrote: >> >> >> >> I think the underlying cause is instability in the training, causing >> >> the derivatives to become too large. This is something that commonly >> >> happens in neural net training, and the solution is generally to >> >> decrease the learning rate. What nonlinearity type are you using? >> >> And do the log-probs printed out in train.*.log or compute_prob_*.log >> >> get very negative? >> >> >> >> Unbounded nonlinearities such as ReLUs are more susceptible to this >> >> instability. >> >> Dan >> >> >> >> >> >> On Mon, Sep 2, 2013 at 6:50 PM, Ben Jiang <be...@ne...> wrote: >> >> > I see. Thanks for the fast response, Dan. >> >> > >> >> > So any idea on this "random" error I am stuck with at pass 27? I >> >> > have >> >> > pasted the stacktrace below. This error doesn't always happen, even >> >> > after >> >> > I removed the randomness introduced in the input mdl and shuffled >> >> > egs. >> >> > (eg, >> >> > save the input mdl and shuffled egs to files and re-run the failed >> >> > nnet-train-parallel from those files in debugger). The re-run would >> >> > sometimes fail and sometimes succeed. >> >> > >> >> > Anyway, I was able catch the error in my debugger and examine the >> >> > variables. >> >> > I think the reason is that the deriv variable in >> >> > NnetUpdater::Backprop() >> >> > contains some "bad" value, such as 1.50931703e+20. This caused the >> >> > trace of >> >> > the matrix to become infinite and in turn cause the p_trace to become >> >> > 0 >> >> > and >> >> > fail the assert. I probably need more time to see how this value got >> >> > in >> >> > there, but again, since the exact re-run would pass sometimes, it's >> >> > kind >> >> > of >> >> > hard to debug. >> >> > >> >> > Any idea? >> >> > >> >> > Here's the stacktrace: >> >> > =============================== >> >> > KALDI_ASSERT: at >> >> > >> >> > >> >> > nnet-train-parallel:PreconditionDirectionsAlphaRescaled:nnet-precondition.cc:128, >> >> > failed: p_trace != 0.0 >> >> > Stack trace is: >> >> > kaldi::KaldiGetStackTrace() >> >> > kaldi::KaldiAssertFailure_(char const*, char const*, int, char >> >> > const*) >> >> > >> >> > >> >> > kaldi::nnet2::PreconditionDirectionsAlphaRescaled(kaldi::MatrixBase<float> >> >> > const&, double, kaldi::MatrixBase<float>*) >> >> > >> >> > >> >> > kaldi::nnet2::AffineComponentPreconditioned::Update(kaldi::MatrixBase<float> >> >> > const&, kaldi::MatrixBase<float> const&) >> >> > kaldi::nnet2::AffineComponent::Backprop(kaldi::MatrixBase<float> >> >> > const&, >> >> > kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&, >> >> > int, >> >> > kaldi::nnet2::Component*, kaldi::Matrix<float>*) const >> >> > >> >> > >> >> > kaldi::nnet2::NnetUpdater::Backprop(std::vector<kaldi::nnet2::NnetTrainingExample, >> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&, >> >> > kaldi::Matrix<float>*) >> >> > >> >> > >> >> > kaldi::nnet2::NnetUpdater::ComputeForMinibatch(std::vector<kaldi::nnet2::NnetTrainingExample, >> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&) >> >> > kaldi::nnet2::DoBackprop(kaldi::nnet2::Nnet const&, >> >> > std::vector<kaldi::nnet2::NnetTrainingExample, >> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&, >> >> > kaldi::nnet2::Nnet*) >> >> > kaldi::nnet2::DoBackpropParallelClass::operator()() >> >> > kaldi::MultiThreadable::run(void*) >> >> > >> >> > Ben >> >> > >> >> > >> >> > On Mon, Sep 2, 2013 at 6:25 PM, Daniel Povey <dp...@gm...> >> >> > wrote: >> >> >> >> >> >> That's how it's supposed to be-- AFAIK that's basically the point of >> >> >> Hogwild, that you allow these kinds of updates and accept the >> >> >> possibility that due to race conditions you will occasionally lose a >> >> >> bit of date. The parameters only change slightly on the timescales >> >> >> that these different threads access them. >> >> >> Dan >> >> >> >> >> >> >> >> >> On Mon, Sep 2, 2013 at 6:01 PM, Ben Jiang <be...@ne...> wrote: >> >> >> > Hi all, >> >> >> > >> >> >> > While hunting some random error from nnet-train-parallel, I >> >> >> > noticed >> >> >> > the >> >> >> > nnet_to_update is shared among the threads, but there is no >> >> >> > synchronization >> >> >> > checks when updating the components in the threads. I still >> >> >> > haven't >> >> >> > gone >> >> >> > too deep in the code yet, but should there be synchronization >> >> >> > checks? >> >> >> > >> >> >> > For example, the deriv variable in NnetUpdater::Backprop() is >> >> >> > updated >> >> >> > and >> >> >> > passed between the components. Could this be an issue if the >> >> >> > components >> >> >> > are >> >> >> > being updated by other threads? >> >> >> > >> >> >> > >> >> >> > Or am I missing something totally? >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Thanks >> >> >> > Ben >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ >> >> >> > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, >> >> >> > more! >> >> >> > Discover the easy way to master current and previous Microsoft >> >> >> > technologies >> >> >> > and advance your career. Get an incredible 1,500+ hours of >> >> >> > step-by-step >> >> >> > tutorial videos with LearnDevNow. Subscribe today and save! >> >> >> > >> >> >> > >> >> >> > >> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk >> >> >> > _______________________________________________ >> >> >> > Kaldi-developers mailing list >> >> >> > Kal...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers >> >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > >> >> > -- >> >> > Thanks >> >> > Ben Jiang >> >> > >> >> > Co-Founder/Principal/CTO >> >> > Nexiwave.com >> >> > Tel: 226-975-2172 / 617-245-0916 >> >> > "Confidential & Privileged: This email message is for the sole use of >> >> > the >> >> > intended recipient(s) and may contain confidential and privileged >> >> > information. Any unauthorized review, use, disclosure or distribution >> >> > is >> >> > prohibited. if you are not the intended recipient, please contact the >> >> > sender >> >> > by reply email and destroy all copies of the original message.” >> > >> > >> > >> > >> > -- >> > >> > -- >> > Thanks >> > Ben Jiang >> > >> > Co-Founder/Principal/CTO >> > Nexiwave.com >> > Tel: 226-975-2172 / 617-245-0916 >> > "Confidential & Privileged: This email message is for the sole use of >> > the >> > intended recipient(s) and may contain confidential and privileged >> > information. Any unauthorized review, use, disclosure or distribution is >> > prohibited. if you are not the intended recipient, please contact the >> > sender >> > by reply email and destroy all copies of the original message.” > > > > > -- > > -- > Thanks > Ben Jiang > > Co-Founder/Principal/CTO > Nexiwave.com > Tel: 226-975-2172 / 617-245-0916 > "Confidential & Privileged: This email message is for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution is > prohibited. if you are not the intended recipient, please contact the sender > by reply email and destroy all copies of the original message.” |