Re: [Kaldi-developers] dnn-cpu train issue

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ok, got it.  Let me try 200k first.

I just updated the trunk, but couldn't find run_nnet2.sh.  Is it supposed
to be in wsj/s5/local/?

Thanks
Ben

On Mon, Sep 2, 2013 at 7:52 PM, Daniel Povey <dp...@gm...> wrote:

> That log-prob per frame if -7.31 is too low, it should be something
> like -2, no lower-- maybe -3 on the 1st iteration.  The size of your
> training data does not matter, what matters is the #samples you
> process per iteration.  Maybe try reducing it from 400k (the default,
> I think) to 200k.  Or use the newer example scripts where I think that
> is the default.  (if you update the trunk and look at the example
> scripts run_nnet2.sh, you'll see what I mean).
>
> But definitely something is wrong here.
>
> Dan
>
>
> On Mon, Sep 2, 2013 at 7:47 PM, Ben Jiang <be...@ne...> wrote:
> > The nonlinearaty type should be the default in train_nnet_cpu.sh, which
> > should be tanh. The log-prob doesn't look too bad. Below is the output
> from
> > a run that actually succeeded:
> > LOG
> (nnet-train-parallel:DoBackpropParallel():nnet-update-parallel.cc:179)
> > Did backprop on 399889 examples, average log-prob per frame is -7.31817
> >
> > The learning rates are 0.01 initial and 0.001 final. I kind of used the
> > value from swbd, but maybe my training data is quite bigger than swbd.  I
> > previously tried 0.001 and 0.0001, which also failed due to an error of
> > "Cannot invert: matrix is singular", but I didn't have debug on back
> then,
> > so it's probably the same issue.  Maybe I should try even smaller, such
> as
> > 0.0001 and 0.00001?
> >
> >
> > Ben
> >
> >
> >
> > On Mon, Sep 2, 2013 at 6:55 PM, Daniel Povey <dp...@gm...> wrote:
> >>
> >> I think the underlying cause is instability in the training, causing
> >> the derivatives to become too large.  This is something that commonly
> >> happens in neural net training, and the solution is generally to
> >> decrease the learning rate.  What nonlinearity type are you using?
> >> And do the log-probs printed out in train.*.log or compute_prob_*.log
> >> get very negative?
> >>
> >> Unbounded nonlinearities such as ReLUs are more susceptible to this
> >> instability.
> >> Dan
> >>
> >>
> >> On Mon, Sep 2, 2013 at 6:50 PM, Ben Jiang <be...@ne...> wrote:
> >> > I see.  Thanks for the fast response, Dan.
> >> >
> >> > So any idea on this "random" error I am stuck with at pass 27?  I have
> >> > pasted the stacktrace below.   This error doesn't always happen, even
> >> > after
> >> > I removed the randomness introduced in the input mdl and shuffled egs.
> >> > (eg,
> >> > save the input mdl and shuffled egs to files and re-run the failed
> >> > nnet-train-parallel from those files in debugger).  The re-run would
> >> > sometimes fail and sometimes succeed.
> >> >
> >> > Anyway, I was able catch the error in my debugger and examine the
> >> > variables.
> >> > I think the reason is that  the deriv variable in
> >> > NnetUpdater::Backprop()
> >> > contains some "bad" value, such as 1.50931703e+20.  This caused the
> >> > trace of
> >> > the matrix to become infinite and in turn cause the p_trace to become
> 0
> >> > and
> >> > fail the assert.  I probably need more time to see how this value got
> in
> >> > there, but again, since the exact re-run would pass sometimes, it's
> kind
> >> > of
> >> > hard to debug.
> >> >
> >> > Any idea?
> >> >
> >> > Here's the stacktrace:
> >> > ===============================
> >> > KALDI_ASSERT: at
> >> >
> >> >
> nnet-train-parallel:PreconditionDirectionsAlphaRescaled:nnet-precondition.cc:128,
> >> > failed: p_trace != 0.0
> >> > Stack trace is:
> >> > kaldi::KaldiGetStackTrace()
> >> > kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
> >> >
> >> >
> kaldi::nnet2::PreconditionDirectionsAlphaRescaled(kaldi::MatrixBase<float>
> >> > const&, double, kaldi::MatrixBase<float>*)
> >> >
> >> >
> kaldi::nnet2::AffineComponentPreconditioned::Update(kaldi::MatrixBase<float>
> >> > const&, kaldi::MatrixBase<float> const&)
> >> > kaldi::nnet2::AffineComponent::Backprop(kaldi::MatrixBase<float>
> const&,
> >> > kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&, int,
> >> > kaldi::nnet2::Component*, kaldi::Matrix<float>*) const
> >> >
> >> >
> kaldi::nnet2::NnetUpdater::Backprop(std::vector<kaldi::nnet2::NnetTrainingExample,
> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&,
> >> > kaldi::Matrix<float>*)
> >> >
> >> >
> kaldi::nnet2::NnetUpdater::ComputeForMinibatch(std::vector<kaldi::nnet2::NnetTrainingExample,
> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&)
> >> > kaldi::nnet2::DoBackprop(kaldi::nnet2::Nnet const&,
> >> > std::vector<kaldi::nnet2::NnetTrainingExample,
> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&,
> >> > kaldi::nnet2::Nnet*)
> >> > kaldi::nnet2::DoBackpropParallelClass::operator()()
> >> > kaldi::MultiThreadable::run(void*)
> >> >
> >> > Ben
> >> >
> >> >
> >> > On Mon, Sep 2, 2013 at 6:25 PM, Daniel Povey <dp...@gm...>
> wrote:
> >> >>
> >> >> That's how it's supposed to be-- AFAIK that's basically the point of
> >> >> Hogwild, that you allow these kinds of updates and accept the
> >> >> possibility that due to race conditions you will occasionally lose a
> >> >> bit of date.  The parameters only change slightly on the timescales
> >> >> that these different threads access them.
> >> >> Dan
> >> >>
> >> >>
> >> >> On Mon, Sep 2, 2013 at 6:01 PM, Ben Jiang <be...@ne...> wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > While hunting some random error from nnet-train-parallel, I noticed
> >> >> > the
> >> >> > nnet_to_update is shared among the threads, but there is no
> >> >> > synchronization
> >> >> > checks when updating the components in the threads.  I still
> haven't
> >> >> > gone
> >> >> > too deep in the code yet, but should there be synchronization
> checks?
> >> >> >
> >> >> > For example, the deriv variable in NnetUpdater::Backprop() is
> updated
> >> >> > and
> >> >> > passed between the components.  Could this be an issue if the
> >> >> > components
> >> >> > are
> >> >> > being updated by other threads?
> >> >> >
> >> >> >
> >> >> > Or am I missing something totally?
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks
> >> >> > Ben
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ------------------------------------------------------------------------------
> >> >> > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012,
> >> >> > more!
> >> >> > Discover the easy way to master current and previous Microsoft
> >> >> > technologies
> >> >> > and advance your career. Get an incredible 1,500+ hours of
> >> >> > step-by-step
> >> >> > tutorial videos with LearnDevNow. Subscribe today and save!
> >> >> >
> >> >> >
> >> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
> >> >> > _______________________________________________
> >> >> > Kaldi-developers mailing list
> >> >> > Kal...@li...
> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > --
> >> > Thanks
> >> > Ben Jiang
> >> >
> >> > Co-Founder/Principal/CTO
> >> > Nexiwave.com
> >> > Tel: 226-975-2172 / 617-245-0916
> >> > "Confidential & Privileged: This email message is for the sole use of
> >> > the
> >> > intended recipient(s) and may contain confidential and privileged
> >> > information. Any unauthorized review, use, disclosure or distribution
> is
> >> > prohibited. if you are not the intended recipient, please contact the
> >> > sender
> >> > by reply email and destroy all copies of the original message.”
> >
> >
> >
> >
> > --
> >
> > --
> > Thanks
> > Ben Jiang
> >
> > Co-Founder/Principal/CTO
> > Nexiwave.com
> > Tel: 226-975-2172 / 617-245-0916
> > "Confidential & Privileged: This email message is for the sole use of the
> > intended recipient(s) and may contain confidential and privileged
> > information. Any unauthorized review, use, disclosure or distribution is
> > prohibited. if you are not the intended recipient, please contact the
> sender
> > by reply email and destroy all copies of the original message.”
>

-- 

--
Thanks
Ben Jiang

Co-Founder/Principal/CTO
Nexiwave.com
Tel: 226-975-2172 / 617-245-0916
"Confidential & Privileged: This email message is for the sole use of the
intended recipient(s) and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or distribution is
prohibited. if you are not the intended recipient, please contact the
sender by reply email and destroy all copies of the original message.”