Re: [Kaldi-developers] dnn-cpu train issue

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sorry, in rm/s5, it's local/run_nnet2.sh, in wsj/s5 it's local/run_nnet_cpu.sh
Dan


On Mon, Sep 2, 2013 at 8:04 PM, Ben Jiang <be...@ne...> wrote:
> Ok, got it.  Let me try 200k first.
>
> I just updated the trunk, but couldn't find run_nnet2.sh.  Is it supposed to
> be in wsj/s5/local/?
>
>
> Thanks
> Ben
>
>
>
> On Mon, Sep 2, 2013 at 7:52 PM, Daniel Povey <dp...@gm...> wrote:
>>
>> That log-prob per frame if -7.31 is too low, it should be something
>> like -2, no lower-- maybe -3 on the 1st iteration.  The size of your
>> training data does not matter, what matters is the #samples you
>> process per iteration.  Maybe try reducing it from 400k (the default,
>> I think) to 200k.  Or use the newer example scripts where I think that
>> is the default.  (if you update the trunk and look at the example
>> scripts run_nnet2.sh, you'll see what I mean).
>>
>> But definitely something is wrong here.
>>
>> Dan
>>
>>
>> On Mon, Sep 2, 2013 at 7:47 PM, Ben Jiang <be...@ne...> wrote:
>> > The nonlinearaty type should be the default in train_nnet_cpu.sh, which
>> > should be tanh. The log-prob doesn't look too bad. Below is the output
>> > from
>> > a run that actually succeeded:
>> > LOG
>> > (nnet-train-parallel:DoBackpropParallel():nnet-update-parallel.cc:179)
>> > Did backprop on 399889 examples, average log-prob per frame is -7.31817
>> >
>> > The learning rates are 0.01 initial and 0.001 final. I kind of used the
>> > value from swbd, but maybe my training data is quite bigger than swbd.
>> > I
>> > previously tried 0.001 and 0.0001, which also failed due to an error of
>> > "Cannot invert: matrix is singular", but I didn't have debug on back
>> > then,
>> > so it's probably the same issue.  Maybe I should try even smaller, such
>> > as
>> > 0.0001 and 0.00001?
>> >
>> >
>> > Ben
>> >
>> >
>> >
>> > On Mon, Sep 2, 2013 at 6:55 PM, Daniel Povey <dp...@gm...> wrote:
>> >>
>> >> I think the underlying cause is instability in the training, causing
>> >> the derivatives to become too large.  This is something that commonly
>> >> happens in neural net training, and the solution is generally to
>> >> decrease the learning rate.  What nonlinearity type are you using?
>> >> And do the log-probs printed out in train.*.log or compute_prob_*.log
>> >> get very negative?
>> >>
>> >> Unbounded nonlinearities such as ReLUs are more susceptible to this
>> >> instability.
>> >> Dan
>> >>
>> >>
>> >> On Mon, Sep 2, 2013 at 6:50 PM, Ben Jiang <be...@ne...> wrote:
>> >> > I see.  Thanks for the fast response, Dan.
>> >> >
>> >> > So any idea on this "random" error I am stuck with at pass 27?  I
>> >> > have
>> >> > pasted the stacktrace below.   This error doesn't always happen, even
>> >> > after
>> >> > I removed the randomness introduced in the input mdl and shuffled
>> >> > egs.
>> >> > (eg,
>> >> > save the input mdl and shuffled egs to files and re-run the failed
>> >> > nnet-train-parallel from those files in debugger).  The re-run would
>> >> > sometimes fail and sometimes succeed.
>> >> >
>> >> > Anyway, I was able catch the error in my debugger and examine the
>> >> > variables.
>> >> > I think the reason is that  the deriv variable in
>> >> > NnetUpdater::Backprop()
>> >> > contains some "bad" value, such as 1.50931703e+20.  This caused the
>> >> > trace of
>> >> > the matrix to become infinite and in turn cause the p_trace to become
>> >> > 0
>> >> > and
>> >> > fail the assert.  I probably need more time to see how this value got
>> >> > in
>> >> > there, but again, since the exact re-run would pass sometimes, it's
>> >> > kind
>> >> > of
>> >> > hard to debug.
>> >> >
>> >> > Any idea?
>> >> >
>> >> > Here's the stacktrace:
>> >> > ===============================
>> >> > KALDI_ASSERT: at
>> >> >
>> >> >
>> >> > nnet-train-parallel:PreconditionDirectionsAlphaRescaled:nnet-precondition.cc:128,
>> >> > failed: p_trace != 0.0
>> >> > Stack trace is:
>> >> > kaldi::KaldiGetStackTrace()
>> >> > kaldi::KaldiAssertFailure_(char const*, char const*, int, char
>> >> > const*)
>> >> >
>> >> >
>> >> > kaldi::nnet2::PreconditionDirectionsAlphaRescaled(kaldi::MatrixBase<float>
>> >> > const&, double, kaldi::MatrixBase<float>*)
>> >> >
>> >> >
>> >> > kaldi::nnet2::AffineComponentPreconditioned::Update(kaldi::MatrixBase<float>
>> >> > const&, kaldi::MatrixBase<float> const&)
>> >> > kaldi::nnet2::AffineComponent::Backprop(kaldi::MatrixBase<float>
>> >> > const&,
>> >> > kaldi::MatrixBase<float> const&, kaldi::MatrixBase<float> const&,
>> >> > int,
>> >> > kaldi::nnet2::Component*, kaldi::Matrix<float>*) const
>> >> >
>> >> >
>> >> > kaldi::nnet2::NnetUpdater::Backprop(std::vector<kaldi::nnet2::NnetTrainingExample,
>> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&,
>> >> > kaldi::Matrix<float>*)
>> >> >
>> >> >
>> >> > kaldi::nnet2::NnetUpdater::ComputeForMinibatch(std::vector<kaldi::nnet2::NnetTrainingExample,
>> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&)
>> >> > kaldi::nnet2::DoBackprop(kaldi::nnet2::Nnet const&,
>> >> > std::vector<kaldi::nnet2::NnetTrainingExample,
>> >> > std::allocator<kaldi::nnet2::NnetTrainingExample> > const&,
>> >> > kaldi::nnet2::Nnet*)
>> >> > kaldi::nnet2::DoBackpropParallelClass::operator()()
>> >> > kaldi::MultiThreadable::run(void*)
>> >> >
>> >> > Ben
>> >> >
>> >> >
>> >> > On Mon, Sep 2, 2013 at 6:25 PM, Daniel Povey <dp...@gm...>
>> >> > wrote:
>> >> >>
>> >> >> That's how it's supposed to be-- AFAIK that's basically the point of
>> >> >> Hogwild, that you allow these kinds of updates and accept the
>> >> >> possibility that due to race conditions you will occasionally lose a
>> >> >> bit of date.  The parameters only change slightly on the timescales
>> >> >> that these different threads access them.
>> >> >> Dan
>> >> >>
>> >> >>
>> >> >> On Mon, Sep 2, 2013 at 6:01 PM, Ben Jiang <be...@ne...> wrote:
>> >> >> > Hi all,
>> >> >> >
>> >> >> > While hunting some random error from nnet-train-parallel, I
>> >> >> > noticed
>> >> >> > the
>> >> >> > nnet_to_update is shared among the threads, but there is no
>> >> >> > synchronization
>> >> >> > checks when updating the components in the threads.  I still
>> >> >> > haven't
>> >> >> > gone
>> >> >> > too deep in the code yet, but should there be synchronization
>> >> >> > checks?
>> >> >> >
>> >> >> > For example, the deriv variable in NnetUpdater::Backprop() is
>> >> >> > updated
>> >> >> > and
>> >> >> > passed between the components.  Could this be an issue if the
>> >> >> > components
>> >> >> > are
>> >> >> > being updated by other threads?
>> >> >> >
>> >> >> >
>> >> >> > Or am I missing something totally?
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Thanks
>> >> >> > Ben
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > ------------------------------------------------------------------------------
>> >> >> > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012,
>> >> >> > more!
>> >> >> > Discover the easy way to master current and previous Microsoft
>> >> >> > technologies
>> >> >> > and advance your career. Get an incredible 1,500+ hours of
>> >> >> > step-by-step
>> >> >> > tutorial videos with LearnDevNow. Subscribe today and save!
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
>> >> >> > _______________________________________________
>> >> >> > Kaldi-developers mailing list
>> >> >> > Kal...@li...
>> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> > --
>> >> > Thanks
>> >> > Ben Jiang
>> >> >
>> >> > Co-Founder/Principal/CTO
>> >> > Nexiwave.com
>> >> > Tel: 226-975-2172 / 617-245-0916
>> >> > "Confidential & Privileged: This email message is for the sole use of
>> >> > the
>> >> > intended recipient(s) and may contain confidential and privileged
>> >> > information. Any unauthorized review, use, disclosure or distribution
>> >> > is
>> >> > prohibited. if you are not the intended recipient, please contact the
>> >> > sender
>> >> > by reply email and destroy all copies of the original message.”
>> >
>> >
>> >
>> >
>> > --
>> >
>> > --
>> > Thanks
>> > Ben Jiang
>> >
>> > Co-Founder/Principal/CTO
>> > Nexiwave.com
>> > Tel: 226-975-2172 / 617-245-0916
>> > "Confidential & Privileged: This email message is for the sole use of
>> > the
>> > intended recipient(s) and may contain confidential and privileged
>> > information. Any unauthorized review, use, disclosure or distribution is
>> > prohibited. if you are not the intended recipient, please contact the
>> > sender
>> > by reply email and destroy all copies of the original message.”
>
>
>
>
> --
>
> --
> Thanks
> Ben Jiang
>
> Co-Founder/Principal/CTO
> Nexiwave.com
> Tel: 226-975-2172 / 617-245-0916
> "Confidential & Privileged: This email message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. if you are not the intended recipient, please contact the sender
> by reply email and destroy all copies of the original message.”