From: Dong-Hyun K. <daw...@gm...> - 2014-10-30 07:37:40
|
Hi, kaldi-developers my name is Dong_Hyun Kim I have a problem using kaldi. My system composed with four GTX760 cards per node and 10 node cluster. so I run 40 gpu card with 40 egs. when I run "nnet-train-simple", I get shrink.log like below; ---------------------------------------------------------------------------------------------- nnet-subset-egs --n=2000 --randomize-order=true --srand=50 ark:data_work/data_FB40_base/train_141002/nnet-5block/egs/train_diagnostic.egs ark:- nnet-combine-fast --num-threads=1 --verbose=3 --minibatch-size=2000 data_work/data_FB40_base/train_141002/nnet-5block/51.mdl ark:- data_work/data_FB40_base/train_141002/nnet-5block/51.mdl LOG (nnet-combine-fast:IsComputeExclusive():cu-device.cc:209) CUDA setup operating under Compute Exclusive Mode. LOG (nnet-combine-fast:FinalizeActiveGpu():cu-device.cc:174) The active GPU is [0]: GeForce GTX 760 free:1994M, used:53M, total:2047M, free/total:0.974084 version 3.0 LOG (nnet-combine-fast:PrintMemoryUsage():cu-device.cc:314) Memory used: 0 bytes. LOG (nnet-subset-egs:main():nnet-subset-egs.cc:88) Selected a subset of 2000 out of 40000 neural-network training examples LOG (nnet-combine-fast:main():nnet-combine-fast.cc:107) Read 2000 examples from the validation set. VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 0 for this minibatch is 70.0758 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 1 for this minibatch is 70.0758 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 2 for this minibatch is 0.0614423 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 3 for this minibatch is 4.40091 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 4 for this minibatch is 0.630933 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 5 for this minibatch is inf VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 6 for this minibatch is 0.692641 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 7 for this minibatch is inf VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 8 for this minibatch is 0.760484 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 9 for this minibatch is 5.29073 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 10 for this minibatch is 0.756328 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 11 for this minibatch is 3.84917 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 12 for this minibatch is 0.704473 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 13 for this minibatch is 9.91905 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 14 for this minibatch is 0.766127 VLOG[3] (nnet-combine-fast:Propagate():nnet-update.cc:82) Stddev of data for component 15 for this minibatch is 10.4979 LOG (nnet-combine-fast:GetInitialModel():combine-nnet-fast.cc:402) Objective functions for the source neural nets are [ -1.4428 ] ----------------------------------------------------------------------------------------------------- Then, running is stopped with next message.. ------------------------------------------------------------------------------------------------------- nnet-shuffle-egs --buffer-size=5000 --srand=144 ark:data_work/data_FB40_comEnv2/train_comEnv2/nnet-5block/egs/egs.26.42.ark ark:- LOG (main():nnet-train-simple.cc:62) nnet-train-simple --minibatch-size=512 --srand=144 data_work/data_FB40_comEnv2/train_comEnv2/nnet-5block/144.mdl ark:- data_work/data_FB40_comEnv2/train_comEnv2/nnet-5block/145.26.mdl LOG (nnet-train-simple:main():nnet-train-simple.cc:72) !!Cuda!!: CuDevice::Instantiate().SelectGpuId(use_gpu); LOG (nnet-train-simple:IsComputeExclusive():cu-device.cc:209) CUDA setup operating under Compute Exclusive Mode. LOG (nnet-train-simple:FinalizeActiveGpu():cu-device.cc:174) The active GPU is [3]: GeForce GTX 760 free:1993M, used:53M, total:2047M, free/total:0.973956 version 3.0 LOG (nnet-train-simple:PrintMemoryUsage():cu-device.cc:314) Memory used: 0 bytes. LOG (nnet-train-simple:BeginNewPhase():train-nnet.cc:59) Training objective function (this phase) is -1.94988 over 25600 frames. KALDI_ASSERT: at nnet-train-simple:PreconditionDirectionsAlphaRescaled:nnet-precondition.cc:160, failed: p_trace != 0.0 Stack trace is: kaldi::KaldiGetStackTrace() kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*) kaldi::nnet2::PreconditionDirectionsAlphaRescaled(kaldi::CuMatrixBase<float> const&, double, kaldi::CuMatrixBase<float>*) kaldi::nnet2::BlockAffineComponentPreconditioned::Update(kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&) kaldi::nnet2::BlockAffineComponent::Backprop(kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, int, kaldi::nnet2::Component*, kaldi::CuMatrix<float>*) const . . kaldi::nnet2::NnetSimpleTrainer::TrainOneMinibatch() kaldi::nnet2::NnetSimpleTrainer::TrainOnExample(kaldi::nnet2::NnetExample const&) nnet-train-simple(main+0x905) [0x57d549] /lib64/libc.so.6(__libc_start_main+0xfd) [0x386ba1ed1d] nnet-train-simple() [0x57cb89] bash: line 1: 30731 Broken pipe nnet-shuffle-egs --buffer-size=5000 --srand=144 ark:data_work/data_FB40_comEnv2/train_comEnv2/nnet-5block/egs/egs.26.42.ark ark:- 30733 Aborted (core dumped) | nnet-train-simple --minibatch-size=512 --srand=144 data_work/data_FB40_comEnv2/train_comEnv2/nnet-5block/144.mdl ark:- data_work/data_FB40_comEnv2/train_comEnv2/nnet-5block/145.26.mdl # Accounting: time=37 threads=1 ------------------------------------------------------------------------------------------------------------------- As debugging, NnetUpdater::Backprop::output_deriv matrix shows inf value. How can I solve this problem? Thank you. |