From: Daniel P. <dp...@gm...> - 2014-10-24 04:11:37
|
cc'ing Karel who may be able to help you, although I think he could be behind on his email. I'm afraid I don't know how to fix this. If you can figure out the full command that's being run then it might be possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and break into it and get a stack trace to find where it's stuck. Dan On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> wrote: > Thank you Dan. > I compiled with CUDA. kaldi.mk is like this: > >> #Next section enables CUDA for compilation > >> CUDA = true > >> CUDATKDIR = /usr/local/cuda-5.5 > >> CUDA_INCLUDE= -I$(CUDATKDIR)/include > >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA > >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 > >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later > than static libs in implicit rule > > The 'make' process does not give any error so I can claim that the tools > are compiled with CUDA successfully, right? > Problem is, although the log stops updating, I can see 'nnet-forward' is > running on GPU-2. > The log in the exp dir is cmvn_glob_fwd.log and it displays: > >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- > >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use > 'nvidia-smi -c 1' to set compute exclusive mode > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 > GPUs > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(0): > Tesla K20m free:4719M, used:80M, total:4799M, free/total:0.983228 > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(1): > Tesla K20m free:4719M, used:80M, total:4799M, free/total:0.983228 > > and no more. I have 4 GPU cards installed, all same model. > BTW, my configure command is: > ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes > --cudatk-dir=/usr/local/cuda-5.5 > > Am I doing something wrong? Why 'nnet-forward' is running on GPU while log > stops updating? > > Thank you and best regards, > Xingyu > > > On 10/24/2014 10:31 AM, Daniel Povey wrote: > > Possibly you did not compile for CUDA. The logs should say which GPU you > are using (look in the dir, for *.log). If the configure script does not > see nvcc on the command line, it will not use CUDA. Grep for CUDA in > kaldi.mk to see. > > Dan > > > On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> > wrote: > >> Hi, I'm new in this community. >> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >> Decoding part. >> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >> running. >> I checked the script and found that it stuck at calling nnet-forward for >> "Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> The program has been running more then 24 hours. >> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >> How long does it normally take? Is there something going wrong? >> Please help. >> >> The log is posted below. >> Thank you >> Xingyu >> >> >> ============================================================================ >> >> DNN Hybrid Training & Decoding (Karel's recipe) >> >> ============================================================================ >> >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >> exp/tri3/decode_test >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >> exp/tri3/decode_dev >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >> exp/tri3_ali >> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 3320 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 376 >> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # Started at Wed Oct 22 16:11:09 CST 2014 >> # >> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # INFO >> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >> of RBMs >> dir : exp/dnn4_pretrain-dbn >> Train-set : data-fmllr-tri3/train >> >> # PREPARING FEATURES >> Preparing train/cv lists >> 3696 exp/dnn4_pretrain-dbn/train.scp >> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >> apply_cmvn disabled (per speaker norm. on input features) >> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:- >> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >> 40 >> Using splice ± 5 , step 1 >> Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> compute-cmvn-stats ark:- - >> cmvn-to-nnet - - >> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > > > |