From: Daniel P. <dp...@gm...> - 2014-10-24 04:15:51
|
I'm running the same thing at JHU to see if I can replicate your problem. Dan On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: > cc'ing Karel who may be able to help you, although I think he could be > behind on his email. > I'm afraid I don't know how to fix this. > If you can figure out the full command that's being run then it might be > possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., > and break into it and get a stack trace to find where it's stuck. > > Dan > > > On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> > wrote: > >> Thank you Dan. >> I compiled with CUDA. kaldi.mk is like this: >> >> #Next section enables CUDA for compilation >> >> CUDA = true >> >> CUDATKDIR = /usr/local/cuda-5.5 >> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include >> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >> than static libs in implicit rule >> >> The 'make' process does not give any error so I can claim that the tools >> are compiled with CUDA successfully, right? >> Problem is, although the log stops updating, I can see 'nnet-forward' is >> running on GPU-2. >> The log in the exp dir is cmvn_glob_fwd.log and it displays: >> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >> 'nvidia-smi -c 1' to set compute exclusive mode >> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >> GPUs >> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >> free/total:0.983228 >> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >> free/total:0.983228 >> >> and no more. I have 4 GPU cards installed, all same model. >> BTW, my configure command is: >> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >> --cudatk-dir=/usr/local/cuda-5.5 >> >> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >> log stops updating? >> >> Thank you and best regards, >> Xingyu >> >> >> On 10/24/2014 10:31 AM, Daniel Povey wrote: >> >> Possibly you did not compile for CUDA. The logs should say which GPU you >> are using (look in the dir, for *.log). If the configure script does not >> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >> kaldi.mk to see. >> >> Dan >> >> >> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >> wrote: >> >>> Hi, I'm new in this community. >>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>> Decoding part. >>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>> running. >>> I checked the script and found that it stuck at calling nnet-forward for >>> "Renormalizing MLP input features into >>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>> The program has been running more then 24 hours. >>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>> How long does it normally take? Is there something going wrong? >>> Please help. >>> >>> The log is posted below. >>> Thank you >>> Xingyu >>> >>> >>> ============================================================================ >>> >>> DNN Hybrid Training & Decoding (Karel's recipe) >>> >>> ============================================================================ >>> >>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>> exp/tri3/decode_test >>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>> exp/tri3/decode_dev >>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>> exp/tri3_ali >>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> reducing #utt from 3696 to 3320 >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> reducing #utt from 3696 to 376 >>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> # Started at Wed Oct 22 16:11:09 CST 2014 >>> # >>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> # INFO >>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>> of RBMs >>> dir : exp/dnn4_pretrain-dbn >>> Train-set : data-fmllr-tri3/train >>> >>> # PREPARING FEATURES >>> Preparing train/cv lists >>> 3696 exp/dnn4_pretrain-dbn/train.scp >>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>> apply_cmvn disabled (per speaker norm. on input features) >>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>> ark:- >>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>> 40 >>> Using splice ± 5 , step 1 >>> Renormalizing MLP input features into >>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> compute-cmvn-stats ark:- - >>> cmvn-to-nnet - - >>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> >> >> > |