|
From: Vesely K. <ive...@fi...> - 2014-10-24 10:32:31
|
Hi, The reason is in the "computation mode", which has with Kaldi following behavior: - default : OS selects GPU with GPU-ID '0' by default (i.e. more processes use same GPU which is slow) [BAD] - process/thread exclusive : OS selects a free GPU which not locked to another process or raises error [RECOMMENDED] Best regards, Karel On 10/24/2014 09:54 AM, Xingyu Na wrote: > Thank you Dan and Alex. > It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't > know why....). > Now I understand how that pipelined command works. > Sorry for saying "Is there a bug" in the previous email.... > > Regards, > Xingyu > > On 10/24/2014 03:46 PM, Alexander Solovets wrote: >> Hi Xingyu, >> >> If you are concerned whether the process hung up or not, you can see >> the output of `ps <PID>` where <PID> is the process id. If you see 'S' >> in STAT fields, like >> >> PID TTY STAT TIME COMMAND >> 11891 pts/5 S+ 0:00 cat >> >> Then the processing is sleeping. Otherwise you should see 'R' like: >> >> PID TTY STAT TIME COMMAND >> 11909 pts/5 R+ 0:01 cat >> >> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote: >>> Thank you so much Dan. >>> The script which causes the halting is : >>> >>> nnet-forward --use-gpu=yes \ >>> $feature_transform_old "$(echo $feats | sed >>> 's|train.scp|train.scp.10k|')" \ >>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>> nnet-concat --binary=false $feature_transform_old - $feature_transform >>> >>> and the command that is running is: >>> >>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >>> >>> If I understand it correctly, nnet-forward is piping its output to >>> compute-cmvn-stats (although apply_cmvn is false), and followed by >>> cmvn-to-nnet and nnet-concat. >>> The problem, I think, is that there is an extra '| ark:-'. It means that the >>> output of nnet-forward is being piped into 'ark:-', which is not a >>> executable. >>> Is there is bug here? >>> >>> Regards, >>> Xingyu >>> >>> >>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>> >>> I'm running the same thing at JHU to see if I can replicate your problem. >>> Dan >>> >>> >>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: >>>> cc'ing Karel who may be able to help you, although I think he could be >>>> behind on his email. >>>> I'm afraid I don't know how to fix this. >>>> If you can figure out the full command that's being run then it might be >>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and >>>> break into it and get a stack trace to find where it's stuck. >>>> >>>> Dan >>>> >>>> >>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> >>>> wrote: >>>>> Thank you Dan. >>>>> I compiled with CUDA. kaldi.mk is like this: >>>>>>> #Next section enables CUDA for compilation >>>>>>> CUDA = true >>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >>>>>>> than static libs in implicit rule >>>>> The 'make' process does not give any error so I can claim that the tools >>>>> are compiled with CUDA successfully, right? >>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is >>>>> running on GPU-2. >>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>>>>>> GPUs >>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>> free/total:0.983228 >>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>> free/total:0.983228 >>>>> and no more. I have 4 GPU cards installed, all same model. >>>>> BTW, my configure command is: >>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>> >>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >>>>> log stops updating? >>>>> >>>>> Thank you and best regards, >>>>> Xingyu >>>>> >>>>> >>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>> >>>>> Possibly you did not compile for CUDA. The logs should say which GPU you >>>>> are using (look in the dir, for *.log). If the configure script does not >>>>> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >>>>> kaldi.mk to see. >>>>> >>>>> Dan >>>>> >>>>> >>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >>>>> wrote: >>>>>> Hi, I'm new in this community. >>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>>>>> Decoding part. >>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>>>>> running. >>>>>> I checked the script and found that it stuck at calling nnet-forward for >>>>>> "Renormalizing MLP input features into >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>>> The program has been running more then 24 hours. >>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>>>>> How long does it normally take? Is there something going wrong? >>>>>> Please help. >>>>>> >>>>>> The log is posted below. >>>>>> Thank you >>>>>> Xingyu >>>>>> >>>>>> >>>>>> ============================================================================ >>>>>> >>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>>> >>>>>> ============================================================================ >>>>>> >>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>>>>> exp/tri3/decode_test >>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>>>>> exp/tri3/decode_dev >>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>>>>> exp/tri3_ali >>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> reducing #utt from 3696 to 3320 >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> reducing #utt from 3696 to 376 >>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>>> # >>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> # INFO >>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>>>>> of RBMs >>>>>> dir : exp/dnn4_pretrain-dbn >>>>>> Train-set : data-fmllr-tri3/train >>>>>> >>>>>> # PREPARING FEATURES >>>>>> Preparing train/cv lists >>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>>>>> apply_cmvn disabled (per speaker norm. on input features) >>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>>>>> ark:- >>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>>>>> 40 >>>>>> Using splice ± 5 , step 1 >>>>>> Renormalizing MLP input features into >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> compute-cmvn-stats ark:- - >>>>>> cmvn-to-nnet - - >>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> _______________________________________________ >>>>>> Kaldi-users mailing list >>>>>> Kal...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users -- Karel Vesely, Brno University of Technology ive...@fi..., +420-54114-1300 |