|
From: Vesely K. <ve...@gm...> - 2014-10-31 10:31:18
|
If the log was saying it is using GPU, it is running on a GPU. The CPU is surely not a bottleneck here. If it halts, there was a problem to finish one of the CUDA kernels and sync, the possible resons are below. K. On 10/31/2014 11:19 AM, Xingyu Na wrote: > Yep, there are too many variables having impact on this. It's really > hard to debug this kind of behaviour since it maybe is running really > really slow that the CPU thought the GPU is sleeping :-) > Anyway, it's working properly now so I'll just move on. Thank all you > guys for helping. > > Best, > Xingyu > > On 10/31/2014 06:13 PM, Vesely Karel wrote: >> Hi Xingyu, >> hmm, I'm afraid I cannot explain this with certainty. Sometimes the >> binaries may >> behave strangely if there is a problem with cuda driver + kernel >> module match, >> or Kaldi compiled using unsufficient computation capability (it is >> okay in current trunk) >> or because of simple GPU overheating. >> Best, >> Karel. >> >> >> On 10/30/2014 03:27 AM, Xingyu Na wrote: >>> Hi Karel, >>> When the script freezed on my station (before I forced the compute >>> mode), 'nvidia-smi' shows that 'nnet-forward' was actually running >>> on one of the GPU cards. >>> Is it possible that it was running on CPU but shows as a running job >>> on nvidia-smi? >>> And at the meantime, when I did 'top', it shows that 'nnet-forward' >>> with an 'S' not an 'R'.... >>> >>> Xingyu >>> >>> On 10/29/2014 09:28 PM, Vesely Karel wrote: >>>> Hi, >>>> the TIMIT DNN training is running, and it is very slow. >>>> I'll add there a script-check to stop training if cuda is not >>>> compiled-in. >>>> (Assuming that typically everybody wants to train on a GPU.) >>>> K. >>>> >>>> On 10/27/2014 11:39 AM, Vesely Karel wrote: >>>>> Dan, >>>>> I'll check it by running TIMIT recipe without GPU code compiled. >>>>> Need to figure out what could have happened... >>>>> K. >>>>> >>>>> On 10/24/2014 07:03 PM, Daniel Povey wrote: >>>>>> Karel, >>>>>> Is there something which we need to fix here? >>>>>> Why was it hanging? Was it using the CPU instead of the GPU? >>>>>> Was it waiting for some kind of reply from the GPU? Had it crashed? >>>>>> Dan >>>>>> >>>>>> >>>>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel >>>>>> <ive...@fi... <mailto:ive...@fi...>> wrote: >>>>>> >>>>>> It is a 'must' on multi-GPU machines and 'recommended' for >>>>>> single-GPU >>>>>> machine. >>>>>> >>>>>> It is a setting in OS, which is assumed to be done. It is >>>>>> good that one >>>>>> does not need >>>>>> to specify a gpu-id in the scripts and track manually which >>>>>> gpus are >>>>>> being used. >>>>>> >>>>>> Karel. >>>>>> >>>>>> On 10/24/2014 12:39 PM, Xingyu Na wrote: >>>>>> > Thank you Karel. >>>>>> > Is that a 'must' for all cuda-based kaldi executives? >>>>>> > >>>>>> > Regards, >>>>>> > Xingyu >>>>>> > >>>>>> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >>>>>> >> Hi, >>>>>> >> The reason is in the "computation mode", which has with >>>>>> Kaldi following >>>>>> >> behavior: >>>>>> >> - default : OS selects GPU with GPU-ID '0' by default >>>>>> (i.e. more >>>>>> >> processes use same GPU which is slow) [BAD] >>>>>> >> - process/thread exclusive : OS selects a free GPU which >>>>>> not locked to >>>>>> >> another process or raises error [RECOMMENDED] >>>>>> >> Best regards, >>>>>> >> Karel >>>>>> >> >>>>>> >> >>>>>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>>>>> >>> Thank you Dan and Alex. >>>>>> >>> It turns out that I need to set 'nvidia-smi -c 1' to >>>>>> continue here(don't >>>>>> >>> know why....). >>>>>> >>> Now I understand how that pipelined command works. >>>>>> >>> Sorry for saying "Is there a bug" in the previous email.... >>>>>> >>> >>>>>> >>> Regards, >>>>>> >>> Xingyu >>>>>> >>> >>>>>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>>>>> >>>> Hi Xingyu, >>>>>> >>>> >>>>>> >>>> If you are concerned whether the process hung up or not, >>>>>> you can see >>>>>> >>>> the output of `ps <PID>` where <PID> is the process id. >>>>>> If you see 'S' >>>>>> >>>> in STAT fields, like >>>>>> >>>> >>>>>> >>>> PID TTY STAT TIME COMMAND >>>>>> >>>> 11891 pts/5 S+ 0:00 cat >>>>>> >>>> >>>>>> >>>> Then the processing is sleeping. Otherwise you should >>>>>> see 'R' like: >>>>>> >>>> >>>>>> >>>> PID TTY STAT TIME COMMAND >>>>>> >>>> 11909 pts/5 R+ 0:01 cat >>>>>> >>>> >>>>>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na >>>>>> <asr...@gm... <mailto:asr...@gm...>> wrote: >>>>>> >>>>> Thank you so much Dan. >>>>>> >>>>> The script which causes the halting is : >>>>>> >>>>> >>>>>> >>>>> nnet-forward --use-gpu=yes \ >>>>>> >>>>> $feature_transform_old "$(echo $feats | sed >>>>>> >>>>> 's|train.scp|train.scp.10k|')" \ >>>>>> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>>>>> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>>>>> >>>>> nnet-concat --binary=false $feature_transform_old >>>>>> - $feature_transform >>>>>> >>>>> >>>>>> >>>>> and the command that is running is: >>>>>> >>>>> >>>>>> >>>>> nnet-forward --use-gpu=yes >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >>>>>> ark:- | ark:- >>>>>> >>>>> >>>>>> >>>>> If I understand it correctly, nnet-forward is piping >>>>>> its output to >>>>>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and >>>>>> followed by >>>>>> >>>>> cmvn-to-nnet and nnet-concat. >>>>>> >>>>> The problem, I think, is that there is an extra '| >>>>>> ark:-'. It means that the >>>>>> >>>>> output of nnet-forward is being piped into 'ark:-', >>>>>> which is not a >>>>>> >>>>> executable. >>>>>> >>>>> Is there is bug here? >>>>>> >>>>> >>>>>> >>>>> Regards, >>>>>> >>>>> Xingyu >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>>>>> >>>>> >>>>>> >>>>> I'm running the same thing at JHU to see if I can >>>>>> replicate your problem. >>>>>> >>>>> Dan >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey >>>>>> <dp...@gm... <mailto:dp...@gm...>> wrote: >>>>>> >>>>>> cc'ing Karel who may be able to help you, although I >>>>>> think he could be >>>>>> >>>>>> behind on his email. >>>>>> >>>>>> I'm afraid I don't know how to fix this. >>>>>> >>>>>> If you can figure out the full command that's being >>>>>> run then it might be >>>>>> >>>>>> possible to get it in a debugger, e.g. gdb --args >>>>>> program arg1 arg2 ..., and >>>>>> >>>>>> break into it and get a stack trace to find where it's >>>>>> stuck. >>>>>> >>>>>> >>>>>> >>>>>> Dan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na >>>>>> <asr...@gm... <mailto:asr...@gm...>> >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Thank you Dan. >>>>>> >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is >>>>>> like this: >>>>>> >>>>>>>>> #Next section enables CUDA for compilation >>>>>> >>>>>>>>> CUDA = true >>>>>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose >>>>>> --machine 64 -DHAVE_CUDA >>>>>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib >>>>>> -Wl,-rpath,$(CUDATKDIR)/lib >>>>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >>>>>> -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs >>>>>> are loaded later >>>>>> >>>>>>>>> than static libs in implicit rule >>>>>> >>>>>>> The 'make' process does not give any error so I can >>>>>> claim that the tools >>>>>> >>>>>>> are compiled with CUDA successfully, right? >>>>>> >>>>>>> Problem is, although the log stops updating, I can >>>>>> see 'nnet-forward' is >>>>>> >>>>>>> running on GPU-2. >>>>>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it >>>>>> displays: >>>>>> >>>>>>>>> nnet-forward --use-gpu=yes >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> >>>>>>>>> 'ark:copy-feats >>>>>> scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>> >>>>>>>>> WARNING >>>>>> (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>> >>>>>>>>> LOG >>>>>> (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting >>>>>> from 4 >>>>>> >>>>>>>>> GPUs >>>>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, >>>>>> used:80M, total:4799M, >>>>>> >>>>>>>>> free/total:0.983228 >>>>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, >>>>>> used:80M, total:4799M, >>>>>> >>>>>>>>> free/total:0.983228 >>>>>> >>>>>>> and no more. I have 4 GPU cards installed, all same >>>>>> model. >>>>>> >>>>>>> BTW, my configure command is: >>>>>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base >>>>>> --use-cuda=yes >>>>>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>>> >>>>>>> >>>>>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is >>>>>> running on GPU while >>>>>> >>>>>>> log stops updating? >>>>>> >>>>>>> >>>>>> >>>>>>> Thank you and best regards, >>>>>> >>>>>>> Xingyu >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Possibly you did not compile for CUDA. The logs >>>>>> should say which GPU you >>>>>> >>>>>>> are using (look in the dir, for *.log). If the >>>>>> configure script does not >>>>>> >>>>>>> see nvcc on the command line, it will not use CUDA. >>>>>> Grep for CUDA in >>>>>> >>>>>>> kaldi.mk <http://kaldi.mk> to see. >>>>>> >>>>>>> >>>>>> >>>>>>> Dan >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >>>>>> <asr...@gm... <mailto:asr...@gm...>> >>>>>> >>>>>>> wrote: >>>>>> >>>>>>>> Hi, I'm new in this community. >>>>>> >>>>>>>> I am running the TIMIT example s5, all the way to >>>>>> DNN Hybrid Training & >>>>>> >>>>>>>> Decoding part. >>>>>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called >>>>>> yesterday, and still >>>>>> >>>>>>>> running. >>>>>> >>>>>>>> I checked the script and found that it stuck at >>>>>> calling nnet-forward for >>>>>> >>>>>>>> "Renormalizing MLP input features into >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>>> >>>>>>>> The program has been running more then 24 hours. >>>>>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on >>>>>> a Tesla K20m... >>>>>> >>>>>>>> How long does it normally take? Is there something >>>>>> going wrong? >>>>>> >>>>>>>> Please help. >>>>>> >>>>>>>> >>>>>> >>>>>>>> The log is posted below. >>>>>> >>>>>>>> Thank you >>>>>> >>>>>>>> Xingyu >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> ============================================================================ >>>>>> >>>>>>>> >>>>>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> ============================================================================ >>>>>> >>>>>>>> >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>>> <http://run.pl> --transform-dir >>>>>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test >>>>>> exp/tri3 >>>>>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is >>>>>> lda_fmllr >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>>> lda_fmllr, data/test --> >>>>>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm >>>>>> exp/tri3, trans >>>>>> >>>>>>>> exp/tri3/decode_test >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>>> <http://run.pl> --transform-dir >>>>>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev >>>>>> exp/tri3 >>>>>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is >>>>>> lda_fmllr >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>>> lda_fmllr, data/dev --> >>>>>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm >>>>>> exp/tri3, trans >>>>>> >>>>>>>> exp/tri3/decode_dev >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>>> <http://run.pl> --transform-dir >>>>>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is >>>>>> lda_fmllr >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>>> lda_fmllr, data/train --> >>>>>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm >>>>>> exp/tri3, trans >>>>>> >>>>>>>> exp/tri3_ali >>>>>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>>> >>>>>>>> >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> >>>>>>>> reducing #utt from 3696 to 3320 >>>>>> >>>>>>>> >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> >>>>>>>> reducing #utt from 3696 to 376 >>>>>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 >>>>>> --rbm-iter 20 >>>>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>>> >>>>>>>> # >>>>>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> >>>>>>>> # INFO >>>>>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep >>>>>> Belief Network as a stack >>>>>> >>>>>>>> of RBMs >>>>>> >>>>>>>> dir : exp/dnn4_pretrain-dbn >>>>>> >>>>>>>> Train-set : data-fmllr-tri3/train >>>>>> >>>>>>>> >>>>>> >>>>>>>> # PREPARING FEATURES >>>>>> >>>>>>>> Preparing train/cv lists >>>>>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>>> >>>>>>>> >>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied >>>>>> 3696 feature matrices. >>>>>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input >>>>>> features) >>>>>> >>>>>>>> Getting feature dim : copy-feats >>>>>> scp:exp/dnn4_pretrain-dbn/train.scp >>>>>> >>>>>>>> ark:- >>>>>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe >>>>>> copy-feats >>>>>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had >>>>>> nonzero return status 13 >>>>>> >>>>>>>> 40 >>>>>> >>>>>>>> Using splice ± 5 , step 1 >>>>>> >>>>>>>> Renormalizing MLP input features into >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> >>>>>>>> compute-cmvn-stats ark:- - >>>>>> >>>>>>>> cmvn-to-nnet - - >>>>>> >>>>>>>> nnet-concat --binary=false >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) >>>>>> Concatenating - >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>>>> _______________________________________________ >>>>>> >>>>>>>> Kaldi-users mailing list >>>>>> >>>>>>>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>> >>>>>> >>>>> _______________________________________________ >>>>>> >>>>> Kaldi-users mailing list >>>>>> >>>>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>> >>>>>> >>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>> _______________________________________________ >>>>>> >>> Kaldi-users mailing list >>>>>> >>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> > >>>>>> > >>>>>> ------------------------------------------------------------------------------ >>>>>> > _______________________________________________ >>>>>> > Kaldi-users mailing list >>>>>> > Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>>> -- >>>>>> Karel Vesely, Brno University of Technology >>>>>> ive...@fi... <mailto:ive...@fi...>, >>>>>> +420-54114-1300 <tel:%2B420-54114-1300> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> _______________________________________________ >>>>>> Kaldi-users mailing list >>>>>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Karel Vesely, Brno University of Technology >>>>> ive...@fi..., +420-54114-1300 >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> > |