From: Xingyu Na <asr...@gm...> - 2014-10-24 02:17:56
|
Hi, I'm new in this community. I am running the TIMIT example s5, all the way to DNN Hybrid Training & Decoding part. The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still running. I checked the script and found that it stuck at calling nnet-forward for "Renormalizing MLP input features into exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" The program has been running more then 24 hours. 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... How long does it normally take? Is there something going wrong? Please help. The log is posted below. Thank you Xingyu ============================================================================ DNN Hybrid Training & Decoding (Karel's recipe) ============================================================================ steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 data-fmllr-tri3/test/log data-fmllr-tri3/test/data steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans exp/tri3/decode_test steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans exp/tri3/decode_dev steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 data-fmllr-tri3/train/log data-fmllr-tri3/train/data steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans exp/tri3_ali utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: reducing #utt from 3696 to 3320 /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: reducing #utt from 3696 to 376 # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 data-fmllr-tri3/train exp/dnn4_pretrain-dbn # Started at Wed Oct 22 16:11:09 CST 2014 # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 data-fmllr-tri3/train exp/dnn4_pretrain-dbn # INFO steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack of RBMs dir : exp/dnn4_pretrain-dbn Train-set : data-fmllr-tri3/train # PREPARING FEATURES Preparing train/cv lists 3696 exp/dnn4_pretrain-dbn/train.scp copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. apply_cmvn disabled (per speaker norm. on input features) Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:- WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 40 Using splice ± 5 , step 1 Renormalizing MLP input features into exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet compute-cmvn-stats ark:- - cmvn-to-nnet - - nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet LOG (nnet-concat:main():nnet-concat.cc:53) Reading exp/dnn4_pretrain-dbn/tr_splice5-1.nnet LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - |
From: Daniel P. <dp...@gm...> - 2014-10-24 02:32:00
|
Possibly you did not compile for CUDA. The logs should say which GPU you are using (look in the dir, for *.log). If the configure script does not see nvcc on the command line, it will not use CUDA. Grep for CUDA in kaldi.mk to see. Dan On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> wrote: > Hi, I'm new in this community. > I am running the TIMIT example s5, all the way to DNN Hybrid Training & > Decoding part. > The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still > running. > I checked the script and found that it stuck at calling nnet-forward for > "Renormalizing MLP input features into > exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" > The program has been running more then 24 hours. > 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... > How long does it normally take? Is there something going wrong? > Please help. > > The log is posted below. > Thank you > Xingyu > > > ============================================================================ > > DNN Hybrid Training & Decoding (Karel's recipe) > > ============================================================================ > > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir > exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 > data-fmllr-tri3/test/log data-fmllr-tri3/test/data > steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> > data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans > exp/tri3/decode_test > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir > exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 > data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data > steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> > data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans > exp/tri3/decode_dev > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir > exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 > data-fmllr-tri3/train/log data-fmllr-tri3/train/data > steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> > data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans > exp/tri3_ali > utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train > data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > reducing #utt from 3696 to 3320 > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > reducing #utt from 3696 to 376 > # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > data-fmllr-tri3/train exp/dnn4_pretrain-dbn > # Started at Wed Oct 22 16:11:09 CST 2014 > # > steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > data-fmllr-tri3/train exp/dnn4_pretrain-dbn > # INFO > steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack > of RBMs > dir : exp/dnn4_pretrain-dbn > Train-set : data-fmllr-tri3/train > > # PREPARING FEATURES > Preparing train/cv lists > 3696 exp/dnn4_pretrain-dbn/train.scp > copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local > ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp > LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. > apply_cmvn disabled (per speaker norm. on input features) > Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:- > WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 > 40 > Using splice ± 5 , step 1 > Renormalizing MLP input features into > exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > compute-cmvn-stats ark:- - > cmvn-to-nnet - - > nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - > exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > LOG (nnet-concat:main():nnet-concat.cc:53) Reading > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - > > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > |
From: Xingyu Na <asr...@gm...> - 2014-10-24 04:05:34
|
Thank you Dan. I compiled with CUDA. kaldi.mk is like this: >> #Next section enables CUDA for compilation >> CUDA = true >> CUDATKDIR = /usr/local/cuda-5.5 >> CUDA_INCLUDE= -I$(CUDATKDIR)/include >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later than static libs in implicit rule The 'make' process does not give any error so I can claim that the tools are compiled with CUDA successfully, right? Problem is, although the log stops updating, I can see 'nnet-forward' is running on GPU-2. The log in the exp dir is cmvn_glob_fwd.log and it displays: >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use 'nvidia-smi -c 1' to set compute exclusive mode >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 GPUs >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, free/total:0.983228 >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, free/total:0.983228 and no more. I have 4 GPU cards installed, all same model. BTW, my configure command is: ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes --cudatk-dir=/usr/local/cuda-5.5 Am I doing something wrong? Why 'nnet-forward' is running on GPU while log stops updating? Thank you and best regards, Xingyu On 10/24/2014 10:31 AM, Daniel Povey wrote: > Possibly you did not compile for CUDA. The logs should say which GPU > you are using (look in the dir, for *.log). If the configure script > does not see nvcc on the command line, it will not use CUDA. Grep for > CUDA in kaldi.mk <http://kaldi.mk> to see. > > Dan > > > On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm... > <mailto:asr...@gm...>> wrote: > > Hi, I'm new in this community. > I am running the TIMIT example s5, all the way to DNN Hybrid > Training & > Decoding part. > The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and > still > running. > I checked the script and found that it stuck at calling > nnet-forward for > "Renormalizing MLP input features into > exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" > The program has been running more then 24 hours. > 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... > How long does it normally take? Is there something going wrong? > Please help. > > The log is posted below. > Thank you > Xingyu > > ============================================================================ > > DNN Hybrid Training & Decoding (Karel's recipe) > ============================================================================ > > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > <http://run.pl> --transform-dir > exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 > data-fmllr-tri3/test/log data-fmllr-tri3/test/data > steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> > data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans > exp/tri3/decode_test > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > <http://run.pl> --transform-dir > exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 > data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data > steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> > data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans > exp/tri3/decode_dev > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > <http://run.pl> --transform-dir > exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 > data-fmllr-tri3/train/log data-fmllr-tri3/train/data > steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> > data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans > exp/tri3_ali > utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train > data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > reducing #utt from 3696 to 3320 > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > reducing #utt from 3696 to 376 > # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > data-fmllr-tri3/train exp/dnn4_pretrain-dbn > # Started at Wed Oct 22 16:11:09 CST 2014 > # > steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > data-fmllr-tri3/train exp/dnn4_pretrain-dbn > # INFO > steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a > stack > of RBMs > dir : exp/dnn4_pretrain-dbn > Train-set : data-fmllr-tri3/train > > # PREPARING FEATURES > Preparing train/cv lists > 3696 exp/dnn4_pretrain-dbn/train.scp > copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local > ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp > LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature > matrices. > apply_cmvn disabled (per speaker norm. on input features) > Getting feature dim : copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp ark:- > WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return > status 13 > 40 > Using splice ± 5 , step 1 > Renormalizing MLP input features into > exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > compute-cmvn-stats ark:- - > cmvn-to-nnet - - > nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - > exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > LOG (nnet-concat:main():nnet-concat.cc:53) Reading > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > <mailto:Kal...@li...> > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > |
From: Daniel P. <dp...@gm...> - 2014-10-24 04:11:37
|
cc'ing Karel who may be able to help you, although I think he could be behind on his email. I'm afraid I don't know how to fix this. If you can figure out the full command that's being run then it might be possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and break into it and get a stack trace to find where it's stuck. Dan On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> wrote: > Thank you Dan. > I compiled with CUDA. kaldi.mk is like this: > >> #Next section enables CUDA for compilation > >> CUDA = true > >> CUDATKDIR = /usr/local/cuda-5.5 > >> CUDA_INCLUDE= -I$(CUDATKDIR)/include > >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA > >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 > >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later > than static libs in implicit rule > > The 'make' process does not give any error so I can claim that the tools > are compiled with CUDA successfully, right? > Problem is, although the log stops updating, I can see 'nnet-forward' is > running on GPU-2. > The log in the exp dir is cmvn_glob_fwd.log and it displays: > >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- > >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use > 'nvidia-smi -c 1' to set compute exclusive mode > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 > GPUs > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(0): > Tesla K20m free:4719M, used:80M, total:4799M, free/total:0.983228 > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(1): > Tesla K20m free:4719M, used:80M, total:4799M, free/total:0.983228 > > and no more. I have 4 GPU cards installed, all same model. > BTW, my configure command is: > ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes > --cudatk-dir=/usr/local/cuda-5.5 > > Am I doing something wrong? Why 'nnet-forward' is running on GPU while log > stops updating? > > Thank you and best regards, > Xingyu > > > On 10/24/2014 10:31 AM, Daniel Povey wrote: > > Possibly you did not compile for CUDA. The logs should say which GPU you > are using (look in the dir, for *.log). If the configure script does not > see nvcc on the command line, it will not use CUDA. Grep for CUDA in > kaldi.mk to see. > > Dan > > > On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> > wrote: > >> Hi, I'm new in this community. >> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >> Decoding part. >> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >> running. >> I checked the script and found that it stuck at calling nnet-forward for >> "Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> The program has been running more then 24 hours. >> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >> How long does it normally take? Is there something going wrong? >> Please help. >> >> The log is posted below. >> Thank you >> Xingyu >> >> >> ============================================================================ >> >> DNN Hybrid Training & Decoding (Karel's recipe) >> >> ============================================================================ >> >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >> exp/tri3/decode_test >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >> exp/tri3/decode_dev >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >> exp/tri3_ali >> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 3320 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 376 >> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # Started at Wed Oct 22 16:11:09 CST 2014 >> # >> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # INFO >> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >> of RBMs >> dir : exp/dnn4_pretrain-dbn >> Train-set : data-fmllr-tri3/train >> >> # PREPARING FEATURES >> Preparing train/cv lists >> 3696 exp/dnn4_pretrain-dbn/train.scp >> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >> apply_cmvn disabled (per speaker norm. on input features) >> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:- >> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >> 40 >> Using splice ± 5 , step 1 >> Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> compute-cmvn-stats ark:- - >> cmvn-to-nnet - - >> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > > > |
From: Daniel P. <dp...@gm...> - 2014-10-24 04:15:51
|
I'm running the same thing at JHU to see if I can replicate your problem. Dan On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: > cc'ing Karel who may be able to help you, although I think he could be > behind on his email. > I'm afraid I don't know how to fix this. > If you can figure out the full command that's being run then it might be > possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., > and break into it and get a stack trace to find where it's stuck. > > Dan > > > On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> > wrote: > >> Thank you Dan. >> I compiled with CUDA. kaldi.mk is like this: >> >> #Next section enables CUDA for compilation >> >> CUDA = true >> >> CUDATKDIR = /usr/local/cuda-5.5 >> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include >> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >> than static libs in implicit rule >> >> The 'make' process does not give any error so I can claim that the tools >> are compiled with CUDA successfully, right? >> Problem is, although the log stops updating, I can see 'nnet-forward' is >> running on GPU-2. >> The log in the exp dir is cmvn_glob_fwd.log and it displays: >> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >> 'nvidia-smi -c 1' to set compute exclusive mode >> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >> GPUs >> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >> free/total:0.983228 >> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >> free/total:0.983228 >> >> and no more. I have 4 GPU cards installed, all same model. >> BTW, my configure command is: >> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >> --cudatk-dir=/usr/local/cuda-5.5 >> >> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >> log stops updating? >> >> Thank you and best regards, >> Xingyu >> >> >> On 10/24/2014 10:31 AM, Daniel Povey wrote: >> >> Possibly you did not compile for CUDA. The logs should say which GPU you >> are using (look in the dir, for *.log). If the configure script does not >> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >> kaldi.mk to see. >> >> Dan >> >> >> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >> wrote: >> >>> Hi, I'm new in this community. >>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>> Decoding part. >>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>> running. >>> I checked the script and found that it stuck at calling nnet-forward for >>> "Renormalizing MLP input features into >>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>> The program has been running more then 24 hours. >>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>> How long does it normally take? Is there something going wrong? >>> Please help. >>> >>> The log is posted below. >>> Thank you >>> Xingyu >>> >>> >>> ============================================================================ >>> >>> DNN Hybrid Training & Decoding (Karel's recipe) >>> >>> ============================================================================ >>> >>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>> exp/tri3/decode_test >>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>> exp/tri3/decode_dev >>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>> exp/tri3_ali >>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> reducing #utt from 3696 to 3320 >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> reducing #utt from 3696 to 376 >>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> # Started at Wed Oct 22 16:11:09 CST 2014 >>> # >>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> # INFO >>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>> of RBMs >>> dir : exp/dnn4_pretrain-dbn >>> Train-set : data-fmllr-tri3/train >>> >>> # PREPARING FEATURES >>> Preparing train/cv lists >>> 3696 exp/dnn4_pretrain-dbn/train.scp >>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>> apply_cmvn disabled (per speaker norm. on input features) >>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>> ark:- >>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>> 40 >>> Using splice ± 5 , step 1 >>> Renormalizing MLP input features into >>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> compute-cmvn-stats ark:- - >>> cmvn-to-nnet - - >>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> >> >> > |
From: Xingyu Na <asr...@gm...> - 2014-10-24 07:19:06
|
Thank you so much Dan. The script which causes the halting is : nnet-forward --use-gpu=yes \ $feature_transform_old "$(echo $feats | sed 's|train.scp|train.scp.10k|')" \ ark:- 2>$dir/log/cmvn_glob_fwd.log |\ compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ nnet-concat --binary=false $feature_transform_old - $feature_transform and the command that is running is: nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- If I understand it correctly, nnet-forward is piping its output to compute-cmvn-stats (although apply_cmvn is false), and followed by cmvn-to-nnet and nnet-concat. The problem, I think, is that there is an extra '| ark:-'. It means that the output of nnet-forward is being piped into 'ark:-', which is not a executable. Is there is bug here? Regards, Xingyu On 10/24/2014 12:15 PM, Daniel Povey wrote: > I'm running the same thing at JHU to see if I can replicate your problem. > Dan > > > On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm... > <mailto:dp...@gm...>> wrote: > > cc'ing Karel who may be able to help you, although I think he > could be behind on his email. > I'm afraid I don't know how to fix this. > If you can figure out the full command that's being run then it > might be possible to get it in a debugger, e.g. gdb --args program > arg1 arg2 ..., and break into it and get a stack trace to find > where it's stuck. > > Dan > > > On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na > <asr...@gm... <mailto:asr...@gm...>> wrote: > > Thank you Dan. > I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like this: > >> #Next section enables CUDA for compilation > >> CUDA = true > >> CUDATKDIR = /usr/local/cuda-5.5 > >> CUDA_INCLUDE= -I$(CUDATKDIR)/include > >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 > -DHAVE_CUDA > >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 > -Wl,-rpath,$(CUDATKDIR)/lib64 > >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are > loaded later than static libs in implicit rule > > The 'make' process does not give any error so I can claim that > the tools are compiled with CUDA successfully, right? > Problem is, although the log stops updating, I can see > 'nnet-forward' is running on GPU-2. > The log in the exp dir is cmvn_glob_fwd.log and it displays: > >> nnet-forward --use-gpu=yes > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 'ark:copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- > >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) > Suggestion: use 'nvidia-smi -c 1' to set compute exclusive mode > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) > Selecting from 4 GPUs > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > cudaSetDevice(0): Tesla K20m free:4719M, used:80M, > total:4799M, free/total:0.983228 > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > cudaSetDevice(1): Tesla K20m free:4719M, used:80M, > total:4799M, free/total:0.983228 > > and no more. I have 4 GPU cards installed, all same model. > BTW, my configure command is: > ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes > --cudatk-dir=/usr/local/cuda-5.5 > > Am I doing something wrong? Why 'nnet-forward' is running on > GPU while log stops updating? > > Thank you and best regards, > Xingyu > > > On 10/24/2014 10:31 AM, Daniel Povey wrote: >> Possibly you did not compile for CUDA. The logs should say >> which GPU you are using (look in the dir, for *.log). If the >> configure script does not see nvcc on the command line, it >> will not use CUDA. Grep for CUDA in kaldi.mk >> <http://kaldi.mk> to see. >> >> Dan >> >> >> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >> <asr...@gm... <mailto:asr...@gm...>> wrote: >> >> Hi, I'm new in this community. >> I am running the TIMIT example s5, all the way to DNN >> Hybrid Training & >> Decoding part. >> The script "steps/nnet/pretrain_dbn.sh" was called >> yesterday, and still >> running. >> I checked the script and found that it stuck at calling >> nnet-forward for >> "Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> The program has been running more then 24 hours. >> 'nvidia-smi' said 'nnet-forward' is still running on a >> Tesla K20m... >> How long does it normally take? Is there something going >> wrong? >> Please help. >> >> The log is posted below. >> Thank you >> Xingyu >> >> ============================================================================ >> >> DNN Hybrid Training & Decoding (Karel's >> recipe) >> ============================================================================ >> >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/test --> >> data-fmllr-tri3/test, using : raw-trans None, gmm >> exp/tri3, trans >> exp/tri3/decode_test >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/dev --> >> data-fmllr-tri3/dev, using : raw-trans None, gmm >> exp/tri3, trans >> exp/tri3/decode_dev >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/train --> >> data-fmllr-tri3/train, using : raw-trans None, gmm >> exp/tri3, trans >> exp/tri3_ali >> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 3320 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 376 >> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # Started at Wed Oct 22 16:11:09 CST 2014 >> # >> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # INFO >> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief >> Network as a stack >> of RBMs >> dir : exp/dnn4_pretrain-dbn >> Train-set : data-fmllr-tri3/train >> >> # PREPARING FEATURES >> Preparing train/cv lists >> 3696 exp/dnn4_pretrain-dbn/train.scp >> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 >> feature matrices. >> apply_cmvn disabled (per speaker norm. on input features) >> Getting feature dim : copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp ark:- >> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero >> return status 13 >> 40 >> Using splice ± 5 , step 1 >> Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> compute-cmvn-stats ark:- - >> cmvn-to-nnet - - >> nnet-concat --binary=false >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> <mailto:Kal...@li...> >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> > > > |
From: Alexander S. <aso...@gm...> - 2014-10-24 07:47:09
|
Hi Xingyu, If you are concerned whether the process hung up or not, you can see the output of `ps <PID>` where <PID> is the process id. If you see 'S' in STAT fields, like PID TTY STAT TIME COMMAND 11891 pts/5 S+ 0:00 cat Then the processing is sleeping. Otherwise you should see 'R' like: PID TTY STAT TIME COMMAND 11909 pts/5 R+ 0:01 cat On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote: > Thank you so much Dan. > The script which causes the halting is : > > nnet-forward --use-gpu=yes \ > $feature_transform_old "$(echo $feats | sed > 's|train.scp|train.scp.10k|')" \ > ark:- 2>$dir/log/cmvn_glob_fwd.log |\ > compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ > nnet-concat --binary=false $feature_transform_old - $feature_transform > > and the command that is running is: > > nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- > > If I understand it correctly, nnet-forward is piping its output to > compute-cmvn-stats (although apply_cmvn is false), and followed by > cmvn-to-nnet and nnet-concat. > The problem, I think, is that there is an extra '| ark:-'. It means that the > output of nnet-forward is being piped into 'ark:-', which is not a > executable. > Is there is bug here? > > Regards, > Xingyu > > > On 10/24/2014 12:15 PM, Daniel Povey wrote: > > I'm running the same thing at JHU to see if I can replicate your problem. > Dan > > > On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: >> >> cc'ing Karel who may be able to help you, although I think he could be >> behind on his email. >> I'm afraid I don't know how to fix this. >> If you can figure out the full command that's being run then it might be >> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and >> break into it and get a stack trace to find where it's stuck. >> >> Dan >> >> >> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> >> wrote: >>> >>> Thank you Dan. >>> I compiled with CUDA. kaldi.mk is like this: >>> >> #Next section enables CUDA for compilation >>> >> CUDA = true >>> >> CUDATKDIR = /usr/local/cuda-5.5 >>> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >>> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >>> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >>> >> than static libs in implicit rule >>> >>> The 'make' process does not give any error so I can claim that the tools >>> are compiled with CUDA successfully, right? >>> Problem is, although the log stops updating, I can see 'nnet-forward' is >>> running on GPU-2. >>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>> >> 'nvidia-smi -c 1' to set compute exclusive mode >>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>> >> GPUs >>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>> >> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >>> >> free/total:0.983228 >>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>> >> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >>> >> free/total:0.983228 >>> >>> and no more. I have 4 GPU cards installed, all same model. >>> BTW, my configure command is: >>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>> --cudatk-dir=/usr/local/cuda-5.5 >>> >>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >>> log stops updating? >>> >>> Thank you and best regards, >>> Xingyu >>> >>> >>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>> >>> Possibly you did not compile for CUDA. The logs should say which GPU you >>> are using (look in the dir, for *.log). If the configure script does not >>> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >>> kaldi.mk to see. >>> >>> Dan >>> >>> >>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >>> wrote: >>>> >>>> Hi, I'm new in this community. >>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>>> Decoding part. >>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>>> running. >>>> I checked the script and found that it stuck at calling nnet-forward for >>>> "Renormalizing MLP input features into >>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>> The program has been running more then 24 hours. >>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>>> How long does it normally take? Is there something going wrong? >>>> Please help. >>>> >>>> The log is posted below. >>>> Thank you >>>> Xingyu >>>> >>>> >>>> ============================================================================ >>>> >>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>> >>>> ============================================================================ >>>> >>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>>> exp/tri3/decode_test >>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>>> exp/tri3/decode_dev >>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>>> exp/tri3_ali >>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>> reducing #utt from 3696 to 3320 >>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>> reducing #utt from 3696 to 376 >>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>> # >>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>> # INFO >>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>>> of RBMs >>>> dir : exp/dnn4_pretrain-dbn >>>> Train-set : data-fmllr-tri3/train >>>> >>>> # PREPARING FEATURES >>>> Preparing train/cv lists >>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>>> apply_cmvn disabled (per speaker norm. on input features) >>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>>> ark:- >>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>>> 40 >>>> Using splice ± 5 , step 1 >>>> Renormalizing MLP input features into >>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>> compute-cmvn-stats ark:- - >>>> cmvn-to-nnet - - >>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>> >>> >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > -- Sincerely, Alexander |
From: Xingyu Na <asr...@gm...> - 2014-10-24 07:55:04
|
Thank you Dan and Alex. It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't know why....). Now I understand how that pipelined command works. Sorry for saying "Is there a bug" in the previous email.... Regards, Xingyu On 10/24/2014 03:46 PM, Alexander Solovets wrote: > Hi Xingyu, > > If you are concerned whether the process hung up or not, you can see > the output of `ps <PID>` where <PID> is the process id. If you see 'S' > in STAT fields, like > > PID TTY STAT TIME COMMAND > 11891 pts/5 S+ 0:00 cat > > Then the processing is sleeping. Otherwise you should see 'R' like: > > PID TTY STAT TIME COMMAND > 11909 pts/5 R+ 0:01 cat > > On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote: >> Thank you so much Dan. >> The script which causes the halting is : >> >> nnet-forward --use-gpu=yes \ >> $feature_transform_old "$(echo $feats | sed >> 's|train.scp|train.scp.10k|')" \ >> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >> nnet-concat --binary=false $feature_transform_old - $feature_transform >> >> and the command that is running is: >> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >> >> If I understand it correctly, nnet-forward is piping its output to >> compute-cmvn-stats (although apply_cmvn is false), and followed by >> cmvn-to-nnet and nnet-concat. >> The problem, I think, is that there is an extra '| ark:-'. It means that the >> output of nnet-forward is being piped into 'ark:-', which is not a >> executable. >> Is there is bug here? >> >> Regards, >> Xingyu >> >> >> On 10/24/2014 12:15 PM, Daniel Povey wrote: >> >> I'm running the same thing at JHU to see if I can replicate your problem. >> Dan >> >> >> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: >>> cc'ing Karel who may be able to help you, although I think he could be >>> behind on his email. >>> I'm afraid I don't know how to fix this. >>> If you can figure out the full command that's being run then it might be >>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and >>> break into it and get a stack trace to find where it's stuck. >>> >>> Dan >>> >>> >>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> >>> wrote: >>>> Thank you Dan. >>>> I compiled with CUDA. kaldi.mk is like this: >>>>>> #Next section enables CUDA for compilation >>>>>> CUDA = true >>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >>>>>> than static libs in implicit rule >>>> The 'make' process does not give any error so I can claim that the tools >>>> are compiled with CUDA successfully, right? >>>> Problem is, although the log stops updating, I can see 'nnet-forward' is >>>> running on GPU-2. >>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>>>>> GPUs >>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>> free/total:0.983228 >>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>> free/total:0.983228 >>>> and no more. I have 4 GPU cards installed, all same model. >>>> BTW, my configure command is: >>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>>> --cudatk-dir=/usr/local/cuda-5.5 >>>> >>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >>>> log stops updating? >>>> >>>> Thank you and best regards, >>>> Xingyu >>>> >>>> >>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>> >>>> Possibly you did not compile for CUDA. The logs should say which GPU you >>>> are using (look in the dir, for *.log). If the configure script does not >>>> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >>>> kaldi.mk to see. >>>> >>>> Dan >>>> >>>> >>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >>>> wrote: >>>>> Hi, I'm new in this community. >>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>>>> Decoding part. >>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>>>> running. >>>>> I checked the script and found that it stuck at calling nnet-forward for >>>>> "Renormalizing MLP input features into >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>> The program has been running more then 24 hours. >>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>>>> How long does it normally take? Is there something going wrong? >>>>> Please help. >>>>> >>>>> The log is posted below. >>>>> Thank you >>>>> Xingyu >>>>> >>>>> >>>>> ============================================================================ >>>>> >>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>> >>>>> ============================================================================ >>>>> >>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>>>> exp/tri3/decode_test >>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>>>> exp/tri3/decode_dev >>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>>>> exp/tri3_ali >>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>> reducing #utt from 3696 to 3320 >>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>> reducing #utt from 3696 to 376 >>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>> # >>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>> # INFO >>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>>>> of RBMs >>>>> dir : exp/dnn4_pretrain-dbn >>>>> Train-set : data-fmllr-tri3/train >>>>> >>>>> # PREPARING FEATURES >>>>> Preparing train/cv lists >>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>>>> apply_cmvn disabled (per speaker norm. on input features) >>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>>>> ark:- >>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>>>> 40 >>>>> Using splice ± 5 , step 1 >>>>> Renormalizing MLP input features into >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>> compute-cmvn-stats ark:- - >>>>> cmvn-to-nnet - - >>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> _______________________________________________ >>>>> Kaldi-users mailing list >>>>> Kal...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > > |
From: Vesely K. <ive...@fi...> - 2014-10-24 10:32:31
|
Hi, The reason is in the "computation mode", which has with Kaldi following behavior: - default : OS selects GPU with GPU-ID '0' by default (i.e. more processes use same GPU which is slow) [BAD] - process/thread exclusive : OS selects a free GPU which not locked to another process or raises error [RECOMMENDED] Best regards, Karel On 10/24/2014 09:54 AM, Xingyu Na wrote: > Thank you Dan and Alex. > It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't > know why....). > Now I understand how that pipelined command works. > Sorry for saying "Is there a bug" in the previous email.... > > Regards, > Xingyu > > On 10/24/2014 03:46 PM, Alexander Solovets wrote: >> Hi Xingyu, >> >> If you are concerned whether the process hung up or not, you can see >> the output of `ps <PID>` where <PID> is the process id. If you see 'S' >> in STAT fields, like >> >> PID TTY STAT TIME COMMAND >> 11891 pts/5 S+ 0:00 cat >> >> Then the processing is sleeping. Otherwise you should see 'R' like: >> >> PID TTY STAT TIME COMMAND >> 11909 pts/5 R+ 0:01 cat >> >> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote: >>> Thank you so much Dan. >>> The script which causes the halting is : >>> >>> nnet-forward --use-gpu=yes \ >>> $feature_transform_old "$(echo $feats | sed >>> 's|train.scp|train.scp.10k|')" \ >>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>> nnet-concat --binary=false $feature_transform_old - $feature_transform >>> >>> and the command that is running is: >>> >>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >>> >>> If I understand it correctly, nnet-forward is piping its output to >>> compute-cmvn-stats (although apply_cmvn is false), and followed by >>> cmvn-to-nnet and nnet-concat. >>> The problem, I think, is that there is an extra '| ark:-'. It means that the >>> output of nnet-forward is being piped into 'ark:-', which is not a >>> executable. >>> Is there is bug here? >>> >>> Regards, >>> Xingyu >>> >>> >>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>> >>> I'm running the same thing at JHU to see if I can replicate your problem. >>> Dan >>> >>> >>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: >>>> cc'ing Karel who may be able to help you, although I think he could be >>>> behind on his email. >>>> I'm afraid I don't know how to fix this. >>>> If you can figure out the full command that's being run then it might be >>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and >>>> break into it and get a stack trace to find where it's stuck. >>>> >>>> Dan >>>> >>>> >>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> >>>> wrote: >>>>> Thank you Dan. >>>>> I compiled with CUDA. kaldi.mk is like this: >>>>>>> #Next section enables CUDA for compilation >>>>>>> CUDA = true >>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >>>>>>> than static libs in implicit rule >>>>> The 'make' process does not give any error so I can claim that the tools >>>>> are compiled with CUDA successfully, right? >>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is >>>>> running on GPU-2. >>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>>>>>> GPUs >>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>> free/total:0.983228 >>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>> free/total:0.983228 >>>>> and no more. I have 4 GPU cards installed, all same model. >>>>> BTW, my configure command is: >>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>> >>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >>>>> log stops updating? >>>>> >>>>> Thank you and best regards, >>>>> Xingyu >>>>> >>>>> >>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>> >>>>> Possibly you did not compile for CUDA. The logs should say which GPU you >>>>> are using (look in the dir, for *.log). If the configure script does not >>>>> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >>>>> kaldi.mk to see. >>>>> >>>>> Dan >>>>> >>>>> >>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >>>>> wrote: >>>>>> Hi, I'm new in this community. >>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>>>>> Decoding part. >>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>>>>> running. >>>>>> I checked the script and found that it stuck at calling nnet-forward for >>>>>> "Renormalizing MLP input features into >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>>> The program has been running more then 24 hours. >>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>>>>> How long does it normally take? Is there something going wrong? >>>>>> Please help. >>>>>> >>>>>> The log is posted below. >>>>>> Thank you >>>>>> Xingyu >>>>>> >>>>>> >>>>>> ============================================================================ >>>>>> >>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>>> >>>>>> ============================================================================ >>>>>> >>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>>>>> exp/tri3/decode_test >>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>>>>> exp/tri3/decode_dev >>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>>>>> exp/tri3_ali >>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> reducing #utt from 3696 to 3320 >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> reducing #utt from 3696 to 376 >>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>>> # >>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> # INFO >>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>>>>> of RBMs >>>>>> dir : exp/dnn4_pretrain-dbn >>>>>> Train-set : data-fmllr-tri3/train >>>>>> >>>>>> # PREPARING FEATURES >>>>>> Preparing train/cv lists >>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>>>>> apply_cmvn disabled (per speaker norm. on input features) >>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>>>>> ark:- >>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>>>>> 40 >>>>>> Using splice ± 5 , step 1 >>>>>> Renormalizing MLP input features into >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> compute-cmvn-stats ark:- - >>>>>> cmvn-to-nnet - - >>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> _______________________________________________ >>>>>> Kaldi-users mailing list >>>>>> Kal...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users -- Karel Vesely, Brno University of Technology ive...@fi..., +420-54114-1300 |
From: Xingyu Na <asr...@gm...> - 2014-10-24 10:40:05
|
Thank you Karel. Is that a 'must' for all cuda-based kaldi executives? Regards, Xingyu On 10/24/2014 06:12 PM, Vesely Karel wrote: > Hi, > The reason is in the "computation mode", which has with Kaldi following > behavior: > - default : OS selects GPU with GPU-ID '0' by default (i.e. more > processes use same GPU which is slow) [BAD] > - process/thread exclusive : OS selects a free GPU which not locked to > another process or raises error [RECOMMENDED] > Best regards, > Karel > > > On 10/24/2014 09:54 AM, Xingyu Na wrote: >> Thank you Dan and Alex. >> It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't >> know why....). >> Now I understand how that pipelined command works. >> Sorry for saying "Is there a bug" in the previous email.... >> >> Regards, >> Xingyu >> >> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>> Hi Xingyu, >>> >>> If you are concerned whether the process hung up or not, you can see >>> the output of `ps <PID>` where <PID> is the process id. If you see 'S' >>> in STAT fields, like >>> >>> PID TTY STAT TIME COMMAND >>> 11891 pts/5 S+ 0:00 cat >>> >>> Then the processing is sleeping. Otherwise you should see 'R' like: >>> >>> PID TTY STAT TIME COMMAND >>> 11909 pts/5 R+ 0:01 cat >>> >>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote: >>>> Thank you so much Dan. >>>> The script which causes the halting is : >>>> >>>> nnet-forward --use-gpu=yes \ >>>> $feature_transform_old "$(echo $feats | sed >>>> 's|train.scp|train.scp.10k|')" \ >>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>>> nnet-concat --binary=false $feature_transform_old - $feature_transform >>>> >>>> and the command that is running is: >>>> >>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >>>> >>>> If I understand it correctly, nnet-forward is piping its output to >>>> compute-cmvn-stats (although apply_cmvn is false), and followed by >>>> cmvn-to-nnet and nnet-concat. >>>> The problem, I think, is that there is an extra '| ark:-'. It means that the >>>> output of nnet-forward is being piped into 'ark:-', which is not a >>>> executable. >>>> Is there is bug here? >>>> >>>> Regards, >>>> Xingyu >>>> >>>> >>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>>> >>>> I'm running the same thing at JHU to see if I can replicate your problem. >>>> Dan >>>> >>>> >>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: >>>>> cc'ing Karel who may be able to help you, although I think he could be >>>>> behind on his email. >>>>> I'm afraid I don't know how to fix this. >>>>> If you can figure out the full command that's being run then it might be >>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and >>>>> break into it and get a stack trace to find where it's stuck. >>>>> >>>>> Dan >>>>> >>>>> >>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> >>>>> wrote: >>>>>> Thank you Dan. >>>>>> I compiled with CUDA. kaldi.mk is like this: >>>>>>>> #Next section enables CUDA for compilation >>>>>>>> CUDA = true >>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >>>>>>>> than static libs in implicit rule >>>>>> The 'make' process does not give any error so I can claim that the tools >>>>>> are compiled with CUDA successfully, right? >>>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is >>>>>> running on GPU-2. >>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>>>>>>> GPUs >>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>>> free/total:0.983228 >>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>>> free/total:0.983228 >>>>>> and no more. I have 4 GPU cards installed, all same model. >>>>>> BTW, my configure command is: >>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>>> >>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >>>>>> log stops updating? >>>>>> >>>>>> Thank you and best regards, >>>>>> Xingyu >>>>>> >>>>>> >>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>>> >>>>>> Possibly you did not compile for CUDA. The logs should say which GPU you >>>>>> are using (look in the dir, for *.log). If the configure script does not >>>>>> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >>>>>> kaldi.mk to see. >>>>>> >>>>>> Dan >>>>>> >>>>>> >>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >>>>>> wrote: >>>>>>> Hi, I'm new in this community. >>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>>>>>> Decoding part. >>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>>>>>> running. >>>>>>> I checked the script and found that it stuck at calling nnet-forward for >>>>>>> "Renormalizing MLP input features into >>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>>>> The program has been running more then 24 hours. >>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>>>>>> How long does it normally take? Is there something going wrong? >>>>>>> Please help. >>>>>>> >>>>>>> The log is posted below. >>>>>>> Thank you >>>>>>> Xingyu >>>>>>> >>>>>>> >>>>>>> ============================================================================ >>>>>>> >>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>>>> >>>>>>> ============================================================================ >>>>>>> >>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>>>>>> exp/tri3/decode_test >>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>>>>>> exp/tri3/decode_dev >>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>>>>>> exp/tri3_ali >>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>>> reducing #utt from 3696 to 3320 >>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>>> reducing #utt from 3696 to 376 >>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>>>> # >>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>>> # INFO >>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>>>>>> of RBMs >>>>>>> dir : exp/dnn4_pretrain-dbn >>>>>>> Train-set : data-fmllr-tri3/train >>>>>>> >>>>>>> # PREPARING FEATURES >>>>>>> Preparing train/cv lists >>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>>>>>> apply_cmvn disabled (per speaker norm. on input features) >>>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>>>>>> ark:- >>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>>>>>> 40 >>>>>>> Using splice ± 5 , step 1 >>>>>>> Renormalizing MLP input features into >>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>>> compute-cmvn-stats ark:- - >>>>>>> cmvn-to-nnet - - >>>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> _______________________________________________ >>>>>>> Kaldi-users mailing list >>>>>>> Kal...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Vesely K. <ive...@fi...> - 2014-10-24 10:44:27
|
It is a 'must' on multi-GPU machines and 'recommended' for single-GPU machine. It is a setting in OS, which is assumed to be done. It is good that one does not need to specify a gpu-id in the scripts and track manually which gpus are being used. Karel. On 10/24/2014 12:39 PM, Xingyu Na wrote: > Thank you Karel. > Is that a 'must' for all cuda-based kaldi executives? > > Regards, > Xingyu > > On 10/24/2014 06:12 PM, Vesely Karel wrote: >> Hi, >> The reason is in the "computation mode", which has with Kaldi following >> behavior: >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >> processes use same GPU which is slow) [BAD] >> - process/thread exclusive : OS selects a free GPU which not locked to >> another process or raises error [RECOMMENDED] >> Best regards, >> Karel >> >> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>> Thank you Dan and Alex. >>> It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't >>> know why....). >>> Now I understand how that pipelined command works. >>> Sorry for saying "Is there a bug" in the previous email.... >>> >>> Regards, >>> Xingyu >>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>>> Hi Xingyu, >>>> >>>> If you are concerned whether the process hung up or not, you can see >>>> the output of `ps <PID>` where <PID> is the process id. If you see 'S' >>>> in STAT fields, like >>>> >>>> PID TTY STAT TIME COMMAND >>>> 11891 pts/5 S+ 0:00 cat >>>> >>>> Then the processing is sleeping. Otherwise you should see 'R' like: >>>> >>>> PID TTY STAT TIME COMMAND >>>> 11909 pts/5 R+ 0:01 cat >>>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote: >>>>> Thank you so much Dan. >>>>> The script which causes the halting is : >>>>> >>>>> nnet-forward --use-gpu=yes \ >>>>> $feature_transform_old "$(echo $feats | sed >>>>> 's|train.scp|train.scp.10k|')" \ >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>>>> nnet-concat --binary=false $feature_transform_old - $feature_transform >>>>> >>>>> and the command that is running is: >>>>> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >>>>> >>>>> If I understand it correctly, nnet-forward is piping its output to >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by >>>>> cmvn-to-nnet and nnet-concat. >>>>> The problem, I think, is that there is an extra '| ark:-'. It means that the >>>>> output of nnet-forward is being piped into 'ark:-', which is not a >>>>> executable. >>>>> Is there is bug here? >>>>> >>>>> Regards, >>>>> Xingyu >>>>> >>>>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>>>> >>>>> I'm running the same thing at JHU to see if I can replicate your problem. >>>>> Dan >>>>> >>>>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote: >>>>>> cc'ing Karel who may be able to help you, although I think he could be >>>>>> behind on his email. >>>>>> I'm afraid I don't know how to fix this. >>>>>> If you can figure out the full command that's being run then it might be >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and >>>>>> break into it and get a stack trace to find where it's stuck. >>>>>> >>>>>> Dan >>>>>> >>>>>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> >>>>>> wrote: >>>>>>> Thank you Dan. >>>>>>> I compiled with CUDA. kaldi.mk is like this: >>>>>>>>> #Next section enables CUDA for compilation >>>>>>>>> CUDA = true >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later >>>>>>>>> than static libs in implicit rule >>>>>>> The 'make' process does not give any error so I can claim that the tools >>>>>>> are compiled with CUDA successfully, right? >>>>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is >>>>>>> running on GPU-2. >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>>>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>>>>>>>> GPUs >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>>>> free/total:0.983228 >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, total:4799M, >>>>>>>>> free/total:0.983228 >>>>>>> and no more. I have 4 GPU cards installed, all same model. >>>>>>> BTW, my configure command is: >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>>>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while >>>>>>> log stops updating? >>>>>>> >>>>>>> Thank you and best regards, >>>>>>> Xingyu >>>>>>> >>>>>>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>>>> >>>>>>> Possibly you did not compile for CUDA. The logs should say which GPU you >>>>>>> are using (look in the dir, for *.log). If the configure script does not >>>>>>> see nvcc on the command line, it will not use CUDA. Grep for CUDA in >>>>>>> kaldi.mk to see. >>>>>>> >>>>>>> Dan >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> >>>>>>> wrote: >>>>>>>> Hi, I'm new in this community. >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training & >>>>>>>> Decoding part. >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still >>>>>>>> running. >>>>>>>> I checked the script and found that it stuck at calling nnet-forward for >>>>>>>> "Renormalizing MLP input features into >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>>>>> The program has been running more then 24 hours. >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m... >>>>>>>> How long does it normally take? Is there something going wrong? >>>>>>>> Please help. >>>>>>>> >>>>>>>> The log is posted below. >>>>>>>> Thank you >>>>>>>> Xingyu >>>>>>>> >>>>>>>> >>>>>>>> ============================================================================ >>>>>>>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>>>>> >>>>>>>> ============================================================================ >>>>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >>>>>>>> exp/tri3/decode_test >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>>>>>>> exp/tri3/decode_dev >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans >>>>>>>> exp/tri3_ali >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>>>> reducing #utt from 3696 to 3320 >>>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>>>> reducing #utt from 3696 to 376 >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>>>>> # >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>>>> # INFO >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack >>>>>>>> of RBMs >>>>>>>> dir : exp/dnn4_pretrain-dbn >>>>>>>> Train-set : data-fmllr-tri3/train >>>>>>>> >>>>>>>> # PREPARING FEATURES >>>>>>>> Preparing train/cv lists >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices. >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >>>>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp >>>>>>>> ark:- >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13 >>>>>>>> 40 >>>>>>>> Using splice ± 5 , step 1 >>>>>>>> Renormalizing MLP input features into >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>>>> compute-cmvn-stats ark:- - >>>>>>>> cmvn-to-nnet - - >>>>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> _______________________________________________ >>>>>>>> Kaldi-users mailing list >>>>>>>> Kal...@li... >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Kaldi-users mailing list >>>>> Kal...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users -- Karel Vesely, Brno University of Technology ive...@fi..., +420-54114-1300 |
From: Daniel P. <dp...@gm...> - 2014-10-24 17:03:50
|
Karel, Is there something which we need to fix here? Why was it hanging? Was it using the CPU instead of the GPU? Was it waiting for some kind of reply from the GPU? Had it crashed? Dan On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...> wrote: > It is a 'must' on multi-GPU machines and 'recommended' for single-GPU > machine. > > It is a setting in OS, which is assumed to be done. It is good that one > does not need > to specify a gpu-id in the scripts and track manually which gpus are > being used. > > Karel. > > On 10/24/2014 12:39 PM, Xingyu Na wrote: > > Thank you Karel. > > Is that a 'must' for all cuda-based kaldi executives? > > > > Regards, > > Xingyu > > > > On 10/24/2014 06:12 PM, Vesely Karel wrote: > >> Hi, > >> The reason is in the "computation mode", which has with Kaldi following > >> behavior: > >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more > >> processes use same GPU which is slow) [BAD] > >> - process/thread exclusive : OS selects a free GPU which not locked to > >> another process or raises error [RECOMMENDED] > >> Best regards, > >> Karel > >> > >> > >> On 10/24/2014 09:54 AM, Xingyu Na wrote: > >>> Thank you Dan and Alex. > >>> It turns out that I need to set 'nvidia-smi -c 1' to continue > here(don't > >>> know why....). > >>> Now I understand how that pipelined command works. > >>> Sorry for saying "Is there a bug" in the previous email.... > >>> > >>> Regards, > >>> Xingyu > >>> > >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: > >>>> Hi Xingyu, > >>>> > >>>> If you are concerned whether the process hung up or not, you can see > >>>> the output of `ps <PID>` where <PID> is the process id. If you see 'S' > >>>> in STAT fields, like > >>>> > >>>> PID TTY STAT TIME COMMAND > >>>> 11891 pts/5 S+ 0:00 cat > >>>> > >>>> Then the processing is sleeping. Otherwise you should see 'R' like: > >>>> > >>>> PID TTY STAT TIME COMMAND > >>>> 11909 pts/5 R+ 0:01 cat > >>>> > >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> > wrote: > >>>>> Thank you so much Dan. > >>>>> The script which causes the halting is : > >>>>> > >>>>> nnet-forward --use-gpu=yes \ > >>>>> $feature_transform_old "$(echo $feats | sed > >>>>> 's|train.scp|train.scp.10k|')" \ > >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ > >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ > >>>>> nnet-concat --binary=false $feature_transform_old - > $feature_transform > >>>>> > >>>>> and the command that is running is: > >>>>> > >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- > >>>>> > >>>>> If I understand it correctly, nnet-forward is piping its output to > >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by > >>>>> cmvn-to-nnet and nnet-concat. > >>>>> The problem, I think, is that there is an extra '| ark:-'. It means > that the > >>>>> output of nnet-forward is being piped into 'ark:-', which is not a > >>>>> executable. > >>>>> Is there is bug here? > >>>>> > >>>>> Regards, > >>>>> Xingyu > >>>>> > >>>>> > >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: > >>>>> > >>>>> I'm running the same thing at JHU to see if I can replicate your > problem. > >>>>> Dan > >>>>> > >>>>> > >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> > wrote: > >>>>>> cc'ing Karel who may be able to help you, although I think he could > be > >>>>>> behind on his email. > >>>>>> I'm afraid I don't know how to fix this. > >>>>>> If you can figure out the full command that's being run then it > might be > >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 > ..., and > >>>>>> break into it and get a stack trace to find where it's stuck. > >>>>>> > >>>>>> Dan > >>>>>> > >>>>>> > >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm... > > > >>>>>> wrote: > >>>>>>> Thank you Dan. > >>>>>>> I compiled with CUDA. kaldi.mk is like this: > >>>>>>>>> #Next section enables CUDA for compilation > >>>>>>>>> CUDA = true > >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 > >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include > >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 > -DHAVE_CUDA > >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include > >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib > >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 > -Wl,-rpath,$(CUDATKDIR)/lib64 > >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded > later > >>>>>>>>> than static libs in implicit rule > >>>>>>> The 'make' process does not give any error so I can claim that the > tools > >>>>>>> are compiled with CUDA successfully, right? > >>>>>>> Problem is, although the log stops updating, I can see > 'nnet-forward' is > >>>>>>> running on GPU-2. > >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: > >>>>>>>>> nnet-forward --use-gpu=yes > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' > ark:- > >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) > Suggestion: use > >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode > >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting > from 4 > >>>>>>>>> GPUs > >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, > total:4799M, > >>>>>>>>> free/total:0.983228 > >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, > total:4799M, > >>>>>>>>> free/total:0.983228 > >>>>>>> and no more. I have 4 GPU cards installed, all same model. > >>>>>>> BTW, my configure command is: > >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes > >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 > >>>>>>> > >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU > while > >>>>>>> log stops updating? > >>>>>>> > >>>>>>> Thank you and best regards, > >>>>>>> Xingyu > >>>>>>> > >>>>>>> > >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: > >>>>>>> > >>>>>>> Possibly you did not compile for CUDA. The logs should say which > GPU you > >>>>>>> are using (look in the dir, for *.log). If the configure script > does not > >>>>>>> see nvcc on the command line, it will not use CUDA. Grep for CUDA > in > >>>>>>> kaldi.mk to see. > >>>>>>> > >>>>>>> Dan > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na < > asr...@gm...> > >>>>>>> wrote: > >>>>>>>> Hi, I'm new in this community. > >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid > Training & > >>>>>>>> Decoding part. > >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and > still > >>>>>>>> running. > >>>>>>>> I checked the script and found that it stuck at calling > nnet-forward for > >>>>>>>> "Renormalizing MLP input features into > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" > >>>>>>>> The program has been running more then 24 hours. > >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla > K20m... > >>>>>>>> How long does it normally take? Is there something going wrong? > >>>>>>>> Please help. > >>>>>>>> > >>>>>>>> The log is posted below. > >>>>>>>> Thank you > >>>>>>>> Xingyu > >>>>>>>> > >>>>>>>> > >>>>>>>> > ============================================================================ > >>>>>>>> > >>>>>>>> DNN Hybrid Training & Decoding (Karel's > recipe) > >>>>>>>> > >>>>>>>> > ============================================================================ > >>>>>>>> > >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > --transform-dir > >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 > >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data > >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test > --> > >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans > >>>>>>>> exp/tri3/decode_test > >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > --transform-dir > >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 > >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data > >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev > --> > >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans > >>>>>>>> exp/tri3/decode_dev > >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > --transform-dir > >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 > >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data > >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train > --> > >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans > >>>>>>>> exp/tri3_ali > >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train > >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 > >>>>>>>> > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > >>>>>>>> reducing #utt from 3696 to 3320 > >>>>>>>> > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > >>>>>>>> reducing #utt from 3696 to 376 > >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn > >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 > >>>>>>>> # > >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn > >>>>>>>> # INFO > >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as > a stack > >>>>>>>> of RBMs > >>>>>>>> dir : exp/dnn4_pretrain-dbn > >>>>>>>> Train-set : data-fmllr-tri3/train > >>>>>>>> > >>>>>>>> # PREPARING FEATURES > >>>>>>>> Preparing train/cv lists > >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp > >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local > >>>>>>>> > ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp > >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature > matrices. > >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) > >>>>>>>> Getting feature dim : copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp > >>>>>>>> ark:- > >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats > >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return > status 13 > >>>>>>>> 40 > >>>>>>>> Using splice ± 5 , step 1 > >>>>>>>> Renormalizing MLP input features into > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > >>>>>>>> compute-cmvn-stats ark:- - > >>>>>>>> cmvn-to-nnet - - > >>>>>>>> nnet-concat --binary=false > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - > >>>>>>>> > >>>>>>>> > >>>>>>>> > ------------------------------------------------------------------------------ > >>>>>>>> _______________________________________________ > >>>>>>>> Kaldi-users mailing list > >>>>>>>> Kal...@li... > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > >>>>> > ------------------------------------------------------------------------------ > >>>>> > >>>>> _______________________________________________ > >>>>> Kaldi-users mailing list > >>>>> Kal...@li... > >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > >>>>> > >>> > ------------------------------------------------------------------------------ > >>> _______________________________________________ > >>> Kaldi-users mailing list > >>> Kal...@li... > >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Kaldi-users mailing list > > Kal...@li... > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > -- > Karel Vesely, Brno University of Technology > ive...@fi..., +420-54114-1300 > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > |
From: Vesely K. <ive...@fi...> - 2014-10-27 10:40:07
|
Dan, I'll check it by running TIMIT recipe without GPU code compiled. Need to figure out what could have happened... K. On 10/24/2014 07:03 PM, Daniel Povey wrote: > Karel, > Is there something which we need to fix here? > Why was it hanging? Was it using the CPU instead of the GPU? Was it > waiting for some kind of reply from the GPU? Had it crashed? > Dan > > > On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi... > <mailto:ive...@fi...>> wrote: > > It is a 'must' on multi-GPU machines and 'recommended' for single-GPU > machine. > > It is a setting in OS, which is assumed to be done. It is good > that one > does not need > to specify a gpu-id in the scripts and track manually which gpus are > being used. > > Karel. > > On 10/24/2014 12:39 PM, Xingyu Na wrote: > > Thank you Karel. > > Is that a 'must' for all cuda-based kaldi executives? > > > > Regards, > > Xingyu > > > > On 10/24/2014 06:12 PM, Vesely Karel wrote: > >> Hi, > >> The reason is in the "computation mode", which has with Kaldi > following > >> behavior: > >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more > >> processes use same GPU which is slow) [BAD] > >> - process/thread exclusive : OS selects a free GPU which not > locked to > >> another process or raises error [RECOMMENDED] > >> Best regards, > >> Karel > >> > >> > >> On 10/24/2014 09:54 AM, Xingyu Na wrote: > >>> Thank you Dan and Alex. > >>> It turns out that I need to set 'nvidia-smi -c 1' to continue > here(don't > >>> know why....). > >>> Now I understand how that pipelined command works. > >>> Sorry for saying "Is there a bug" in the previous email.... > >>> > >>> Regards, > >>> Xingyu > >>> > >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: > >>>> Hi Xingyu, > >>>> > >>>> If you are concerned whether the process hung up or not, you > can see > >>>> the output of `ps <PID>` where <PID> is the process id. If > you see 'S' > >>>> in STAT fields, like > >>>> > >>>> PID TTY STAT TIME COMMAND > >>>> 11891 pts/5 S+ 0:00 cat > >>>> > >>>> Then the processing is sleeping. Otherwise you should see 'R' > like: > >>>> > >>>> PID TTY STAT TIME COMMAND > >>>> 11909 pts/5 R+ 0:01 cat > >>>> > >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na > <asr...@gm... <mailto:asr...@gm...>> wrote: > >>>>> Thank you so much Dan. > >>>>> The script which causes the halting is : > >>>>> > >>>>> nnet-forward --use-gpu=yes \ > >>>>> $feature_transform_old "$(echo $feats | sed > >>>>> 's|train.scp|train.scp.10k|')" \ > >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ > >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ > >>>>> nnet-concat --binary=false $feature_transform_old - > $feature_transform > >>>>> > >>>>> and the command that is running is: > >>>>> > >>>>> nnet-forward --use-gpu=yes > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- > | ark:- > >>>>> > >>>>> If I understand it correctly, nnet-forward is piping its > output to > >>>>> compute-cmvn-stats (although apply_cmvn is false), and > followed by > >>>>> cmvn-to-nnet and nnet-concat. > >>>>> The problem, I think, is that there is an extra '| ark:-'. > It means that the > >>>>> output of nnet-forward is being piped into 'ark:-', which is > not a > >>>>> executable. > >>>>> Is there is bug here? > >>>>> > >>>>> Regards, > >>>>> Xingyu > >>>>> > >>>>> > >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: > >>>>> > >>>>> I'm running the same thing at JHU to see if I can replicate > your problem. > >>>>> Dan > >>>>> > >>>>> > >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey > <dp...@gm... <mailto:dp...@gm...>> wrote: > >>>>>> cc'ing Karel who may be able to help you, although I think > he could be > >>>>>> behind on his email. > >>>>>> I'm afraid I don't know how to fix this. > >>>>>> If you can figure out the full command that's being run > then it might be > >>>>>> possible to get it in a debugger, e.g. gdb --args program > arg1 arg2 ..., and > >>>>>> break into it and get a stack trace to find where it's stuck. > >>>>>> > >>>>>> Dan > >>>>>> > >>>>>> > >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na > <asr...@gm... <mailto:asr...@gm...>> > >>>>>> wrote: > >>>>>>> Thank you Dan. > >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like this: > >>>>>>>>> #Next section enables CUDA for compilation > >>>>>>>>> CUDA = true > >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 > >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include > >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 > -DHAVE_CUDA > >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include > >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib > -Wl,-rpath,$(CUDATKDIR)/lib > >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 > -Wl,-rpath,$(CUDATKDIR)/lib64 > >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are > loaded later > >>>>>>>>> than static libs in implicit rule > >>>>>>> The 'make' process does not give any error so I can claim > that the tools > >>>>>>> are compiled with CUDA successfully, right? > >>>>>>> Problem is, although the log stops updating, I can see > 'nnet-forward' is > >>>>>>> running on GPU-2. > >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: > >>>>>>>>> nnet-forward --use-gpu=yes > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k > ark:- |' ark:- > >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) > Suggestion: use > >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode > >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) > Selecting from 4 > >>>>>>>>> GPUs > >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, > total:4799M, > >>>>>>>>> free/total:0.983228 > >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, > total:4799M, > >>>>>>>>> free/total:0.983228 > >>>>>>> and no more. I have 4 GPU cards installed, all same model. > >>>>>>> BTW, my configure command is: > >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes > >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 > >>>>>>> > >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running > on GPU while > >>>>>>> log stops updating? > >>>>>>> > >>>>>>> Thank you and best regards, > >>>>>>> Xingyu > >>>>>>> > >>>>>>> > >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: > >>>>>>> > >>>>>>> Possibly you did not compile for CUDA. The logs should > say which GPU you > >>>>>>> are using (look in the dir, for *.log). If the configure > script does not > >>>>>>> see nvcc on the command line, it will not use CUDA. Grep > for CUDA in > >>>>>>> kaldi.mk <http://kaldi.mk> to see. > >>>>>>> > >>>>>>> Dan > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na > <asr...@gm... <mailto:asr...@gm...>> > >>>>>>> wrote: > >>>>>>>> Hi, I'm new in this community. > >>>>>>>> I am running the TIMIT example s5, all the way to DNN > Hybrid Training & > >>>>>>>> Decoding part. > >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called > yesterday, and still > >>>>>>>> running. > >>>>>>>> I checked the script and found that it stuck at calling > nnet-forward for > >>>>>>>> "Renormalizing MLP input features into > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" > >>>>>>>> The program has been running more then 24 hours. > >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a > Tesla K20m... > >>>>>>>> How long does it normally take? Is there something going > wrong? > >>>>>>>> Please help. > >>>>>>>> > >>>>>>>> The log is posted below. > >>>>>>>> Thank you > >>>>>>>> Xingyu > >>>>>>>> > >>>>>>>> > >>>>>>>> > ============================================================================ > >>>>>>>> > >>>>>>>> DNN Hybrid Training & Decoding > (Karel's recipe) > >>>>>>>> > >>>>>>>> > ============================================================================ > >>>>>>>> > >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > <http://run.pl> --transform-dir > >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 > >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data > >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, > data/test --> > >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm > exp/tri3, trans > >>>>>>>> exp/tri3/decode_test > >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > <http://run.pl> --transform-dir > >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 > >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data > >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, > data/dev --> > >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm > exp/tri3, trans > >>>>>>>> exp/tri3/decode_dev > >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl > <http://run.pl> --transform-dir > >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 > >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data > >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr > >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, > data/train --> > >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm > exp/tri3, trans > >>>>>>>> exp/tri3_ali > >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train > >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 > >>>>>>>> > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > >>>>>>>> reducing #utt from 3696 to 3320 > >>>>>>>> > /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: > >>>>>>>> reducing #utt from 3696 to 376 > >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn > >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 > >>>>>>>> # > >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 > >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn > >>>>>>>> # INFO > >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief > Network as a stack > >>>>>>>> of RBMs > >>>>>>>> dir : exp/dnn4_pretrain-dbn > >>>>>>>> Train-set : data-fmllr-tri3/train > >>>>>>>> > >>>>>>>> # PREPARING FEATURES > >>>>>>>> Preparing train/cv lists > >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp > >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local > >>>>>>>> > ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp > >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 > feature matrices. > >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) > >>>>>>>> Getting feature dim : copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp > >>>>>>>> ark:- > >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats > >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero > return status 13 > >>>>>>>> 40 > >>>>>>>> Using splice ± 5 , step 1 > >>>>>>>> Renormalizing MLP input features into > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > >>>>>>>> compute-cmvn-stats ark:- - > >>>>>>>> cmvn-to-nnet - - > >>>>>>>> nnet-concat --binary=false > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet > >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading > >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet > >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - > >>>>>>>> > >>>>>>>> > >>>>>>>> > ------------------------------------------------------------------------------ > >>>>>>>> _______________________________________________ > >>>>>>>> Kaldi-users mailing list > >>>>>>>> Kal...@li... > <mailto:Kal...@li...> > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > >>>>> > ------------------------------------------------------------------------------ > >>>>> > >>>>> _______________________________________________ > >>>>> Kaldi-users mailing list > >>>>> Kal...@li... > <mailto:Kal...@li...> > >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > >>>>> > >>> > ------------------------------------------------------------------------------ > >>> _______________________________________________ > >>> Kaldi-users mailing list > >>> Kal...@li... > <mailto:Kal...@li...> > >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Kaldi-users mailing list > > Kal...@li... > <mailto:Kal...@li...> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > -- > Karel Vesely, Brno University of Technology > ive...@fi... <mailto:ive...@fi...>, > +420-54114-1300 <tel:%2B420-54114-1300> > > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > <mailto:Kal...@li...> > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > -- Karel Vesely, Brno University of Technology ive...@fi..., +420-54114-1300 |
From: Vesely K. <ve...@gm...> - 2014-10-29 13:28:16
|
Hi, the TIMIT DNN training is running, and it is very slow. I'll add there a script-check to stop training if cuda is not compiled-in. (Assuming that typically everybody wants to train on a GPU.) K. On 10/27/2014 11:39 AM, Vesely Karel wrote: > Dan, > I'll check it by running TIMIT recipe without GPU code compiled. > Need to figure out what could have happened... > K. > > On 10/24/2014 07:03 PM, Daniel Povey wrote: >> Karel, >> Is there something which we need to fix here? >> Why was it hanging? Was it using the CPU instead of the GPU? Was it >> waiting for some kind of reply from the GPU? Had it crashed? >> Dan >> >> >> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi... >> <mailto:ive...@fi...>> wrote: >> >> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU >> machine. >> >> It is a setting in OS, which is assumed to be done. It is good >> that one >> does not need >> to specify a gpu-id in the scripts and track manually which gpus are >> being used. >> >> Karel. >> >> On 10/24/2014 12:39 PM, Xingyu Na wrote: >> > Thank you Karel. >> > Is that a 'must' for all cuda-based kaldi executives? >> > >> > Regards, >> > Xingyu >> > >> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >> >> Hi, >> >> The reason is in the "computation mode", which has with Kaldi >> following >> >> behavior: >> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >> >> processes use same GPU which is slow) [BAD] >> >> - process/thread exclusive : OS selects a free GPU which not >> locked to >> >> another process or raises error [RECOMMENDED] >> >> Best regards, >> >> Karel >> >> >> >> >> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >> >>> Thank you Dan and Alex. >> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue >> here(don't >> >>> know why....). >> >>> Now I understand how that pipelined command works. >> >>> Sorry for saying "Is there a bug" in the previous email.... >> >>> >> >>> Regards, >> >>> Xingyu >> >>> >> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >> >>>> Hi Xingyu, >> >>>> >> >>>> If you are concerned whether the process hung up or not, you >> can see >> >>>> the output of `ps <PID>` where <PID> is the process id. If >> you see 'S' >> >>>> in STAT fields, like >> >>>> >> >>>> PID TTY STAT TIME COMMAND >> >>>> 11891 pts/5 S+ 0:00 cat >> >>>> >> >>>> Then the processing is sleeping. Otherwise you should see >> 'R' like: >> >>>> >> >>>> PID TTY STAT TIME COMMAND >> >>>> 11909 pts/5 R+ 0:01 cat >> >>>> >> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na >> <asr...@gm... <mailto:asr...@gm...>> wrote: >> >>>>> Thank you so much Dan. >> >>>>> The script which causes the halting is : >> >>>>> >> >>>>> nnet-forward --use-gpu=yes \ >> >>>>> $feature_transform_old "$(echo $feats | sed >> >>>>> 's|train.scp|train.scp.10k|')" \ >> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >> >>>>> nnet-concat --binary=false $feature_transform_old - >> $feature_transform >> >>>>> >> >>>>> and the command that is running is: >> >>>>> >> >>>>> nnet-forward --use-gpu=yes >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >> ark:- | ark:- >> >>>>> >> >>>>> If I understand it correctly, nnet-forward is piping its >> output to >> >>>>> compute-cmvn-stats (although apply_cmvn is false), and >> followed by >> >>>>> cmvn-to-nnet and nnet-concat. >> >>>>> The problem, I think, is that there is an extra '| ark:-'. >> It means that the >> >>>>> output of nnet-forward is being piped into 'ark:-', which >> is not a >> >>>>> executable. >> >>>>> Is there is bug here? >> >>>>> >> >>>>> Regards, >> >>>>> Xingyu >> >>>>> >> >>>>> >> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >> >>>>> >> >>>>> I'm running the same thing at JHU to see if I can replicate >> your problem. >> >>>>> Dan >> >>>>> >> >>>>> >> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey >> <dp...@gm... <mailto:dp...@gm...>> wrote: >> >>>>>> cc'ing Karel who may be able to help you, although I think >> he could be >> >>>>>> behind on his email. >> >>>>>> I'm afraid I don't know how to fix this. >> >>>>>> If you can figure out the full command that's being run >> then it might be >> >>>>>> possible to get it in a debugger, e.g. gdb --args program >> arg1 arg2 ..., and >> >>>>>> break into it and get a stack trace to find where it's stuck. >> >>>>>> >> >>>>>> Dan >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na >> <asr...@gm... <mailto:asr...@gm...>> >> >>>>>> wrote: >> >>>>>>> Thank you Dan. >> >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like >> this: >> >>>>>>>>> #Next section enables CUDA for compilation >> >>>>>>>>> CUDA = true >> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 >> -DHAVE_CUDA >> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib >> -Wl,-rpath,$(CUDATKDIR)/lib >> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >> -Wl,-rpath,$(CUDATKDIR)/lib64 >> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are >> loaded later >> >>>>>>>>> than static libs in implicit rule >> >>>>>>> The 'make' process does not give any error so I can claim >> that the tools >> >>>>>>> are compiled with CUDA successfully, right? >> >>>>>>> Problem is, although the log stops updating, I can see >> 'nnet-forward' is >> >>>>>>> running on GPU-2. >> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >> >>>>>>>>> nnet-forward --use-gpu=yes >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >> ark:- |' ark:- >> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) >> Suggestion: use >> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) >> Selecting from 4 >> >>>>>>>>> GPUs >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, >> total:4799M, >> >>>>>>>>> free/total:0.983228 >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, >> total:4799M, >> >>>>>>>>> free/total:0.983228 >> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >> >>>>>>> BTW, my configure command is: >> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >> >>>>>>> >> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running >> on GPU while >> >>>>>>> log stops updating? >> >>>>>>> >> >>>>>>> Thank you and best regards, >> >>>>>>> Xingyu >> >>>>>>> >> >>>>>>> >> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >> >>>>>>> >> >>>>>>> Possibly you did not compile for CUDA. The logs should >> say which GPU you >> >>>>>>> are using (look in the dir, for *.log). If the configure >> script does not >> >>>>>>> see nvcc on the command line, it will not use CUDA. Grep >> for CUDA in >> >>>>>>> kaldi.mk <http://kaldi.mk> to see. >> >>>>>>> >> >>>>>>> Dan >> >>>>>>> >> >>>>>>> >> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >> <asr...@gm... <mailto:asr...@gm...>> >> >>>>>>> wrote: >> >>>>>>>> Hi, I'm new in this community. >> >>>>>>>> I am running the TIMIT example s5, all the way to DNN >> Hybrid Training & >> >>>>>>>> Decoding part. >> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called >> yesterday, and still >> >>>>>>>> running. >> >>>>>>>> I checked the script and found that it stuck at calling >> nnet-forward for >> >>>>>>>> "Renormalizing MLP input features into >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> >>>>>>>> The program has been running more then 24 hours. >> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a >> Tesla K20m... >> >>>>>>>> How long does it normally take? Is there something going >> wrong? >> >>>>>>>> Please help. >> >>>>>>>> >> >>>>>>>> The log is posted below. >> >>>>>>>> Thank you >> >>>>>>>> Xingyu >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> ============================================================================ >> >>>>>>>> >> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >> >>>>>>>> >> >>>>>>>> >> ============================================================================ >> >>>>>>>> >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/test --> >> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm >> exp/tri3, trans >> >>>>>>>> exp/tri3/decode_test >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/dev --> >> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm >> exp/tri3, trans >> >>>>>>>> exp/tri3/decode_dev >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/train --> >> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm >> exp/tri3, trans >> >>>>>>>> exp/tri3_ali >> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> >>>>>>>> >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> >>>>>>>> reducing #utt from 3696 to 3320 >> >>>>>>>> >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> >>>>>>>> reducing #utt from 3696 to 376 >> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >> >>>>>>>> # >> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> >>>>>>>> # INFO >> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief >> Network as a stack >> >>>>>>>> of RBMs >> >>>>>>>> dir : exp/dnn4_pretrain-dbn >> >>>>>>>> Train-set : data-fmllr-tri3/train >> >>>>>>>> >> >>>>>>>> # PREPARING FEATURES >> >>>>>>>> Preparing train/cv lists >> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> >>>>>>>> >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 >> feature matrices. >> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >> >>>>>>>> Getting feature dim : copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> ark:- >> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe >> copy-feats >> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero >> return status 13 >> >>>>>>>> 40 >> >>>>>>>> Using splice ± 5 , step 1 >> >>>>>>>> Renormalizing MLP input features into >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> >>>>>>>> compute-cmvn-stats ark:- - >> >>>>>>>> cmvn-to-nnet - - >> >>>>>>>> nnet-concat --binary=false >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> ------------------------------------------------------------------------------ >> >>>>>>>> _______________________________________________ >> >>>>>>>> Kaldi-users mailing list >> >>>>>>>> Kal...@li... >> <mailto:Kal...@li...> >> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >>>>> >> ------------------------------------------------------------------------------ >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Kaldi-users mailing list >> >>>>> Kal...@li... >> <mailto:Kal...@li...> >> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >>>>> >> >>> >> ------------------------------------------------------------------------------ >> >>> _______________________________________________ >> >>> Kaldi-users mailing list >> >>> Kal...@li... >> <mailto:Kal...@li...> >> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> > >> ------------------------------------------------------------------------------ >> > _______________________________________________ >> > Kaldi-users mailing list >> > Kal...@li... >> <mailto:Kal...@li...> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> -- >> Karel Vesely, Brno University of Technology >> ive...@fi... <mailto:ive...@fi...>, >> +420-54114-1300 <tel:%2B420-54114-1300> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> <mailto:Kal...@li...> >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> > > -- > Karel Vesely, Brno University of Technology > ive...@fi..., +420-54114-1300 |
From: Xingyu Na <asr...@gm...> - 2014-10-30 02:28:16
|
Hi Karel, When the script freezed on my station (before I forced the compute mode), 'nvidia-smi' shows that 'nnet-forward' was actually running on one of the GPU cards. Is it possible that it was running on CPU but shows as a running job on nvidia-smi? And at the meantime, when I did 'top', it shows that 'nnet-forward' with an 'S' not an 'R'.... Xingyu On 10/29/2014 09:28 PM, Vesely Karel wrote: > Hi, > the TIMIT DNN training is running, and it is very slow. > I'll add there a script-check to stop training if cuda is not compiled-in. > (Assuming that typically everybody wants to train on a GPU.) > K. > > On 10/27/2014 11:39 AM, Vesely Karel wrote: >> Dan, >> I'll check it by running TIMIT recipe without GPU code compiled. >> Need to figure out what could have happened... >> K. >> >> On 10/24/2014 07:03 PM, Daniel Povey wrote: >>> Karel, >>> Is there something which we need to fix here? >>> Why was it hanging? Was it using the CPU instead of the GPU? Was >>> it waiting for some kind of reply from the GPU? Had it crashed? >>> Dan >>> >>> >>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi... >>> <mailto:ive...@fi...>> wrote: >>> >>> It is a 'must' on multi-GPU machines and 'recommended' for >>> single-GPU >>> machine. >>> >>> It is a setting in OS, which is assumed to be done. It is good >>> that one >>> does not need >>> to specify a gpu-id in the scripts and track manually which gpus are >>> being used. >>> >>> Karel. >>> >>> On 10/24/2014 12:39 PM, Xingyu Na wrote: >>> > Thank you Karel. >>> > Is that a 'must' for all cuda-based kaldi executives? >>> > >>> > Regards, >>> > Xingyu >>> > >>> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >>> >> Hi, >>> >> The reason is in the "computation mode", which has with Kaldi >>> following >>> >> behavior: >>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >>> >> processes use same GPU which is slow) [BAD] >>> >> - process/thread exclusive : OS selects a free GPU which not >>> locked to >>> >> another process or raises error [RECOMMENDED] >>> >> Best regards, >>> >> Karel >>> >> >>> >> >>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>> >>> Thank you Dan and Alex. >>> >>> It turns out that I need to set 'nvidia-smi -c 1' to >>> continue here(don't >>> >>> know why....). >>> >>> Now I understand how that pipelined command works. >>> >>> Sorry for saying "Is there a bug" in the previous email.... >>> >>> >>> >>> Regards, >>> >>> Xingyu >>> >>> >>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>> >>>> Hi Xingyu, >>> >>>> >>> >>>> If you are concerned whether the process hung up or not, >>> you can see >>> >>>> the output of `ps <PID>` where <PID> is the process id. If >>> you see 'S' >>> >>>> in STAT fields, like >>> >>>> >>> >>>> PID TTY STAT TIME COMMAND >>> >>>> 11891 pts/5 S+ 0:00 cat >>> >>>> >>> >>>> Then the processing is sleeping. Otherwise you should see >>> 'R' like: >>> >>>> >>> >>>> PID TTY STAT TIME COMMAND >>> >>>> 11909 pts/5 R+ 0:01 cat >>> >>>> >>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na >>> <asr...@gm... <mailto:asr...@gm...>> wrote: >>> >>>>> Thank you so much Dan. >>> >>>>> The script which causes the halting is : >>> >>>>> >>> >>>>> nnet-forward --use-gpu=yes \ >>> >>>>> $feature_transform_old "$(echo $feats | sed >>> >>>>> 's|train.scp|train.scp.10k|')" \ >>> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>> >>>>> nnet-concat --binary=false $feature_transform_old - >>> $feature_transform >>> >>>>> >>> >>>>> and the command that is running is: >>> >>>>> >>> >>>>> nnet-forward --use-gpu=yes >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >>> ark:- | ark:- >>> >>>>> >>> >>>>> If I understand it correctly, nnet-forward is piping its >>> output to >>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and >>> followed by >>> >>>>> cmvn-to-nnet and nnet-concat. >>> >>>>> The problem, I think, is that there is an extra '| ark:-'. >>> It means that the >>> >>>>> output of nnet-forward is being piped into 'ark:-', which >>> is not a >>> >>>>> executable. >>> >>>>> Is there is bug here? >>> >>>>> >>> >>>>> Regards, >>> >>>>> Xingyu >>> >>>>> >>> >>>>> >>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>> >>>>> >>> >>>>> I'm running the same thing at JHU to see if I can >>> replicate your problem. >>> >>>>> Dan >>> >>>>> >>> >>>>> >>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey >>> <dp...@gm... <mailto:dp...@gm...>> wrote: >>> >>>>>> cc'ing Karel who may be able to help you, although I >>> think he could be >>> >>>>>> behind on his email. >>> >>>>>> I'm afraid I don't know how to fix this. >>> >>>>>> If you can figure out the full command that's being run >>> then it might be >>> >>>>>> possible to get it in a debugger, e.g. gdb --args program >>> arg1 arg2 ..., and >>> >>>>>> break into it and get a stack trace to find where it's stuck. >>> >>>>>> >>> >>>>>> Dan >>> >>>>>> >>> >>>>>> >>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na >>> <asr...@gm... <mailto:asr...@gm...>> >>> >>>>>> wrote: >>> >>>>>>> Thank you Dan. >>> >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like >>> this: >>> >>>>>>>>> #Next section enables CUDA for compilation >>> >>>>>>>>> CUDA = true >>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine >>> 64 -DHAVE_CUDA >>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib >>> -Wl,-rpath,$(CUDATKDIR)/lib >>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >>> -Wl,-rpath,$(CUDATKDIR)/lib64 >>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs >>> are loaded later >>> >>>>>>>>> than static libs in implicit rule >>> >>>>>>> The 'make' process does not give any error so I can >>> claim that the tools >>> >>>>>>> are compiled with CUDA successfully, right? >>> >>>>>>> Problem is, although the log stops updating, I can see >>> 'nnet-forward' is >>> >>>>>>> running on GPU-2. >>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>> >>>>>>>>> nnet-forward --use-gpu=yes >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >>>>>>>>> 'ark:copy-feats >>> scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) >>> Suggestion: use >>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) >>> Selecting from 4 >>> >>>>>>>>> GPUs >>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, >>> total:4799M, >>> >>>>>>>>> free/total:0.983228 >>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, >>> total:4799M, >>> >>>>>>>>> free/total:0.983228 >>> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >>> >>>>>>> BTW, my configure command is: >>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>> >>>>>>> >>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is >>> running on GPU while >>> >>>>>>> log stops updating? >>> >>>>>>> >>> >>>>>>> Thank you and best regards, >>> >>>>>>> Xingyu >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>> >>>>>>> >>> >>>>>>> Possibly you did not compile for CUDA. The logs should >>> say which GPU you >>> >>>>>>> are using (look in the dir, for *.log). If the >>> configure script does not >>> >>>>>>> see nvcc on the command line, it will not use CUDA. >>> Grep for CUDA in >>> >>>>>>> kaldi.mk <http://kaldi.mk> to see. >>> >>>>>>> >>> >>>>>>> Dan >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >>> <asr...@gm... <mailto:asr...@gm...>> >>> >>>>>>> wrote: >>> >>>>>>>> Hi, I'm new in this community. >>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN >>> Hybrid Training & >>> >>>>>>>> Decoding part. >>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called >>> yesterday, and still >>> >>>>>>>> running. >>> >>>>>>>> I checked the script and found that it stuck at calling >>> nnet-forward for >>> >>>>>>>> "Renormalizing MLP input features into >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>> >>>>>>>> The program has been running more then 24 hours. >>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a >>> Tesla K20m... >>> >>>>>>>> How long does it normally take? Is there something >>> going wrong? >>> >>>>>>>> Please help. >>> >>>>>>>> >>> >>>>>>>> The log is posted below. >>> >>>>>>>> Thank you >>> >>>>>>>> Xingyu >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> ============================================================================ >>> >>>>>>>> >>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>> >>>>>>>> >>> >>>>>>>> >>> ============================================================================ >>> >>>>>>>> >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>> <http://run.pl> --transform-dir >>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test >>> exp/tri3 >>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>> data/test --> >>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm >>> exp/tri3, trans >>> >>>>>>>> exp/tri3/decode_test >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>> <http://run.pl> --transform-dir >>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>> data/dev --> >>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm >>> exp/tri3, trans >>> >>>>>>>> exp/tri3/decode_dev >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>> <http://run.pl> --transform-dir >>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>> data/train --> >>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm >>> exp/tri3, trans >>> >>>>>>>> exp/tri3_ali >>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>> >>>>>>>> >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> >>>>>>>> reducing #utt from 3696 to 3320 >>> >>>>>>>> >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> >>>>>>>> reducing #utt from 3696 to 376 >>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>> >>>>>>>> # >>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> >>>>>>>> # INFO >>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief >>> Network as a stack >>> >>>>>>>> of RBMs >>> >>>>>>>> dir : exp/dnn4_pretrain-dbn >>> >>>>>>>> Train-set : data-fmllr-tri3/train >>> >>>>>>>> >>> >>>>>>>> # PREPARING FEATURES >>> >>>>>>>> Preparing train/cv lists >>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>> >>>>>>>> >>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 >>> feature matrices. >>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >>> >>>>>>>> Getting feature dim : copy-feats >>> scp:exp/dnn4_pretrain-dbn/train.scp >>> >>>>>>>> ark:- >>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe >>> copy-feats >>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero >>> return status 13 >>> >>>>>>>> 40 >>> >>>>>>>> Using splice ± 5 , step 1 >>> >>>>>>>> Renormalizing MLP input features into >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> >>>>>>>> compute-cmvn-stats ark:- - >>> >>>>>>>> cmvn-to-nnet - - >>> >>>>>>>> nnet-concat --binary=false >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> ------------------------------------------------------------------------------ >>> >>>>>>>> _______________________________________________ >>> >>>>>>>> Kaldi-users mailing list >>> >>>>>>>> Kal...@li... >>> <mailto:Kal...@li...> >>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>>>> >>> ------------------------------------------------------------------------------ >>> >>>>> >>> >>>>> _______________________________________________ >>> >>>>> Kaldi-users mailing list >>> >>>>> Kal...@li... >>> <mailto:Kal...@li...> >>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> >>> Kaldi-users mailing list >>> >>> Kal...@li... >>> <mailto:Kal...@li...> >>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > _______________________________________________ >>> > Kaldi-users mailing list >>> > Kal...@li... >>> <mailto:Kal...@li...> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>> -- >>> Karel Vesely, Brno University of Technology >>> ive...@fi... <mailto:ive...@fi...>, >>> +420-54114-1300 <tel:%2B420-54114-1300> >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> <mailto:Kal...@li...> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>> >> >> -- >> Karel Vesely, Brno University of Technology >> ive...@fi..., +420-54114-1300 > > > ------------------------------------------------------------------------------ > > > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Vesely K. <ve...@gm...> - 2014-10-31 10:13:17
|
Hi Xingyu, hmm, I'm afraid I cannot explain this with certainty. Sometimes the binaries may behave strangely if there is a problem with cuda driver + kernel module match, or Kaldi compiled using unsufficient computation capability (it is okay in current trunk) or because of simple GPU overheating. Best, Karel. On 10/30/2014 03:27 AM, Xingyu Na wrote: > Hi Karel, > When the script freezed on my station (before I forced the compute > mode), 'nvidia-smi' shows that 'nnet-forward' was actually running on > one of the GPU cards. > Is it possible that it was running on CPU but shows as a running job > on nvidia-smi? > And at the meantime, when I did 'top', it shows that 'nnet-forward' > with an 'S' not an 'R'.... > > Xingyu > > On 10/29/2014 09:28 PM, Vesely Karel wrote: >> Hi, >> the TIMIT DNN training is running, and it is very slow. >> I'll add there a script-check to stop training if cuda is not >> compiled-in. >> (Assuming that typically everybody wants to train on a GPU.) >> K. >> >> On 10/27/2014 11:39 AM, Vesely Karel wrote: >>> Dan, >>> I'll check it by running TIMIT recipe without GPU code compiled. >>> Need to figure out what could have happened... >>> K. >>> >>> On 10/24/2014 07:03 PM, Daniel Povey wrote: >>>> Karel, >>>> Is there something which we need to fix here? >>>> Why was it hanging? Was it using the CPU instead of the GPU? Was >>>> it waiting for some kind of reply from the GPU? Had it crashed? >>>> Dan >>>> >>>> >>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel >>>> <ive...@fi... <mailto:ive...@fi...>> wrote: >>>> >>>> It is a 'must' on multi-GPU machines and 'recommended' for >>>> single-GPU >>>> machine. >>>> >>>> It is a setting in OS, which is assumed to be done. It is good >>>> that one >>>> does not need >>>> to specify a gpu-id in the scripts and track manually which >>>> gpus are >>>> being used. >>>> >>>> Karel. >>>> >>>> On 10/24/2014 12:39 PM, Xingyu Na wrote: >>>> > Thank you Karel. >>>> > Is that a 'must' for all cuda-based kaldi executives? >>>> > >>>> > Regards, >>>> > Xingyu >>>> > >>>> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >>>> >> Hi, >>>> >> The reason is in the "computation mode", which has with >>>> Kaldi following >>>> >> behavior: >>>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >>>> >> processes use same GPU which is slow) [BAD] >>>> >> - process/thread exclusive : OS selects a free GPU which not >>>> locked to >>>> >> another process or raises error [RECOMMENDED] >>>> >> Best regards, >>>> >> Karel >>>> >> >>>> >> >>>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>>> >>> Thank you Dan and Alex. >>>> >>> It turns out that I need to set 'nvidia-smi -c 1' to >>>> continue here(don't >>>> >>> know why....). >>>> >>> Now I understand how that pipelined command works. >>>> >>> Sorry for saying "Is there a bug" in the previous email.... >>>> >>> >>>> >>> Regards, >>>> >>> Xingyu >>>> >>> >>>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>>> >>>> Hi Xingyu, >>>> >>>> >>>> >>>> If you are concerned whether the process hung up or not, >>>> you can see >>>> >>>> the output of `ps <PID>` where <PID> is the process id. If >>>> you see 'S' >>>> >>>> in STAT fields, like >>>> >>>> >>>> >>>> PID TTY STAT TIME COMMAND >>>> >>>> 11891 pts/5 S+ 0:00 cat >>>> >>>> >>>> >>>> Then the processing is sleeping. Otherwise you should see >>>> 'R' like: >>>> >>>> >>>> >>>> PID TTY STAT TIME COMMAND >>>> >>>> 11909 pts/5 R+ 0:01 cat >>>> >>>> >>>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na >>>> <asr...@gm... <mailto:asr...@gm...>> wrote: >>>> >>>>> Thank you so much Dan. >>>> >>>>> The script which causes the halting is : >>>> >>>>> >>>> >>>>> nnet-forward --use-gpu=yes \ >>>> >>>>> $feature_transform_old "$(echo $feats | sed >>>> >>>>> 's|train.scp|train.scp.10k|')" \ >>>> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>>> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>>> >>>>> nnet-concat --binary=false $feature_transform_old - >>>> $feature_transform >>>> >>>>> >>>> >>>>> and the command that is running is: >>>> >>>>> >>>> >>>>> nnet-forward --use-gpu=yes >>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >>>> ark:- | ark:- >>>> >>>>> >>>> >>>>> If I understand it correctly, nnet-forward is piping its >>>> output to >>>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and >>>> followed by >>>> >>>>> cmvn-to-nnet and nnet-concat. >>>> >>>>> The problem, I think, is that there is an extra '| >>>> ark:-'. It means that the >>>> >>>>> output of nnet-forward is being piped into 'ark:-', which >>>> is not a >>>> >>>>> executable. >>>> >>>>> Is there is bug here? >>>> >>>>> >>>> >>>>> Regards, >>>> >>>>> Xingyu >>>> >>>>> >>>> >>>>> >>>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>>> >>>>> >>>> >>>>> I'm running the same thing at JHU to see if I can >>>> replicate your problem. >>>> >>>>> Dan >>>> >>>>> >>>> >>>>> >>>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey >>>> <dp...@gm... <mailto:dp...@gm...>> wrote: >>>> >>>>>> cc'ing Karel who may be able to help you, although I >>>> think he could be >>>> >>>>>> behind on his email. >>>> >>>>>> I'm afraid I don't know how to fix this. >>>> >>>>>> If you can figure out the full command that's being run >>>> then it might be >>>> >>>>>> possible to get it in a debugger, e.g. gdb --args >>>> program arg1 arg2 ..., and >>>> >>>>>> break into it and get a stack trace to find where it's >>>> stuck. >>>> >>>>>> >>>> >>>>>> Dan >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na >>>> <asr...@gm... <mailto:asr...@gm...>> >>>> >>>>>> wrote: >>>> >>>>>>> Thank you Dan. >>>> >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is >>>> like this: >>>> >>>>>>>>> #Next section enables CUDA for compilation >>>> >>>>>>>>> CUDA = true >>>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine >>>> 64 -DHAVE_CUDA >>>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib >>>> -Wl,-rpath,$(CUDATKDIR)/lib >>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >>>> -Wl,-rpath,$(CUDATKDIR)/lib64 >>>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs >>>> are loaded later >>>> >>>>>>>>> than static libs in implicit rule >>>> >>>>>>> The 'make' process does not give any error so I can >>>> claim that the tools >>>> >>>>>>> are compiled with CUDA successfully, right? >>>> >>>>>>> Problem is, although the log stops updating, I can see >>>> 'nnet-forward' is >>>> >>>>>>> running on GPU-2. >>>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it >>>> displays: >>>> >>>>>>>>> nnet-forward --use-gpu=yes >>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>> >>>>>>>>> 'ark:copy-feats >>>> scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) >>>> Suggestion: use >>>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) >>>> Selecting from 4 >>>> >>>>>>>>> GPUs >>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, >>>> total:4799M, >>>> >>>>>>>>> free/total:0.983228 >>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, >>>> total:4799M, >>>> >>>>>>>>> free/total:0.983228 >>>> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >>>> >>>>>>> BTW, my configure command is: >>>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>> >>>>>>> >>>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is >>>> running on GPU while >>>> >>>>>>> log stops updating? >>>> >>>>>>> >>>> >>>>>>> Thank you and best regards, >>>> >>>>>>> Xingyu >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>> >>>>>>> >>>> >>>>>>> Possibly you did not compile for CUDA. The logs should >>>> say which GPU you >>>> >>>>>>> are using (look in the dir, for *.log). If the >>>> configure script does not >>>> >>>>>>> see nvcc on the command line, it will not use CUDA. >>>> Grep for CUDA in >>>> >>>>>>> kaldi.mk <http://kaldi.mk> to see. >>>> >>>>>>> >>>> >>>>>>> Dan >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >>>> <asr...@gm... <mailto:asr...@gm...>> >>>> >>>>>>> wrote: >>>> >>>>>>>> Hi, I'm new in this community. >>>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN >>>> Hybrid Training & >>>> >>>>>>>> Decoding part. >>>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called >>>> yesterday, and still >>>> >>>>>>>> running. >>>> >>>>>>>> I checked the script and found that it stuck at >>>> calling nnet-forward for >>>> >>>>>>>> "Renormalizing MLP input features into >>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>> >>>>>>>> The program has been running more then 24 hours. >>>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a >>>> Tesla K20m... >>>> >>>>>>>> How long does it normally take? Is there something >>>> going wrong? >>>> >>>>>>>> Please help. >>>> >>>>>>>> >>>> >>>>>>>> The log is posted below. >>>> >>>>>>>> Thank you >>>> >>>>>>>> Xingyu >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> ============================================================================ >>>> >>>>>>>> >>>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>> >>>>>>>> >>>> >>>>>>>> >>>> ============================================================================ >>>> >>>>>>>> >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>> <http://run.pl> --transform-dir >>>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test >>>> exp/tri3 >>>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>>> data/test --> >>>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm >>>> exp/tri3, trans >>>> >>>>>>>> exp/tri3/decode_test >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>> <http://run.pl> --transform-dir >>>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>>> data/dev --> >>>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm >>>> exp/tri3, trans >>>> >>>>>>>> exp/tri3/decode_dev >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>> <http://run.pl> --transform-dir >>>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>>> data/train --> >>>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm >>>> exp/tri3, trans >>>> >>>>>>>> exp/tri3_ali >>>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>> >>>>>>>> >>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>> >>>>>>>> reducing #utt from 3696 to 3320 >>>> >>>>>>>> >>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>> >>>>>>>> reducing #utt from 3696 to 376 >>>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>> >>>>>>>> # >>>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>> >>>>>>>> # INFO >>>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief >>>> Network as a stack >>>> >>>>>>>> of RBMs >>>> >>>>>>>> dir : exp/dnn4_pretrain-dbn >>>> >>>>>>>> Train-set : data-fmllr-tri3/train >>>> >>>>>>>> >>>> >>>>>>>> # PREPARING FEATURES >>>> >>>>>>>> Preparing train/cv lists >>>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>> >>>>>>>> >>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 >>>> feature matrices. >>>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >>>> >>>>>>>> Getting feature dim : copy-feats >>>> scp:exp/dnn4_pretrain-dbn/train.scp >>>> >>>>>>>> ark:- >>>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe >>>> copy-feats >>>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had >>>> nonzero return status 13 >>>> >>>>>>>> 40 >>>> >>>>>>>> Using splice ± 5 , step 1 >>>> >>>>>>>> Renormalizing MLP input features into >>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>> >>>>>>>> compute-cmvn-stats ark:- - >>>> >>>>>>>> cmvn-to-nnet - - >>>> >>>>>>>> nnet-concat --binary=false >>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> ------------------------------------------------------------------------------ >>>> >>>>>>>> _______________________________________________ >>>> >>>>>>>> Kaldi-users mailing list >>>> >>>>>>>> Kal...@li... >>>> <mailto:Kal...@li...> >>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>>> >>>> ------------------------------------------------------------------------------ >>>> >>>>> >>>> >>>>> _______________________________________________ >>>> >>>>> Kaldi-users mailing list >>>> >>>>> Kal...@li... >>>> <mailto:Kal...@li...> >>>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>>> >>>> >>> >>>> ------------------------------------------------------------------------------ >>>> >>> _______________________________________________ >>>> >>> Kaldi-users mailing list >>>> >>> Kal...@li... >>>> <mailto:Kal...@li...> >>>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> > >>>> ------------------------------------------------------------------------------ >>>> > _______________________________________________ >>>> > Kaldi-users mailing list >>>> > Kal...@li... >>>> <mailto:Kal...@li...> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>> -- >>>> Karel Vesely, Brno University of Technology >>>> ive...@fi... <mailto:ive...@fi...>, >>>> +420-54114-1300 <tel:%2B420-54114-1300> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> <mailto:Kal...@li...> >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>> >>> >>> -- >>> Karel Vesely, Brno University of Technology >>> ive...@fi..., +420-54114-1300 >> >> >> ------------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users > |
From: Xingyu Na <asr...@gm...> - 2014-10-31 10:19:50
|
Yep, there are too many variables having impact on this. It's really hard to debug this kind of behaviour since it maybe is running really really slow that the CPU thought the GPU is sleeping :-) Anyway, it's working properly now so I'll just move on. Thank all you guys for helping. Best, Xingyu On 10/31/2014 06:13 PM, Vesely Karel wrote: > Hi Xingyu, > hmm, I'm afraid I cannot explain this with certainty. Sometimes the > binaries may > behave strangely if there is a problem with cuda driver + kernel > module match, > or Kaldi compiled using unsufficient computation capability (it is > okay in current trunk) > or because of simple GPU overheating. > Best, > Karel. > > > On 10/30/2014 03:27 AM, Xingyu Na wrote: >> Hi Karel, >> When the script freezed on my station (before I forced the compute >> mode), 'nvidia-smi' shows that 'nnet-forward' was actually running on >> one of the GPU cards. >> Is it possible that it was running on CPU but shows as a running job >> on nvidia-smi? >> And at the meantime, when I did 'top', it shows that 'nnet-forward' >> with an 'S' not an 'R'.... >> >> Xingyu >> >> On 10/29/2014 09:28 PM, Vesely Karel wrote: >>> Hi, >>> the TIMIT DNN training is running, and it is very slow. >>> I'll add there a script-check to stop training if cuda is not >>> compiled-in. >>> (Assuming that typically everybody wants to train on a GPU.) >>> K. >>> >>> On 10/27/2014 11:39 AM, Vesely Karel wrote: >>>> Dan, >>>> I'll check it by running TIMIT recipe without GPU code compiled. >>>> Need to figure out what could have happened... >>>> K. >>>> >>>> On 10/24/2014 07:03 PM, Daniel Povey wrote: >>>>> Karel, >>>>> Is there something which we need to fix here? >>>>> Why was it hanging? Was it using the CPU instead of the GPU? Was >>>>> it waiting for some kind of reply from the GPU? Had it crashed? >>>>> Dan >>>>> >>>>> >>>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel >>>>> <ive...@fi... <mailto:ive...@fi...>> wrote: >>>>> >>>>> It is a 'must' on multi-GPU machines and 'recommended' for >>>>> single-GPU >>>>> machine. >>>>> >>>>> It is a setting in OS, which is assumed to be done. It is good >>>>> that one >>>>> does not need >>>>> to specify a gpu-id in the scripts and track manually which >>>>> gpus are >>>>> being used. >>>>> >>>>> Karel. >>>>> >>>>> On 10/24/2014 12:39 PM, Xingyu Na wrote: >>>>> > Thank you Karel. >>>>> > Is that a 'must' for all cuda-based kaldi executives? >>>>> > >>>>> > Regards, >>>>> > Xingyu >>>>> > >>>>> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >>>>> >> Hi, >>>>> >> The reason is in the "computation mode", which has with >>>>> Kaldi following >>>>> >> behavior: >>>>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. >>>>> more >>>>> >> processes use same GPU which is slow) [BAD] >>>>> >> - process/thread exclusive : OS selects a free GPU which >>>>> not locked to >>>>> >> another process or raises error [RECOMMENDED] >>>>> >> Best regards, >>>>> >> Karel >>>>> >> >>>>> >> >>>>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>>>> >>> Thank you Dan and Alex. >>>>> >>> It turns out that I need to set 'nvidia-smi -c 1' to >>>>> continue here(don't >>>>> >>> know why....). >>>>> >>> Now I understand how that pipelined command works. >>>>> >>> Sorry for saying "Is there a bug" in the previous email.... >>>>> >>> >>>>> >>> Regards, >>>>> >>> Xingyu >>>>> >>> >>>>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>>>> >>>> Hi Xingyu, >>>>> >>>> >>>>> >>>> If you are concerned whether the process hung up or not, >>>>> you can see >>>>> >>>> the output of `ps <PID>` where <PID> is the process id. >>>>> If you see 'S' >>>>> >>>> in STAT fields, like >>>>> >>>> >>>>> >>>> PID TTY STAT TIME COMMAND >>>>> >>>> 11891 pts/5 S+ 0:00 cat >>>>> >>>> >>>>> >>>> Then the processing is sleeping. Otherwise you should see >>>>> 'R' like: >>>>> >>>> >>>>> >>>> PID TTY STAT TIME COMMAND >>>>> >>>> 11909 pts/5 R+ 0:01 cat >>>>> >>>> >>>>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na >>>>> <asr...@gm... <mailto:asr...@gm...>> wrote: >>>>> >>>>> Thank you so much Dan. >>>>> >>>>> The script which causes the halting is : >>>>> >>>>> >>>>> >>>>> nnet-forward --use-gpu=yes \ >>>>> >>>>> $feature_transform_old "$(echo $feats | sed >>>>> >>>>> 's|train.scp|train.scp.10k|')" \ >>>>> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>>>> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>>>> >>>>> nnet-concat --binary=false $feature_transform_old >>>>> - $feature_transform >>>>> >>>>> >>>>> >>>>> and the command that is running is: >>>>> >>>>> >>>>> >>>>> nnet-forward --use-gpu=yes >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >>>>> ark:- | ark:- >>>>> >>>>> >>>>> >>>>> If I understand it correctly, nnet-forward is piping its >>>>> output to >>>>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and >>>>> followed by >>>>> >>>>> cmvn-to-nnet and nnet-concat. >>>>> >>>>> The problem, I think, is that there is an extra '| >>>>> ark:-'. It means that the >>>>> >>>>> output of nnet-forward is being piped into 'ark:-', >>>>> which is not a >>>>> >>>>> executable. >>>>> >>>>> Is there is bug here? >>>>> >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Xingyu >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>>>> >>>>> >>>>> >>>>> I'm running the same thing at JHU to see if I can >>>>> replicate your problem. >>>>> >>>>> Dan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey >>>>> <dp...@gm... <mailto:dp...@gm...>> wrote: >>>>> >>>>>> cc'ing Karel who may be able to help you, although I >>>>> think he could be >>>>> >>>>>> behind on his email. >>>>> >>>>>> I'm afraid I don't know how to fix this. >>>>> >>>>>> If you can figure out the full command that's being run >>>>> then it might be >>>>> >>>>>> possible to get it in a debugger, e.g. gdb --args >>>>> program arg1 arg2 ..., and >>>>> >>>>>> break into it and get a stack trace to find where it's >>>>> stuck. >>>>> >>>>>> >>>>> >>>>>> Dan >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na >>>>> <asr...@gm... <mailto:asr...@gm...>> >>>>> >>>>>> wrote: >>>>> >>>>>>> Thank you Dan. >>>>> >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is >>>>> like this: >>>>> >>>>>>>>> #Next section enables CUDA for compilation >>>>> >>>>>>>>> CUDA = true >>>>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine >>>>> 64 -DHAVE_CUDA >>>>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib >>>>> -Wl,-rpath,$(CUDATKDIR)/lib >>>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >>>>> -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs >>>>> are loaded later >>>>> >>>>>>>>> than static libs in implicit rule >>>>> >>>>>>> The 'make' process does not give any error so I can >>>>> claim that the tools >>>>> >>>>>>> are compiled with CUDA successfully, right? >>>>> >>>>>>> Problem is, although the log stops updating, I can see >>>>> 'nnet-forward' is >>>>> >>>>>>> running on GPU-2. >>>>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it >>>>> displays: >>>>> >>>>>>>>> nnet-forward --use-gpu=yes >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>> >>>>>>>>> 'ark:copy-feats >>>>> scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>> >>>>>>>>> WARNING >>>>> (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>> >>>>>>>>> LOG >>>>> (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4 >>>>> >>>>>>>>> GPUs >>>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, >>>>> used:80M, total:4799M, >>>>> >>>>>>>>> free/total:0.983228 >>>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, >>>>> used:80M, total:4799M, >>>>> >>>>>>>>> free/total:0.983228 >>>>> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >>>>> >>>>>>> BTW, my configure command is: >>>>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base >>>>> --use-cuda=yes >>>>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>> >>>>>>> >>>>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is >>>>> running on GPU while >>>>> >>>>>>> log stops updating? >>>>> >>>>>>> >>>>> >>>>>>> Thank you and best regards, >>>>> >>>>>>> Xingyu >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>> >>>>>>> >>>>> >>>>>>> Possibly you did not compile for CUDA. The logs >>>>> should say which GPU you >>>>> >>>>>>> are using (look in the dir, for *.log). If the >>>>> configure script does not >>>>> >>>>>>> see nvcc on the command line, it will not use CUDA. >>>>> Grep for CUDA in >>>>> >>>>>>> kaldi.mk <http://kaldi.mk> to see. >>>>> >>>>>>> >>>>> >>>>>>> Dan >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >>>>> <asr...@gm... <mailto:asr...@gm...>> >>>>> >>>>>>> wrote: >>>>> >>>>>>>> Hi, I'm new in this community. >>>>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN >>>>> Hybrid Training & >>>>> >>>>>>>> Decoding part. >>>>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called >>>>> yesterday, and still >>>>> >>>>>>>> running. >>>>> >>>>>>>> I checked the script and found that it stuck at >>>>> calling nnet-forward for >>>>> >>>>>>>> "Renormalizing MLP input features into >>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>> >>>>>>>> The program has been running more then 24 hours. >>>>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on >>>>> a Tesla K20m... >>>>> >>>>>>>> How long does it normally take? Is there something >>>>> going wrong? >>>>> >>>>>>>> Please help. >>>>> >>>>>>>> >>>>> >>>>>>>> The log is posted below. >>>>> >>>>>>>> Thank you >>>>> >>>>>>>> Xingyu >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> ============================================================================ >>>>> >>>>>>>> >>>>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> ============================================================================ >>>>> >>>>>>>> >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>> <http://run.pl> --transform-dir >>>>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test >>>>> exp/tri3 >>>>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>> lda_fmllr, data/test --> >>>>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm >>>>> exp/tri3, trans >>>>> >>>>>>>> exp/tri3/decode_test >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>> <http://run.pl> --transform-dir >>>>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>>>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>> lda_fmllr, data/dev --> >>>>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm >>>>> exp/tri3, trans >>>>> >>>>>>>> exp/tri3/decode_dev >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>> <http://run.pl> --transform-dir >>>>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>> lda_fmllr, data/train --> >>>>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm >>>>> exp/tri3, trans >>>>> >>>>>>>> exp/tri3_ali >>>>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>> >>>>>>>> >>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>> >>>>>>>> reducing #utt from 3696 to 3320 >>>>> >>>>>>>> >>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>> >>>>>>>> reducing #utt from 3696 to 376 >>>>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>> >>>>>>>> # >>>>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>> >>>>>>>> # INFO >>>>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief >>>>> Network as a stack >>>>> >>>>>>>> of RBMs >>>>> >>>>>>>> dir : exp/dnn4_pretrain-dbn >>>>> >>>>>>>> Train-set : data-fmllr-tri3/train >>>>> >>>>>>>> >>>>> >>>>>>>> # PREPARING FEATURES >>>>> >>>>>>>> Preparing train/cv lists >>>>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>> >>>>>>>> >>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 >>>>> feature matrices. >>>>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >>>>> >>>>>>>> Getting feature dim : copy-feats >>>>> scp:exp/dnn4_pretrain-dbn/train.scp >>>>> >>>>>>>> ark:- >>>>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe >>>>> copy-feats >>>>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had >>>>> nonzero return status 13 >>>>> >>>>>>>> 40 >>>>> >>>>>>>> Using splice ± 5 , step 1 >>>>> >>>>>>>> Renormalizing MLP input features into >>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>> >>>>>>>> compute-cmvn-stats ark:- - >>>>> >>>>>>>> cmvn-to-nnet - - >>>>> >>>>>>>> nnet-concat --binary=false >>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) >>>>> Concatenating - >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>>>>> _______________________________________________ >>>>> >>>>>>>> Kaldi-users mailing list >>>>> >>>>>>>> Kal...@li... >>>>> <mailto:Kal...@li...> >>>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> >>>>> Kaldi-users mailing list >>>>> >>>>> Kal...@li... >>>>> <mailto:Kal...@li...> >>>>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>>>> >>>>> >>> >>>>> ------------------------------------------------------------------------------ >>>>> >>> _______________________________________________ >>>>> >>> Kaldi-users mailing list >>>>> >>> Kal...@li... >>>>> <mailto:Kal...@li...> >>>>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------ >>>>> > _______________________________________________ >>>>> > Kaldi-users mailing list >>>>> > Kal...@li... >>>>> <mailto:Kal...@li...> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>>>> -- >>>>> Karel Vesely, Brno University of Technology >>>>> ive...@fi... <mailto:ive...@fi...>, >>>>> +420-54114-1300 <tel:%2B420-54114-1300> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> _______________________________________________ >>>>> Kaldi-users mailing list >>>>> Kal...@li... >>>>> <mailto:Kal...@li...> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>>>> >>>> >>>> -- >>>> Karel Vesely, Brno University of Technology >>>> ive...@fi..., +420-54114-1300 >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > |
From: Ondrej P. <ond...@gm...> - 2014-10-29 16:17:38
|
Hi, may I ask how to force Kaldi to use one GPU(Tesla) over the other (Quadro). I am running it locally (using run.pl njobs=10) and I want to use the much stronger Tesla GPU Unfortunately, it selects the GPUs kind of randomly and quite often if computes on Quadro. Ondra On 29 October 2014 14:28, Vesely Karel <ve...@gm...> wrote: > Hi, > the TIMIT DNN training is running, and it is very slow. > I'll add there a script-check to stop training if cuda is not compiled-in. > (Assuming that typically everybody wants to train on a GPU.) > K. > > > On 10/27/2014 11:39 AM, Vesely Karel wrote: > > Dan, > I'll check it by running TIMIT recipe without GPU code compiled. > Need to figure out what could have happened... > K. > > On 10/24/2014 07:03 PM, Daniel Povey wrote: > > Karel, > Is there something which we need to fix here? > Why was it hanging? Was it using the CPU instead of the GPU? Was it > waiting for some kind of reply from the GPU? Had it crashed? > Dan > > > On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...> > wrote: > >> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU >> machine. >> >> It is a setting in OS, which is assumed to be done. It is good that one >> does not need >> to specify a gpu-id in the scripts and track manually which gpus are >> being used. >> >> Karel. >> >> On 10/24/2014 12:39 PM, Xingyu Na wrote: >> > Thank you Karel. >> > Is that a 'must' for all cuda-based kaldi executives? >> > >> > Regards, >> > Xingyu >> > >> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >> >> Hi, >> >> The reason is in the "computation mode", which has with Kaldi following >> >> behavior: >> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >> >> processes use same GPU which is slow) [BAD] >> >> - process/thread exclusive : OS selects a free GPU which not locked to >> >> another process or raises error [RECOMMENDED] >> >> Best regards, >> >> Karel >> >> >> >> >> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >> >>> Thank you Dan and Alex. >> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue >> here(don't >> >>> know why....). >> >>> Now I understand how that pipelined command works. >> >>> Sorry for saying "Is there a bug" in the previous email.... >> >>> >> >>> Regards, >> >>> Xingyu >> >>> >> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >> >>>> Hi Xingyu, >> >>>> >> >>>> If you are concerned whether the process hung up or not, you can see >> >>>> the output of `ps <PID>` where <PID> is the process id. If you see >> 'S' >> >>>> in STAT fields, like >> >>>> >> >>>> PID TTY STAT TIME COMMAND >> >>>> 11891 pts/5 S+ 0:00 cat >> >>>> >> >>>> Then the processing is sleeping. Otherwise you should see 'R' like: >> >>>> >> >>>> PID TTY STAT TIME COMMAND >> >>>> 11909 pts/5 R+ 0:01 cat >> >>>> >> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> >> wrote: >> >>>>> Thank you so much Dan. >> >>>>> The script which causes the halting is : >> >>>>> >> >>>>> nnet-forward --use-gpu=yes \ >> >>>>> $feature_transform_old "$(echo $feats | sed >> >>>>> 's|train.scp|train.scp.10k|')" \ >> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >> >>>>> nnet-concat --binary=false $feature_transform_old - >> $feature_transform >> >>>>> >> >>>>> and the command that is running is: >> >>>>> >> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >> >>>>> >> >>>>> If I understand it correctly, nnet-forward is piping its output to >> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by >> >>>>> cmvn-to-nnet and nnet-concat. >> >>>>> The problem, I think, is that there is an extra '| ark:-'. It means >> that the >> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a >> >>>>> executable. >> >>>>> Is there is bug here? >> >>>>> >> >>>>> Regards, >> >>>>> Xingyu >> >>>>> >> >>>>> >> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >> >>>>> >> >>>>> I'm running the same thing at JHU to see if I can replicate your >> problem. >> >>>>> Dan >> >>>>> >> >>>>> >> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> >> wrote: >> >>>>>> cc'ing Karel who may be able to help you, although I think he >> could be >> >>>>>> behind on his email. >> >>>>>> I'm afraid I don't know how to fix this. >> >>>>>> If you can figure out the full command that's being run then it >> might be >> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 >> arg2 ..., and >> >>>>>> break into it and get a stack trace to find where it's stuck. >> >>>>>> >> >>>>>> Dan >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na < >> asr...@gm...> >> >>>>>> wrote: >> >>>>>>> Thank you Dan. >> >>>>>>> I compiled with CUDA. kaldi.mk is like this: >> >>>>>>>>> #Next section enables CUDA for compilation >> >>>>>>>>> CUDA = true >> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 >> -DHAVE_CUDA >> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >> -Wl,-rpath,$(CUDATKDIR)/lib64 >> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded >> later >> >>>>>>>>> than static libs in implicit rule >> >>>>>>> The 'make' process does not give any error so I can claim that >> the tools >> >>>>>>> are compiled with CUDA successfully, right? >> >>>>>>> Problem is, although the log stops updating, I can see >> 'nnet-forward' is >> >>>>>>> running on GPU-2. >> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >> >>>>>>>>> nnet-forward --use-gpu=yes >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- >> |' ark:- >> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) >> Suggestion: use >> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting >> from 4 >> >>>>>>>>> GPUs >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, >> total:4799M, >> >>>>>>>>> free/total:0.983228 >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, >> total:4799M, >> >>>>>>>>> free/total:0.983228 >> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >> >>>>>>> BTW, my configure command is: >> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >> >>>>>>> >> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU >> while >> >>>>>>> log stops updating? >> >>>>>>> >> >>>>>>> Thank you and best regards, >> >>>>>>> Xingyu >> >>>>>>> >> >>>>>>> >> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >> >>>>>>> >> >>>>>>> Possibly you did not compile for CUDA. The logs should say which >> GPU you >> >>>>>>> are using (look in the dir, for *.log). If the configure script >> does not >> >>>>>>> see nvcc on the command line, it will not use CUDA. Grep for >> CUDA in >> >>>>>>> kaldi.mk to see. >> >>>>>>> >> >>>>>>> Dan >> >>>>>>> >> >>>>>>> >> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na < >> asr...@gm...> >> >>>>>>> wrote: >> >>>>>>>> Hi, I'm new in this community. >> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid >> Training & >> >>>>>>>> Decoding part. >> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, >> and still >> >>>>>>>> running. >> >>>>>>>> I checked the script and found that it stuck at calling >> nnet-forward for >> >>>>>>>> "Renormalizing MLP input features into >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> >>>>>>>> The program has been running more then 24 hours. >> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla >> K20m... >> >>>>>>>> How long does it normally take? Is there something going wrong? >> >>>>>>>> Please help. >> >>>>>>>> >> >>>>>>>> The log is posted below. >> >>>>>>>> Thank you >> >>>>>>>> Xingyu >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> ============================================================================ >> >>>>>>>> >> >>>>>>>> DNN Hybrid Training & Decoding (Karel's >> recipe) >> >>>>>>>> >> >>>>>>>> >> ============================================================================ >> >>>>>>>> >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> --transform-dir >> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test >> --> >> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >> >>>>>>>> exp/tri3/decode_test >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> --transform-dir >> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev >> --> >> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >> >>>>>>>> exp/tri3/decode_dev >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> --transform-dir >> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/train --> >> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, >> trans >> >>>>>>>> exp/tri3_ali >> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> >>>>>>>> >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> >>>>>>>> reducing #utt from 3696 to 3320 >> >>>>>>>> >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> >>>>>>>> reducing #utt from 3696 to 376 >> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >> >>>>>>>> # >> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> >>>>>>>> # INFO >> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as >> a stack >> >>>>>>>> of RBMs >> >>>>>>>> dir : exp/dnn4_pretrain-dbn >> >>>>>>>> Train-set : data-fmllr-tri3/train >> >>>>>>>> >> >>>>>>>> # PREPARING FEATURES >> >>>>>>>> Preparing train/cv lists >> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> >>>>>>>> >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature >> matrices. >> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >> >>>>>>>> Getting feature dim : copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> ark:- >> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return >> status 13 >> >>>>>>>> 40 >> >>>>>>>> Using splice ± 5 , step 1 >> >>>>>>>> Renormalizing MLP input features into >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> >>>>>>>> compute-cmvn-stats ark:- - >> >>>>>>>> cmvn-to-nnet - - >> >>>>>>>> nnet-concat --binary=false >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> ------------------------------------------------------------------------------ >> >>>>>>>> _______________________________________________ >> >>>>>>>> Kaldi-users mailing list >> >>>>>>>> Kal...@li... >> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >>>>> >> ------------------------------------------------------------------------------ >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Kaldi-users mailing list >> >>>>> Kal...@li... >> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >>>>> >> >>> >> ------------------------------------------------------------------------------ >> >>> _______________________________________________ >> >>> Kaldi-users mailing list >> >>> Kal...@li... >> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> > >> ------------------------------------------------------------------------------ >> > _______________________________________________ >> > Kaldi-users mailing list >> > Kal...@li... >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> -- >> Karel Vesely, Brno University of Technology >> ive...@fi..., +420-54114-1300 >> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > > > -- > Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300 > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > |
From: Jan T. <af...@ce...> - 2014-10-29 16:22:21
|
Ondrej, you can play with the CUDA_VISIBLE_DEVICES environment variable to mask out the GPUs you don't want to use. y. On Wed, Oct 29, 2014 at 5:17 PM, Ondrej Platek <ond...@gm...> wrote: > Hi, > > may I ask how to force Kaldi to use one GPU(Tesla) over the other (Quadro). > I am running it locally (using run.pl njobs=10) and I want to use the > much stronger Tesla GPU > > Unfortunately, it selects the GPUs kind of randomly and quite often if > computes on Quadro. > > Ondra > > > On 29 October 2014 14:28, Vesely Karel <ve...@gm...> wrote: > >> Hi, >> the TIMIT DNN training is running, and it is very slow. >> I'll add there a script-check to stop training if cuda is not compiled-in. >> (Assuming that typically everybody wants to train on a GPU.) >> K. >> >> >> On 10/27/2014 11:39 AM, Vesely Karel wrote: >> >> Dan, >> I'll check it by running TIMIT recipe without GPU code compiled. >> Need to figure out what could have happened... >> K. >> >> On 10/24/2014 07:03 PM, Daniel Povey wrote: >> >> Karel, >> Is there something which we need to fix here? >> Why was it hanging? Was it using the CPU instead of the GPU? Was it >> waiting for some kind of reply from the GPU? Had it crashed? >> Dan >> >> >> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...> >> wrote: >> >>> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU >>> machine. >>> >>> It is a setting in OS, which is assumed to be done. It is good that one >>> does not need >>> to specify a gpu-id in the scripts and track manually which gpus are >>> being used. >>> >>> Karel. >>> >>> On 10/24/2014 12:39 PM, Xingyu Na wrote: >>> > Thank you Karel. >>> > Is that a 'must' for all cuda-based kaldi executives? >>> > >>> > Regards, >>> > Xingyu >>> > >>> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >>> >> Hi, >>> >> The reason is in the "computation mode", which has with Kaldi >>> following >>> >> behavior: >>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >>> >> processes use same GPU which is slow) [BAD] >>> >> - process/thread exclusive : OS selects a free GPU which not locked to >>> >> another process or raises error [RECOMMENDED] >>> >> Best regards, >>> >> Karel >>> >> >>> >> >>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>> >>> Thank you Dan and Alex. >>> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue >>> here(don't >>> >>> know why....). >>> >>> Now I understand how that pipelined command works. >>> >>> Sorry for saying "Is there a bug" in the previous email.... >>> >>> >>> >>> Regards, >>> >>> Xingyu >>> >>> >>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>> >>>> Hi Xingyu, >>> >>>> >>> >>>> If you are concerned whether the process hung up or not, you can see >>> >>>> the output of `ps <PID>` where <PID> is the process id. If you see >>> 'S' >>> >>>> in STAT fields, like >>> >>>> >>> >>>> PID TTY STAT TIME COMMAND >>> >>>> 11891 pts/5 S+ 0:00 cat >>> >>>> >>> >>>> Then the processing is sleeping. Otherwise you should see 'R' like: >>> >>>> >>> >>>> PID TTY STAT TIME COMMAND >>> >>>> 11909 pts/5 R+ 0:01 cat >>> >>>> >>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> >>> wrote: >>> >>>>> Thank you so much Dan. >>> >>>>> The script which causes the halting is : >>> >>>>> >>> >>>>> nnet-forward --use-gpu=yes \ >>> >>>>> $feature_transform_old "$(echo $feats | sed >>> >>>>> 's|train.scp|train.scp.10k|')" \ >>> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>> >>>>> nnet-concat --binary=false $feature_transform_old - >>> $feature_transform >>> >>>>> >>> >>>>> and the command that is running is: >>> >>>>> >>> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | >>> ark:- >>> >>>>> >>> >>>>> If I understand it correctly, nnet-forward is piping its output to >>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by >>> >>>>> cmvn-to-nnet and nnet-concat. >>> >>>>> The problem, I think, is that there is an extra '| ark:-'. It >>> means that the >>> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a >>> >>>>> executable. >>> >>>>> Is there is bug here? >>> >>>>> >>> >>>>> Regards, >>> >>>>> Xingyu >>> >>>>> >>> >>>>> >>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>> >>>>> >>> >>>>> I'm running the same thing at JHU to see if I can replicate your >>> problem. >>> >>>>> Dan >>> >>>>> >>> >>>>> >>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> >>> wrote: >>> >>>>>> cc'ing Karel who may be able to help you, although I think he >>> could be >>> >>>>>> behind on his email. >>> >>>>>> I'm afraid I don't know how to fix this. >>> >>>>>> If you can figure out the full command that's being run then it >>> might be >>> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 >>> arg2 ..., and >>> >>>>>> break into it and get a stack trace to find where it's stuck. >>> >>>>>> >>> >>>>>> Dan >>> >>>>>> >>> >>>>>> >>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na < >>> asr...@gm...> >>> >>>>>> wrote: >>> >>>>>>> Thank you Dan. >>> >>>>>>> I compiled with CUDA. kaldi.mk is like this: >>> >>>>>>>>> #Next section enables CUDA for compilation >>> >>>>>>>>> CUDA = true >>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 >>> -DHAVE_CUDA >>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >>> -Wl,-rpath,$(CUDATKDIR)/lib64 >>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded >>> later >>> >>>>>>>>> than static libs in implicit rule >>> >>>>>>> The 'make' process does not give any error so I can claim that >>> the tools >>> >>>>>>> are compiled with CUDA successfully, right? >>> >>>>>>> Problem is, although the log stops updating, I can see >>> 'nnet-forward' is >>> >>>>>>> running on GPU-2. >>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >>> >>>>>>>>> nnet-forward --use-gpu=yes >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- >>> |' ark:- >>> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) >>> Suggestion: use >>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) >>> Selecting from 4 >>> >>>>>>>>> GPUs >>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, >>> total:4799M, >>> >>>>>>>>> free/total:0.983228 >>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, >>> total:4799M, >>> >>>>>>>>> free/total:0.983228 >>> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >>> >>>>>>> BTW, my configure command is: >>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>> >>>>>>> >>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU >>> while >>> >>>>>>> log stops updating? >>> >>>>>>> >>> >>>>>>> Thank you and best regards, >>> >>>>>>> Xingyu >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>> >>>>>>> >>> >>>>>>> Possibly you did not compile for CUDA. The logs should say >>> which GPU you >>> >>>>>>> are using (look in the dir, for *.log). If the configure script >>> does not >>> >>>>>>> see nvcc on the command line, it will not use CUDA. Grep for >>> CUDA in >>> >>>>>>> kaldi.mk to see. >>> >>>>>>> >>> >>>>>>> Dan >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na < >>> asr...@gm...> >>> >>>>>>> wrote: >>> >>>>>>>> Hi, I'm new in this community. >>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid >>> Training & >>> >>>>>>>> Decoding part. >>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, >>> and still >>> >>>>>>>> running. >>> >>>>>>>> I checked the script and found that it stuck at calling >>> nnet-forward for >>> >>>>>>>> "Renormalizing MLP input features into >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>> >>>>>>>> The program has been running more then 24 hours. >>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla >>> K20m... >>> >>>>>>>> How long does it normally take? Is there something going wrong? >>> >>>>>>>> Please help. >>> >>>>>>>> >>> >>>>>>>> The log is posted below. >>> >>>>>>>> Thank you >>> >>>>>>>> Xingyu >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> ============================================================================ >>> >>>>>>>> >>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's >>> recipe) >>> >>>>>>>> >>> >>>>>>>> >>> ============================================================================ >>> >>>>>>>> >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>> --transform-dir >>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>> data/test --> >>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, >>> trans >>> >>>>>>>> exp/tri3/decode_test >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>> --transform-dir >>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev >>> --> >>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >>> >>>>>>>> exp/tri3/decode_dev >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>> --transform-dir >>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >>> data/train --> >>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, >>> trans >>> >>>>>>>> exp/tri3_ali >>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>> >>>>>>>> >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> >>>>>>>> reducing #utt from 3696 to 3320 >>> >>>>>>>> >>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>> >>>>>>>> reducing #utt from 3696 to 376 >>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>> >>>>>>>> # >>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>> >>>>>>>> # INFO >>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network >>> as a stack >>> >>>>>>>> of RBMs >>> >>>>>>>> dir : exp/dnn4_pretrain-dbn >>> >>>>>>>> Train-set : data-fmllr-tri3/train >>> >>>>>>>> >>> >>>>>>>> # PREPARING FEATURES >>> >>>>>>>> Preparing train/cv lists >>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>> >>>>>>>> >>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature >>> matrices. >>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >>> >>>>>>>> Getting feature dim : copy-feats >>> scp:exp/dnn4_pretrain-dbn/train.scp >>> >>>>>>>> ark:- >>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return >>> status 13 >>> >>>>>>>> 40 >>> >>>>>>>> Using splice ± 5 , step 1 >>> >>>>>>>> Renormalizing MLP input features into >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> >>>>>>>> compute-cmvn-stats ark:- - >>> >>>>>>>> cmvn-to-nnet - - >>> >>>>>>>> nnet-concat --binary=false >>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> ------------------------------------------------------------------------------ >>> >>>>>>>> _______________________________________________ >>> >>>>>>>> Kaldi-users mailing list >>> >>>>>>>> Kal...@li... >>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>>>> >>> ------------------------------------------------------------------------------ >>> >>>>> >>> >>>>> _______________________________________________ >>> >>>>> Kaldi-users mailing list >>> >>>>> Kal...@li... >>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> >>> Kaldi-users mailing list >>> >>> Kal...@li... >>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > _______________________________________________ >>> > Kaldi-users mailing list >>> > Kal...@li... >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>> -- >>> Karel Vesely, Brno University of Technology >>> ive...@fi..., +420-54114-1300 >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> >> >> -- >> Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300 >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > |
From: Jan T. <af...@ce...> - 2014-10-29 16:23:49
|
Ondrej, you can play with the CUDA_VISIBLE_DEVICES environment variable to mask out the GPUs you don't want to use. y. |
From: Vesely K. <ve...@gm...> - 2014-10-31 10:31:18
|
If the log was saying it is using GPU, it is running on a GPU. The CPU is surely not a bottleneck here. If it halts, there was a problem to finish one of the CUDA kernels and sync, the possible resons are below. K. On 10/31/2014 11:19 AM, Xingyu Na wrote: > Yep, there are too many variables having impact on this. It's really > hard to debug this kind of behaviour since it maybe is running really > really slow that the CPU thought the GPU is sleeping :-) > Anyway, it's working properly now so I'll just move on. Thank all you > guys for helping. > > Best, > Xingyu > > On 10/31/2014 06:13 PM, Vesely Karel wrote: >> Hi Xingyu, >> hmm, I'm afraid I cannot explain this with certainty. Sometimes the >> binaries may >> behave strangely if there is a problem with cuda driver + kernel >> module match, >> or Kaldi compiled using unsufficient computation capability (it is >> okay in current trunk) >> or because of simple GPU overheating. >> Best, >> Karel. >> >> >> On 10/30/2014 03:27 AM, Xingyu Na wrote: >>> Hi Karel, >>> When the script freezed on my station (before I forced the compute >>> mode), 'nvidia-smi' shows that 'nnet-forward' was actually running >>> on one of the GPU cards. >>> Is it possible that it was running on CPU but shows as a running job >>> on nvidia-smi? >>> And at the meantime, when I did 'top', it shows that 'nnet-forward' >>> with an 'S' not an 'R'.... >>> >>> Xingyu >>> >>> On 10/29/2014 09:28 PM, Vesely Karel wrote: >>>> Hi, >>>> the TIMIT DNN training is running, and it is very slow. >>>> I'll add there a script-check to stop training if cuda is not >>>> compiled-in. >>>> (Assuming that typically everybody wants to train on a GPU.) >>>> K. >>>> >>>> On 10/27/2014 11:39 AM, Vesely Karel wrote: >>>>> Dan, >>>>> I'll check it by running TIMIT recipe without GPU code compiled. >>>>> Need to figure out what could have happened... >>>>> K. >>>>> >>>>> On 10/24/2014 07:03 PM, Daniel Povey wrote: >>>>>> Karel, >>>>>> Is there something which we need to fix here? >>>>>> Why was it hanging? Was it using the CPU instead of the GPU? >>>>>> Was it waiting for some kind of reply from the GPU? Had it crashed? >>>>>> Dan >>>>>> >>>>>> >>>>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel >>>>>> <ive...@fi... <mailto:ive...@fi...>> wrote: >>>>>> >>>>>> It is a 'must' on multi-GPU machines and 'recommended' for >>>>>> single-GPU >>>>>> machine. >>>>>> >>>>>> It is a setting in OS, which is assumed to be done. It is >>>>>> good that one >>>>>> does not need >>>>>> to specify a gpu-id in the scripts and track manually which >>>>>> gpus are >>>>>> being used. >>>>>> >>>>>> Karel. >>>>>> >>>>>> On 10/24/2014 12:39 PM, Xingyu Na wrote: >>>>>> > Thank you Karel. >>>>>> > Is that a 'must' for all cuda-based kaldi executives? >>>>>> > >>>>>> > Regards, >>>>>> > Xingyu >>>>>> > >>>>>> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >>>>>> >> Hi, >>>>>> >> The reason is in the "computation mode", which has with >>>>>> Kaldi following >>>>>> >> behavior: >>>>>> >> - default : OS selects GPU with GPU-ID '0' by default >>>>>> (i.e. more >>>>>> >> processes use same GPU which is slow) [BAD] >>>>>> >> - process/thread exclusive : OS selects a free GPU which >>>>>> not locked to >>>>>> >> another process or raises error [RECOMMENDED] >>>>>> >> Best regards, >>>>>> >> Karel >>>>>> >> >>>>>> >> >>>>>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >>>>>> >>> Thank you Dan and Alex. >>>>>> >>> It turns out that I need to set 'nvidia-smi -c 1' to >>>>>> continue here(don't >>>>>> >>> know why....). >>>>>> >>> Now I understand how that pipelined command works. >>>>>> >>> Sorry for saying "Is there a bug" in the previous email.... >>>>>> >>> >>>>>> >>> Regards, >>>>>> >>> Xingyu >>>>>> >>> >>>>>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >>>>>> >>>> Hi Xingyu, >>>>>> >>>> >>>>>> >>>> If you are concerned whether the process hung up or not, >>>>>> you can see >>>>>> >>>> the output of `ps <PID>` where <PID> is the process id. >>>>>> If you see 'S' >>>>>> >>>> in STAT fields, like >>>>>> >>>> >>>>>> >>>> PID TTY STAT TIME COMMAND >>>>>> >>>> 11891 pts/5 S+ 0:00 cat >>>>>> >>>> >>>>>> >>>> Then the processing is sleeping. Otherwise you should >>>>>> see 'R' like: >>>>>> >>>> >>>>>> >>>> PID TTY STAT TIME COMMAND >>>>>> >>>> 11909 pts/5 R+ 0:01 cat >>>>>> >>>> >>>>>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na >>>>>> <asr...@gm... <mailto:asr...@gm...>> wrote: >>>>>> >>>>> Thank you so much Dan. >>>>>> >>>>> The script which causes the halting is : >>>>>> >>>>> >>>>>> >>>>> nnet-forward --use-gpu=yes \ >>>>>> >>>>> $feature_transform_old "$(echo $feats | sed >>>>>> >>>>> 's|train.scp|train.scp.10k|')" \ >>>>>> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >>>>>> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >>>>>> >>>>> nnet-concat --binary=false $feature_transform_old >>>>>> - $feature_transform >>>>>> >>>>> >>>>>> >>>>> and the command that is running is: >>>>>> >>>>> >>>>>> >>>>> nnet-forward --use-gpu=yes >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k >>>>>> ark:- | ark:- >>>>>> >>>>> >>>>>> >>>>> If I understand it correctly, nnet-forward is piping >>>>>> its output to >>>>>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and >>>>>> followed by >>>>>> >>>>> cmvn-to-nnet and nnet-concat. >>>>>> >>>>> The problem, I think, is that there is an extra '| >>>>>> ark:-'. It means that the >>>>>> >>>>> output of nnet-forward is being piped into 'ark:-', >>>>>> which is not a >>>>>> >>>>> executable. >>>>>> >>>>> Is there is bug here? >>>>>> >>>>> >>>>>> >>>>> Regards, >>>>>> >>>>> Xingyu >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >>>>>> >>>>> >>>>>> >>>>> I'm running the same thing at JHU to see if I can >>>>>> replicate your problem. >>>>>> >>>>> Dan >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey >>>>>> <dp...@gm... <mailto:dp...@gm...>> wrote: >>>>>> >>>>>> cc'ing Karel who may be able to help you, although I >>>>>> think he could be >>>>>> >>>>>> behind on his email. >>>>>> >>>>>> I'm afraid I don't know how to fix this. >>>>>> >>>>>> If you can figure out the full command that's being >>>>>> run then it might be >>>>>> >>>>>> possible to get it in a debugger, e.g. gdb --args >>>>>> program arg1 arg2 ..., and >>>>>> >>>>>> break into it and get a stack trace to find where it's >>>>>> stuck. >>>>>> >>>>>> >>>>>> >>>>>> Dan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na >>>>>> <asr...@gm... <mailto:asr...@gm...>> >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Thank you Dan. >>>>>> >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is >>>>>> like this: >>>>>> >>>>>>>>> #Next section enables CUDA for compilation >>>>>> >>>>>>>>> CUDA = true >>>>>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >>>>>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >>>>>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose >>>>>> --machine 64 -DHAVE_CUDA >>>>>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >>>>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib >>>>>> -Wl,-rpath,$(CUDATKDIR)/lib >>>>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >>>>>> -Wl,-rpath,$(CUDATKDIR)/lib64 >>>>>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs >>>>>> are loaded later >>>>>> >>>>>>>>> than static libs in implicit rule >>>>>> >>>>>>> The 'make' process does not give any error so I can >>>>>> claim that the tools >>>>>> >>>>>>> are compiled with CUDA successfully, right? >>>>>> >>>>>>> Problem is, although the log stops updating, I can >>>>>> see 'nnet-forward' is >>>>>> >>>>>>> running on GPU-2. >>>>>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it >>>>>> displays: >>>>>> >>>>>>>>> nnet-forward --use-gpu=yes >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> >>>>>>>>> 'ark:copy-feats >>>>>> scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- >>>>>> >>>>>>>>> WARNING >>>>>> (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use >>>>>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >>>>>> >>>>>>>>> LOG >>>>>> (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting >>>>>> from 4 >>>>>> >>>>>>>>> GPUs >>>>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, >>>>>> used:80M, total:4799M, >>>>>> >>>>>>>>> free/total:0.983228 >>>>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >>>>>> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, >>>>>> used:80M, total:4799M, >>>>>> >>>>>>>>> free/total:0.983228 >>>>>> >>>>>>> and no more. I have 4 GPU cards installed, all same >>>>>> model. >>>>>> >>>>>>> BTW, my configure command is: >>>>>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base >>>>>> --use-cuda=yes >>>>>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >>>>>> >>>>>>> >>>>>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is >>>>>> running on GPU while >>>>>> >>>>>>> log stops updating? >>>>>> >>>>>>> >>>>>> >>>>>>> Thank you and best regards, >>>>>> >>>>>>> Xingyu >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Possibly you did not compile for CUDA. The logs >>>>>> should say which GPU you >>>>>> >>>>>>> are using (look in the dir, for *.log). If the >>>>>> configure script does not >>>>>> >>>>>>> see nvcc on the command line, it will not use CUDA. >>>>>> Grep for CUDA in >>>>>> >>>>>>> kaldi.mk <http://kaldi.mk> to see. >>>>>> >>>>>>> >>>>>> >>>>>>> Dan >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >>>>>> <asr...@gm... <mailto:asr...@gm...>> >>>>>> >>>>>>> wrote: >>>>>> >>>>>>>> Hi, I'm new in this community. >>>>>> >>>>>>>> I am running the TIMIT example s5, all the way to >>>>>> DNN Hybrid Training & >>>>>> >>>>>>>> Decoding part. >>>>>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called >>>>>> yesterday, and still >>>>>> >>>>>>>> running. >>>>>> >>>>>>>> I checked the script and found that it stuck at >>>>>> calling nnet-forward for >>>>>> >>>>>>>> "Renormalizing MLP input features into >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >>>>>> >>>>>>>> The program has been running more then 24 hours. >>>>>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on >>>>>> a Tesla K20m... >>>>>> >>>>>>>> How long does it normally take? Is there something >>>>>> going wrong? >>>>>> >>>>>>>> Please help. >>>>>> >>>>>>>> >>>>>> >>>>>>>> The log is posted below. >>>>>> >>>>>>>> Thank you >>>>>> >>>>>>>> Xingyu >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> ============================================================================ >>>>>> >>>>>>>> >>>>>> >>>>>>>> DNN Hybrid Training & Decoding (Karel's recipe) >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> ============================================================================ >>>>>> >>>>>>>> >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>>> <http://run.pl> --transform-dir >>>>>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test >>>>>> exp/tri3 >>>>>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is >>>>>> lda_fmllr >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>>> lda_fmllr, data/test --> >>>>>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm >>>>>> exp/tri3, trans >>>>>> >>>>>>>> exp/tri3/decode_test >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>>> <http://run.pl> --transform-dir >>>>>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev >>>>>> exp/tri3 >>>>>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is >>>>>> lda_fmllr >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>>> lda_fmllr, data/dev --> >>>>>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm >>>>>> exp/tri3, trans >>>>>> >>>>>>>> exp/tri3/decode_dev >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >>>>>> <http://run.pl> --transform-dir >>>>>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >>>>>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is >>>>>> lda_fmllr >>>>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type >>>>>> lda_fmllr, data/train --> >>>>>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm >>>>>> exp/tri3, trans >>>>>> >>>>>>>> exp/tri3_ali >>>>>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >>>>>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >>>>>> >>>>>>>> >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> >>>>>>>> reducing #utt from 3696 to 3320 >>>>>> >>>>>>>> >>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >>>>>> >>>>>>>> reducing #utt from 3696 to 376 >>>>>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 >>>>>> --rbm-iter 20 >>>>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >>>>>> >>>>>>>> # >>>>>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >>>>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >>>>>> >>>>>>>> # INFO >>>>>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep >>>>>> Belief Network as a stack >>>>>> >>>>>>>> of RBMs >>>>>> >>>>>>>> dir : exp/dnn4_pretrain-dbn >>>>>> >>>>>>>> Train-set : data-fmllr-tri3/train >>>>>> >>>>>>>> >>>>>> >>>>>>>> # PREPARING FEATURES >>>>>> >>>>>>>> Preparing train/cv lists >>>>>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >>>>>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >>>>>> >>>>>>>> >>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >>>>>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied >>>>>> 3696 feature matrices. >>>>>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input >>>>>> features) >>>>>> >>>>>>>> Getting feature dim : copy-feats >>>>>> scp:exp/dnn4_pretrain-dbn/train.scp >>>>>> >>>>>>>> ark:- >>>>>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe >>>>>> copy-feats >>>>>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had >>>>>> nonzero return status 13 >>>>>> >>>>>>>> 40 >>>>>> >>>>>>>> Using splice ± 5 , step 1 >>>>>> >>>>>>>> Renormalizing MLP input features into >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> >>>>>>>> compute-cmvn-stats ark:- - >>>>>> >>>>>>>> cmvn-to-nnet - - >>>>>> >>>>>>>> nnet-concat --binary=false >>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >>>>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >>>>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >>>>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) >>>>>> Concatenating - >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>>>> _______________________________________________ >>>>>> >>>>>>>> Kaldi-users mailing list >>>>>> >>>>>>>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>> >>>>>> >>>>> _______________________________________________ >>>>>> >>>>> Kaldi-users mailing list >>>>>> >>>>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>> >>>>>> >>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>> _______________________________________________ >>>>>> >>> Kaldi-users mailing list >>>>>> >>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> > >>>>>> > >>>>>> ------------------------------------------------------------------------------ >>>>>> > _______________________________________________ >>>>>> > Kaldi-users mailing list >>>>>> > Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>>> -- >>>>>> Karel Vesely, Brno University of Technology >>>>>> ive...@fi... <mailto:ive...@fi...>, >>>>>> +420-54114-1300 <tel:%2B420-54114-1300> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> _______________________________________________ >>>>>> Kaldi-users mailing list >>>>>> Kal...@li... >>>>>> <mailto:Kal...@li...> >>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Karel Vesely, Brno University of Technology >>>>> ive...@fi..., +420-54114-1300 >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> > |
From: Daniel P. <dp...@gm...> - 2014-10-31 17:02:07
|
BTW, something you can do in situations like this is to do something like the following- assuming you are debugging nnet-train-simple, but it could be another program gdb $(which nnet-train-simple) (gdb) attach 9541 (gdb) bt where 9541 is an example process id. $(which nnet-train-simple) gives you the full pathname of the program which (IIRC) gdb requires. Dan On Fri, Oct 31, 2014 at 6:31 AM, Vesely Karel <ve...@gm...> wrote: > If the log was saying it is using GPU, it is running on a GPU. The CPU is > surely not a bottleneck here. > If it halts, there was a problem to finish one of the CUDA kernels and > sync, the possible resons are below. > K. > > > On 10/31/2014 11:19 AM, Xingyu Na wrote: > > Yep, there are too many variables having impact on this. It's really hard > to debug this kind of behaviour since it maybe is running really really > slow that the CPU thought the GPU is sleeping :-) > Anyway, it's working properly now so I'll just move on. Thank all you guys > for helping. > > Best, > Xingyu > > On 10/31/2014 06:13 PM, Vesely Karel wrote: > > Hi Xingyu, > hmm, I'm afraid I cannot explain this with certainty. Sometimes the > binaries may > behave strangely if there is a problem with cuda driver + kernel module > match, > or Kaldi compiled using unsufficient computation capability (it is okay in > current trunk) > or because of simple GPU overheating. > Best, > Karel. > > > On 10/30/2014 03:27 AM, Xingyu Na wrote: > > Hi Karel, > When the script freezed on my station (before I forced the compute mode), > 'nvidia-smi' shows that 'nnet-forward' was actually running on one of the > GPU cards. > Is it possible that it was running on CPU but shows as a running job on > nvidia-smi? > And at the meantime, when I did 'top', it shows that 'nnet-forward' with > an 'S' not an 'R'.... > > Xingyu > > On 10/29/2014 09:28 PM, Vesely Karel wrote: > > Hi, > the TIMIT DNN training is running, and it is very slow. > I'll add there a script-check to stop training if cuda is not compiled-in. > (Assuming that typically everybody wants to train on a GPU.) > K. > > On 10/27/2014 11:39 AM, Vesely Karel wrote: > > Dan, > I'll check it by running TIMIT recipe without GPU code compiled. > Need to figure out what could have happened... > K. > > On 10/24/2014 07:03 PM, Daniel Povey wrote: > > Karel, > Is there something which we need to fix here? > Why was it hanging? Was it using the CPU instead of the GPU? Was it > waiting for some kind of reply from the GPU? Had it crashed? > Dan > > > On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...> > wrote: > >> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU >> machine. >> >> It is a setting in OS, which is assumed to be done. It is good that one >> does not need >> to specify a gpu-id in the scripts and track manually which gpus are >> being used. >> >> Karel. >> >> On 10/24/2014 12:39 PM, Xingyu Na wrote: >> > Thank you Karel. >> > Is that a 'must' for all cuda-based kaldi executives? >> > >> > Regards, >> > Xingyu >> > >> > On 10/24/2014 06:12 PM, Vesely Karel wrote: >> >> Hi, >> >> The reason is in the "computation mode", which has with Kaldi following >> >> behavior: >> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more >> >> processes use same GPU which is slow) [BAD] >> >> - process/thread exclusive : OS selects a free GPU which not locked to >> >> another process or raises error [RECOMMENDED] >> >> Best regards, >> >> Karel >> >> >> >> >> >> On 10/24/2014 09:54 AM, Xingyu Na wrote: >> >>> Thank you Dan and Alex. >> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue >> here(don't >> >>> know why....). >> >>> Now I understand how that pipelined command works. >> >>> Sorry for saying "Is there a bug" in the previous email.... >> >>> >> >>> Regards, >> >>> Xingyu >> >>> >> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote: >> >>>> Hi Xingyu, >> >>>> >> >>>> If you are concerned whether the process hung up or not, you can see >> >>>> the output of `ps <PID>` where <PID> is the process id. If you see >> 'S' >> >>>> in STAT fields, like >> >>>> >> >>>> PID TTY STAT TIME COMMAND >> >>>> 11891 pts/5 S+ 0:00 cat >> >>>> >> >>>> Then the processing is sleeping. Otherwise you should see 'R' like: >> >>>> >> >>>> PID TTY STAT TIME COMMAND >> >>>> 11909 pts/5 R+ 0:01 cat >> >>>> >> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> >> wrote: >> >>>>> Thank you so much Dan. >> >>>>> The script which causes the halting is : >> >>>>> >> >>>>> nnet-forward --use-gpu=yes \ >> >>>>> $feature_transform_old "$(echo $feats | sed >> >>>>> 's|train.scp|train.scp.10k|')" \ >> >>>>> ark:- 2>$dir/log/cmvn_glob_fwd.log |\ >> >>>>> compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ >> >>>>> nnet-concat --binary=false $feature_transform_old - >> $feature_transform >> >>>>> >> >>>>> and the command that is running is: >> >>>>> >> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- >> >>>>> >> >>>>> If I understand it correctly, nnet-forward is piping its output to >> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by >> >>>>> cmvn-to-nnet and nnet-concat. >> >>>>> The problem, I think, is that there is an extra '| ark:-'. It means >> that the >> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a >> >>>>> executable. >> >>>>> Is there is bug here? >> >>>>> >> >>>>> Regards, >> >>>>> Xingyu >> >>>>> >> >>>>> >> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote: >> >>>>> >> >>>>> I'm running the same thing at JHU to see if I can replicate your >> problem. >> >>>>> Dan >> >>>>> >> >>>>> >> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> >> wrote: >> >>>>>> cc'ing Karel who may be able to help you, although I think he >> could be >> >>>>>> behind on his email. >> >>>>>> I'm afraid I don't know how to fix this. >> >>>>>> If you can figure out the full command that's being run then it >> might be >> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 >> arg2 ..., and >> >>>>>> break into it and get a stack trace to find where it's stuck. >> >>>>>> >> >>>>>> Dan >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na < >> asr...@gm...> >> >>>>>> wrote: >> >>>>>>> Thank you Dan. >> >>>>>>> I compiled with CUDA. kaldi.mk is like this: >> >>>>>>>>> #Next section enables CUDA for compilation >> >>>>>>>>> CUDA = true >> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5 >> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include >> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 >> -DHAVE_CUDA >> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include >> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib >> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 >> -Wl,-rpath,$(CUDATKDIR)/lib64 >> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded >> later >> >>>>>>>>> than static libs in implicit rule >> >>>>>>> The 'make' process does not give any error so I can claim that >> the tools >> >>>>>>> are compiled with CUDA successfully, right? >> >>>>>>> Problem is, although the log stops updating, I can see >> 'nnet-forward' is >> >>>>>>> running on GPU-2. >> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays: >> >>>>>>>>> nnet-forward --use-gpu=yes >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- >> |' ark:- >> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) >> Suggestion: use >> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting >> from 4 >> >>>>>>>>> GPUs >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> >>>>>>>>> cudaSetDevice(0): Tesla K20m free:4719M, used:80M, >> total:4799M, >> >>>>>>>>> free/total:0.983228 >> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) >> >>>>>>>>> cudaSetDevice(1): Tesla K20m free:4719M, used:80M, >> total:4799M, >> >>>>>>>>> free/total:0.983228 >> >>>>>>> and no more. I have 4 GPU cards installed, all same model. >> >>>>>>> BTW, my configure command is: >> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes >> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5 >> >>>>>>> >> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU >> while >> >>>>>>> log stops updating? >> >>>>>>> >> >>>>>>> Thank you and best regards, >> >>>>>>> Xingyu >> >>>>>>> >> >>>>>>> >> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote: >> >>>>>>> >> >>>>>>> Possibly you did not compile for CUDA. The logs should say which >> GPU you >> >>>>>>> are using (look in the dir, for *.log). If the configure script >> does not >> >>>>>>> see nvcc on the command line, it will not use CUDA. Grep for >> CUDA in >> >>>>>>> kaldi.mk to see. >> >>>>>>> >> >>>>>>> Dan >> >>>>>>> >> >>>>>>> >> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na < >> asr...@gm...> >> >>>>>>> wrote: >> >>>>>>>> Hi, I'm new in this community. >> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid >> Training & >> >>>>>>>> Decoding part. >> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, >> and still >> >>>>>>>> running. >> >>>>>>>> I checked the script and found that it stuck at calling >> nnet-forward for >> >>>>>>>> "Renormalizing MLP input features into >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> >>>>>>>> The program has been running more then 24 hours. >> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla >> K20m... >> >>>>>>>> How long does it normally take? Is there something going wrong? >> >>>>>>>> Please help. >> >>>>>>>> >> >>>>>>>> The log is posted below. >> >>>>>>>> Thank you >> >>>>>>>> Xingyu >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> ============================================================================ >> >>>>>>>> >> >>>>>>>> DNN Hybrid Training & Decoding (Karel's >> recipe) >> >>>>>>>> >> >>>>>>>> >> ============================================================================ >> >>>>>>>> >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> --transform-dir >> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test >> --> >> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans >> >>>>>>>> exp/tri3/decode_test >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> --transform-dir >> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev >> --> >> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans >> >>>>>>>> exp/tri3/decode_dev >> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> --transform-dir >> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/train --> >> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, >> trans >> >>>>>>>> exp/tri3_ali >> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> >>>>>>>> >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> >>>>>>>> reducing #utt from 3696 to 3320 >> >>>>>>>> >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> >>>>>>>> reducing #utt from 3696 to 376 >> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014 >> >>>>>>>> # >> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> >>>>>>>> # INFO >> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as >> a stack >> >>>>>>>> of RBMs >> >>>>>>>> dir : exp/dnn4_pretrain-dbn >> >>>>>>>> Train-set : data-fmllr-tri3/train >> >>>>>>>> >> >>>>>>>> # PREPARING FEATURES >> >>>>>>>> Preparing train/cv lists >> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> >>>>>>>> >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature >> matrices. >> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features) >> >>>>>>>> Getting feature dim : copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp >> >>>>>>>> ark:- >> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return >> status 13 >> >>>>>>>> 40 >> >>>>>>>> Using splice ± 5 , step 1 >> >>>>>>>> Renormalizing MLP input features into >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> >>>>>>>> compute-cmvn-stats ark:- - >> >>>>>>>> cmvn-to-nnet - - >> >>>>>>>> nnet-concat --binary=false >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> ------------------------------------------------------------------------------ >> >>>>>>>> _______________________________________________ >> >>>>>>>> Kaldi-users mailing list >> >>>>>>>> Kal...@li... >> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >>>>> >> ------------------------------------------------------------------------------ >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Kaldi-users mailing list >> >>>>> Kal...@li... >> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >>>>> >> >>> >> ------------------------------------------------------------------------------ >> >>> _______________________________________________ >> >>> Kaldi-users mailing list >> >>> Kal...@li... >> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> > >> ------------------------------------------------------------------------------ >> > _______________________________________________ >> > Kaldi-users mailing list >> > Kal...@li... >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> -- >> Karel Vesely, Brno University of Technology >> ive...@fi..., +420-54114-1300 >> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > > > -- > Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300 > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Kaldi-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/kaldi-users > > > > > > |