From: Xingyu Na <asr...@gm...> - 2014-10-24 07:19:06
|
Thank you so much Dan. The script which causes the halting is : nnet-forward --use-gpu=yes \ $feature_transform_old "$(echo $feats | sed 's|train.scp|train.scp.10k|')" \ ark:- 2>$dir/log/cmvn_glob_fwd.log |\ compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\ nnet-concat --binary=false $feature_transform_old - $feature_transform and the command that is running is: nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:- If I understand it correctly, nnet-forward is piping its output to compute-cmvn-stats (although apply_cmvn is false), and followed by cmvn-to-nnet and nnet-concat. The problem, I think, is that there is an extra '| ark:-'. It means that the output of nnet-forward is being piped into 'ark:-', which is not a executable. Is there is bug here? Regards, Xingyu On 10/24/2014 12:15 PM, Daniel Povey wrote: > I'm running the same thing at JHU to see if I can replicate your problem. > Dan > > > On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm... > <mailto:dp...@gm...>> wrote: > > cc'ing Karel who may be able to help you, although I think he > could be behind on his email. > I'm afraid I don't know how to fix this. > If you can figure out the full command that's being run then it > might be possible to get it in a debugger, e.g. gdb --args program > arg1 arg2 ..., and break into it and get a stack trace to find > where it's stuck. > > Dan > > > On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na > <asr...@gm... <mailto:asr...@gm...>> wrote: > > Thank you Dan. > I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like this: > >> #Next section enables CUDA for compilation > >> CUDA = true > >> CUDATKDIR = /usr/local/cuda-5.5 > >> CUDA_INCLUDE= -I$(CUDATKDIR)/include > >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 > -DHAVE_CUDA > >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib > >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 > -Wl,-rpath,$(CUDATKDIR)/lib64 > >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are > loaded later than static libs in implicit rule > > The 'make' process does not give any error so I can claim that > the tools are compiled with CUDA successfully, right? > Problem is, although the log stops updating, I can see > 'nnet-forward' is running on GPU-2. > The log in the exp dir is cmvn_glob_fwd.log and it displays: > >> nnet-forward --use-gpu=yes > exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 'ark:copy-feats > scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:- > >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) > Suggestion: use 'nvidia-smi -c 1' to set compute exclusive mode > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) > Selecting from 4 GPUs > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > cudaSetDevice(0): Tesla K20m free:4719M, used:80M, > total:4799M, free/total:0.983228 > >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) > cudaSetDevice(1): Tesla K20m free:4719M, used:80M, > total:4799M, free/total:0.983228 > > and no more. I have 4 GPU cards installed, all same model. > BTW, my configure command is: > ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes > --cudatk-dir=/usr/local/cuda-5.5 > > Am I doing something wrong? Why 'nnet-forward' is running on > GPU while log stops updating? > > Thank you and best regards, > Xingyu > > > On 10/24/2014 10:31 AM, Daniel Povey wrote: >> Possibly you did not compile for CUDA. The logs should say >> which GPU you are using (look in the dir, for *.log). If the >> configure script does not see nvcc on the command line, it >> will not use CUDA. Grep for CUDA in kaldi.mk >> <http://kaldi.mk> to see. >> >> Dan >> >> >> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na >> <asr...@gm... <mailto:asr...@gm...>> wrote: >> >> Hi, I'm new in this community. >> I am running the TIMIT example s5, all the way to DNN >> Hybrid Training & >> Decoding part. >> The script "steps/nnet/pretrain_dbn.sh" was called >> yesterday, and still >> running. >> I checked the script and found that it stuck at calling >> nnet-forward for >> "Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet" >> The program has been running more then 24 hours. >> 'nvidia-smi' said 'nnet-forward' is still running on a >> Tesla K20m... >> How long does it normally take? Is there something going >> wrong? >> Please help. >> >> The log is posted below. >> Thank you >> Xingyu >> >> ============================================================================ >> >> DNN Hybrid Training & Decoding (Karel's >> recipe) >> ============================================================================ >> >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 >> data-fmllr-tri3/test/log data-fmllr-tri3/test/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/test --> >> data-fmllr-tri3/test, using : raw-trans None, gmm >> exp/tri3, trans >> exp/tri3/decode_test >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 >> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/dev --> >> data-fmllr-tri3/dev, using : raw-trans None, gmm >> exp/tri3, trans >> exp/tri3/decode_dev >> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl >> <http://run.pl> --transform-dir >> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 >> data-fmllr-tri3/train/log data-fmllr-tri3/train/data >> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr >> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, >> data/train --> >> data-fmllr-tri3/train, using : raw-trans None, gmm >> exp/tri3, trans >> exp/tri3_ali >> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train >> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 3320 >> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: >> reducing #utt from 3696 to 376 >> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # Started at Wed Oct 22 16:11:09 CST 2014 >> # >> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 >> data-fmllr-tri3/train exp/dnn4_pretrain-dbn >> # INFO >> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief >> Network as a stack >> of RBMs >> dir : exp/dnn4_pretrain-dbn >> Train-set : data-fmllr-tri3/train >> >> # PREPARING FEATURES >> Preparing train/cv lists >> 3696 exp/dnn4_pretrain-dbn/train.scp >> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local >> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp >> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 >> feature matrices. >> apply_cmvn disabled (per speaker norm. on input features) >> Getting feature dim : copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp ark:- >> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats >> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero >> return status 13 >> 40 >> Using splice ± 5 , step 1 >> Renormalizing MLP input features into >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> compute-cmvn-stats ark:- - >> cmvn-to-nnet - - >> nnet-concat --binary=false >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - >> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet >> LOG (nnet-concat:main():nnet-concat.cc:53) Reading >> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet >> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating - >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> <mailto:Kal...@li...> >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> > > > |