Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,
The reason is in the "computation mode", which has with Kaldi following 
behavior:
- default : OS selects GPU with GPU-ID '0' by default (i.e. more 
processes use same GPU which is slow) [BAD]
- process/thread exclusive : OS selects a free GPU which not locked to 
another process or raises error [RECOMMENDED]
Best regards,
Karel


On 10/24/2014 09:54 AM, Xingyu Na wrote:
> Thank you Dan and Alex.
> It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't
> know why....).
> Now I understand how that pipelined command works.
> Sorry for saying "Is there a bug" in the previous email....
>
> Regards,
> Xingyu
>
> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>> Hi Xingyu,
>>
>> If you are concerned whether the process hung up or not, you can see
>> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
>> in STAT fields, like
>>
>> PID TTY      STAT   TIME COMMAND
>> 11891 pts/5    S+     0:00 cat
>>
>> Then the processing is sleeping. Otherwise you should see 'R' like:
>>
>> PID TTY      STAT   TIME COMMAND
>> 11909 pts/5    R+     0:01 cat
>>
>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
>>> Thank you so much Dan.
>>> The script which causes the halting is :
>>>
>>>     nnet-forward --use-gpu=yes \
>>>       $feature_transform_old "$(echo $feats | sed
>>> 's|train.scp|train.scp.10k|')" \
>>>       ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>     compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>     nnet-concat --binary=false $feature_transform_old - $feature_transform
>>>
>>> and the command that is running is:
>>>
>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>>>
>>> If I understand it correctly, nnet-forward is piping its output to
>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>>> cmvn-to-nnet and nnet-concat.
>>> The problem, I think, is that there is an extra '| ark:-'. It means that the
>>> output of nnet-forward is being piped into 'ark:-', which is not a
>>> executable.
>>> Is there is bug here?
>>>
>>> Regards,
>>> Xingyu
>>>
>>>
>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>
>>> I'm running the same thing at JHU to see if I can replicate your problem.
>>> Dan
>>>
>>>
>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>>> cc'ing Karel who may be able to help you, although I think he could be
>>>> behind on his email.
>>>> I'm afraid I don't know how to fix this.
>>>> If you can figure out the full command that's being run then it might be
>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>>>> break into it and get a stack trace to find where it's stuck.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>>>> wrote:
>>>>> Thank you Dan.
>>>>> I compiled with CUDA. kaldi.mk is like this:
>>>>>>> #Next section enables CUDA for compilation
>>>>>>> CUDA = true
>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>>>>>> than static libs in implicit rule
>>>>> The 'make' process does not give any error so I can claim that the tools
>>>>> are compiled with CUDA successfully, right?
>>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>>>> running on GPU-2.
>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>>> GPUs
>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>> free/total:0.983228
>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>> free/total:0.983228
>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>>> BTW, my configure command is:
>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>>
>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>>>> log stops updating?
>>>>>
>>>>> Thank you and best regards,
>>>>> Xingyu
>>>>>
>>>>>
>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>>
>>>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>>>> are using (look in the dir, for *.log).  If the configure script does not
>>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>>>> kaldi.mk to see.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>>>> wrote:
>>>>>> Hi, I'm new in this community.
>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>>>> Decoding part.
>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>>>> running.
>>>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>>>> "Renormalizing MLP input features into
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>>> The program has been running more then 24 hours.
>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>>>> How long does it normally take? Is there something going wrong?
>>>>>> Please help.
>>>>>>
>>>>>> The log is posted below.
>>>>>> Thank you
>>>>>> Xingyu
>>>>>>
>>>>>>
>>>>>> ============================================================================
>>>>>>
>>>>>>                   DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>>
>>>>>> ============================================================================
>>>>>>
>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>>>> exp/tri3/decode_test
>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>>>> exp/tri3/decode_dev
>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>>>> exp/tri3_ali
>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>> reducing #utt from 3696 to 3320
>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>> reducing #utt from 3696 to 376
>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>>> #
>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>> # INFO
>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>>>> of RBMs
>>>>>>         dir       : exp/dnn4_pretrain-dbn
>>>>>>         Train-set : data-fmllr-tri3/train
>>>>>>
>>>>>> # PREPARING FEATURES
>>>>>> Preparing train/cv lists
>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>>>> ark:-
>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>>>> 40
>>>>>> Using splice ± 5 , step 1
>>>>>> Renormalizing MLP input features into
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>> compute-cmvn-stats ark:- -
>>>>>> cmvn-to-nnet - -
>>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> _______________________________________________
>>>>>> Kaldi-users mailing list
>>>>>> Kal...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users

-- 
Karel Vesely, Brno University of Technology
ive...@fi..., +420-54114-1300