Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you Dan and Alex.
It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't 
know why....).
Now I understand how that pipelined command works.
Sorry for saying "Is there a bug" in the previous email....

Regards,
Xingyu

On 10/24/2014 03:46 PM, Alexander Solovets wrote:
> Hi Xingyu,
>
> If you are concerned whether the process hung up or not, you can see
> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
> in STAT fields, like
>
> PID TTY      STAT   TIME COMMAND
> 11891 pts/5    S+     0:00 cat
>
> Then the processing is sleeping. Otherwise you should see 'R' like:
>
> PID TTY      STAT   TIME COMMAND
> 11909 pts/5    R+     0:01 cat
>
> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
>> Thank you so much Dan.
>> The script which causes the halting is :
>>
>>    nnet-forward --use-gpu=yes \
>>      $feature_transform_old "$(echo $feats | sed
>> 's|train.scp|train.scp.10k|')" \
>>      ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>    compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>    nnet-concat --binary=false $feature_transform_old - $feature_transform
>>
>> and the command that is running is:
>>
>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>>
>> If I understand it correctly, nnet-forward is piping its output to
>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>> cmvn-to-nnet and nnet-concat.
>> The problem, I think, is that there is an extra '| ark:-'. It means that the
>> output of nnet-forward is being piped into 'ark:-', which is not a
>> executable.
>> Is there is bug here?
>>
>> Regards,
>> Xingyu
>>
>>
>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>
>> I'm running the same thing at JHU to see if I can replicate your problem.
>> Dan
>>
>>
>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>> cc'ing Karel who may be able to help you, although I think he could be
>>> behind on his email.
>>> I'm afraid I don't know how to fix this.
>>> If you can figure out the full command that's being run then it might be
>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>>> break into it and get a stack trace to find where it's stuck.
>>>
>>> Dan
>>>
>>>
>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>>> wrote:
>>>> Thank you Dan.
>>>> I compiled with CUDA. kaldi.mk is like this:
>>>>>> #Next section enables CUDA for compilation
>>>>>> CUDA = true
>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>>>>> than static libs in implicit rule
>>>> The 'make' process does not give any error so I can claim that the tools
>>>> are compiled with CUDA successfully, right?
>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>>> running on GPU-2.
>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>> GPUs
>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>> free/total:0.983228
>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>> free/total:0.983228
>>>> and no more. I have 4 GPU cards installed, all same model.
>>>> BTW, my configure command is:
>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>
>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>>> log stops updating?
>>>>
>>>> Thank you and best regards,
>>>> Xingyu
>>>>
>>>>
>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>
>>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>>> are using (look in the dir, for *.log).  If the configure script does not
>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>>> kaldi.mk to see.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>>> wrote:
>>>>> Hi, I'm new in this community.
>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>>> Decoding part.
>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>>> running.
>>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>>> "Renormalizing MLP input features into
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>> The program has been running more then 24 hours.
>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>>> How long does it normally take? Is there something going wrong?
>>>>> Please help.
>>>>>
>>>>> The log is posted below.
>>>>> Thank you
>>>>> Xingyu
>>>>>
>>>>>
>>>>> ============================================================================
>>>>>
>>>>>                  DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>
>>>>> ============================================================================
>>>>>
>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>>> exp/tri3/decode_test
>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>>> exp/tri3/decode_dev
>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>>> exp/tri3_ali
>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>> reducing #utt from 3696 to 3320
>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>> reducing #utt from 3696 to 376
>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>> #
>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>> # INFO
>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>>> of RBMs
>>>>>        dir       : exp/dnn4_pretrain-dbn
>>>>>        Train-set : data-fmllr-tri3/train
>>>>>
>>>>> # PREPARING FEATURES
>>>>> Preparing train/cv lists
>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>>> ark:-
>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>>> 40
>>>>> Using splice ± 5 , step 1
>>>>> Renormalizing MLP input features into
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>> compute-cmvn-stats ark:- -
>>>>> cmvn-to-nnet - -
>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> _______________________________________________
>>>>> Kaldi-users mailing list
>>>>> Kal...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>
>>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>
>