Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I'm running the same thing at JHU to see if I can replicate your problem.
Dan

On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:

> cc'ing Karel who may be able to help you, although I think he could be
> behind on his email.
> I'm afraid I don't know how to fix this.
> If you can figure out the full command that's being run then it might be
> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ...,
> and break into it and get a stack trace to find where it's stuck.
>
> Dan
>
>
> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
> wrote:
>
>>  Thank you Dan.
>> I compiled with CUDA. kaldi.mk is like this:
>> >> #Next section enables CUDA for compilation
>> >> CUDA = true
>> >> CUDATKDIR = /usr/local/cuda-5.5
>> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>> than static libs in implicit rule
>>
>> The 'make' process does not give any error so I can claim that the tools
>> are compiled with CUDA successfully, right?
>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>> running on GPU-2.
>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>> 'nvidia-smi -c 1' to set compute exclusive mode
>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>> GPUs
>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>> free/total:0.983228
>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>> free/total:0.983228
>>
>> and no more. I have 4 GPU cards installed, all same model.
>> BTW, my configure command is:
>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>> --cudatk-dir=/usr/local/cuda-5.5
>>
>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>> log stops updating?
>>
>> Thank you and best regards,
>> Xingyu
>>
>>
>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>
>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>> are using (look in the dir, for *.log).  If the configure script does not
>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>> kaldi.mk to see.
>>
>> Dan
>>
>>
>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>> wrote:
>>
>>> Hi, I'm new in this community.
>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>> Decoding part.
>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>> running.
>>> I checked the script and found that it stuck at calling nnet-forward for
>>> "Renormalizing MLP input features into
>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>> The program has been running more then 24 hours.
>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>> How long does it normally take? Is there something going wrong?
>>> Please help.
>>>
>>> The log is posted below.
>>> Thank you
>>> Xingyu
>>>
>>>
>>> ============================================================================
>>>
>>>                 DNN Hybrid Training & Decoding (Karel's recipe)
>>>
>>> ============================================================================
>>>
>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>> exp/tri3/decode_test
>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>> exp/tri3/decode_dev
>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>> exp/tri3_ali
>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>> reducing #utt from 3696 to 3320
>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>> reducing #utt from 3696 to 376
>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>> #
>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>> # INFO
>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>> of RBMs
>>>       dir       : exp/dnn4_pretrain-dbn
>>>       Train-set : data-fmllr-tri3/train
>>>
>>> # PREPARING FEATURES
>>> Preparing train/cv lists
>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>> apply_cmvn disabled (per speaker norm. on input features)
>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>> ark:-
>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>> 40
>>> Using splice ± 5 , step 1
>>> Renormalizing MLP input features into
>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>> compute-cmvn-stats ark:- -
>>> cmvn-to-nnet - -
>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>
>>
>>
>