Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you so much Dan.
The script which causes the halting is :

   nnet-forward --use-gpu=yes \
     $feature_transform_old "$(echo $feats | sed 
's|train.scp|train.scp.10k|')" \
     ark:- 2>$dir/log/cmvn_glob_fwd.log |\
   compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
   nnet-concat --binary=false $feature_transform_old - $feature_transform

and the command that is running is:

nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 
ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-

If I understand it correctly, nnet-forward is piping its output to 
compute-cmvn-stats (although apply_cmvn is false), and followed by 
cmvn-to-nnet and nnet-concat.
The problem, I think, is that there is an extra '| ark:-'. It means that 
the output of nnet-forward is being piped into 'ark:-', which is not a 
executable.
Is there is bug here?

Regards,
Xingyu

On 10/24/2014 12:15 PM, Daniel Povey wrote:
> I'm running the same thing at JHU to see if I can replicate your problem.
> Dan
>
>
> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm... 
> <mailto:dp...@gm...>> wrote:
>
>     cc'ing Karel who may be able to help you, although I think he
>     could be behind on his email.
>     I'm afraid I don't know how to fix this.
>     If you can figure out the full command that's being run then it
>     might be possible to get it in a debugger, e.g. gdb --args program
>     arg1 arg2 ..., and break into it and get a stack trace to find
>     where it's stuck.
>
>     Dan
>
>
>     On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>
>         Thank you Dan.
>         I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like this:
>         >> #Next section enables CUDA for compilation
>         >> CUDA = true
>         >> CUDATKDIR = /usr/local/cuda-5.5
>         >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>         >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>         -DHAVE_CUDA
>         >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>         >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>         >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>         -Wl,-rpath,$(CUDATKDIR)/lib64
>         >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are
>         loaded later than static libs in implicit rule
>
>         The 'make' process does not give any error so I can claim that
>         the tools are compiled with CUDA successfully, right?
>         Problem is, although the log stops updating, I can see
>         'nnet-forward' is running on GPU-2.
>         The log in the exp dir is cmvn_glob_fwd.log and it displays:
>         >> nnet-forward --use-gpu=yes
>         exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 'ark:copy-feats
>         scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>         >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>         Suggestion: use 'nvidia-smi -c 1' to set compute exclusive mode
>         >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>         Selecting from 4 GPUs
>         >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>         cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>         total:4799M, free/total:0.983228
>         >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>         cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>         total:4799M, free/total:0.983228
>
>         and no more. I have 4 GPU cards installed, all same model.
>         BTW, my configure command is:
>         ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>         --cudatk-dir=/usr/local/cuda-5.5
>
>         Am I doing something wrong? Why 'nnet-forward' is running on
>         GPU while log stops updating?
>
>         Thank you and best regards,
>         Xingyu
>
>
>         On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>         Possibly you did not compile for CUDA.  The logs should say
>>         which GPU you are using (look in the dir, for *.log).  If the
>>         configure script does not see nvcc on the command line, it
>>         will not use CUDA.  Grep for CUDA in kaldi.mk
>>         <http://kaldi.mk> to see.
>>
>>         Dan
>>
>>
>>         On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>         <asr...@gm... <mailto:asr...@gm...>> wrote:
>>
>>             Hi, I'm new in this community.
>>             I am running the TIMIT example s5, all the way to DNN
>>             Hybrid Training &
>>             Decoding part.
>>             The script "steps/nnet/pretrain_dbn.sh" was called
>>             yesterday, and still
>>             running.
>>             I checked the script and found that it stuck at calling
>>             nnet-forward for
>>             "Renormalizing MLP input features into
>>             exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>             The program has been running more then 24 hours.
>>             'nvidia-smi' said 'nnet-forward' is still running on a
>>             Tesla K20m...
>>             How long does it normally take? Is there something going
>>             wrong?
>>             Please help.
>>
>>             The log is posted below.
>>             Thank you
>>             Xingyu
>>
>>             ============================================================================
>>
>>                             DNN Hybrid Training & Decoding (Karel's
>>             recipe)
>>             ============================================================================
>>
>>             steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>             <http://run.pl> --transform-dir
>>             exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>             data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>             steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>             steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>             data/test -->
>>             data-fmllr-tri3/test, using : raw-trans None, gmm
>>             exp/tri3, trans
>>             exp/tri3/decode_test
>>             steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>             <http://run.pl> --transform-dir
>>             exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>             data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>             steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>             steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>             data/dev -->
>>             data-fmllr-tri3/dev, using : raw-trans None, gmm
>>             exp/tri3, trans
>>             exp/tri3/decode_dev
>>             steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>             <http://run.pl> --transform-dir
>>             exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>             data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>             steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>             steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>             data/train -->
>>             data-fmllr-tri3/train, using : raw-trans None, gmm
>>             exp/tri3, trans
>>             exp/tri3_ali
>>             utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>             data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>             /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>             reducing #utt from 3696 to 3320
>>             /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>             reducing #utt from 3696 to 376
>>             # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>             data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>             # Started at Wed Oct 22 16:11:09 CST 2014
>>             #
>>             steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>             data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>             # INFO
>>             steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>>             Network as a stack
>>             of RBMs
>>                   dir       : exp/dnn4_pretrain-dbn
>>                   Train-set : data-fmllr-tri3/train
>>
>>             # PREPARING FEATURES
>>             Preparing train/cv lists
>>             3696 exp/dnn4_pretrain-dbn/train.scp
>>             copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>             ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>             LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>>             feature matrices.
>>             apply_cmvn disabled (per speaker norm. on input features)
>>             Getting feature dim : copy-feats
>>             scp:exp/dnn4_pretrain-dbn/train.scp ark:-
>>             WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>             scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero
>>             return status 13
>>             40
>>             Using splice ± 5 , step 1
>>             Renormalizing MLP input features into
>>             exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>             compute-cmvn-stats ark:- -
>>             cmvn-to-nnet - -
>>             nnet-concat --binary=false
>>             exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>             exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>             LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>             exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>             LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>
>>             ------------------------------------------------------------------------------
>>             _______________________________________________
>>             Kaldi-users mailing list
>>             Kal...@li...
>>             <mailto:Kal...@li...>
>>             https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>
>
>
>