Thread: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

Brought to you by: bouliagi, danielpovey, jtrmal, ngoel17, and 2 others

kaldi-users

[Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-24 02:17:56

Hi, I'm new in this community.
I am running the TIMIT example s5, all the way to DNN Hybrid Training & 
Decoding part.
The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still 
running.
I checked the script and found that it stuck at calling nnet-forward for 
"Renormalizing MLP input features into 
exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
The program has been running more then 24 hours.
'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
How long does it normally take? Is there something going wrong?
Please help.

The log is posted below.
Thank you
Xingyu

============================================================================ 

                DNN Hybrid Training & Decoding (Karel's recipe)
============================================================================ 

steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir 
exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3 
data-fmllr-tri3/test/log data-fmllr-tri3/test/data
steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test --> 
data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans 
exp/tri3/decode_test
steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir 
exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3 
data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev --> 
data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans 
exp/tri3/decode_dev
steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir 
exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3 
data-fmllr-tri3/train/log data-fmllr-tri3/train/data
steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train --> 
data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans 
exp/tri3_ali
utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train 
data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
/nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: 
reducing #utt from 3696 to 3320
/nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh: 
reducing #utt from 3696 to 376
# steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 
data-fmllr-tri3/train exp/dnn4_pretrain-dbn
# Started at Wed Oct 22 16:11:09 CST 2014
#
steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20 
data-fmllr-tri3/train exp/dnn4_pretrain-dbn
# INFO
steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack 
of RBMs
      dir       : exp/dnn4_pretrain-dbn
      Train-set : data-fmllr-tri3/train

# PREPARING FEATURES
Preparing train/cv lists
3696 exp/dnn4_pretrain-dbn/train.scp
copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local 
ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
apply_cmvn disabled (per speaker norm. on input features)
Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:-
WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats 
scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
40
Using splice ± 5 , step 1
Renormalizing MLP input features into 
exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
compute-cmvn-stats ark:- -
cmvn-to-nnet - -
nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet - 
exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
LOG (nnet-concat:main():nnet-concat.cc:53) Reading 
exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Daniel P. <dp...@gm...> - 2014-10-24 02:32:00

Possibly you did not compile for CUDA.  The logs should say which GPU you
are using (look in the dir, for *.log).  If the configure script does not
see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
kaldi.mk to see.

Dan


On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...> wrote:

> Hi, I'm new in this community.
> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
> Decoding part.
> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
> running.
> I checked the script and found that it stuck at calling nnet-forward for
> "Renormalizing MLP input features into
> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
> The program has been running more then 24 hours.
> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
> How long does it normally take? Is there something going wrong?
> Please help.
>
> The log is posted below.
> Thank you
> Xingyu
>
>
> ============================================================================
>
>                 DNN Hybrid Training & Decoding (Karel's recipe)
>
> ============================================================================
>
> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
> exp/tri3/decode_test
> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
> exp/tri3/decode_dev
> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
> exp/tri3_ali
> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
> reducing #utt from 3696 to 3320
> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
> reducing #utt from 3696 to 376
> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
> # Started at Wed Oct 22 16:11:09 CST 2014
> #
> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
> # INFO
> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
> of RBMs
>       dir       : exp/dnn4_pretrain-dbn
>       Train-set : data-fmllr-tri3/train
>
> # PREPARING FEATURES
> Preparing train/cv lists
> 3696 exp/dnn4_pretrain-dbn/train.scp
> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
> apply_cmvn disabled (per speaker norm. on input features)
> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:-
> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
> 40
> Using splice ± 5 , step 1
> Renormalizing MLP input features into
> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
> compute-cmvn-stats ark:- -
> cmvn-to-nnet - -
> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-24 04:05:34

Thank you Dan.
I compiled with CUDA. kaldi.mk is like this:
 >> #Next section enables CUDA for compilation
 >> CUDA = true
 >> CUDATKDIR = /usr/local/cuda-5.5
 >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
 >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
 >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
 >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
 >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
 >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later 
than static libs in implicit rule

The 'make' process does not give any error so I can claim that the tools 
are compiled with CUDA successfully, right?
Problem is, although the log stops updating, I can see 'nnet-forward' is 
running on GPU-2.
The log in the exp dir is cmvn_glob_fwd.log and it displays:
 >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 
'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
 >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: 
use 'nvidia-smi -c 1' to set compute exclusive mode
 >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 
4 GPUs
 >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) 
cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M, 
free/total:0.983228
 >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) 
cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M, 
free/total:0.983228

and no more. I have 4 GPU cards installed, all same model.
BTW, my configure command is:
./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes 
--cudatk-dir=/usr/local/cuda-5.5

Am I doing something wrong? Why 'nnet-forward' is running on GPU while 
log stops updating?

Thank you and best regards,
Xingyu

On 10/24/2014 10:31 AM, Daniel Povey wrote:
> Possibly you did not compile for CUDA.  The logs should say which GPU 
> you are using (look in the dir, for *.log).  If the configure script 
> does not see nvcc on the command line, it will not use CUDA.  Grep for 
> CUDA in kaldi.mk <http://kaldi.mk> to see.
>
> Dan
>
>
> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm... 
> <mailto:asr...@gm...>> wrote:
>
>     Hi, I'm new in this community.
>     I am running the TIMIT example s5, all the way to DNN Hybrid
>     Training &
>     Decoding part.
>     The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and
>     still
>     running.
>     I checked the script and found that it stuck at calling
>     nnet-forward for
>     "Renormalizing MLP input features into
>     exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>     The program has been running more then 24 hours.
>     'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>     How long does it normally take? Is there something going wrong?
>     Please help.
>
>     The log is posted below.
>     Thank you
>     Xingyu
>
>     ============================================================================
>
>                     DNN Hybrid Training & Decoding (Karel's recipe)
>     ============================================================================
>
>     steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>     <http://run.pl> --transform-dir
>     exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>     data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>     steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>     steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>     data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>     exp/tri3/decode_test
>     steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>     <http://run.pl> --transform-dir
>     exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>     data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>     steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>     steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>     data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>     exp/tri3/decode_dev
>     steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>     <http://run.pl> --transform-dir
>     exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>     data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>     steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>     steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>     data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>     exp/tri3_ali
>     utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>     data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>     reducing #utt from 3696 to 3320
>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>     reducing #utt from 3696 to 376
>     # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>     data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>     # Started at Wed Oct 22 16:11:09 CST 2014
>     #
>     steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>     data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>     # INFO
>     steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a
>     stack
>     of RBMs
>           dir       : exp/dnn4_pretrain-dbn
>           Train-set : data-fmllr-tri3/train
>
>     # PREPARING FEATURES
>     Preparing train/cv lists
>     3696 exp/dnn4_pretrain-dbn/train.scp
>     copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>     LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature
>     matrices.
>     apply_cmvn disabled (per speaker norm. on input features)
>     Getting feature dim : copy-feats
>     scp:exp/dnn4_pretrain-dbn/train.scp ark:-
>     WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>     scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return
>     status 13
>     40
>     Using splice ± 5 , step 1
>     Renormalizing MLP input features into
>     exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>     compute-cmvn-stats ark:- -
>     cmvn-to-nnet - -
>     nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>     exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>     LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>     LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>
>     ------------------------------------------------------------------------------
>     _______________________________________________
>     Kaldi-users mailing list
>     Kal...@li...
>     <mailto:Kal...@li...>
>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Daniel P. <dp...@gm...> - 2014-10-24 04:11:37

cc'ing Karel who may be able to help you, although I think he could be
behind on his email.
I'm afraid I don't know how to fix this.
If you can figure out the full command that's being run then it might be
possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ...,
and break into it and get a stack trace to find where it's stuck.

Dan


On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...> wrote:

>  Thank you Dan.
> I compiled with CUDA. kaldi.mk is like this:
> >> #Next section enables CUDA for compilation
> >> CUDA = true
> >> CUDATKDIR = /usr/local/cuda-5.5
> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
> than static libs in implicit rule
>
> The 'make' process does not give any error so I can claim that the tools
> are compiled with CUDA successfully, right?
> Problem is, although the log stops updating, I can see 'nnet-forward' is
> running on GPU-2.
> The log in the exp dir is cmvn_glob_fwd.log and it displays:
> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
> 'nvidia-smi -c 1' to set compute exclusive mode
> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
> GPUs
> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(0):
> Tesla K20m    free:4719M, used:80M, total:4799M, free/total:0.983228
> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(1):
> Tesla K20m    free:4719M, used:80M, total:4799M, free/total:0.983228
>
> and no more. I have 4 GPU cards installed, all same model.
> BTW, my configure command is:
> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
> --cudatk-dir=/usr/local/cuda-5.5
>
> Am I doing something wrong? Why 'nnet-forward' is running on GPU while log
> stops updating?
>
> Thank you and best regards,
> Xingyu
>
>
> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>
> Possibly you did not compile for CUDA.  The logs should say which GPU you
> are using (look in the dir, for *.log).  If the configure script does not
> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
> kaldi.mk to see.
>
> Dan
>
>
> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
> wrote:
>
>> Hi, I'm new in this community.
>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>> Decoding part.
>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>> running.
>> I checked the script and found that it stuck at calling nnet-forward for
>> "Renormalizing MLP input features into
>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>> The program has been running more then 24 hours.
>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>> How long does it normally take? Is there something going wrong?
>> Please help.
>>
>> The log is posted below.
>> Thank you
>> Xingyu
>>
>>
>> ============================================================================
>>
>>                 DNN Hybrid Training & Decoding (Karel's recipe)
>>
>> ============================================================================
>>
>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>> exp/tri3/decode_test
>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>> exp/tri3/decode_dev
>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>> exp/tri3_ali
>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>> reducing #utt from 3696 to 3320
>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>> reducing #utt from 3696 to 376
>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>> # Started at Wed Oct 22 16:11:09 CST 2014
>> #
>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>> # INFO
>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>> of RBMs
>>       dir       : exp/dnn4_pretrain-dbn
>>       Train-set : data-fmllr-tri3/train
>>
>> # PREPARING FEATURES
>> Preparing train/cv lists
>> 3696 exp/dnn4_pretrain-dbn/train.scp
>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>> apply_cmvn disabled (per speaker norm. on input features)
>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp ark:-
>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>> 40
>> Using splice ± 5 , step 1
>> Renormalizing MLP input features into
>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>> compute-cmvn-stats ark:- -
>> cmvn-to-nnet - -
>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>
>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Daniel P. <dp...@gm...> - 2014-10-24 04:15:51

I'm running the same thing at JHU to see if I can replicate your problem.
Dan


On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:

> cc'ing Karel who may be able to help you, although I think he could be
> behind on his email.
> I'm afraid I don't know how to fix this.
> If you can figure out the full command that's being run then it might be
> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ...,
> and break into it and get a stack trace to find where it's stuck.
>
> Dan
>
>
> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
> wrote:
>
>>  Thank you Dan.
>> I compiled with CUDA. kaldi.mk is like this:
>> >> #Next section enables CUDA for compilation
>> >> CUDA = true
>> >> CUDATKDIR = /usr/local/cuda-5.5
>> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>> than static libs in implicit rule
>>
>> The 'make' process does not give any error so I can claim that the tools
>> are compiled with CUDA successfully, right?
>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>> running on GPU-2.
>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>> 'nvidia-smi -c 1' to set compute exclusive mode
>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>> GPUs
>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>> free/total:0.983228
>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>> free/total:0.983228
>>
>> and no more. I have 4 GPU cards installed, all same model.
>> BTW, my configure command is:
>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>> --cudatk-dir=/usr/local/cuda-5.5
>>
>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>> log stops updating?
>>
>> Thank you and best regards,
>> Xingyu
>>
>>
>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>
>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>> are using (look in the dir, for *.log).  If the configure script does not
>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>> kaldi.mk to see.
>>
>> Dan
>>
>>
>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>> wrote:
>>
>>> Hi, I'm new in this community.
>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>> Decoding part.
>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>> running.
>>> I checked the script and found that it stuck at calling nnet-forward for
>>> "Renormalizing MLP input features into
>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>> The program has been running more then 24 hours.
>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>> How long does it normally take? Is there something going wrong?
>>> Please help.
>>>
>>> The log is posted below.
>>> Thank you
>>> Xingyu
>>>
>>>
>>> ============================================================================
>>>
>>>                 DNN Hybrid Training & Decoding (Karel's recipe)
>>>
>>> ============================================================================
>>>
>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>> exp/tri3/decode_test
>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>> exp/tri3/decode_dev
>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>> exp/tri3_ali
>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>> reducing #utt from 3696 to 3320
>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>> reducing #utt from 3696 to 376
>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>> #
>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>> # INFO
>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>> of RBMs
>>>       dir       : exp/dnn4_pretrain-dbn
>>>       Train-set : data-fmllr-tri3/train
>>>
>>> # PREPARING FEATURES
>>> Preparing train/cv lists
>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>> apply_cmvn disabled (per speaker norm. on input features)
>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>> ark:-
>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>> 40
>>> Using splice ± 5 , step 1
>>> Renormalizing MLP input features into
>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>> compute-cmvn-stats ark:- -
>>> cmvn-to-nnet - -
>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>
>>
>>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-24 07:19:06

Thank you so much Dan.
The script which causes the halting is :

   nnet-forward --use-gpu=yes \
     $feature_transform_old "$(echo $feats | sed 
's|train.scp|train.scp.10k|')" \
     ark:- 2>$dir/log/cmvn_glob_fwd.log |\
   compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
   nnet-concat --binary=false $feature_transform_old - $feature_transform

and the command that is running is:

nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 
ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-

If I understand it correctly, nnet-forward is piping its output to 
compute-cmvn-stats (although apply_cmvn is false), and followed by 
cmvn-to-nnet and nnet-concat.
The problem, I think, is that there is an extra '| ark:-'. It means that 
the output of nnet-forward is being piped into 'ark:-', which is not a 
executable.
Is there is bug here?

Regards,
Xingyu

On 10/24/2014 12:15 PM, Daniel Povey wrote:
> I'm running the same thing at JHU to see if I can replicate your problem.
> Dan
>
>
> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm... 
> <mailto:dp...@gm...>> wrote:
>
>     cc'ing Karel who may be able to help you, although I think he
>     could be behind on his email.
>     I'm afraid I don't know how to fix this.
>     If you can figure out the full command that's being run then it
>     might be possible to get it in a debugger, e.g. gdb --args program
>     arg1 arg2 ..., and break into it and get a stack trace to find
>     where it's stuck.
>
>     Dan
>
>
>     On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>
>         Thank you Dan.
>         I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like this:
>         >> #Next section enables CUDA for compilation
>         >> CUDA = true
>         >> CUDATKDIR = /usr/local/cuda-5.5
>         >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>         >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>         -DHAVE_CUDA
>         >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>         >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>         >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>         -Wl,-rpath,$(CUDATKDIR)/lib64
>         >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are
>         loaded later than static libs in implicit rule
>
>         The 'make' process does not give any error so I can claim that
>         the tools are compiled with CUDA successfully, right?
>         Problem is, although the log stops updating, I can see
>         'nnet-forward' is running on GPU-2.
>         The log in the exp dir is cmvn_glob_fwd.log and it displays:
>         >> nnet-forward --use-gpu=yes
>         exp/dnn4_pretrain-dbn/tr_splice5-1.nnet 'ark:copy-feats
>         scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>         >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>         Suggestion: use 'nvidia-smi -c 1' to set compute exclusive mode
>         >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>         Selecting from 4 GPUs
>         >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>         cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>         total:4799M, free/total:0.983228
>         >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>         cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>         total:4799M, free/total:0.983228
>
>         and no more. I have 4 GPU cards installed, all same model.
>         BTW, my configure command is:
>         ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>         --cudatk-dir=/usr/local/cuda-5.5
>
>         Am I doing something wrong? Why 'nnet-forward' is running on
>         GPU while log stops updating?
>
>         Thank you and best regards,
>         Xingyu
>
>
>         On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>         Possibly you did not compile for CUDA.  The logs should say
>>         which GPU you are using (look in the dir, for *.log).  If the
>>         configure script does not see nvcc on the command line, it
>>         will not use CUDA.  Grep for CUDA in kaldi.mk
>>         <http://kaldi.mk> to see.
>>
>>         Dan
>>
>>
>>         On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>         <asr...@gm... <mailto:asr...@gm...>> wrote:
>>
>>             Hi, I'm new in this community.
>>             I am running the TIMIT example s5, all the way to DNN
>>             Hybrid Training &
>>             Decoding part.
>>             The script "steps/nnet/pretrain_dbn.sh" was called
>>             yesterday, and still
>>             running.
>>             I checked the script and found that it stuck at calling
>>             nnet-forward for
>>             "Renormalizing MLP input features into
>>             exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>             The program has been running more then 24 hours.
>>             'nvidia-smi' said 'nnet-forward' is still running on a
>>             Tesla K20m...
>>             How long does it normally take? Is there something going
>>             wrong?
>>             Please help.
>>
>>             The log is posted below.
>>             Thank you
>>             Xingyu
>>
>>             ============================================================================
>>
>>                             DNN Hybrid Training & Decoding (Karel's
>>             recipe)
>>             ============================================================================
>>
>>             steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>             <http://run.pl> --transform-dir
>>             exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>             data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>             steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>             steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>             data/test -->
>>             data-fmllr-tri3/test, using : raw-trans None, gmm
>>             exp/tri3, trans
>>             exp/tri3/decode_test
>>             steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>             <http://run.pl> --transform-dir
>>             exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>             data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>             steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>             steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>             data/dev -->
>>             data-fmllr-tri3/dev, using : raw-trans None, gmm
>>             exp/tri3, trans
>>             exp/tri3/decode_dev
>>             steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>             <http://run.pl> --transform-dir
>>             exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>             data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>             steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>             steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>             data/train -->
>>             data-fmllr-tri3/train, using : raw-trans None, gmm
>>             exp/tri3, trans
>>             exp/tri3_ali
>>             utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>             data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>             /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>             reducing #utt from 3696 to 3320
>>             /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>             reducing #utt from 3696 to 376
>>             # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>             data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>             # Started at Wed Oct 22 16:11:09 CST 2014
>>             #
>>             steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>             data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>             # INFO
>>             steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>>             Network as a stack
>>             of RBMs
>>                   dir       : exp/dnn4_pretrain-dbn
>>                   Train-set : data-fmllr-tri3/train
>>
>>             # PREPARING FEATURES
>>             Preparing train/cv lists
>>             3696 exp/dnn4_pretrain-dbn/train.scp
>>             copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>             ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>             LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>>             feature matrices.
>>             apply_cmvn disabled (per speaker norm. on input features)
>>             Getting feature dim : copy-feats
>>             scp:exp/dnn4_pretrain-dbn/train.scp ark:-
>>             WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>             scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero
>>             return status 13
>>             40
>>             Using splice ± 5 , step 1
>>             Renormalizing MLP input features into
>>             exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>             compute-cmvn-stats ark:- -
>>             cmvn-to-nnet - -
>>             nnet-concat --binary=false
>>             exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>             exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>             LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>             exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>             LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>
>>             ------------------------------------------------------------------------------
>>             _______________________________________________
>>             Kaldi-users mailing list
>>             Kal...@li...
>>             <mailto:Kal...@li...>
>>             https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>
>
>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Alexander S. <aso...@gm...> - 2014-10-24 07:47:09

Hi Xingyu,

If you are concerned whether the process hung up or not, you can see
the output of `ps <PID>` where <PID> is the process id. If you see 'S'
in STAT fields, like

PID TTY      STAT   TIME COMMAND
11891 pts/5    S+     0:00 cat

Then the processing is sleeping. Otherwise you should see 'R' like:

PID TTY      STAT   TIME COMMAND
11909 pts/5    R+     0:01 cat

On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
> Thank you so much Dan.
> The script which causes the halting is :
>
>   nnet-forward --use-gpu=yes \
>     $feature_transform_old "$(echo $feats | sed
> 's|train.scp|train.scp.10k|')" \
>     ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>   compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>   nnet-concat --binary=false $feature_transform_old - $feature_transform
>
> and the command that is running is:
>
> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>
> If I understand it correctly, nnet-forward is piping its output to
> compute-cmvn-stats (although apply_cmvn is false), and followed by
> cmvn-to-nnet and nnet-concat.
> The problem, I think, is that there is an extra '| ark:-'. It means that the
> output of nnet-forward is being piped into 'ark:-', which is not a
> executable.
> Is there is bug here?
>
> Regards,
> Xingyu
>
>
> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>
> I'm running the same thing at JHU to see if I can replicate your problem.
> Dan
>
>
> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>
>> cc'ing Karel who may be able to help you, although I think he could be
>> behind on his email.
>> I'm afraid I don't know how to fix this.
>> If you can figure out the full command that's being run then it might be
>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>> break into it and get a stack trace to find where it's stuck.
>>
>> Dan
>>
>>
>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>> wrote:
>>>
>>> Thank you Dan.
>>> I compiled with CUDA. kaldi.mk is like this:
>>> >> #Next section enables CUDA for compilation
>>> >> CUDA = true
>>> >> CUDATKDIR = /usr/local/cuda-5.5
>>> >> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>> >> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>> >> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>> >> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>> >> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>> >> than static libs in implicit rule
>>>
>>> The 'make' process does not give any error so I can claim that the tools
>>> are compiled with CUDA successfully, right?
>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>> running on GPU-2.
>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>> >> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> >> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>> >> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>> >> 'nvidia-smi -c 1' to set compute exclusive mode
>>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>> >> GPUs
>>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>> >> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>> >> free/total:0.983228
>>> >> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>> >> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>> >> free/total:0.983228
>>>
>>> and no more. I have 4 GPU cards installed, all same model.
>>> BTW, my configure command is:
>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>> --cudatk-dir=/usr/local/cuda-5.5
>>>
>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>> log stops updating?
>>>
>>> Thank you and best regards,
>>> Xingyu
>>>
>>>
>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>
>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>> are using (look in the dir, for *.log).  If the configure script does not
>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>> kaldi.mk to see.
>>>
>>> Dan
>>>
>>>
>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>> wrote:
>>>>
>>>> Hi, I'm new in this community.
>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>> Decoding part.
>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>> running.
>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>> "Renormalizing MLP input features into
>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>> The program has been running more then 24 hours.
>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>> How long does it normally take? Is there something going wrong?
>>>> Please help.
>>>>
>>>> The log is posted below.
>>>> Thank you
>>>> Xingyu
>>>>
>>>>
>>>> ============================================================================
>>>>
>>>>                 DNN Hybrid Training & Decoding (Karel's recipe)
>>>>
>>>> ============================================================================
>>>>
>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>> exp/tri3/decode_test
>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>> exp/tri3/decode_dev
>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>> exp/tri3_ali
>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>> reducing #utt from 3696 to 3320
>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>> reducing #utt from 3696 to 376
>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>> #
>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>> # INFO
>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>> of RBMs
>>>>       dir       : exp/dnn4_pretrain-dbn
>>>>       Train-set : data-fmllr-tri3/train
>>>>
>>>> # PREPARING FEATURES
>>>> Preparing train/cv lists
>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>> ark:-
>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>> 40
>>>> Using splice ± 5 , step 1
>>>> Renormalizing MLP input features into
>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>> compute-cmvn-stats ark:- -
>>>> cmvn-to-nnet - -
>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> _______________________________________________
>>>> Kaldi-users mailing list
>>>> Kal...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>>
>>>
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>



-- 
Sincerely, Alexander

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-24 07:55:04

Thank you Dan and Alex.
It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't 
know why....).
Now I understand how that pipelined command works.
Sorry for saying "Is there a bug" in the previous email....

Regards,
Xingyu

On 10/24/2014 03:46 PM, Alexander Solovets wrote:
> Hi Xingyu,
>
> If you are concerned whether the process hung up or not, you can see
> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
> in STAT fields, like
>
> PID TTY      STAT   TIME COMMAND
> 11891 pts/5    S+     0:00 cat
>
> Then the processing is sleeping. Otherwise you should see 'R' like:
>
> PID TTY      STAT   TIME COMMAND
> 11909 pts/5    R+     0:01 cat
>
> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
>> Thank you so much Dan.
>> The script which causes the halting is :
>>
>>    nnet-forward --use-gpu=yes \
>>      $feature_transform_old "$(echo $feats | sed
>> 's|train.scp|train.scp.10k|')" \
>>      ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>    compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>    nnet-concat --binary=false $feature_transform_old - $feature_transform
>>
>> and the command that is running is:
>>
>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>>
>> If I understand it correctly, nnet-forward is piping its output to
>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>> cmvn-to-nnet and nnet-concat.
>> The problem, I think, is that there is an extra '| ark:-'. It means that the
>> output of nnet-forward is being piped into 'ark:-', which is not a
>> executable.
>> Is there is bug here?
>>
>> Regards,
>> Xingyu
>>
>>
>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>
>> I'm running the same thing at JHU to see if I can replicate your problem.
>> Dan
>>
>>
>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>> cc'ing Karel who may be able to help you, although I think he could be
>>> behind on his email.
>>> I'm afraid I don't know how to fix this.
>>> If you can figure out the full command that's being run then it might be
>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>>> break into it and get a stack trace to find where it's stuck.
>>>
>>> Dan
>>>
>>>
>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>>> wrote:
>>>> Thank you Dan.
>>>> I compiled with CUDA. kaldi.mk is like this:
>>>>>> #Next section enables CUDA for compilation
>>>>>> CUDA = true
>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>>>>> than static libs in implicit rule
>>>> The 'make' process does not give any error so I can claim that the tools
>>>> are compiled with CUDA successfully, right?
>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>>> running on GPU-2.
>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>> GPUs
>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>> free/total:0.983228
>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>> free/total:0.983228
>>>> and no more. I have 4 GPU cards installed, all same model.
>>>> BTW, my configure command is:
>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>
>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>>> log stops updating?
>>>>
>>>> Thank you and best regards,
>>>> Xingyu
>>>>
>>>>
>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>
>>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>>> are using (look in the dir, for *.log).  If the configure script does not
>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>>> kaldi.mk to see.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>>> wrote:
>>>>> Hi, I'm new in this community.
>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>>> Decoding part.
>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>>> running.
>>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>>> "Renormalizing MLP input features into
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>> The program has been running more then 24 hours.
>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>>> How long does it normally take? Is there something going wrong?
>>>>> Please help.
>>>>>
>>>>> The log is posted below.
>>>>> Thank you
>>>>> Xingyu
>>>>>
>>>>>
>>>>> ============================================================================
>>>>>
>>>>>                  DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>
>>>>> ============================================================================
>>>>>
>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>>> exp/tri3/decode_test
>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>>> exp/tri3/decode_dev
>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>>> exp/tri3_ali
>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>> reducing #utt from 3696 to 3320
>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>> reducing #utt from 3696 to 376
>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>> #
>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>> # INFO
>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>>> of RBMs
>>>>>        dir       : exp/dnn4_pretrain-dbn
>>>>>        Train-set : data-fmllr-tri3/train
>>>>>
>>>>> # PREPARING FEATURES
>>>>> Preparing train/cv lists
>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>>> ark:-
>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>>> 40
>>>>> Using splice ± 5 , step 1
>>>>> Renormalizing MLP input features into
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>> compute-cmvn-stats ark:- -
>>>>> cmvn-to-nnet - -
>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> _______________________________________________
>>>>> Kaldi-users mailing list
>>>>> Kal...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>
>>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Vesely K. <ive...@fi...> - 2014-10-24 10:32:31

Hi,
The reason is in the "computation mode", which has with Kaldi following 
behavior:
- default : OS selects GPU with GPU-ID '0' by default (i.e. more 
processes use same GPU which is slow) [BAD]
- process/thread exclusive : OS selects a free GPU which not locked to 
another process or raises error [RECOMMENDED]
Best regards,
Karel


On 10/24/2014 09:54 AM, Xingyu Na wrote:
> Thank you Dan and Alex.
> It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't
> know why....).
> Now I understand how that pipelined command works.
> Sorry for saying "Is there a bug" in the previous email....
>
> Regards,
> Xingyu
>
> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>> Hi Xingyu,
>>
>> If you are concerned whether the process hung up or not, you can see
>> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
>> in STAT fields, like
>>
>> PID TTY      STAT   TIME COMMAND
>> 11891 pts/5    S+     0:00 cat
>>
>> Then the processing is sleeping. Otherwise you should see 'R' like:
>>
>> PID TTY      STAT   TIME COMMAND
>> 11909 pts/5    R+     0:01 cat
>>
>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
>>> Thank you so much Dan.
>>> The script which causes the halting is :
>>>
>>>     nnet-forward --use-gpu=yes \
>>>       $feature_transform_old "$(echo $feats | sed
>>> 's|train.scp|train.scp.10k|')" \
>>>       ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>     compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>     nnet-concat --binary=false $feature_transform_old - $feature_transform
>>>
>>> and the command that is running is:
>>>
>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>>>
>>> If I understand it correctly, nnet-forward is piping its output to
>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>>> cmvn-to-nnet and nnet-concat.
>>> The problem, I think, is that there is an extra '| ark:-'. It means that the
>>> output of nnet-forward is being piped into 'ark:-', which is not a
>>> executable.
>>> Is there is bug here?
>>>
>>> Regards,
>>> Xingyu
>>>
>>>
>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>
>>> I'm running the same thing at JHU to see if I can replicate your problem.
>>> Dan
>>>
>>>
>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>>> cc'ing Karel who may be able to help you, although I think he could be
>>>> behind on his email.
>>>> I'm afraid I don't know how to fix this.
>>>> If you can figure out the full command that's being run then it might be
>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>>>> break into it and get a stack trace to find where it's stuck.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>>>> wrote:
>>>>> Thank you Dan.
>>>>> I compiled with CUDA. kaldi.mk is like this:
>>>>>>> #Next section enables CUDA for compilation
>>>>>>> CUDA = true
>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>>>>>> than static libs in implicit rule
>>>>> The 'make' process does not give any error so I can claim that the tools
>>>>> are compiled with CUDA successfully, right?
>>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>>>> running on GPU-2.
>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>>> GPUs
>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>> free/total:0.983228
>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>> free/total:0.983228
>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>>> BTW, my configure command is:
>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>>
>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>>>> log stops updating?
>>>>>
>>>>> Thank you and best regards,
>>>>> Xingyu
>>>>>
>>>>>
>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>>
>>>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>>>> are using (look in the dir, for *.log).  If the configure script does not
>>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>>>> kaldi.mk to see.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>>>> wrote:
>>>>>> Hi, I'm new in this community.
>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>>>> Decoding part.
>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>>>> running.
>>>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>>>> "Renormalizing MLP input features into
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>>> The program has been running more then 24 hours.
>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>>>> How long does it normally take? Is there something going wrong?
>>>>>> Please help.
>>>>>>
>>>>>> The log is posted below.
>>>>>> Thank you
>>>>>> Xingyu
>>>>>>
>>>>>>
>>>>>> ============================================================================
>>>>>>
>>>>>>                   DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>>
>>>>>> ============================================================================
>>>>>>
>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>>>> exp/tri3/decode_test
>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>>>> exp/tri3/decode_dev
>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>>>> exp/tri3_ali
>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>> reducing #utt from 3696 to 3320
>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>> reducing #utt from 3696 to 376
>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>>> #
>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>> # INFO
>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>>>> of RBMs
>>>>>>         dir       : exp/dnn4_pretrain-dbn
>>>>>>         Train-set : data-fmllr-tri3/train
>>>>>>
>>>>>> # PREPARING FEATURES
>>>>>> Preparing train/cv lists
>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>>>> ark:-
>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>>>> 40
>>>>>> Using splice ± 5 , step 1
>>>>>> Renormalizing MLP input features into
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>> compute-cmvn-stats ark:- -
>>>>>> cmvn-to-nnet - -
>>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> _______________________________________________
>>>>>> Kaldi-users mailing list
>>>>>> Kal...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users

-- 
Karel Vesely, Brno University of Technology
ive...@fi..., +420-54114-1300

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-24 10:40:05

Thank you Karel.
Is that a 'must' for all cuda-based kaldi executives?

Regards,
Xingyu

On 10/24/2014 06:12 PM, Vesely Karel wrote:
> Hi,
> The reason is in the "computation mode", which has with Kaldi following
> behavior:
> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
> processes use same GPU which is slow) [BAD]
> - process/thread exclusive : OS selects a free GPU which not locked to
> another process or raises error [RECOMMENDED]
> Best regards,
> Karel
>
>
> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>> Thank you Dan and Alex.
>> It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't
>> know why....).
>> Now I understand how that pipelined command works.
>> Sorry for saying "Is there a bug" in the previous email....
>>
>> Regards,
>> Xingyu
>>
>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>> Hi Xingyu,
>>>
>>> If you are concerned whether the process hung up or not, you can see
>>> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
>>> in STAT fields, like
>>>
>>> PID TTY      STAT   TIME COMMAND
>>> 11891 pts/5    S+     0:00 cat
>>>
>>> Then the processing is sleeping. Otherwise you should see 'R' like:
>>>
>>> PID TTY      STAT   TIME COMMAND
>>> 11909 pts/5    R+     0:01 cat
>>>
>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
>>>> Thank you so much Dan.
>>>> The script which causes the halting is :
>>>>
>>>>      nnet-forward --use-gpu=yes \
>>>>        $feature_transform_old "$(echo $feats | sed
>>>> 's|train.scp|train.scp.10k|')" \
>>>>        ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>>      compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>>      nnet-concat --binary=false $feature_transform_old - $feature_transform
>>>>
>>>> and the command that is running is:
>>>>
>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>>>>
>>>> If I understand it correctly, nnet-forward is piping its output to
>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>>>> cmvn-to-nnet and nnet-concat.
>>>> The problem, I think, is that there is an extra '| ark:-'. It means that the
>>>> output of nnet-forward is being piped into 'ark:-', which is not a
>>>> executable.
>>>> Is there is bug here?
>>>>
>>>> Regards,
>>>> Xingyu
>>>>
>>>>
>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>>
>>>> I'm running the same thing at JHU to see if I can replicate your problem.
>>>> Dan
>>>>
>>>>
>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>>>> cc'ing Karel who may be able to help you, although I think he could be
>>>>> behind on his email.
>>>>> I'm afraid I don't know how to fix this.
>>>>> If you can figure out the full command that's being run then it might be
>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>>>>> break into it and get a stack trace to find where it's stuck.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>>>>> wrote:
>>>>>> Thank you Dan.
>>>>>> I compiled with CUDA. kaldi.mk is like this:
>>>>>>>> #Next section enables CUDA for compilation
>>>>>>>> CUDA = true
>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>>>>>>> than static libs in implicit rule
>>>>>> The 'make' process does not give any error so I can claim that the tools
>>>>>> are compiled with CUDA successfully, right?
>>>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>>>>> running on GPU-2.
>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>>>> GPUs
>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>>> free/total:0.983228
>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>>> free/total:0.983228
>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>>>> BTW, my configure command is:
>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>>>
>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>>>>> log stops updating?
>>>>>>
>>>>>> Thank you and best regards,
>>>>>> Xingyu
>>>>>>
>>>>>>
>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>>>
>>>>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>>>>> are using (look in the dir, for *.log).  If the configure script does not
>>>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>>>>> kaldi.mk to see.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>>>>> wrote:
>>>>>>> Hi, I'm new in this community.
>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>>>>> Decoding part.
>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>>>>> running.
>>>>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>>>>> "Renormalizing MLP input features into
>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>>>> The program has been running more then 24 hours.
>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>>>>> How long does it normally take? Is there something going wrong?
>>>>>>> Please help.
>>>>>>>
>>>>>>> The log is posted below.
>>>>>>> Thank you
>>>>>>> Xingyu
>>>>>>>
>>>>>>>
>>>>>>> ============================================================================
>>>>>>>
>>>>>>>                    DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>>>
>>>>>>> ============================================================================
>>>>>>>
>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>>>>> exp/tri3/decode_test
>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>>>>> exp/tri3/decode_dev
>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>>>>> exp/tri3_ali
>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>>> reducing #utt from 3696 to 3320
>>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>>> reducing #utt from 3696 to 376
>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>>>> #
>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>>> # INFO
>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>>>>> of RBMs
>>>>>>>          dir       : exp/dnn4_pretrain-dbn
>>>>>>>          Train-set : data-fmllr-tri3/train
>>>>>>>
>>>>>>> # PREPARING FEATURES
>>>>>>> Preparing train/cv lists
>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>>>>> ark:-
>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>>>>> 40
>>>>>>> Using splice ± 5 , step 1
>>>>>>> Renormalizing MLP input features into
>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>>> compute-cmvn-stats ark:- -
>>>>>>> cmvn-to-nnet - -
>>>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> _______________________________________________
>>>>>>> Kaldi-users mailing list
>>>>>>> Kal...@li...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Kaldi-users mailing list
>>>> Kal...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Vesely K. <ive...@fi...> - 2014-10-24 10:44:27

It is a 'must' on multi-GPU machines and 'recommended' for single-GPU 
machine.

It is a setting in OS, which is assumed to be done. It is good that one 
does not need
to specify a gpu-id in the scripts and track manually which gpus are 
being used.

Karel.

On 10/24/2014 12:39 PM, Xingyu Na wrote:
> Thank you Karel.
> Is that a 'must' for all cuda-based kaldi executives?
>
> Regards,
> Xingyu
>
> On 10/24/2014 06:12 PM, Vesely Karel wrote:
>> Hi,
>> The reason is in the "computation mode", which has with Kaldi following
>> behavior:
>> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>> processes use same GPU which is slow) [BAD]
>> - process/thread exclusive : OS selects a free GPU which not locked to
>> another process or raises error [RECOMMENDED]
>> Best regards,
>> Karel
>>
>>
>> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>> Thank you Dan and Alex.
>>> It turns out that I need to set 'nvidia-smi -c 1' to continue here(don't
>>> know why....).
>>> Now I understand how that pipelined command works.
>>> Sorry for saying "Is there a bug" in the previous email....
>>>
>>> Regards,
>>> Xingyu
>>>
>>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>>> Hi Xingyu,
>>>>
>>>> If you are concerned whether the process hung up or not, you can see
>>>> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
>>>> in STAT fields, like
>>>>
>>>> PID TTY      STAT   TIME COMMAND
>>>> 11891 pts/5    S+     0:00 cat
>>>>
>>>> Then the processing is sleeping. Otherwise you should see 'R' like:
>>>>
>>>> PID TTY      STAT   TIME COMMAND
>>>> 11909 pts/5    R+     0:01 cat
>>>>
>>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...> wrote:
>>>>> Thank you so much Dan.
>>>>> The script which causes the halting is :
>>>>>
>>>>>       nnet-forward --use-gpu=yes \
>>>>>         $feature_transform_old "$(echo $feats | sed
>>>>> 's|train.scp|train.scp.10k|')" \
>>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>>>       nnet-concat --binary=false $feature_transform_old - $feature_transform
>>>>>
>>>>> and the command that is running is:
>>>>>
>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>>>>>
>>>>> If I understand it correctly, nnet-forward is piping its output to
>>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>>>>> cmvn-to-nnet and nnet-concat.
>>>>> The problem, I think, is that there is an extra '| ark:-'. It means that the
>>>>> output of nnet-forward is being piped into 'ark:-', which is not a
>>>>> executable.
>>>>> Is there is bug here?
>>>>>
>>>>> Regards,
>>>>> Xingyu
>>>>>
>>>>>
>>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>>>
>>>>> I'm running the same thing at JHU to see if I can replicate your problem.
>>>>> Dan
>>>>>
>>>>>
>>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...> wrote:
>>>>>> cc'ing Karel who may be able to help you, although I think he could be
>>>>>> behind on his email.
>>>>>> I'm afraid I don't know how to fix this.
>>>>>> If you can figure out the full command that's being run then it might be
>>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2 ..., and
>>>>>> break into it and get a stack trace to find where it's stuck.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...>
>>>>>> wrote:
>>>>>>> Thank you Dan.
>>>>>>> I compiled with CUDA. kaldi.mk is like this:
>>>>>>>>> #Next section enables CUDA for compilation
>>>>>>>>> CUDA = true
>>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64 -DHAVE_CUDA
>>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64 -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded later
>>>>>>>>> than static libs in implicit rule
>>>>>>> The 'make' process does not give any error so I can claim that the tools
>>>>>>> are compiled with CUDA successfully, right?
>>>>>>> Problem is, although the log stops updating, I can see 'nnet-forward' is
>>>>>>> running on GPU-2.
>>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>>>>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>>>>> GPUs
>>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>>>> free/total:0.983228
>>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M, total:4799M,
>>>>>>>>> free/total:0.983228
>>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>>>>> BTW, my configure command is:
>>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>>>>
>>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU while
>>>>>>> log stops updating?
>>>>>>>
>>>>>>> Thank you and best regards,
>>>>>>> Xingyu
>>>>>>>
>>>>>>>
>>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>>>>
>>>>>>> Possibly you did not compile for CUDA.  The logs should say which GPU you
>>>>>>> are using (look in the dir, for *.log).  If the configure script does not
>>>>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA in
>>>>>>> kaldi.mk to see.
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <asr...@gm...>
>>>>>>> wrote:
>>>>>>>> Hi, I'm new in this community.
>>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid Training &
>>>>>>>> Decoding part.
>>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and still
>>>>>>>> running.
>>>>>>>> I checked the script and found that it stuck at calling nnet-forward for
>>>>>>>> "Renormalizing MLP input features into
>>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>>>>> The program has been running more then 24 hours.
>>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla K20m...
>>>>>>>> How long does it normally take? Is there something going wrong?
>>>>>>>> Please help.
>>>>>>>>
>>>>>>>> The log is posted below.
>>>>>>>> Thank you
>>>>>>>> Xingyu
>>>>>>>>
>>>>>>>>
>>>>>>>> ============================================================================
>>>>>>>>
>>>>>>>>                     DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>>>>
>>>>>>>> ============================================================================
>>>>>>>>
>>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test -->
>>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>>>>>>>> exp/tri3/decode_test
>>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev -->
>>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>>>>>>> exp/tri3/decode_dev
>>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl --transform-dir
>>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train -->
>>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
>>>>>>>> exp/tri3_ali
>>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>>>> reducing #utt from 3696 to 3320
>>>>>>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>>>> reducing #utt from 3696 to 376
>>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>>>>> #
>>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>>>> # INFO
>>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as a stack
>>>>>>>> of RBMs
>>>>>>>>           dir       : exp/dnn4_pretrain-dbn
>>>>>>>>           Train-set : data-fmllr-tri3/train
>>>>>>>>
>>>>>>>> # PREPARING FEATURES
>>>>>>>> Preparing train/cv lists
>>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>>>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature matrices.
>>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>>>>> Getting feature dim : copy-feats scp:exp/dnn4_pretrain-dbn/train.scp
>>>>>>>> ark:-
>>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return status 13
>>>>>>>> 40
>>>>>>>> Using splice ± 5 , step 1
>>>>>>>> Renormalizing MLP input features into
>>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>>>> compute-cmvn-stats ark:- -
>>>>>>>> cmvn-to-nnet - -
>>>>>>>> nnet-concat --binary=false exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> _______________________________________________
>>>>>>>> Kaldi-users mailing list
>>>>>>>> Kal...@li...
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Kaldi-users mailing list
>>>>> Kal...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users

-- 
Karel Vesely, Brno University of Technology
ive...@fi..., +420-54114-1300

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Daniel P. <dp...@gm...> - 2014-10-24 17:03:50

Karel,
Is there something which we need to fix here?
Why was it hanging?  Was it using the CPU instead of the GPU?  Was it
waiting for some kind of reply from the GPU?  Had it crashed?
Dan


On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...> wrote:

> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU
> machine.
>
> It is a setting in OS, which is assumed to be done. It is good that one
> does not need
> to specify a gpu-id in the scripts and track manually which gpus are
> being used.
>
> Karel.
>
> On 10/24/2014 12:39 PM, Xingyu Na wrote:
> > Thank you Karel.
> > Is that a 'must' for all cuda-based kaldi executives?
> >
> > Regards,
> > Xingyu
> >
> > On 10/24/2014 06:12 PM, Vesely Karel wrote:
> >> Hi,
> >> The reason is in the "computation mode", which has with Kaldi following
> >> behavior:
> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
> >> processes use same GPU which is slow) [BAD]
> >> - process/thread exclusive : OS selects a free GPU which not locked to
> >> another process or raises error [RECOMMENDED]
> >> Best regards,
> >> Karel
> >>
> >>
> >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
> >>> Thank you Dan and Alex.
> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue
> here(don't
> >>> know why....).
> >>> Now I understand how that pipelined command works.
> >>> Sorry for saying "Is there a bug" in the previous email....
> >>>
> >>> Regards,
> >>> Xingyu
> >>>
> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
> >>>> Hi Xingyu,
> >>>>
> >>>> If you are concerned whether the process hung up or not, you can see
> >>>> the output of `ps <PID>` where <PID> is the process id. If you see 'S'
> >>>> in STAT fields, like
> >>>>
> >>>> PID TTY      STAT   TIME COMMAND
> >>>> 11891 pts/5    S+     0:00 cat
> >>>>
> >>>> Then the processing is sleeping. Otherwise you should see 'R' like:
> >>>>
> >>>> PID TTY      STAT   TIME COMMAND
> >>>> 11909 pts/5    R+     0:01 cat
> >>>>
> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...>
> wrote:
> >>>>> Thank you so much Dan.
> >>>>> The script which causes the halting is :
> >>>>>
> >>>>>       nnet-forward --use-gpu=yes \
> >>>>>         $feature_transform_old "$(echo $feats | sed
> >>>>> 's|train.scp|train.scp.10k|')" \
> >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
> >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
> >>>>>       nnet-concat --binary=false $feature_transform_old -
> $feature_transform
> >>>>>
> >>>>> and the command that is running is:
> >>>>>
> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
> >>>>>
> >>>>> If I understand it correctly, nnet-forward is piping its output to
> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
> >>>>> cmvn-to-nnet and nnet-concat.
> >>>>> The problem, I think, is that there is an extra '| ark:-'. It means
> that the
> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a
> >>>>> executable.
> >>>>> Is there is bug here?
> >>>>>
> >>>>> Regards,
> >>>>> Xingyu
> >>>>>
> >>>>>
> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
> >>>>>
> >>>>> I'm running the same thing at JHU to see if I can replicate your
> problem.
> >>>>> Dan
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...>
> wrote:
> >>>>>> cc'ing Karel who may be able to help you, although I think he could
> be
> >>>>>> behind on his email.
> >>>>>> I'm afraid I don't know how to fix this.
> >>>>>> If you can figure out the full command that's being run then it
> might be
> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1 arg2
> ..., and
> >>>>>> break into it and get a stack trace to find where it's stuck.
> >>>>>>
> >>>>>> Dan
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <asr...@gm...
> >
> >>>>>> wrote:
> >>>>>>> Thank you Dan.
> >>>>>>> I compiled with CUDA. kaldi.mk is like this:
> >>>>>>>>> #Next section enables CUDA for compilation
> >>>>>>>>> CUDA = true
> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
> -DHAVE_CUDA
> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
> -Wl,-rpath,$(CUDATKDIR)/lib64
> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded
> later
> >>>>>>>>> than static libs in implicit rule
> >>>>>>> The 'make' process does not give any error so I can claim that the
> tools
> >>>>>>> are compiled with CUDA successfully, right?
> >>>>>>> Problem is, although the log stops updating, I can see
> 'nnet-forward' is
> >>>>>>> running on GPU-2.
> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
> >>>>>>>>> nnet-forward --use-gpu=yes
> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |'
> ark:-
> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
> Suggestion: use
> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting
> from 4
> >>>>>>>>> GPUs
> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
> >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
> total:4799M,
> >>>>>>>>> free/total:0.983228
> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
> >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
> total:4799M,
> >>>>>>>>> free/total:0.983228
> >>>>>>> and no more. I have 4 GPU cards installed, all same model.
> >>>>>>> BTW, my configure command is:
> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
> >>>>>>>
> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU
> while
> >>>>>>> log stops updating?
> >>>>>>>
> >>>>>>> Thank you and best regards,
> >>>>>>> Xingyu
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
> >>>>>>>
> >>>>>>> Possibly you did not compile for CUDA.  The logs should say which
> GPU you
> >>>>>>> are using (look in the dir, for *.log).  If the configure script
> does not
> >>>>>>> see nvcc on the command line, it will not use CUDA.  Grep for CUDA
> in
> >>>>>>> kaldi.mk to see.
> >>>>>>>
> >>>>>>> Dan
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <
> asr...@gm...>
> >>>>>>> wrote:
> >>>>>>>> Hi, I'm new in this community.
> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid
> Training &
> >>>>>>>> Decoding part.
> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday, and
> still
> >>>>>>>> running.
> >>>>>>>> I checked the script and found that it stuck at calling
> nnet-forward for
> >>>>>>>> "Renormalizing MLP input features into
> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
> >>>>>>>> The program has been running more then 24 hours.
> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla
> K20m...
> >>>>>>>> How long does it normally take? Is there something going wrong?
> >>>>>>>> Please help.
> >>>>>>>>
> >>>>>>>> The log is posted below.
> >>>>>>>> Thank you
> >>>>>>>> Xingyu
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> ============================================================================
> >>>>>>>>
> >>>>>>>>                     DNN Hybrid Training & Decoding (Karel's
> recipe)
> >>>>>>>>
> >>>>>>>>
> ============================================================================
> >>>>>>>>
> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
> --transform-dir
> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test
> -->
> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
> >>>>>>>> exp/tri3/decode_test
> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
> --transform-dir
> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev
> -->
> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
> >>>>>>>> exp/tri3/decode_dev
> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
> --transform-dir
> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/train
> -->
> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3, trans
> >>>>>>>> exp/tri3_ali
> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
> >>>>>>>>
> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
> >>>>>>>> reducing #utt from 3696 to 3320
> >>>>>>>>
> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
> >>>>>>>> reducing #utt from 3696 to 376
> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
> >>>>>>>> #
> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
> >>>>>>>> # INFO
> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as
> a stack
> >>>>>>>> of RBMs
> >>>>>>>>           dir       : exp/dnn4_pretrain-dbn
> >>>>>>>>           Train-set : data-fmllr-tri3/train
> >>>>>>>>
> >>>>>>>> # PREPARING FEATURES
> >>>>>>>> Preparing train/cv lists
> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
> >>>>>>>>
> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature
> matrices.
> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
> >>>>>>>> Getting feature dim : copy-feats
> scp:exp/dnn4_pretrain-dbn/train.scp
> >>>>>>>> ark:-
> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return
> status 13
> >>>>>>>> 40
> >>>>>>>> Using splice ± 5 , step 1
> >>>>>>>> Renormalizing MLP input features into
> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
> >>>>>>>> compute-cmvn-stats ark:- -
> >>>>>>>> cmvn-to-nnet - -
> >>>>>>>> nnet-concat --binary=false
> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> ------------------------------------------------------------------------------
> >>>>>>>> _______________________________________________
> >>>>>>>> Kaldi-users mailing list
> >>>>>>>> Kal...@li...
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
> >>>>>
> ------------------------------------------------------------------------------
> >>>>>
> >>>>> _______________________________________________
> >>>>> Kaldi-users mailing list
> >>>>> Kal...@li...
> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
> >>>>>
> >>>
> ------------------------------------------------------------------------------
> >>> _______________________________________________
> >>> Kaldi-users mailing list
> >>> Kal...@li...
> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Kaldi-users mailing list
> > Kal...@li...
> > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
> --
> Karel Vesely, Brno University of Technology
> ive...@fi..., +420-54114-1300
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Vesely K. <ive...@fi...> - 2014-10-27 10:40:07

Dan,
I'll check it by running TIMIT recipe without GPU code compiled.
Need to figure out what could have happened...
K.

On 10/24/2014 07:03 PM, Daniel Povey wrote:
> Karel,
> Is there something which we need to fix here?
> Why was it hanging?  Was it using the CPU instead of the GPU?  Was it 
> waiting for some kind of reply from the GPU?  Had it crashed?
> Dan
>
>
> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi... 
> <mailto:ive...@fi...>> wrote:
>
>     It is a 'must' on multi-GPU machines and 'recommended' for single-GPU
>     machine.
>
>     It is a setting in OS, which is assumed to be done. It is good
>     that one
>     does not need
>     to specify a gpu-id in the scripts and track manually which gpus are
>     being used.
>
>     Karel.
>
>     On 10/24/2014 12:39 PM, Xingyu Na wrote:
>     > Thank you Karel.
>     > Is that a 'must' for all cuda-based kaldi executives?
>     >
>     > Regards,
>     > Xingyu
>     >
>     > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>     >> Hi,
>     >> The reason is in the "computation mode", which has with Kaldi
>     following
>     >> behavior:
>     >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>     >> processes use same GPU which is slow) [BAD]
>     >> - process/thread exclusive : OS selects a free GPU which not
>     locked to
>     >> another process or raises error [RECOMMENDED]
>     >> Best regards,
>     >> Karel
>     >>
>     >>
>     >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>     >>> Thank you Dan and Alex.
>     >>> It turns out that I need to set 'nvidia-smi -c 1' to continue
>     here(don't
>     >>> know why....).
>     >>> Now I understand how that pipelined command works.
>     >>> Sorry for saying "Is there a bug" in the previous email....
>     >>>
>     >>> Regards,
>     >>> Xingyu
>     >>>
>     >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>     >>>> Hi Xingyu,
>     >>>>
>     >>>> If you are concerned whether the process hung up or not, you
>     can see
>     >>>> the output of `ps <PID>` where <PID> is the process id. If
>     you see 'S'
>     >>>> in STAT fields, like
>     >>>>
>     >>>> PID TTY      STAT   TIME COMMAND
>     >>>> 11891 pts/5    S+     0:00 cat
>     >>>>
>     >>>> Then the processing is sleeping. Otherwise you should see 'R'
>     like:
>     >>>>
>     >>>> PID TTY      STAT   TIME COMMAND
>     >>>> 11909 pts/5    R+     0:01 cat
>     >>>>
>     >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na
>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>     >>>>> Thank you so much Dan.
>     >>>>> The script which causes the halting is :
>     >>>>>
>     >>>>>       nnet-forward --use-gpu=yes \
>     >>>>>         $feature_transform_old "$(echo $feats | sed
>     >>>>> 's|train.scp|train.scp.10k|')" \
>     >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>     >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>     >>>>>       nnet-concat --binary=false $feature_transform_old -
>     $feature_transform
>     >>>>>
>     >>>>> and the command that is running is:
>     >>>>>
>     >>>>> nnet-forward --use-gpu=yes
>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>     >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
>     | ark:-
>     >>>>>
>     >>>>> If I understand it correctly, nnet-forward is piping its
>     output to
>     >>>>> compute-cmvn-stats (although apply_cmvn is false), and
>     followed by
>     >>>>> cmvn-to-nnet and nnet-concat.
>     >>>>> The problem, I think, is that there is an extra '| ark:-'.
>     It means that the
>     >>>>> output of nnet-forward is being piped into 'ark:-', which is
>     not a
>     >>>>> executable.
>     >>>>> Is there is bug here?
>     >>>>>
>     >>>>> Regards,
>     >>>>> Xingyu
>     >>>>>
>     >>>>>
>     >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>     >>>>>
>     >>>>> I'm running the same thing at JHU to see if I can replicate
>     your problem.
>     >>>>> Dan
>     >>>>>
>     >>>>>
>     >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey
>     <dp...@gm... <mailto:dp...@gm...>> wrote:
>     >>>>>> cc'ing Karel who may be able to help you, although I think
>     he could be
>     >>>>>> behind on his email.
>     >>>>>> I'm afraid I don't know how to fix this.
>     >>>>>> If you can figure out the full command that's being run
>     then it might be
>     >>>>>> possible to get it in a debugger, e.g. gdb --args program
>     arg1 arg2 ..., and
>     >>>>>> break into it and get a stack trace to find where it's stuck.
>     >>>>>>
>     >>>>>> Dan
>     >>>>>>
>     >>>>>>
>     >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>     <asr...@gm... <mailto:asr...@gm...>>
>     >>>>>> wrote:
>     >>>>>>> Thank you Dan.
>     >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like this:
>     >>>>>>>>> #Next section enables CUDA for compilation
>     >>>>>>>>> CUDA = true
>     >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>     >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>     >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>     -DHAVE_CUDA
>     >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib
>     -Wl,-rpath,$(CUDATKDIR)/lib
>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>     -Wl,-rpath,$(CUDATKDIR)/lib64
>     >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are
>     loaded later
>     >>>>>>>>> than static libs in implicit rule
>     >>>>>>> The 'make' process does not give any error so I can claim
>     that the tools
>     >>>>>>> are compiled with CUDA successfully, right?
>     >>>>>>> Problem is, although the log stops updating, I can see
>     'nnet-forward' is
>     >>>>>>> running on GPU-2.
>     >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>     >>>>>>>>> nnet-forward --use-gpu=yes
>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>     >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>     ark:- |' ark:-
>     >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>     Suggestion: use
>     >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>     Selecting from 4
>     >>>>>>>>> GPUs
>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>     >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>     total:4799M,
>     >>>>>>>>> free/total:0.983228
>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>     >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>     total:4799M,
>     >>>>>>>>> free/total:0.983228
>     >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>     >>>>>>> BTW, my configure command is:
>     >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>     >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>     >>>>>>>
>     >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running
>     on GPU while
>     >>>>>>> log stops updating?
>     >>>>>>>
>     >>>>>>> Thank you and best regards,
>     >>>>>>> Xingyu
>     >>>>>>>
>     >>>>>>>
>     >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>     >>>>>>>
>     >>>>>>> Possibly you did not compile for CUDA.  The logs should
>     say which GPU you
>     >>>>>>> are using (look in the dir, for *.log).  If the configure
>     script does not
>     >>>>>>> see nvcc on the command line, it will not use CUDA.  Grep
>     for CUDA in
>     >>>>>>> kaldi.mk <http://kaldi.mk> to see.
>     >>>>>>>
>     >>>>>>> Dan
>     >>>>>>>
>     >>>>>>>
>     >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>     <asr...@gm... <mailto:asr...@gm...>>
>     >>>>>>> wrote:
>     >>>>>>>> Hi, I'm new in this community.
>     >>>>>>>> I am running the TIMIT example s5, all the way to DNN
>     Hybrid Training &
>     >>>>>>>> Decoding part.
>     >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called
>     yesterday, and still
>     >>>>>>>> running.
>     >>>>>>>> I checked the script and found that it stuck at calling
>     nnet-forward for
>     >>>>>>>> "Renormalizing MLP input features into
>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>     >>>>>>>> The program has been running more then 24 hours.
>     >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a
>     Tesla K20m...
>     >>>>>>>> How long does it normally take? Is there something going
>     wrong?
>     >>>>>>>> Please help.
>     >>>>>>>>
>     >>>>>>>> The log is posted below.
>     >>>>>>>> Thank you
>     >>>>>>>> Xingyu
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>>
>     ============================================================================
>     >>>>>>>>
>     >>>>>>>>                     DNN Hybrid Training & Decoding
>     (Karel's recipe)
>     >>>>>>>>
>     >>>>>>>>
>     ============================================================================
>     >>>>>>>>
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>     <http://run.pl> --transform-dir
>     >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>     >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>     data/test -->
>     >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm
>     exp/tri3, trans
>     >>>>>>>> exp/tri3/decode_test
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>     <http://run.pl> --transform-dir
>     >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>     >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>     data/dev -->
>     >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm
>     exp/tri3, trans
>     >>>>>>>> exp/tri3/decode_dev
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>     <http://run.pl> --transform-dir
>     >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>     >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>     data/train -->
>     >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm
>     exp/tri3, trans
>     >>>>>>>> exp/tri3_ali
>     >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>     >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>     >>>>>>>>
>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>     >>>>>>>> reducing #utt from 3696 to 3320
>     >>>>>>>>
>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>     >>>>>>>> reducing #utt from 3696 to 376
>     >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>     >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>     >>>>>>>> #
>     >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>     >>>>>>>> # INFO
>     >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>     Network as a stack
>     >>>>>>>> of RBMs
>     >>>>>>>>           dir       : exp/dnn4_pretrain-dbn
>     >>>>>>>>           Train-set : data-fmllr-tri3/train
>     >>>>>>>>
>     >>>>>>>> # PREPARING FEATURES
>     >>>>>>>> Preparing train/cv lists
>     >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>     >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>     >>>>>>>>
>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>     >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>     feature matrices.
>     >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>     >>>>>>>> Getting feature dim : copy-feats
>     scp:exp/dnn4_pretrain-dbn/train.scp
>     >>>>>>>> ark:-
>     >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>     >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero
>     return status 13
>     >>>>>>>> 40
>     >>>>>>>> Using splice ± 5 , step 1
>     >>>>>>>> Renormalizing MLP input features into
>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>     >>>>>>>> compute-cmvn-stats ark:- -
>     >>>>>>>> cmvn-to-nnet - -
>     >>>>>>>> nnet-concat --binary=false
>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>>
>     ------------------------------------------------------------------------------
>     >>>>>>>> _______________________________________________
>     >>>>>>>> Kaldi-users mailing list
>     >>>>>>>> Kal...@li...
>     <mailto:Kal...@li...>
>     >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>     >>>>>
>     ------------------------------------------------------------------------------
>     >>>>>
>     >>>>> _______________________________________________
>     >>>>> Kaldi-users mailing list
>     >>>>> Kal...@li...
>     <mailto:Kal...@li...>
>     >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>     >>>>>
>     >>>
>     ------------------------------------------------------------------------------
>     >>> _______________________________________________
>     >>> Kaldi-users mailing list
>     >>> Kal...@li...
>     <mailto:Kal...@li...>
>     >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>     >
>     >
>     ------------------------------------------------------------------------------
>     > _______________________________________________
>     > Kaldi-users mailing list
>     > Kal...@li...
>     <mailto:Kal...@li...>
>     > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
>     --
>     Karel Vesely, Brno University of Technology
>     ive...@fi... <mailto:ive...@fi...>,
>     +420-54114-1300 <tel:%2B420-54114-1300>
>
>
>     ------------------------------------------------------------------------------
>     _______________________________________________
>     Kaldi-users mailing list
>     Kal...@li...
>     <mailto:Kal...@li...>
>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
>

-- 
Karel Vesely, Brno University of Technology
ive...@fi..., +420-54114-1300

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Vesely K. <ve...@gm...> - 2014-10-29 13:28:16

Hi,
the TIMIT DNN training is running, and it is very slow.
I'll add there a script-check to stop training if cuda is not compiled-in.
(Assuming that typically everybody wants to train on a GPU.)
K.

On 10/27/2014 11:39 AM, Vesely Karel wrote:
> Dan,
> I'll check it by running TIMIT recipe without GPU code compiled.
> Need to figure out what could have happened...
> K.
>
> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>> Karel,
>> Is there something which we need to fix here?
>> Why was it hanging?  Was it using the CPU instead of the GPU?  Was it 
>> waiting for some kind of reply from the GPU? Had it crashed?
>> Dan
>>
>>
>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi... 
>> <mailto:ive...@fi...>> wrote:
>>
>>     It is a 'must' on multi-GPU machines and 'recommended' for single-GPU
>>     machine.
>>
>>     It is a setting in OS, which is assumed to be done. It is good
>>     that one
>>     does not need
>>     to specify a gpu-id in the scripts and track manually which gpus are
>>     being used.
>>
>>     Karel.
>>
>>     On 10/24/2014 12:39 PM, Xingyu Na wrote:
>>     > Thank you Karel.
>>     > Is that a 'must' for all cuda-based kaldi executives?
>>     >
>>     > Regards,
>>     > Xingyu
>>     >
>>     > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>>     >> Hi,
>>     >> The reason is in the "computation mode", which has with Kaldi
>>     following
>>     >> behavior:
>>     >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>>     >> processes use same GPU which is slow) [BAD]
>>     >> - process/thread exclusive : OS selects a free GPU which not
>>     locked to
>>     >> another process or raises error [RECOMMENDED]
>>     >> Best regards,
>>     >> Karel
>>     >>
>>     >>
>>     >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>     >>> Thank you Dan and Alex.
>>     >>> It turns out that I need to set 'nvidia-smi -c 1' to continue
>>     here(don't
>>     >>> know why....).
>>     >>> Now I understand how that pipelined command works.
>>     >>> Sorry for saying "Is there a bug" in the previous email....
>>     >>>
>>     >>> Regards,
>>     >>> Xingyu
>>     >>>
>>     >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>     >>>> Hi Xingyu,
>>     >>>>
>>     >>>> If you are concerned whether the process hung up or not, you
>>     can see
>>     >>>> the output of `ps <PID>` where <PID> is the process id. If
>>     you see 'S'
>>     >>>> in STAT fields, like
>>     >>>>
>>     >>>> PID TTY      STAT   TIME COMMAND
>>     >>>> 11891 pts/5    S+     0:00 cat
>>     >>>>
>>     >>>> Then the processing is sleeping. Otherwise you should see
>>     'R' like:
>>     >>>>
>>     >>>> PID TTY      STAT   TIME COMMAND
>>     >>>> 11909 pts/5    R+     0:01 cat
>>     >>>>
>>     >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na
>>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>>     >>>>> Thank you so much Dan.
>>     >>>>> The script which causes the halting is :
>>     >>>>>
>>     >>>>>       nnet-forward --use-gpu=yes \
>>     >>>>>         $feature_transform_old "$(echo $feats | sed
>>     >>>>> 's|train.scp|train.scp.10k|')" \
>>     >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>     >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>     >>>>>       nnet-concat --binary=false $feature_transform_old -
>>     $feature_transform
>>     >>>>>
>>     >>>>> and the command that is running is:
>>     >>>>>
>>     >>>>> nnet-forward --use-gpu=yes
>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>     >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>>     ark:- | ark:-
>>     >>>>>
>>     >>>>> If I understand it correctly, nnet-forward is piping its
>>     output to
>>     >>>>> compute-cmvn-stats (although apply_cmvn is false), and
>>     followed by
>>     >>>>> cmvn-to-nnet and nnet-concat.
>>     >>>>> The problem, I think, is that there is an extra '| ark:-'.
>>     It means that the
>>     >>>>> output of nnet-forward is being piped into 'ark:-', which
>>     is not a
>>     >>>>> executable.
>>     >>>>> Is there is bug here?
>>     >>>>>
>>     >>>>> Regards,
>>     >>>>> Xingyu
>>     >>>>>
>>     >>>>>
>>     >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>     >>>>>
>>     >>>>> I'm running the same thing at JHU to see if I can replicate
>>     your problem.
>>     >>>>> Dan
>>     >>>>>
>>     >>>>>
>>     >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey
>>     <dp...@gm... <mailto:dp...@gm...>> wrote:
>>     >>>>>> cc'ing Karel who may be able to help you, although I think
>>     he could be
>>     >>>>>> behind on his email.
>>     >>>>>> I'm afraid I don't know how to fix this.
>>     >>>>>> If you can figure out the full command that's being run
>>     then it might be
>>     >>>>>> possible to get it in a debugger, e.g. gdb --args program
>>     arg1 arg2 ..., and
>>     >>>>>> break into it and get a stack trace to find where it's stuck.
>>     >>>>>>
>>     >>>>>> Dan
>>     >>>>>>
>>     >>>>>>
>>     >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>>     <asr...@gm... <mailto:asr...@gm...>>
>>     >>>>>> wrote:
>>     >>>>>>> Thank you Dan.
>>     >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like
>>     this:
>>     >>>>>>>>> #Next section enables CUDA for compilation
>>     >>>>>>>>> CUDA = true
>>     >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>     >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>     >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>>     -DHAVE_CUDA
>>     >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib
>>     -Wl,-rpath,$(CUDATKDIR)/lib
>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>>     -Wl,-rpath,$(CUDATKDIR)/lib64
>>     >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are
>>     loaded later
>>     >>>>>>>>> than static libs in implicit rule
>>     >>>>>>> The 'make' process does not give any error so I can claim
>>     that the tools
>>     >>>>>>> are compiled with CUDA successfully, right?
>>     >>>>>>> Problem is, although the log stops updating, I can see
>>     'nnet-forward' is
>>     >>>>>>> running on GPU-2.
>>     >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>     >>>>>>>>> nnet-forward --use-gpu=yes
>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>     >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>>     ark:- |' ark:-
>>     >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>>     Suggestion: use
>>     >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>>     Selecting from 4
>>     >>>>>>>>> GPUs
>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>     >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>>     total:4799M,
>>     >>>>>>>>> free/total:0.983228
>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>     >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>>     total:4799M,
>>     >>>>>>>>> free/total:0.983228
>>     >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>     >>>>>>> BTW, my configure command is:
>>     >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>     >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>     >>>>>>>
>>     >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running
>>     on GPU while
>>     >>>>>>> log stops updating?
>>     >>>>>>>
>>     >>>>>>> Thank you and best regards,
>>     >>>>>>> Xingyu
>>     >>>>>>>
>>     >>>>>>>
>>     >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>     >>>>>>>
>>     >>>>>>> Possibly you did not compile for CUDA.  The logs should
>>     say which GPU you
>>     >>>>>>> are using (look in the dir, for *.log).  If the configure
>>     script does not
>>     >>>>>>> see nvcc on the command line, it will not use CUDA.  Grep
>>     for CUDA in
>>     >>>>>>> kaldi.mk <http://kaldi.mk> to see.
>>     >>>>>>>
>>     >>>>>>> Dan
>>     >>>>>>>
>>     >>>>>>>
>>     >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>     <asr...@gm... <mailto:asr...@gm...>>
>>     >>>>>>> wrote:
>>     >>>>>>>> Hi, I'm new in this community.
>>     >>>>>>>> I am running the TIMIT example s5, all the way to DNN
>>     Hybrid Training &
>>     >>>>>>>> Decoding part.
>>     >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called
>>     yesterday, and still
>>     >>>>>>>> running.
>>     >>>>>>>> I checked the script and found that it stuck at calling
>>     nnet-forward for
>>     >>>>>>>> "Renormalizing MLP input features into
>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>     >>>>>>>> The program has been running more then 24 hours.
>>     >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a
>>     Tesla K20m...
>>     >>>>>>>> How long does it normally take? Is there something going
>>     wrong?
>>     >>>>>>>> Please help.
>>     >>>>>>>>
>>     >>>>>>>> The log is posted below.
>>     >>>>>>>> Thank you
>>     >>>>>>>> Xingyu
>>     >>>>>>>>
>>     >>>>>>>>
>>     >>>>>>>>
>>     ============================================================================
>>     >>>>>>>>
>>     >>>>>>>>  DNN Hybrid Training & Decoding (Karel's recipe)
>>     >>>>>>>>
>>     >>>>>>>>
>>     ============================================================================
>>     >>>>>>>>
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>     <http://run.pl> --transform-dir
>>     >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>     >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>     data/test -->
>>     >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm
>>     exp/tri3, trans
>>     >>>>>>>> exp/tri3/decode_test
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>     <http://run.pl> --transform-dir
>>     >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>     >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>     data/dev -->
>>     >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm
>>     exp/tri3, trans
>>     >>>>>>>> exp/tri3/decode_dev
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>     <http://run.pl> --transform-dir
>>     >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>     >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>     data/train -->
>>     >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm
>>     exp/tri3, trans
>>     >>>>>>>> exp/tri3_ali
>>     >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>     >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>     >>>>>>>>
>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>     >>>>>>>> reducing #utt from 3696 to 3320
>>     >>>>>>>>
>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>     >>>>>>>> reducing #utt from 3696 to 376
>>     >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>     >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>     >>>>>>>> #
>>     >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>     >>>>>>>> # INFO
>>     >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>>     Network as a stack
>>     >>>>>>>> of RBMs
>>     >>>>>>>>           dir       : exp/dnn4_pretrain-dbn
>>     >>>>>>>>           Train-set : data-fmllr-tri3/train
>>     >>>>>>>>
>>     >>>>>>>> # PREPARING FEATURES
>>     >>>>>>>> Preparing train/cv lists
>>     >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>     >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>     >>>>>>>>
>>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>     >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>>     feature matrices.
>>     >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>     >>>>>>>> Getting feature dim : copy-feats
>>     scp:exp/dnn4_pretrain-dbn/train.scp
>>     >>>>>>>> ark:-
>>     >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe
>>     copy-feats
>>     >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero
>>     return status 13
>>     >>>>>>>> 40
>>     >>>>>>>> Using splice ± 5 , step 1
>>     >>>>>>>> Renormalizing MLP input features into
>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>     >>>>>>>> compute-cmvn-stats ark:- -
>>     >>>>>>>> cmvn-to-nnet - -
>>     >>>>>>>> nnet-concat --binary=false
>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>     >>>>>>>>
>>     >>>>>>>>
>>     >>>>>>>>
>>     ------------------------------------------------------------------------------
>>     >>>>>>>> _______________________________________________
>>     >>>>>>>> Kaldi-users mailing list
>>     >>>>>>>> Kal...@li...
>>     <mailto:Kal...@li...>
>>     >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>     >>>>>
>>     ------------------------------------------------------------------------------
>>     >>>>>
>>     >>>>> _______________________________________________
>>     >>>>> Kaldi-users mailing list
>>     >>>>> Kal...@li...
>>     <mailto:Kal...@li...>
>>     >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>     >>>>>
>>     >>>
>>     ------------------------------------------------------------------------------
>>     >>> _______________________________________________
>>     >>> Kaldi-users mailing list
>>     >>> Kal...@li...
>>     <mailto:Kal...@li...>
>>     >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>     >
>>     >
>>     ------------------------------------------------------------------------------
>>     > _______________________________________________
>>     > Kaldi-users mailing list
>>     > Kal...@li...
>>     <mailto:Kal...@li...>
>>     > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>     --
>>     Karel Vesely, Brno University of Technology
>>     ive...@fi... <mailto:ive...@fi...>,
>>     +420-54114-1300 <tel:%2B420-54114-1300>
>>
>>
>>     ------------------------------------------------------------------------------
>>     _______________________________________________
>>     Kaldi-users mailing list
>>     Kal...@li...
>>     <mailto:Kal...@li...>
>>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>
>
> -- 
> Karel Vesely, Brno University of Technology
> ive...@fi..., +420-54114-1300

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-30 02:28:16

Hi Karel,
When the script freezed on my station (before I forced the compute 
mode), 'nvidia-smi' shows that 'nnet-forward' was actually running on 
one of the GPU cards.
Is it possible that it was running on CPU but shows as a running job on 
nvidia-smi?
And at the meantime, when I did 'top', it shows that 'nnet-forward' with 
an 'S' not an 'R'....

Xingyu

On 10/29/2014 09:28 PM, Vesely Karel wrote:
> Hi,
> the TIMIT DNN training is running, and it is very slow.
> I'll add there a script-check to stop training if cuda is not compiled-in.
> (Assuming that typically everybody wants to train on a GPU.)
> K.
>
> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>> Dan,
>> I'll check it by running TIMIT recipe without GPU code compiled.
>> Need to figure out what could have happened...
>> K.
>>
>> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>>> Karel,
>>> Is there something which we need to fix here?
>>> Why was it hanging?  Was it using the CPU instead of the GPU?  Was 
>>> it waiting for some kind of reply from the GPU?  Had it crashed?
>>> Dan
>>>
>>>
>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi... 
>>> <mailto:ive...@fi...>> wrote:
>>>
>>>     It is a 'must' on multi-GPU machines and 'recommended' for
>>>     single-GPU
>>>     machine.
>>>
>>>     It is a setting in OS, which is assumed to be done. It is good
>>>     that one
>>>     does not need
>>>     to specify a gpu-id in the scripts and track manually which gpus are
>>>     being used.
>>>
>>>     Karel.
>>>
>>>     On 10/24/2014 12:39 PM, Xingyu Na wrote:
>>>     > Thank you Karel.
>>>     > Is that a 'must' for all cuda-based kaldi executives?
>>>     >
>>>     > Regards,
>>>     > Xingyu
>>>     >
>>>     > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>>>     >> Hi,
>>>     >> The reason is in the "computation mode", which has with Kaldi
>>>     following
>>>     >> behavior:
>>>     >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>>>     >> processes use same GPU which is slow) [BAD]
>>>     >> - process/thread exclusive : OS selects a free GPU which not
>>>     locked to
>>>     >> another process or raises error [RECOMMENDED]
>>>     >> Best regards,
>>>     >> Karel
>>>     >>
>>>     >>
>>>     >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>>     >>> Thank you Dan and Alex.
>>>     >>> It turns out that I need to set 'nvidia-smi -c 1' to
>>>     continue here(don't
>>>     >>> know why....).
>>>     >>> Now I understand how that pipelined command works.
>>>     >>> Sorry for saying "Is there a bug" in the previous email....
>>>     >>>
>>>     >>> Regards,
>>>     >>> Xingyu
>>>     >>>
>>>     >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>>     >>>> Hi Xingyu,
>>>     >>>>
>>>     >>>> If you are concerned whether the process hung up or not,
>>>     you can see
>>>     >>>> the output of `ps <PID>` where <PID> is the process id. If
>>>     you see 'S'
>>>     >>>> in STAT fields, like
>>>     >>>>
>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>     >>>> 11891 pts/5    S+     0:00 cat
>>>     >>>>
>>>     >>>> Then the processing is sleeping. Otherwise you should see
>>>     'R' like:
>>>     >>>>
>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>     >>>> 11909 pts/5    R+     0:01 cat
>>>     >>>>
>>>     >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na
>>>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>>>     >>>>> Thank you so much Dan.
>>>     >>>>> The script which causes the halting is :
>>>     >>>>>
>>>     >>>>>       nnet-forward --use-gpu=yes \
>>>     >>>>>         $feature_transform_old "$(echo $feats | sed
>>>     >>>>> 's|train.scp|train.scp.10k|')" \
>>>     >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>     >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>     >>>>>       nnet-concat --binary=false $feature_transform_old -
>>>     $feature_transform
>>>     >>>>>
>>>     >>>>> and the command that is running is:
>>>     >>>>>
>>>     >>>>> nnet-forward --use-gpu=yes
>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>     >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>>>     ark:- | ark:-
>>>     >>>>>
>>>     >>>>> If I understand it correctly, nnet-forward is piping its
>>>     output to
>>>     >>>>> compute-cmvn-stats (although apply_cmvn is false), and
>>>     followed by
>>>     >>>>> cmvn-to-nnet and nnet-concat.
>>>     >>>>> The problem, I think, is that there is an extra '| ark:-'.
>>>     It means that the
>>>     >>>>> output of nnet-forward is being piped into 'ark:-', which
>>>     is not a
>>>     >>>>> executable.
>>>     >>>>> Is there is bug here?
>>>     >>>>>
>>>     >>>>> Regards,
>>>     >>>>> Xingyu
>>>     >>>>>
>>>     >>>>>
>>>     >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>     >>>>>
>>>     >>>>> I'm running the same thing at JHU to see if I can
>>>     replicate your problem.
>>>     >>>>> Dan
>>>     >>>>>
>>>     >>>>>
>>>     >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey
>>>     <dp...@gm... <mailto:dp...@gm...>> wrote:
>>>     >>>>>> cc'ing Karel who may be able to help you, although I
>>>     think he could be
>>>     >>>>>> behind on his email.
>>>     >>>>>> I'm afraid I don't know how to fix this.
>>>     >>>>>> If you can figure out the full command that's being run
>>>     then it might be
>>>     >>>>>> possible to get it in a debugger, e.g. gdb --args program
>>>     arg1 arg2 ..., and
>>>     >>>>>> break into it and get a stack trace to find where it's stuck.
>>>     >>>>>>
>>>     >>>>>> Dan
>>>     >>>>>>
>>>     >>>>>>
>>>     >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>     >>>>>> wrote:
>>>     >>>>>>> Thank you Dan.
>>>     >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is like
>>>     this:
>>>     >>>>>>>>> #Next section enables CUDA for compilation
>>>     >>>>>>>>> CUDA = true
>>>     >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>     >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>     >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine
>>>     64 -DHAVE_CUDA
>>>     >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib
>>>     -Wl,-rpath,$(CUDATKDIR)/lib
>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>>>     -Wl,-rpath,$(CUDATKDIR)/lib64
>>>     >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs
>>>     are loaded later
>>>     >>>>>>>>> than static libs in implicit rule
>>>     >>>>>>> The 'make' process does not give any error so I can
>>>     claim that the tools
>>>     >>>>>>> are compiled with CUDA successfully, right?
>>>     >>>>>>> Problem is, although the log stops updating, I can see
>>>     'nnet-forward' is
>>>     >>>>>>> running on GPU-2.
>>>     >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>>     >>>>>>>>> nnet-forward --use-gpu=yes
>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>     >>>>>>>>> 'ark:copy-feats
>>>     scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>     >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>>>     Suggestion: use
>>>     >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>>>     Selecting from 4
>>>     >>>>>>>>> GPUs
>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>     >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>>>     total:4799M,
>>>     >>>>>>>>> free/total:0.983228
>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>     >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>>>     total:4799M,
>>>     >>>>>>>>> free/total:0.983228
>>>     >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>     >>>>>>> BTW, my configure command is:
>>>     >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>     >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>     >>>>>>>
>>>     >>>>>>> Am I doing something wrong? Why 'nnet-forward' is
>>>     running on GPU while
>>>     >>>>>>> log stops updating?
>>>     >>>>>>>
>>>     >>>>>>> Thank you and best regards,
>>>     >>>>>>> Xingyu
>>>     >>>>>>>
>>>     >>>>>>>
>>>     >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>     >>>>>>>
>>>     >>>>>>> Possibly you did not compile for CUDA.  The logs should
>>>     say which GPU you
>>>     >>>>>>> are using (look in the dir, for *.log).  If the
>>>     configure script does not
>>>     >>>>>>> see nvcc on the command line, it will not use CUDA. 
>>>     Grep for CUDA in
>>>     >>>>>>> kaldi.mk <http://kaldi.mk> to see.
>>>     >>>>>>>
>>>     >>>>>>> Dan
>>>     >>>>>>>
>>>     >>>>>>>
>>>     >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>     >>>>>>> wrote:
>>>     >>>>>>>> Hi, I'm new in this community.
>>>     >>>>>>>> I am running the TIMIT example s5, all the way to DNN
>>>     Hybrid Training &
>>>     >>>>>>>> Decoding part.
>>>     >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called
>>>     yesterday, and still
>>>     >>>>>>>> running.
>>>     >>>>>>>> I checked the script and found that it stuck at calling
>>>     nnet-forward for
>>>     >>>>>>>> "Renormalizing MLP input features into
>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>     >>>>>>>> The program has been running more then 24 hours.
>>>     >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a
>>>     Tesla K20m...
>>>     >>>>>>>> How long does it normally take? Is there something
>>>     going wrong?
>>>     >>>>>>>> Please help.
>>>     >>>>>>>>
>>>     >>>>>>>> The log is posted below.
>>>     >>>>>>>> Thank you
>>>     >>>>>>>> Xingyu
>>>     >>>>>>>>
>>>     >>>>>>>>
>>>     >>>>>>>>
>>>     ============================================================================
>>>     >>>>>>>>
>>>     >>>>>>>>  DNN Hybrid Training & Decoding (Karel's recipe)
>>>     >>>>>>>>
>>>     >>>>>>>>
>>>     ============================================================================
>>>     >>>>>>>>
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>     <http://run.pl> --transform-dir
>>>     >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test
>>>     exp/tri3
>>>     >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>>     data/test -->
>>>     >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm
>>>     exp/tri3, trans
>>>     >>>>>>>> exp/tri3/decode_test
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>     <http://run.pl> --transform-dir
>>>     >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>     >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>>     data/dev -->
>>>     >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm
>>>     exp/tri3, trans
>>>     >>>>>>>> exp/tri3/decode_dev
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>     <http://run.pl> --transform-dir
>>>     >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>     >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>>     data/train -->
>>>     >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm
>>>     exp/tri3, trans
>>>     >>>>>>>> exp/tri3_ali
>>>     >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>     >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>     >>>>>>>>
>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>     >>>>>>>> reducing #utt from 3696 to 3320
>>>     >>>>>>>>
>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>     >>>>>>>> reducing #utt from 3696 to 376
>>>     >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>     >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>     >>>>>>>> #
>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>     >>>>>>>> # INFO
>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>>>     Network as a stack
>>>     >>>>>>>> of RBMs
>>>     >>>>>>>>           dir  : exp/dnn4_pretrain-dbn
>>>     >>>>>>>>           Train-set : data-fmllr-tri3/train
>>>     >>>>>>>>
>>>     >>>>>>>> # PREPARING FEATURES
>>>     >>>>>>>> Preparing train/cv lists
>>>     >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>     >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>     >>>>>>>>
>>>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>     >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>>>     feature matrices.
>>>     >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>     >>>>>>>> Getting feature dim : copy-feats
>>>     scp:exp/dnn4_pretrain-dbn/train.scp
>>>     >>>>>>>> ark:-
>>>     >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe
>>>     copy-feats
>>>     >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero
>>>     return status 13
>>>     >>>>>>>> 40
>>>     >>>>>>>> Using splice ± 5 , step 1
>>>     >>>>>>>> Renormalizing MLP input features into
>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>     >>>>>>>> compute-cmvn-stats ark:- -
>>>     >>>>>>>> cmvn-to-nnet - -
>>>     >>>>>>>> nnet-concat --binary=false
>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>     >>>>>>>>
>>>     >>>>>>>>
>>>     >>>>>>>>
>>>     ------------------------------------------------------------------------------
>>>     >>>>>>>> _______________________________________________
>>>     >>>>>>>> Kaldi-users mailing list
>>>     >>>>>>>> Kal...@li...
>>>     <mailto:Kal...@li...>
>>>     >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>     >>>>>
>>>     ------------------------------------------------------------------------------
>>>     >>>>>
>>>     >>>>> _______________________________________________
>>>     >>>>> Kaldi-users mailing list
>>>     >>>>> Kal...@li...
>>>     <mailto:Kal...@li...>
>>>     >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>     >>>>>
>>>     >>>
>>>     ------------------------------------------------------------------------------
>>>     >>> _______________________________________________
>>>     >>> Kaldi-users mailing list
>>>     >>> Kal...@li...
>>>     <mailto:Kal...@li...>
>>>     >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>     >
>>>     >
>>>     ------------------------------------------------------------------------------
>>>     > _______________________________________________
>>>     > Kaldi-users mailing list
>>>     > Kal...@li...
>>>     <mailto:Kal...@li...>
>>>     > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>>     --
>>>     Karel Vesely, Brno University of Technology
>>>     ive...@fi... <mailto:ive...@fi...>,
>>>     +420-54114-1300 <tel:%2B420-54114-1300>
>>>
>>>
>>>     ------------------------------------------------------------------------------
>>>     _______________________________________________
>>>     Kaldi-users mailing list
>>>     Kal...@li...
>>>     <mailto:Kal...@li...>
>>>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>>
>>
>> -- 
>> Karel Vesely, Brno University of Technology
>> ive...@fi..., +420-54114-1300
>
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Vesely K. <ve...@gm...> - 2014-10-31 10:13:17

Hi Xingyu,
hmm, I'm afraid I cannot explain this with certainty. Sometimes the 
binaries may
behave strangely if there is a problem with cuda driver + kernel module 
match,
or Kaldi compiled using unsufficient computation capability (it is okay 
in current trunk)
or because of simple GPU overheating.
Best,
Karel.


On 10/30/2014 03:27 AM, Xingyu Na wrote:
> Hi Karel,
> When the script freezed on my station (before I forced the compute 
> mode), 'nvidia-smi' shows that 'nnet-forward' was actually running on 
> one of the GPU cards.
> Is it possible that it was running on CPU but shows as a running job 
> on nvidia-smi?
> And at the meantime, when I did 'top', it shows that 'nnet-forward' 
> with an 'S' not an 'R'....
>
> Xingyu
>
> On 10/29/2014 09:28 PM, Vesely Karel wrote:
>> Hi,
>> the TIMIT DNN training is running, and it is very slow.
>> I'll add there a script-check to stop training if cuda is not 
>> compiled-in.
>> (Assuming that typically everybody wants to train on a GPU.)
>> K.
>>
>> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>>> Dan,
>>> I'll check it by running TIMIT recipe without GPU code compiled.
>>> Need to figure out what could have happened...
>>> K.
>>>
>>> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>>>> Karel,
>>>> Is there something which we need to fix here?
>>>> Why was it hanging?  Was it using the CPU instead of the GPU?  Was 
>>>> it waiting for some kind of reply from the GPU?  Had it crashed?
>>>> Dan
>>>>
>>>>
>>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel 
>>>> <ive...@fi... <mailto:ive...@fi...>> wrote:
>>>>
>>>>     It is a 'must' on multi-GPU machines and 'recommended' for
>>>>     single-GPU
>>>>     machine.
>>>>
>>>>     It is a setting in OS, which is assumed to be done. It is good
>>>>     that one
>>>>     does not need
>>>>     to specify a gpu-id in the scripts and track manually which
>>>>     gpus are
>>>>     being used.
>>>>
>>>>     Karel.
>>>>
>>>>     On 10/24/2014 12:39 PM, Xingyu Na wrote:
>>>>     > Thank you Karel.
>>>>     > Is that a 'must' for all cuda-based kaldi executives?
>>>>     >
>>>>     > Regards,
>>>>     > Xingyu
>>>>     >
>>>>     > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>>>>     >> Hi,
>>>>     >> The reason is in the "computation mode", which has with
>>>>     Kaldi following
>>>>     >> behavior:
>>>>     >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>>>>     >> processes use same GPU which is slow) [BAD]
>>>>     >> - process/thread exclusive : OS selects a free GPU which not
>>>>     locked to
>>>>     >> another process or raises error [RECOMMENDED]
>>>>     >> Best regards,
>>>>     >> Karel
>>>>     >>
>>>>     >>
>>>>     >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>>>     >>> Thank you Dan and Alex.
>>>>     >>> It turns out that I need to set 'nvidia-smi -c 1' to
>>>>     continue here(don't
>>>>     >>> know why....).
>>>>     >>> Now I understand how that pipelined command works.
>>>>     >>> Sorry for saying "Is there a bug" in the previous email....
>>>>     >>>
>>>>     >>> Regards,
>>>>     >>> Xingyu
>>>>     >>>
>>>>     >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>>>     >>>> Hi Xingyu,
>>>>     >>>>
>>>>     >>>> If you are concerned whether the process hung up or not,
>>>>     you can see
>>>>     >>>> the output of `ps <PID>` where <PID> is the process id. If
>>>>     you see 'S'
>>>>     >>>> in STAT fields, like
>>>>     >>>>
>>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>>     >>>> 11891 pts/5    S+     0:00 cat
>>>>     >>>>
>>>>     >>>> Then the processing is sleeping. Otherwise you should see
>>>>     'R' like:
>>>>     >>>>
>>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>>     >>>> 11909 pts/5    R+     0:01 cat
>>>>     >>>>
>>>>     >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na
>>>>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>>>>     >>>>> Thank you so much Dan.
>>>>     >>>>> The script which causes the halting is :
>>>>     >>>>>
>>>>     >>>>>       nnet-forward --use-gpu=yes \
>>>>     >>>>>  $feature_transform_old "$(echo $feats | sed
>>>>     >>>>> 's|train.scp|train.scp.10k|')" \
>>>>     >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>>     >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>>     >>>>>       nnet-concat --binary=false $feature_transform_old -
>>>>     $feature_transform
>>>>     >>>>>
>>>>     >>>>> and the command that is running is:
>>>>     >>>>>
>>>>     >>>>> nnet-forward --use-gpu=yes
>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>     >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>>>>     ark:- | ark:-
>>>>     >>>>>
>>>>     >>>>> If I understand it correctly, nnet-forward is piping its
>>>>     output to
>>>>     >>>>> compute-cmvn-stats (although apply_cmvn is false), and
>>>>     followed by
>>>>     >>>>> cmvn-to-nnet and nnet-concat.
>>>>     >>>>> The problem, I think, is that there is an extra '|
>>>>     ark:-'. It means that the
>>>>     >>>>> output of nnet-forward is being piped into 'ark:-', which
>>>>     is not a
>>>>     >>>>> executable.
>>>>     >>>>> Is there is bug here?
>>>>     >>>>>
>>>>     >>>>> Regards,
>>>>     >>>>> Xingyu
>>>>     >>>>>
>>>>     >>>>>
>>>>     >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>>     >>>>>
>>>>     >>>>> I'm running the same thing at JHU to see if I can
>>>>     replicate your problem.
>>>>     >>>>> Dan
>>>>     >>>>>
>>>>     >>>>>
>>>>     >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey
>>>>     <dp...@gm... <mailto:dp...@gm...>> wrote:
>>>>     >>>>>> cc'ing Karel who may be able to help you, although I
>>>>     think he could be
>>>>     >>>>>> behind on his email.
>>>>     >>>>>> I'm afraid I don't know how to fix this.
>>>>     >>>>>> If you can figure out the full command that's being run
>>>>     then it might be
>>>>     >>>>>> possible to get it in a debugger, e.g. gdb --args
>>>>     program arg1 arg2 ..., and
>>>>     >>>>>> break into it and get a stack trace to find where it's
>>>>     stuck.
>>>>     >>>>>>
>>>>     >>>>>> Dan
>>>>     >>>>>>
>>>>     >>>>>>
>>>>     >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>>     >>>>>> wrote:
>>>>     >>>>>>> Thank you Dan.
>>>>     >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is
>>>>     like this:
>>>>     >>>>>>>>> #Next section enables CUDA for compilation
>>>>     >>>>>>>>> CUDA = true
>>>>     >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>     >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>     >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine
>>>>     64 -DHAVE_CUDA
>>>>     >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib
>>>>     -Wl,-rpath,$(CUDATKDIR)/lib
>>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>>>>     -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>     >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs
>>>>     are loaded later
>>>>     >>>>>>>>> than static libs in implicit rule
>>>>     >>>>>>> The 'make' process does not give any error so I can
>>>>     claim that the tools
>>>>     >>>>>>> are compiled with CUDA successfully, right?
>>>>     >>>>>>> Problem is, although the log stops updating, I can see
>>>>     'nnet-forward' is
>>>>     >>>>>>> running on GPU-2.
>>>>     >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it
>>>>     displays:
>>>>     >>>>>>>>> nnet-forward --use-gpu=yes
>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>     >>>>>>>>> 'ark:copy-feats
>>>>     scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>     >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>>>>     Suggestion: use
>>>>     >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>>>>     Selecting from 4
>>>>     >>>>>>>>> GPUs
>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>     >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>>>>     total:4799M,
>>>>     >>>>>>>>> free/total:0.983228
>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>     >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>>>>     total:4799M,
>>>>     >>>>>>>>> free/total:0.983228
>>>>     >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>>     >>>>>>> BTW, my configure command is:
>>>>     >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>>>     >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>     >>>>>>>
>>>>     >>>>>>> Am I doing something wrong? Why 'nnet-forward' is
>>>>     running on GPU while
>>>>     >>>>>>> log stops updating?
>>>>     >>>>>>>
>>>>     >>>>>>> Thank you and best regards,
>>>>     >>>>>>> Xingyu
>>>>     >>>>>>>
>>>>     >>>>>>>
>>>>     >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>     >>>>>>>
>>>>     >>>>>>> Possibly you did not compile for CUDA.  The logs should
>>>>     say which GPU you
>>>>     >>>>>>> are using (look in the dir, for *.log).  If the
>>>>     configure script does not
>>>>     >>>>>>> see nvcc on the command line, it will not use CUDA. 
>>>>     Grep for CUDA in
>>>>     >>>>>>> kaldi.mk <http://kaldi.mk> to see.
>>>>     >>>>>>>
>>>>     >>>>>>> Dan
>>>>     >>>>>>>
>>>>     >>>>>>>
>>>>     >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>>     >>>>>>> wrote:
>>>>     >>>>>>>> Hi, I'm new in this community.
>>>>     >>>>>>>> I am running the TIMIT example s5, all the way to DNN
>>>>     Hybrid Training &
>>>>     >>>>>>>> Decoding part.
>>>>     >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called
>>>>     yesterday, and still
>>>>     >>>>>>>> running.
>>>>     >>>>>>>> I checked the script and found that it stuck at
>>>>     calling nnet-forward for
>>>>     >>>>>>>> "Renormalizing MLP input features into
>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>     >>>>>>>> The program has been running more then 24 hours.
>>>>     >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a
>>>>     Tesla K20m...
>>>>     >>>>>>>> How long does it normally take? Is there something
>>>>     going wrong?
>>>>     >>>>>>>> Please help.
>>>>     >>>>>>>>
>>>>     >>>>>>>> The log is posted below.
>>>>     >>>>>>>> Thank you
>>>>     >>>>>>>> Xingyu
>>>>     >>>>>>>>
>>>>     >>>>>>>>
>>>>     >>>>>>>>
>>>>     ============================================================================
>>>>     >>>>>>>>
>>>>     >>>>>>>>    DNN Hybrid Training & Decoding (Karel's recipe)
>>>>     >>>>>>>>
>>>>     >>>>>>>>
>>>>     ============================================================================
>>>>     >>>>>>>>
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>     <http://run.pl> --transform-dir
>>>>     >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test
>>>>     exp/tri3
>>>>     >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>>>     data/test -->
>>>>     >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm
>>>>     exp/tri3, trans
>>>>     >>>>>>>> exp/tri3/decode_test
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>     <http://run.pl> --transform-dir
>>>>     >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>     >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>>>     data/dev -->
>>>>     >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm
>>>>     exp/tri3, trans
>>>>     >>>>>>>> exp/tri3/decode_dev
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>     <http://run.pl> --transform-dir
>>>>     >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>     >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>>>     data/train -->
>>>>     >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm
>>>>     exp/tri3, trans
>>>>     >>>>>>>> exp/tri3_ali
>>>>     >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>     >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>     >>>>>>>>
>>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>     >>>>>>>> reducing #utt from 3696 to 3320
>>>>     >>>>>>>>
>>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>     >>>>>>>> reducing #utt from 3696 to 376
>>>>     >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>     >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>     >>>>>>>> #
>>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>     >>>>>>>> # INFO
>>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>>>>     Network as a stack
>>>>     >>>>>>>> of RBMs
>>>>     >>>>>>>>           dir    : exp/dnn4_pretrain-dbn
>>>>     >>>>>>>>  Train-set : data-fmllr-tri3/train
>>>>     >>>>>>>>
>>>>     >>>>>>>> # PREPARING FEATURES
>>>>     >>>>>>>> Preparing train/cv lists
>>>>     >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>     >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>     >>>>>>>>
>>>>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>     >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>>>>     feature matrices.
>>>>     >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>     >>>>>>>> Getting feature dim : copy-feats
>>>>     scp:exp/dnn4_pretrain-dbn/train.scp
>>>>     >>>>>>>> ark:-
>>>>     >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe
>>>>     copy-feats
>>>>     >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had
>>>>     nonzero return status 13
>>>>     >>>>>>>> 40
>>>>     >>>>>>>> Using splice ± 5 , step 1
>>>>     >>>>>>>> Renormalizing MLP input features into
>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>     >>>>>>>> compute-cmvn-stats ark:- -
>>>>     >>>>>>>> cmvn-to-nnet - -
>>>>     >>>>>>>> nnet-concat --binary=false
>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>>>     >>>>>>>>
>>>>     >>>>>>>>
>>>>     >>>>>>>>
>>>>     ------------------------------------------------------------------------------
>>>>     >>>>>>>> _______________________________________________
>>>>     >>>>>>>> Kaldi-users mailing list
>>>>     >>>>>>>> Kal...@li...
>>>>     <mailto:Kal...@li...>
>>>>     >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>     >>>>>
>>>>     ------------------------------------------------------------------------------
>>>>     >>>>>
>>>>     >>>>> _______________________________________________
>>>>     >>>>> Kaldi-users mailing list
>>>>     >>>>> Kal...@li...
>>>>     <mailto:Kal...@li...>
>>>>     >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>     >>>>>
>>>>     >>>
>>>>     ------------------------------------------------------------------------------
>>>>     >>> _______________________________________________
>>>>     >>> Kaldi-users mailing list
>>>>     >>> Kal...@li...
>>>>     <mailto:Kal...@li...>
>>>>     >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>     >
>>>>     >
>>>>     ------------------------------------------------------------------------------
>>>>     > _______________________________________________
>>>>     > Kaldi-users mailing list
>>>>     > Kal...@li...
>>>>     <mailto:Kal...@li...>
>>>>     > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>
>>>>     --
>>>>     Karel Vesely, Brno University of Technology
>>>>     ive...@fi... <mailto:ive...@fi...>,
>>>>     +420-54114-1300 <tel:%2B420-54114-1300>
>>>>
>>>>
>>>>     ------------------------------------------------------------------------------
>>>>     _______________________________________________
>>>>     Kaldi-users mailing list
>>>>     Kal...@li...
>>>>     <mailto:Kal...@li...>
>>>>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>
>>>>
>>>
>>> -- 
>>> Karel Vesely, Brno University of Technology
>>> ive...@fi..., +420-54114-1300
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Xingyu Na <asr...@gm...> - 2014-10-31 10:19:50

Yep, there are too many variables having impact on this. It's really 
hard to debug this kind of behaviour since it maybe is running really 
really slow that the CPU thought the GPU is sleeping :-)
Anyway, it's working properly now so I'll just move on. Thank all you 
guys for helping.

Best,
Xingyu

On 10/31/2014 06:13 PM, Vesely Karel wrote:
> Hi Xingyu,
> hmm, I'm afraid I cannot explain this with certainty. Sometimes the 
> binaries may
> behave strangely if there is a problem with cuda driver + kernel 
> module match,
> or Kaldi compiled using unsufficient computation capability (it is 
> okay in current trunk)
> or because of simple GPU overheating.
> Best,
> Karel.
>
>
> On 10/30/2014 03:27 AM, Xingyu Na wrote:
>> Hi Karel,
>> When the script freezed on my station (before I forced the compute 
>> mode), 'nvidia-smi' shows that 'nnet-forward' was actually running on 
>> one of the GPU cards.
>> Is it possible that it was running on CPU but shows as a running job 
>> on nvidia-smi?
>> And at the meantime, when I did 'top', it shows that 'nnet-forward' 
>> with an 'S' not an 'R'....
>>
>> Xingyu
>>
>> On 10/29/2014 09:28 PM, Vesely Karel wrote:
>>> Hi,
>>> the TIMIT DNN training is running, and it is very slow.
>>> I'll add there a script-check to stop training if cuda is not 
>>> compiled-in.
>>> (Assuming that typically everybody wants to train on a GPU.)
>>> K.
>>>
>>> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>>>> Dan,
>>>> I'll check it by running TIMIT recipe without GPU code compiled.
>>>> Need to figure out what could have happened...
>>>> K.
>>>>
>>>> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>>>>> Karel,
>>>>> Is there something which we need to fix here?
>>>>> Why was it hanging?  Was it using the CPU instead of the GPU?  Was 
>>>>> it waiting for some kind of reply from the GPU?  Had it crashed?
>>>>> Dan
>>>>>
>>>>>
>>>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel 
>>>>> <ive...@fi... <mailto:ive...@fi...>> wrote:
>>>>>
>>>>>     It is a 'must' on multi-GPU machines and 'recommended' for
>>>>>     single-GPU
>>>>>     machine.
>>>>>
>>>>>     It is a setting in OS, which is assumed to be done. It is good
>>>>>     that one
>>>>>     does not need
>>>>>     to specify a gpu-id in the scripts and track manually which
>>>>>     gpus are
>>>>>     being used.
>>>>>
>>>>>     Karel.
>>>>>
>>>>>     On 10/24/2014 12:39 PM, Xingyu Na wrote:
>>>>>     > Thank you Karel.
>>>>>     > Is that a 'must' for all cuda-based kaldi executives?
>>>>>     >
>>>>>     > Regards,
>>>>>     > Xingyu
>>>>>     >
>>>>>     > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>>>>>     >> Hi,
>>>>>     >> The reason is in the "computation mode", which has with
>>>>>     Kaldi following
>>>>>     >> behavior:
>>>>>     >> - default : OS selects GPU with GPU-ID '0' by default (i.e.
>>>>>     more
>>>>>     >> processes use same GPU which is slow) [BAD]
>>>>>     >> - process/thread exclusive : OS selects a free GPU which
>>>>>     not locked to
>>>>>     >> another process or raises error [RECOMMENDED]
>>>>>     >> Best regards,
>>>>>     >> Karel
>>>>>     >>
>>>>>     >>
>>>>>     >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>>>>     >>> Thank you Dan and Alex.
>>>>>     >>> It turns out that I need to set 'nvidia-smi -c 1' to
>>>>>     continue here(don't
>>>>>     >>> know why....).
>>>>>     >>> Now I understand how that pipelined command works.
>>>>>     >>> Sorry for saying "Is there a bug" in the previous email....
>>>>>     >>>
>>>>>     >>> Regards,
>>>>>     >>> Xingyu
>>>>>     >>>
>>>>>     >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>>>>     >>>> Hi Xingyu,
>>>>>     >>>>
>>>>>     >>>> If you are concerned whether the process hung up or not,
>>>>>     you can see
>>>>>     >>>> the output of `ps <PID>` where <PID> is the process id.
>>>>>     If you see 'S'
>>>>>     >>>> in STAT fields, like
>>>>>     >>>>
>>>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>>>     >>>> 11891 pts/5    S+     0:00 cat
>>>>>     >>>>
>>>>>     >>>> Then the processing is sleeping. Otherwise you should see
>>>>>     'R' like:
>>>>>     >>>>
>>>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>>>     >>>> 11909 pts/5    R+     0:01 cat
>>>>>     >>>>
>>>>>     >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na
>>>>>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>>>>>     >>>>> Thank you so much Dan.
>>>>>     >>>>> The script which causes the halting is :
>>>>>     >>>>>
>>>>>     >>>>>       nnet-forward --use-gpu=yes \
>>>>>     >>>>>  $feature_transform_old "$(echo $feats | sed
>>>>>     >>>>> 's|train.scp|train.scp.10k|')" \
>>>>>     >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>>>     >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>>>     >>>>>       nnet-concat --binary=false $feature_transform_old
>>>>>     - $feature_transform
>>>>>     >>>>>
>>>>>     >>>>> and the command that is running is:
>>>>>     >>>>>
>>>>>     >>>>> nnet-forward --use-gpu=yes
>>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>     >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>>>>>     ark:- | ark:-
>>>>>     >>>>>
>>>>>     >>>>> If I understand it correctly, nnet-forward is piping its
>>>>>     output to
>>>>>     >>>>> compute-cmvn-stats (although apply_cmvn is false), and
>>>>>     followed by
>>>>>     >>>>> cmvn-to-nnet and nnet-concat.
>>>>>     >>>>> The problem, I think, is that there is an extra '|
>>>>>     ark:-'. It means that the
>>>>>     >>>>> output of nnet-forward is being piped into 'ark:-',
>>>>>     which is not a
>>>>>     >>>>> executable.
>>>>>     >>>>> Is there is bug here?
>>>>>     >>>>>
>>>>>     >>>>> Regards,
>>>>>     >>>>> Xingyu
>>>>>     >>>>>
>>>>>     >>>>>
>>>>>     >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>>>     >>>>>
>>>>>     >>>>> I'm running the same thing at JHU to see if I can
>>>>>     replicate your problem.
>>>>>     >>>>> Dan
>>>>>     >>>>>
>>>>>     >>>>>
>>>>>     >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey
>>>>>     <dp...@gm... <mailto:dp...@gm...>> wrote:
>>>>>     >>>>>> cc'ing Karel who may be able to help you, although I
>>>>>     think he could be
>>>>>     >>>>>> behind on his email.
>>>>>     >>>>>> I'm afraid I don't know how to fix this.
>>>>>     >>>>>> If you can figure out the full command that's being run
>>>>>     then it might be
>>>>>     >>>>>> possible to get it in a debugger, e.g. gdb --args
>>>>>     program arg1 arg2 ..., and
>>>>>     >>>>>> break into it and get a stack trace to find where it's
>>>>>     stuck.
>>>>>     >>>>>>
>>>>>     >>>>>> Dan
>>>>>     >>>>>>
>>>>>     >>>>>>
>>>>>     >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>>>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>>>     >>>>>> wrote:
>>>>>     >>>>>>> Thank you Dan.
>>>>>     >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is
>>>>>     like this:
>>>>>     >>>>>>>>> #Next section enables CUDA for compilation
>>>>>     >>>>>>>>> CUDA = true
>>>>>     >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>     >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>     >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine
>>>>>     64 -DHAVE_CUDA
>>>>>     >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib
>>>>>     -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>>>>>     -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>     >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs
>>>>>     are loaded later
>>>>>     >>>>>>>>> than static libs in implicit rule
>>>>>     >>>>>>> The 'make' process does not give any error so I can
>>>>>     claim that the tools
>>>>>     >>>>>>> are compiled with CUDA successfully, right?
>>>>>     >>>>>>> Problem is, although the log stops updating, I can see
>>>>>     'nnet-forward' is
>>>>>     >>>>>>> running on GPU-2.
>>>>>     >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it
>>>>>     displays:
>>>>>     >>>>>>>>> nnet-forward --use-gpu=yes
>>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>     >>>>>>>>> 'ark:copy-feats
>>>>>     scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>     >>>>>>>>> WARNING
>>>>>     (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>     >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>     >>>>>>>>> LOG
>>>>>     (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting from 4
>>>>>     >>>>>>>>> GPUs
>>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>     >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M,
>>>>>     used:80M, total:4799M,
>>>>>     >>>>>>>>> free/total:0.983228
>>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>     >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M,
>>>>>     used:80M, total:4799M,
>>>>>     >>>>>>>>> free/total:0.983228
>>>>>     >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>>>>     >>>>>>> BTW, my configure command is:
>>>>>     >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base
>>>>>     --use-cuda=yes
>>>>>     >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>>     >>>>>>>
>>>>>     >>>>>>> Am I doing something wrong? Why 'nnet-forward' is
>>>>>     running on GPU while
>>>>>     >>>>>>> log stops updating?
>>>>>     >>>>>>>
>>>>>     >>>>>>> Thank you and best regards,
>>>>>     >>>>>>> Xingyu
>>>>>     >>>>>>>
>>>>>     >>>>>>>
>>>>>     >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>>     >>>>>>>
>>>>>     >>>>>>> Possibly you did not compile for CUDA.  The logs
>>>>>     should say which GPU you
>>>>>     >>>>>>> are using (look in the dir, for *.log).  If the
>>>>>     configure script does not
>>>>>     >>>>>>> see nvcc on the command line, it will not use CUDA. 
>>>>>     Grep for CUDA in
>>>>>     >>>>>>> kaldi.mk <http://kaldi.mk> to see.
>>>>>     >>>>>>>
>>>>>     >>>>>>> Dan
>>>>>     >>>>>>>
>>>>>     >>>>>>>
>>>>>     >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>>>     >>>>>>> wrote:
>>>>>     >>>>>>>> Hi, I'm new in this community.
>>>>>     >>>>>>>> I am running the TIMIT example s5, all the way to DNN
>>>>>     Hybrid Training &
>>>>>     >>>>>>>> Decoding part.
>>>>>     >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called
>>>>>     yesterday, and still
>>>>>     >>>>>>>> running.
>>>>>     >>>>>>>> I checked the script and found that it stuck at
>>>>>     calling nnet-forward for
>>>>>     >>>>>>>> "Renormalizing MLP input features into
>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>>     >>>>>>>> The program has been running more then 24 hours.
>>>>>     >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on
>>>>>     a Tesla K20m...
>>>>>     >>>>>>>> How long does it normally take? Is there something
>>>>>     going wrong?
>>>>>     >>>>>>>> Please help.
>>>>>     >>>>>>>>
>>>>>     >>>>>>>> The log is posted below.
>>>>>     >>>>>>>> Thank you
>>>>>     >>>>>>>> Xingyu
>>>>>     >>>>>>>>
>>>>>     >>>>>>>>
>>>>>     >>>>>>>>
>>>>>     ============================================================================
>>>>>     >>>>>>>>
>>>>>     >>>>>>>>      DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>     >>>>>>>>
>>>>>     >>>>>>>>
>>>>>     ============================================================================
>>>>>     >>>>>>>>
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>>     <http://run.pl> --transform-dir
>>>>>     >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test
>>>>>     exp/tri3
>>>>>     >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type
>>>>>     lda_fmllr, data/test -->
>>>>>     >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm
>>>>>     exp/tri3, trans
>>>>>     >>>>>>>> exp/tri3/decode_test
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>>     <http://run.pl> --transform-dir
>>>>>     >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>>>>     >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type
>>>>>     lda_fmllr, data/dev -->
>>>>>     >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm
>>>>>     exp/tri3, trans
>>>>>     >>>>>>>> exp/tri3/decode_dev
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>>     <http://run.pl> --transform-dir
>>>>>     >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>>     >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type
>>>>>     lda_fmllr, data/train -->
>>>>>     >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm
>>>>>     exp/tri3, trans
>>>>>     >>>>>>>> exp/tri3_ali
>>>>>     >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>>     >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>>     >>>>>>>>
>>>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>     >>>>>>>> reducing #utt from 3696 to 3320
>>>>>     >>>>>>>>
>>>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>     >>>>>>>> reducing #utt from 3696 to 376
>>>>>     >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>     >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>>     >>>>>>>> #
>>>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>     >>>>>>>> # INFO
>>>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief
>>>>>     Network as a stack
>>>>>     >>>>>>>> of RBMs
>>>>>     >>>>>>>>           dir      : exp/dnn4_pretrain-dbn
>>>>>     >>>>>>>>  Train-set : data-fmllr-tri3/train
>>>>>     >>>>>>>>
>>>>>     >>>>>>>> # PREPARING FEATURES
>>>>>     >>>>>>>> Preparing train/cv lists
>>>>>     >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>>     >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>>     >>>>>>>>
>>>>>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>>     >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696
>>>>>     feature matrices.
>>>>>     >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>>>>     >>>>>>>> Getting feature dim : copy-feats
>>>>>     scp:exp/dnn4_pretrain-dbn/train.scp
>>>>>     >>>>>>>> ark:-
>>>>>     >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe
>>>>>     copy-feats
>>>>>     >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had
>>>>>     nonzero return status 13
>>>>>     >>>>>>>> 40
>>>>>     >>>>>>>> Using splice ± 5 , step 1
>>>>>     >>>>>>>> Renormalizing MLP input features into
>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>     >>>>>>>> compute-cmvn-stats ark:- -
>>>>>     >>>>>>>> cmvn-to-nnet - -
>>>>>     >>>>>>>> nnet-concat --binary=false
>>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65)
>>>>>     Concatenating -
>>>>>     >>>>>>>>
>>>>>     >>>>>>>>
>>>>>     >>>>>>>>
>>>>>     ------------------------------------------------------------------------------
>>>>>     >>>>>>>> _______________________________________________
>>>>>     >>>>>>>> Kaldi-users mailing list
>>>>>     >>>>>>>> Kal...@li...
>>>>>     <mailto:Kal...@li...>
>>>>>     >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>     >>>>>
>>>>>     ------------------------------------------------------------------------------
>>>>>     >>>>>
>>>>>     >>>>> _______________________________________________
>>>>>     >>>>> Kaldi-users mailing list
>>>>>     >>>>> Kal...@li...
>>>>>     <mailto:Kal...@li...>
>>>>>     >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>     >>>>>
>>>>>     >>>
>>>>>     ------------------------------------------------------------------------------
>>>>>     >>> _______________________________________________
>>>>>     >>> Kaldi-users mailing list
>>>>>     >>> Kal...@li...
>>>>>     <mailto:Kal...@li...>
>>>>>     >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>     >
>>>>>     >
>>>>>     ------------------------------------------------------------------------------
>>>>>     > _______________________________________________
>>>>>     > Kaldi-users mailing list
>>>>>     > Kal...@li...
>>>>>     <mailto:Kal...@li...>
>>>>>     > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>
>>>>>     --
>>>>>     Karel Vesely, Brno University of Technology
>>>>>     ive...@fi... <mailto:ive...@fi...>,
>>>>>     +420-54114-1300 <tel:%2B420-54114-1300>
>>>>>
>>>>>
>>>>>     ------------------------------------------------------------------------------
>>>>>     _______________________________________________
>>>>>     Kaldi-users mailing list
>>>>>     Kal...@li...
>>>>>     <mailto:Kal...@li...>
>>>>>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>
>>>>>
>>>>
>>>> -- 
>>>> Karel Vesely, Brno University of Technology
>>>> ive...@fi..., +420-54114-1300
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Ondrej P. <ond...@gm...> - 2014-10-29 16:17:38

Hi,

may I ask how to force Kaldi to use one GPU(Tesla) over the other (Quadro).
I am running it locally (using run.pl njobs=10) and I want to use the much
stronger Tesla GPU

Unfortunately, it selects the GPUs kind of randomly and quite often if
computes on Quadro.

Ondra


On 29 October 2014 14:28, Vesely Karel <ve...@gm...> wrote:

>  Hi,
> the TIMIT DNN training is running, and it is very slow.
> I'll add there a script-check to stop training if cuda is not compiled-in.
> (Assuming that typically everybody wants to train on a GPU.)
> K.
>
>
> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>
> Dan,
> I'll check it by running TIMIT recipe without GPU code compiled.
> Need to figure out what could have happened...
> K.
>
> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>
> Karel,
> Is there something which we need to fix here?
> Why was it hanging?  Was it using the CPU instead of the GPU?  Was it
> waiting for some kind of reply from the GPU?  Had it crashed?
> Dan
>
>
> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...>
> wrote:
>
>> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU
>> machine.
>>
>> It is a setting in OS, which is assumed to be done. It is good that one
>> does not need
>> to specify a gpu-id in the scripts and track manually which gpus are
>> being used.
>>
>> Karel.
>>
>> On 10/24/2014 12:39 PM, Xingyu Na wrote:
>> > Thank you Karel.
>> > Is that a 'must' for all cuda-based kaldi executives?
>> >
>> > Regards,
>> > Xingyu
>> >
>> > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>> >> Hi,
>> >> The reason is in the "computation mode", which has with Kaldi following
>> >> behavior:
>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>> >> processes use same GPU which is slow) [BAD]
>> >> - process/thread exclusive : OS selects a free GPU which not locked to
>> >> another process or raises error [RECOMMENDED]
>> >> Best regards,
>> >> Karel
>> >>
>> >>
>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>> >>> Thank you Dan and Alex.
>> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue
>> here(don't
>> >>> know why....).
>> >>> Now I understand how that pipelined command works.
>> >>> Sorry for saying "Is there a bug" in the previous email....
>> >>>
>> >>> Regards,
>> >>> Xingyu
>> >>>
>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>> >>>> Hi Xingyu,
>> >>>>
>> >>>> If you are concerned whether the process hung up or not, you can see
>> >>>> the output of `ps <PID>` where <PID> is the process id. If you see
>> 'S'
>> >>>> in STAT fields, like
>> >>>>
>> >>>> PID TTY      STAT   TIME COMMAND
>> >>>> 11891 pts/5    S+     0:00 cat
>> >>>>
>> >>>> Then the processing is sleeping. Otherwise you should see 'R' like:
>> >>>>
>> >>>> PID TTY      STAT   TIME COMMAND
>> >>>> 11909 pts/5    R+     0:01 cat
>> >>>>
>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...>
>> wrote:
>> >>>>> Thank you so much Dan.
>> >>>>> The script which causes the halting is :
>> >>>>>
>> >>>>>       nnet-forward --use-gpu=yes \
>> >>>>>         $feature_transform_old "$(echo $feats | sed
>> >>>>> 's|train.scp|train.scp.10k|')" \
>> >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>> >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>> >>>>>       nnet-concat --binary=false $feature_transform_old -
>> $feature_transform
>> >>>>>
>> >>>>> and the command that is running is:
>> >>>>>
>> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>> >>>>>
>> >>>>> If I understand it correctly, nnet-forward is piping its output to
>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>> >>>>> cmvn-to-nnet and nnet-concat.
>> >>>>> The problem, I think, is that there is an extra '| ark:-'. It means
>> that the
>> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a
>> >>>>> executable.
>> >>>>> Is there is bug here?
>> >>>>>
>> >>>>> Regards,
>> >>>>> Xingyu
>> >>>>>
>> >>>>>
>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>> >>>>>
>> >>>>> I'm running the same thing at JHU to see if I can replicate your
>> problem.
>> >>>>> Dan
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...>
>> wrote:
>> >>>>>> cc'ing Karel who may be able to help you, although I think he
>> could be
>> >>>>>> behind on his email.
>> >>>>>> I'm afraid I don't know how to fix this.
>> >>>>>> If you can figure out the full command that's being run then it
>> might be
>> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1
>> arg2 ..., and
>> >>>>>> break into it and get a stack trace to find where it's stuck.
>> >>>>>>
>> >>>>>> Dan
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <
>> asr...@gm...>
>> >>>>>> wrote:
>> >>>>>>> Thank you Dan.
>> >>>>>>> I compiled with CUDA. kaldi.mk is like this:
>> >>>>>>>>> #Next section enables CUDA for compilation
>> >>>>>>>>> CUDA = true
>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>> -DHAVE_CUDA
>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>> -Wl,-rpath,$(CUDATKDIR)/lib64
>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded
>> later
>> >>>>>>>>> than static libs in implicit rule
>> >>>>>>> The 'make' process does not give any error so I can claim that
>> the tools
>> >>>>>>> are compiled with CUDA successfully, right?
>> >>>>>>> Problem is, although the log stops updating, I can see
>> 'nnet-forward' is
>> >>>>>>> running on GPU-2.
>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>> >>>>>>>>> nnet-forward --use-gpu=yes
>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
>> |' ark:-
>> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>> Suggestion: use
>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting
>> from 4
>> >>>>>>>>> GPUs
>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>> total:4799M,
>> >>>>>>>>> free/total:0.983228
>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>> total:4799M,
>> >>>>>>>>> free/total:0.983228
>> >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>> >>>>>>> BTW, my configure command is:
>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>> >>>>>>>
>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU
>> while
>> >>>>>>> log stops updating?
>> >>>>>>>
>> >>>>>>> Thank you and best regards,
>> >>>>>>> Xingyu
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>> >>>>>>>
>> >>>>>>> Possibly you did not compile for CUDA.  The logs should say which
>> GPU you
>> >>>>>>> are using (look in the dir, for *.log).  If the configure script
>> does not
>> >>>>>>> see nvcc on the command line, it will not use CUDA.  Grep for
>> CUDA in
>> >>>>>>> kaldi.mk to see.
>> >>>>>>>
>> >>>>>>> Dan
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <
>> asr...@gm...>
>> >>>>>>> wrote:
>> >>>>>>>> Hi, I'm new in this community.
>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid
>> Training &
>> >>>>>>>> Decoding part.
>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday,
>> and still
>> >>>>>>>> running.
>> >>>>>>>> I checked the script and found that it stuck at calling
>> nnet-forward for
>> >>>>>>>> "Renormalizing MLP input features into
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>> >>>>>>>> The program has been running more then 24 hours.
>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla
>> K20m...
>> >>>>>>>> How long does it normally take? Is there something going wrong?
>> >>>>>>>> Please help.
>> >>>>>>>>
>> >>>>>>>> The log is posted below.
>> >>>>>>>> Thank you
>> >>>>>>>> Xingyu
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> ============================================================================
>> >>>>>>>>
>> >>>>>>>>                     DNN Hybrid Training & Decoding (Karel's
>> recipe)
>> >>>>>>>>
>> >>>>>>>>
>> ============================================================================
>> >>>>>>>>
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>> --transform-dir
>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test
>> -->
>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>> >>>>>>>> exp/tri3/decode_test
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>> --transform-dir
>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev
>> -->
>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>> >>>>>>>> exp/tri3/decode_dev
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>> --transform-dir
>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>> data/train -->
>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3,
>> trans
>> >>>>>>>> exp/tri3_ali
>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>> >>>>>>>>
>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>> >>>>>>>> reducing #utt from 3696 to 3320
>> >>>>>>>>
>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>> >>>>>>>> reducing #utt from 3696 to 376
>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>> >>>>>>>> #
>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>> >>>>>>>> # INFO
>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as
>> a stack
>> >>>>>>>> of RBMs
>> >>>>>>>>           dir       : exp/dnn4_pretrain-dbn
>> >>>>>>>>           Train-set : data-fmllr-tri3/train
>> >>>>>>>>
>> >>>>>>>> # PREPARING FEATURES
>> >>>>>>>> Preparing train/cv lists
>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>> >>>>>>>>
>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature
>> matrices.
>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>> >>>>>>>> Getting feature dim : copy-feats
>> scp:exp/dnn4_pretrain-dbn/train.scp
>> >>>>>>>> ark:-
>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return
>> status 13
>> >>>>>>>> 40
>> >>>>>>>> Using splice ± 5 , step 1
>> >>>>>>>> Renormalizing MLP input features into
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>> >>>>>>>> compute-cmvn-stats ark:- -
>> >>>>>>>> cmvn-to-nnet - -
>> >>>>>>>> nnet-concat --binary=false
>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> ------------------------------------------------------------------------------
>> >>>>>>>> _______________________________________________
>> >>>>>>>> Kaldi-users mailing list
>> >>>>>>>> Kal...@li...
>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>> >>>>>
>> ------------------------------------------------------------------------------
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> Kaldi-users mailing list
>> >>>>> Kal...@li...
>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>> >>>>>
>> >>>
>> ------------------------------------------------------------------------------
>> >>> _______________________________________________
>> >>> Kaldi-users mailing list
>> >>> Kal...@li...
>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>> >
>> >
>> ------------------------------------------------------------------------------
>> > _______________________________________________
>> > Kaldi-users mailing list
>> > Kal...@li...
>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>  --
>> Karel Vesely, Brno University of Technology
>> ive...@fi..., +420-54114-1300
>>
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>
>
> --
> Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Jan T. <af...@ce...> - 2014-10-29 16:22:21

Ondrej,
you can play with the CUDA_VISIBLE_DEVICES environment variable to mask out
the GPUs you don't want to use.
y.

On Wed, Oct 29, 2014 at 5:17 PM, Ondrej Platek <ond...@gm...>
wrote:

> Hi,
>
> may I ask how to force Kaldi to use one GPU(Tesla) over the other (Quadro).
> I am running it locally (using run.pl njobs=10) and I want to use the
> much stronger Tesla GPU
>
> Unfortunately, it selects the GPUs kind of randomly and quite often if
> computes on Quadro.
>
> Ondra
>
>
> On 29 October 2014 14:28, Vesely Karel <ve...@gm...> wrote:
>
>>  Hi,
>> the TIMIT DNN training is running, and it is very slow.
>> I'll add there a script-check to stop training if cuda is not compiled-in.
>> (Assuming that typically everybody wants to train on a GPU.)
>> K.
>>
>>
>> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>>
>> Dan,
>> I'll check it by running TIMIT recipe without GPU code compiled.
>> Need to figure out what could have happened...
>> K.
>>
>> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>>
>> Karel,
>> Is there something which we need to fix here?
>> Why was it hanging?  Was it using the CPU instead of the GPU?  Was it
>> waiting for some kind of reply from the GPU?  Had it crashed?
>> Dan
>>
>>
>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...>
>> wrote:
>>
>>> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU
>>> machine.
>>>
>>> It is a setting in OS, which is assumed to be done. It is good that one
>>> does not need
>>> to specify a gpu-id in the scripts and track manually which gpus are
>>> being used.
>>>
>>> Karel.
>>>
>>> On 10/24/2014 12:39 PM, Xingyu Na wrote:
>>> > Thank you Karel.
>>> > Is that a 'must' for all cuda-based kaldi executives?
>>> >
>>> > Regards,
>>> > Xingyu
>>> >
>>> > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>>> >> Hi,
>>> >> The reason is in the "computation mode", which has with Kaldi
>>> following
>>> >> behavior:
>>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>>> >> processes use same GPU which is slow) [BAD]
>>> >> - process/thread exclusive : OS selects a free GPU which not locked to
>>> >> another process or raises error [RECOMMENDED]
>>> >> Best regards,
>>> >> Karel
>>> >>
>>> >>
>>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>> >>> Thank you Dan and Alex.
>>> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue
>>> here(don't
>>> >>> know why....).
>>> >>> Now I understand how that pipelined command works.
>>> >>> Sorry for saying "Is there a bug" in the previous email....
>>> >>>
>>> >>> Regards,
>>> >>> Xingyu
>>> >>>
>>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>> >>>> Hi Xingyu,
>>> >>>>
>>> >>>> If you are concerned whether the process hung up or not, you can see
>>> >>>> the output of `ps <PID>` where <PID> is the process id. If you see
>>> 'S'
>>> >>>> in STAT fields, like
>>> >>>>
>>> >>>> PID TTY      STAT   TIME COMMAND
>>> >>>> 11891 pts/5    S+     0:00 cat
>>> >>>>
>>> >>>> Then the processing is sleeping. Otherwise you should see 'R' like:
>>> >>>>
>>> >>>> PID TTY      STAT   TIME COMMAND
>>> >>>> 11909 pts/5    R+     0:01 cat
>>> >>>>
>>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...>
>>> wrote:
>>> >>>>> Thank you so much Dan.
>>> >>>>> The script which causes the halting is :
>>> >>>>>
>>> >>>>>       nnet-forward --use-gpu=yes \
>>> >>>>>         $feature_transform_old "$(echo $feats | sed
>>> >>>>> 's|train.scp|train.scp.10k|')" \
>>> >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>> >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>> >>>>>       nnet-concat --binary=false $feature_transform_old -
>>> $feature_transform
>>> >>>>>
>>> >>>>> and the command that is running is:
>>> >>>>>
>>> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |
>>> ark:-
>>> >>>>>
>>> >>>>> If I understand it correctly, nnet-forward is piping its output to
>>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>>> >>>>> cmvn-to-nnet and nnet-concat.
>>> >>>>> The problem, I think, is that there is an extra '| ark:-'. It
>>> means that the
>>> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a
>>> >>>>> executable.
>>> >>>>> Is there is bug here?
>>> >>>>>
>>> >>>>> Regards,
>>> >>>>> Xingyu
>>> >>>>>
>>> >>>>>
>>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>> >>>>>
>>> >>>>> I'm running the same thing at JHU to see if I can replicate your
>>> problem.
>>> >>>>> Dan
>>> >>>>>
>>> >>>>>
>>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...>
>>> wrote:
>>> >>>>>> cc'ing Karel who may be able to help you, although I think he
>>> could be
>>> >>>>>> behind on his email.
>>> >>>>>> I'm afraid I don't know how to fix this.
>>> >>>>>> If you can figure out the full command that's being run then it
>>> might be
>>> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1
>>> arg2 ..., and
>>> >>>>>> break into it and get a stack trace to find where it's stuck.
>>> >>>>>>
>>> >>>>>> Dan
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <
>>> asr...@gm...>
>>> >>>>>> wrote:
>>> >>>>>>> Thank you Dan.
>>> >>>>>>> I compiled with CUDA. kaldi.mk is like this:
>>> >>>>>>>>> #Next section enables CUDA for compilation
>>> >>>>>>>>> CUDA = true
>>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>>> -DHAVE_CUDA
>>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>>> -Wl,-rpath,$(CUDATKDIR)/lib64
>>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded
>>> later
>>> >>>>>>>>> than static libs in implicit rule
>>> >>>>>>> The 'make' process does not give any error so I can claim that
>>> the tools
>>> >>>>>>> are compiled with CUDA successfully, right?
>>> >>>>>>> Problem is, although the log stops updating, I can see
>>> 'nnet-forward' is
>>> >>>>>>> running on GPU-2.
>>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>>> >>>>>>>>> nnet-forward --use-gpu=yes
>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
>>> |' ark:-
>>> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>>> Suggestion: use
>>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242)
>>> Selecting from 4
>>> >>>>>>>>> GPUs
>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>> >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>>> total:4799M,
>>> >>>>>>>>> free/total:0.983228
>>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>> >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>>> total:4799M,
>>> >>>>>>>>> free/total:0.983228
>>> >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>>> >>>>>>> BTW, my configure command is:
>>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>> >>>>>>>
>>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU
>>> while
>>> >>>>>>> log stops updating?
>>> >>>>>>>
>>> >>>>>>> Thank you and best regards,
>>> >>>>>>> Xingyu
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>> >>>>>>>
>>> >>>>>>> Possibly you did not compile for CUDA.  The logs should say
>>> which GPU you
>>> >>>>>>> are using (look in the dir, for *.log).  If the configure script
>>> does not
>>> >>>>>>> see nvcc on the command line, it will not use CUDA.  Grep for
>>> CUDA in
>>> >>>>>>> kaldi.mk to see.
>>> >>>>>>>
>>> >>>>>>> Dan
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <
>>> asr...@gm...>
>>> >>>>>>> wrote:
>>> >>>>>>>> Hi, I'm new in this community.
>>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid
>>> Training &
>>> >>>>>>>> Decoding part.
>>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday,
>>> and still
>>> >>>>>>>> running.
>>> >>>>>>>> I checked the script and found that it stuck at calling
>>> nnet-forward for
>>> >>>>>>>> "Renormalizing MLP input features into
>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>> >>>>>>>> The program has been running more then 24 hours.
>>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla
>>> K20m...
>>> >>>>>>>> How long does it normally take? Is there something going wrong?
>>> >>>>>>>> Please help.
>>> >>>>>>>>
>>> >>>>>>>> The log is posted below.
>>> >>>>>>>> Thank you
>>> >>>>>>>> Xingyu
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> ============================================================================
>>> >>>>>>>>
>>> >>>>>>>>                     DNN Hybrid Training & Decoding (Karel's
>>> recipe)
>>> >>>>>>>>
>>> >>>>>>>>
>>> ============================================================================
>>> >>>>>>>>
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>> --transform-dir
>>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>> data/test -->
>>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3,
>>> trans
>>> >>>>>>>> exp/tri3/decode_test
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>> --transform-dir
>>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev
>>> -->
>>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>>> >>>>>>>> exp/tri3/decode_dev
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>> --transform-dir
>>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>>> data/train -->
>>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3,
>>> trans
>>> >>>>>>>> exp/tri3_ali
>>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>> >>>>>>>>
>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>> >>>>>>>> reducing #utt from 3696 to 3320
>>> >>>>>>>>
>>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>> >>>>>>>> reducing #utt from 3696 to 376
>>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>> >>>>>>>> #
>>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>> >>>>>>>> # INFO
>>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network
>>> as a stack
>>> >>>>>>>> of RBMs
>>> >>>>>>>>           dir       : exp/dnn4_pretrain-dbn
>>> >>>>>>>>           Train-set : data-fmllr-tri3/train
>>> >>>>>>>>
>>> >>>>>>>> # PREPARING FEATURES
>>> >>>>>>>> Preparing train/cv lists
>>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>> >>>>>>>>
>>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature
>>> matrices.
>>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>>> >>>>>>>> Getting feature dim : copy-feats
>>> scp:exp/dnn4_pretrain-dbn/train.scp
>>> >>>>>>>> ark:-
>>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return
>>> status 13
>>> >>>>>>>> 40
>>> >>>>>>>> Using splice ± 5 , step 1
>>> >>>>>>>> Renormalizing MLP input features into
>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>> >>>>>>>> compute-cmvn-stats ark:- -
>>> >>>>>>>> cmvn-to-nnet - -
>>> >>>>>>>> nnet-concat --binary=false
>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> ------------------------------------------------------------------------------
>>> >>>>>>>> _______________________________________________
>>> >>>>>>>> Kaldi-users mailing list
>>> >>>>>>>> Kal...@li...
>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>> >>>>>
>>> ------------------------------------------------------------------------------
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> Kaldi-users mailing list
>>> >>>>> Kal...@li...
>>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>> >>>>>
>>> >>>
>>> ------------------------------------------------------------------------------
>>> >>> _______________________________________________
>>> >>> Kaldi-users mailing list
>>> >>> Kal...@li...
>>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > _______________________________________________
>>> > Kaldi-users mailing list
>>> > Kal...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>>  --
>>> Karel Vesely, Brno University of Technology
>>> ive...@fi..., +420-54114-1300
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>
>>
>> --
>> Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Jan T. <af...@ce...> - 2014-10-29 16:23:49

Ondrej,
you can play with the CUDA_VISIBLE_DEVICES environment variable to mask out
the GPUs you don't want to use.
y.

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Vesely K. <ve...@gm...> - 2014-10-31 10:31:18

If the log was saying it is using GPU, it is running on a GPU. The CPU 
is surely not a bottleneck here.
If it halts, there was a problem to finish one of the CUDA kernels and 
sync, the possible resons are below.
K.

On 10/31/2014 11:19 AM, Xingyu Na wrote:
> Yep, there are too many variables having impact on this. It's really 
> hard to debug this kind of behaviour since it maybe is running really 
> really slow that the CPU thought the GPU is sleeping :-)
> Anyway, it's working properly now so I'll just move on. Thank all you 
> guys for helping.
>
> Best,
> Xingyu
>
> On 10/31/2014 06:13 PM, Vesely Karel wrote:
>> Hi Xingyu,
>> hmm, I'm afraid I cannot explain this with certainty. Sometimes the 
>> binaries may
>> behave strangely if there is a problem with cuda driver + kernel 
>> module match,
>> or Kaldi compiled using unsufficient computation capability (it is 
>> okay in current trunk)
>> or because of simple GPU overheating.
>> Best,
>> Karel.
>>
>>
>> On 10/30/2014 03:27 AM, Xingyu Na wrote:
>>> Hi Karel,
>>> When the script freezed on my station (before I forced the compute 
>>> mode), 'nvidia-smi' shows that 'nnet-forward' was actually running 
>>> on one of the GPU cards.
>>> Is it possible that it was running on CPU but shows as a running job 
>>> on nvidia-smi?
>>> And at the meantime, when I did 'top', it shows that 'nnet-forward' 
>>> with an 'S' not an 'R'....
>>>
>>> Xingyu
>>>
>>> On 10/29/2014 09:28 PM, Vesely Karel wrote:
>>>> Hi,
>>>> the TIMIT DNN training is running, and it is very slow.
>>>> I'll add there a script-check to stop training if cuda is not 
>>>> compiled-in.
>>>> (Assuming that typically everybody wants to train on a GPU.)
>>>> K.
>>>>
>>>> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>>>>> Dan,
>>>>> I'll check it by running TIMIT recipe without GPU code compiled.
>>>>> Need to figure out what could have happened...
>>>>> K.
>>>>>
>>>>> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>>>>>> Karel,
>>>>>> Is there something which we need to fix here?
>>>>>> Why was it hanging?  Was it using the CPU instead of the GPU?  
>>>>>> Was it waiting for some kind of reply from the GPU?  Had it crashed?
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel 
>>>>>> <ive...@fi... <mailto:ive...@fi...>> wrote:
>>>>>>
>>>>>>     It is a 'must' on multi-GPU machines and 'recommended' for
>>>>>>     single-GPU
>>>>>>     machine.
>>>>>>
>>>>>>     It is a setting in OS, which is assumed to be done. It is
>>>>>>     good that one
>>>>>>     does not need
>>>>>>     to specify a gpu-id in the scripts and track manually which
>>>>>>     gpus are
>>>>>>     being used.
>>>>>>
>>>>>>     Karel.
>>>>>>
>>>>>>     On 10/24/2014 12:39 PM, Xingyu Na wrote:
>>>>>>     > Thank you Karel.
>>>>>>     > Is that a 'must' for all cuda-based kaldi executives?
>>>>>>     >
>>>>>>     > Regards,
>>>>>>     > Xingyu
>>>>>>     >
>>>>>>     > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>>>>>>     >> Hi,
>>>>>>     >> The reason is in the "computation mode", which has with
>>>>>>     Kaldi following
>>>>>>     >> behavior:
>>>>>>     >> - default : OS selects GPU with GPU-ID '0' by default
>>>>>>     (i.e. more
>>>>>>     >> processes use same GPU which is slow) [BAD]
>>>>>>     >> - process/thread exclusive : OS selects a free GPU which
>>>>>>     not locked to
>>>>>>     >> another process or raises error [RECOMMENDED]
>>>>>>     >> Best regards,
>>>>>>     >> Karel
>>>>>>     >>
>>>>>>     >>
>>>>>>     >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>>>>>>     >>> Thank you Dan and Alex.
>>>>>>     >>> It turns out that I need to set 'nvidia-smi -c 1' to
>>>>>>     continue here(don't
>>>>>>     >>> know why....).
>>>>>>     >>> Now I understand how that pipelined command works.
>>>>>>     >>> Sorry for saying "Is there a bug" in the previous email....
>>>>>>     >>>
>>>>>>     >>> Regards,
>>>>>>     >>> Xingyu
>>>>>>     >>>
>>>>>>     >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>>>>>>     >>>> Hi Xingyu,
>>>>>>     >>>>
>>>>>>     >>>> If you are concerned whether the process hung up or not,
>>>>>>     you can see
>>>>>>     >>>> the output of `ps <PID>` where <PID> is the process id.
>>>>>>     If you see 'S'
>>>>>>     >>>> in STAT fields, like
>>>>>>     >>>>
>>>>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>>>>     >>>> 11891 pts/5    S+     0:00 cat
>>>>>>     >>>>
>>>>>>     >>>> Then the processing is sleeping. Otherwise you should
>>>>>>     see 'R' like:
>>>>>>     >>>>
>>>>>>     >>>> PID TTY      STAT   TIME COMMAND
>>>>>>     >>>> 11909 pts/5    R+     0:01 cat
>>>>>>     >>>>
>>>>>>     >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na
>>>>>>     <asr...@gm... <mailto:asr...@gm...>> wrote:
>>>>>>     >>>>> Thank you so much Dan.
>>>>>>     >>>>> The script which causes the halting is :
>>>>>>     >>>>>
>>>>>>     >>>>>       nnet-forward --use-gpu=yes \
>>>>>>     >>>>>  $feature_transform_old "$(echo $feats | sed
>>>>>>     >>>>> 's|train.scp|train.scp.10k|')" \
>>>>>>     >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>>>>>>     >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>>>>>>     >>>>>       nnet-concat --binary=false $feature_transform_old
>>>>>>     - $feature_transform
>>>>>>     >>>>>
>>>>>>     >>>>> and the command that is running is:
>>>>>>     >>>>>
>>>>>>     >>>>> nnet-forward --use-gpu=yes
>>>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>     >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k
>>>>>>     ark:- | ark:-
>>>>>>     >>>>>
>>>>>>     >>>>> If I understand it correctly, nnet-forward is piping
>>>>>>     its output to
>>>>>>     >>>>> compute-cmvn-stats (although apply_cmvn is false), and
>>>>>>     followed by
>>>>>>     >>>>> cmvn-to-nnet and nnet-concat.
>>>>>>     >>>>> The problem, I think, is that there is an extra '|
>>>>>>     ark:-'. It means that the
>>>>>>     >>>>> output of nnet-forward is being piped into 'ark:-',
>>>>>>     which is not a
>>>>>>     >>>>> executable.
>>>>>>     >>>>> Is there is bug here?
>>>>>>     >>>>>
>>>>>>     >>>>> Regards,
>>>>>>     >>>>> Xingyu
>>>>>>     >>>>>
>>>>>>     >>>>>
>>>>>>     >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>>>>>>     >>>>>
>>>>>>     >>>>> I'm running the same thing at JHU to see if I can
>>>>>>     replicate your problem.
>>>>>>     >>>>> Dan
>>>>>>     >>>>>
>>>>>>     >>>>>
>>>>>>     >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey
>>>>>>     <dp...@gm... <mailto:dp...@gm...>> wrote:
>>>>>>     >>>>>> cc'ing Karel who may be able to help you, although I
>>>>>>     think he could be
>>>>>>     >>>>>> behind on his email.
>>>>>>     >>>>>> I'm afraid I don't know how to fix this.
>>>>>>     >>>>>> If you can figure out the full command that's being
>>>>>>     run then it might be
>>>>>>     >>>>>> possible to get it in a debugger, e.g. gdb --args
>>>>>>     program arg1 arg2 ..., and
>>>>>>     >>>>>> break into it and get a stack trace to find where it's
>>>>>>     stuck.
>>>>>>     >>>>>>
>>>>>>     >>>>>> Dan
>>>>>>     >>>>>>
>>>>>>     >>>>>>
>>>>>>     >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na
>>>>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>>>>     >>>>>> wrote:
>>>>>>     >>>>>>> Thank you Dan.
>>>>>>     >>>>>>> I compiled with CUDA. kaldi.mk <http://kaldi.mk> is
>>>>>>     like this:
>>>>>>     >>>>>>>>> #Next section enables CUDA for compilation
>>>>>>     >>>>>>>>> CUDA = true
>>>>>>     >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>>>>>>     >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>>>>>>     >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose
>>>>>>     --machine 64 -DHAVE_CUDA
>>>>>>     >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>>>>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib
>>>>>>     -Wl,-rpath,$(CUDATKDIR)/lib
>>>>>>     >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>>>>>>     -Wl,-rpath,$(CUDATKDIR)/lib64
>>>>>>     >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs
>>>>>>     are loaded later
>>>>>>     >>>>>>>>> than static libs in implicit rule
>>>>>>     >>>>>>> The 'make' process does not give any error so I can
>>>>>>     claim that the tools
>>>>>>     >>>>>>> are compiled with CUDA successfully, right?
>>>>>>     >>>>>>> Problem is, although the log stops updating, I can
>>>>>>     see 'nnet-forward' is
>>>>>>     >>>>>>> running on GPU-2.
>>>>>>     >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it
>>>>>>     displays:
>>>>>>     >>>>>>>>> nnet-forward --use-gpu=yes
>>>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>     >>>>>>>>> 'ark:copy-feats
>>>>>>     scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- |' ark:-
>>>>>>     >>>>>>>>> WARNING
>>>>>>     (nnet-forward:SelectGpuId():cu-device.cc:130) Suggestion: use
>>>>>>     >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>>>>>>     >>>>>>>>> LOG
>>>>>>     (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting
>>>>>>     from 4
>>>>>>     >>>>>>>>> GPUs
>>>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>     >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M,
>>>>>>     used:80M, total:4799M,
>>>>>>     >>>>>>>>> free/total:0.983228
>>>>>>     >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>>>>>>     >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M,
>>>>>>     used:80M, total:4799M,
>>>>>>     >>>>>>>>> free/total:0.983228
>>>>>>     >>>>>>> and no more. I have 4 GPU cards installed, all same
>>>>>>     model.
>>>>>>     >>>>>>> BTW, my configure command is:
>>>>>>     >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base
>>>>>>     --use-cuda=yes
>>>>>>     >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>>>>>>     >>>>>>>
>>>>>>     >>>>>>> Am I doing something wrong? Why 'nnet-forward' is
>>>>>>     running on GPU while
>>>>>>     >>>>>>> log stops updating?
>>>>>>     >>>>>>>
>>>>>>     >>>>>>> Thank you and best regards,
>>>>>>     >>>>>>> Xingyu
>>>>>>     >>>>>>>
>>>>>>     >>>>>>>
>>>>>>     >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>>>>>>     >>>>>>>
>>>>>>     >>>>>>> Possibly you did not compile for CUDA.  The logs
>>>>>>     should say which GPU you
>>>>>>     >>>>>>> are using (look in the dir, for *.log).  If the
>>>>>>     configure script does not
>>>>>>     >>>>>>> see nvcc on the command line, it will not use CUDA. 
>>>>>>     Grep for CUDA in
>>>>>>     >>>>>>> kaldi.mk <http://kaldi.mk> to see.
>>>>>>     >>>>>>>
>>>>>>     >>>>>>> Dan
>>>>>>     >>>>>>>
>>>>>>     >>>>>>>
>>>>>>     >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na
>>>>>>     <asr...@gm... <mailto:asr...@gm...>>
>>>>>>     >>>>>>> wrote:
>>>>>>     >>>>>>>> Hi, I'm new in this community.
>>>>>>     >>>>>>>> I am running the TIMIT example s5, all the way to
>>>>>>     DNN Hybrid Training &
>>>>>>     >>>>>>>> Decoding part.
>>>>>>     >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called
>>>>>>     yesterday, and still
>>>>>>     >>>>>>>> running.
>>>>>>     >>>>>>>> I checked the script and found that it stuck at
>>>>>>     calling nnet-forward for
>>>>>>     >>>>>>>> "Renormalizing MLP input features into
>>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>>>>>>     >>>>>>>> The program has been running more then 24 hours.
>>>>>>     >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on
>>>>>>     a Tesla K20m...
>>>>>>     >>>>>>>> How long does it normally take? Is there something
>>>>>>     going wrong?
>>>>>>     >>>>>>>> Please help.
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>> The log is posted below.
>>>>>>     >>>>>>>> Thank you
>>>>>>     >>>>>>>> Xingyu
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>>
>>>>>>     ============================================================================
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>>        DNN Hybrid Training & Decoding (Karel's recipe)
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>>
>>>>>>     ============================================================================
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>>>     <http://run.pl> --transform-dir
>>>>>>     >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test
>>>>>>     exp/tri3
>>>>>>     >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is
>>>>>>     lda_fmllr
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type
>>>>>>     lda_fmllr, data/test -->
>>>>>>     >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm
>>>>>>     exp/tri3, trans
>>>>>>     >>>>>>>> exp/tri3/decode_test
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>>>     <http://run.pl> --transform-dir
>>>>>>     >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev
>>>>>>     exp/tri3
>>>>>>     >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is
>>>>>>     lda_fmllr
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type
>>>>>>     lda_fmllr, data/dev -->
>>>>>>     >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm
>>>>>>     exp/tri3, trans
>>>>>>     >>>>>>>> exp/tri3/decode_dev
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>>>>>>     <http://run.pl> --transform-dir
>>>>>>     >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>>>>>>     >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is
>>>>>>     lda_fmllr
>>>>>>     >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type
>>>>>>     lda_fmllr, data/train -->
>>>>>>     >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm
>>>>>>     exp/tri3, trans
>>>>>>     >>>>>>>> exp/tri3_ali
>>>>>>     >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>>>>>>     >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>>>>>>     >>>>>>>>
>>>>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>>     >>>>>>>> reducing #utt from 3696 to 3320
>>>>>>     >>>>>>>>
>>>>>>     /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>>>>>>     >>>>>>>> reducing #utt from 3696 to 376
>>>>>>     >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024
>>>>>>     --rbm-iter 20
>>>>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>>     >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>>>>>>     >>>>>>>> #
>>>>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>>>>>>     >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>>>>>>     >>>>>>>> # INFO
>>>>>>     >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep
>>>>>>     Belief Network as a stack
>>>>>>     >>>>>>>> of RBMs
>>>>>>     >>>>>>>>  dir       : exp/dnn4_pretrain-dbn
>>>>>>     >>>>>>>>  Train-set : data-fmllr-tri3/train
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>> # PREPARING FEATURES
>>>>>>     >>>>>>>> Preparing train/cv lists
>>>>>>     >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>>>>>>     >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>>>>>>     >>>>>>>>
>>>>>>     ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>>>>>>     >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied
>>>>>>     3696 feature matrices.
>>>>>>     >>>>>>>> apply_cmvn disabled (per speaker norm. on input
>>>>>>     features)
>>>>>>     >>>>>>>> Getting feature dim : copy-feats
>>>>>>     scp:exp/dnn4_pretrain-dbn/train.scp
>>>>>>     >>>>>>>> ark:-
>>>>>>     >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe
>>>>>>     copy-feats
>>>>>>     >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had
>>>>>>     nonzero return status 13
>>>>>>     >>>>>>>> 40
>>>>>>     >>>>>>>> Using splice ± 5 , step 1
>>>>>>     >>>>>>>> Renormalizing MLP input features into
>>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>>     >>>>>>>> compute-cmvn-stats ark:- -
>>>>>>     >>>>>>>> cmvn-to-nnet - -
>>>>>>     >>>>>>>> nnet-concat --binary=false
>>>>>>     exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>>>>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>>>>>>     >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>>>>>>     >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65)
>>>>>>     Concatenating -
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>>
>>>>>>     >>>>>>>>
>>>>>>     ------------------------------------------------------------------------------
>>>>>>     >>>>>>>> _______________________________________________
>>>>>>     >>>>>>>> Kaldi-users mailing list
>>>>>>     >>>>>>>> Kal...@li...
>>>>>>     <mailto:Kal...@li...>
>>>>>>     >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>>     >>>>>
>>>>>>     ------------------------------------------------------------------------------
>>>>>>     >>>>>
>>>>>>     >>>>> _______________________________________________
>>>>>>     >>>>> Kaldi-users mailing list
>>>>>>     >>>>> Kal...@li...
>>>>>>     <mailto:Kal...@li...>
>>>>>>     >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>>     >>>>>
>>>>>>     >>>
>>>>>>     ------------------------------------------------------------------------------
>>>>>>     >>> _______________________________________________
>>>>>>     >>> Kaldi-users mailing list
>>>>>>     >>> Kal...@li...
>>>>>>     <mailto:Kal...@li...>
>>>>>>     >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>>     >
>>>>>>     >
>>>>>>     ------------------------------------------------------------------------------
>>>>>>     > _______________________________________________
>>>>>>     > Kaldi-users mailing list
>>>>>>     > Kal...@li...
>>>>>>     <mailto:Kal...@li...>
>>>>>>     > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>>
>>>>>>     --
>>>>>>     Karel Vesely, Brno University of Technology
>>>>>>     ive...@fi... <mailto:ive...@fi...>,
>>>>>>     +420-54114-1300 <tel:%2B420-54114-1300>
>>>>>>
>>>>>>
>>>>>>     ------------------------------------------------------------------------------
>>>>>>     _______________________________________________
>>>>>>     Kaldi-users mailing list
>>>>>>     Kal...@li...
>>>>>>     <mailto:Kal...@li...>
>>>>>>     https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> Karel Vesely, Brno University of Technology
>>>>> ive...@fi..., +420-54114-1300
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>> _______________________________________________
>>>> Kaldi-users mailing list
>>>> Kal...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>
>

Re: [Kaldi-users] DNN Hybrid Training & Decoding of TIMIT example

From: Daniel P. <dp...@gm...> - 2014-10-31 17:02:07

BTW, something you can do in situations like this is to do something like
the following- assuming you are debugging nnet-train-simple, but it could
be another program

gdb $(which nnet-train-simple)
(gdb) attach 9541
(gdb) bt

where 9541 is an example process id.
$(which nnet-train-simple) gives you the full pathname of the program which
(IIRC) gdb requires.
Dan


On Fri, Oct 31, 2014 at 6:31 AM, Vesely Karel <ve...@gm...> wrote:

>  If the log was saying it is using GPU, it is running on a GPU. The CPU is
> surely not a bottleneck here.
> If it halts, there was a problem to finish one of the CUDA kernels and
> sync, the possible resons are below.
> K.
>
>
> On 10/31/2014 11:19 AM, Xingyu Na wrote:
>
> Yep, there are too many variables having impact on this. It's really hard
> to debug this kind of behaviour since it maybe is running really really
> slow that the CPU thought the GPU is sleeping :-)
> Anyway, it's working properly now so I'll just move on. Thank all you guys
> for helping.
>
> Best,
> Xingyu
>
> On 10/31/2014 06:13 PM, Vesely Karel wrote:
>
> Hi Xingyu,
> hmm, I'm afraid I cannot explain this with certainty. Sometimes the
> binaries may
> behave strangely if there is a problem with cuda driver + kernel module
> match,
> or Kaldi compiled using unsufficient computation capability (it is okay in
> current trunk)
> or because of simple GPU overheating.
> Best,
> Karel.
>
>
> On 10/30/2014 03:27 AM, Xingyu Na wrote:
>
> Hi Karel,
> When the script freezed on my station (before I forced the compute mode),
> 'nvidia-smi' shows that 'nnet-forward' was actually running on one of the
> GPU cards.
> Is it possible that it was running on CPU but shows as a running job on
> nvidia-smi?
> And at the meantime, when I did 'top', it shows that 'nnet-forward' with
> an 'S' not an 'R'....
>
> Xingyu
>
> On 10/29/2014 09:28 PM, Vesely Karel wrote:
>
> Hi,
> the TIMIT DNN training is running, and it is very slow.
> I'll add there a script-check to stop training if cuda is not compiled-in.
> (Assuming that typically everybody wants to train on a GPU.)
> K.
>
> On 10/27/2014 11:39 AM, Vesely Karel wrote:
>
> Dan,
> I'll check it by running TIMIT recipe without GPU code compiled.
> Need to figure out what could have happened...
> K.
>
> On 10/24/2014 07:03 PM, Daniel Povey wrote:
>
> Karel,
> Is there something which we need to fix here?
> Why was it hanging?  Was it using the CPU instead of the GPU?  Was it
> waiting for some kind of reply from the GPU?  Had it crashed?
> Dan
>
>
> On Fri, Oct 24, 2014 at 6:44 AM, Vesely Karel <ive...@fi...>
> wrote:
>
>> It is a 'must' on multi-GPU machines and 'recommended' for single-GPU
>> machine.
>>
>> It is a setting in OS, which is assumed to be done. It is good that one
>> does not need
>> to specify a gpu-id in the scripts and track manually which gpus are
>> being used.
>>
>> Karel.
>>
>> On 10/24/2014 12:39 PM, Xingyu Na wrote:
>> > Thank you Karel.
>> > Is that a 'must' for all cuda-based kaldi executives?
>> >
>> > Regards,
>> > Xingyu
>> >
>> > On 10/24/2014 06:12 PM, Vesely Karel wrote:
>> >> Hi,
>> >> The reason is in the "computation mode", which has with Kaldi following
>> >> behavior:
>> >> - default : OS selects GPU with GPU-ID '0' by default (i.e. more
>> >> processes use same GPU which is slow) [BAD]
>> >> - process/thread exclusive : OS selects a free GPU which not locked to
>> >> another process or raises error [RECOMMENDED]
>> >> Best regards,
>> >> Karel
>> >>
>> >>
>> >> On 10/24/2014 09:54 AM, Xingyu Na wrote:
>> >>> Thank you Dan and Alex.
>> >>> It turns out that I need to set 'nvidia-smi -c 1' to continue
>> here(don't
>> >>> know why....).
>> >>> Now I understand how that pipelined command works.
>> >>> Sorry for saying "Is there a bug" in the previous email....
>> >>>
>> >>> Regards,
>> >>> Xingyu
>> >>>
>> >>> On 10/24/2014 03:46 PM, Alexander Solovets wrote:
>> >>>> Hi Xingyu,
>> >>>>
>> >>>> If you are concerned whether the process hung up or not, you can see
>> >>>> the output of `ps <PID>` where <PID> is the process id. If you see
>> 'S'
>> >>>> in STAT fields, like
>> >>>>
>> >>>> PID TTY      STAT   TIME COMMAND
>> >>>> 11891 pts/5    S+     0:00 cat
>> >>>>
>> >>>> Then the processing is sleeping. Otherwise you should see 'R' like:
>> >>>>
>> >>>> PID TTY      STAT   TIME COMMAND
>> >>>> 11909 pts/5    R+     0:01 cat
>> >>>>
>> >>>> On Fri, Oct 24, 2014 at 6:18 PM, Xingyu Na <asr...@gm...>
>> wrote:
>> >>>>> Thank you so much Dan.
>> >>>>> The script which causes the halting is :
>> >>>>>
>> >>>>>       nnet-forward --use-gpu=yes \
>> >>>>>         $feature_transform_old "$(echo $feats | sed
>> >>>>> 's|train.scp|train.scp.10k|')" \
>> >>>>>         ark:- 2>$dir/log/cmvn_glob_fwd.log |\
>> >>>>>       compute-cmvn-stats ark:- - | cmvn-to-nnet - - |\
>> >>>>>       nnet-concat --binary=false $feature_transform_old -
>> $feature_transform
>> >>>>>
>> >>>>> and the command that is running is:
>> >>>>>
>> >>>>> nnet-forward --use-gpu=yes exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> >>>>> ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:- | ark:-
>> >>>>>
>> >>>>> If I understand it correctly, nnet-forward is piping its output to
>> >>>>> compute-cmvn-stats (although apply_cmvn is false), and followed by
>> >>>>> cmvn-to-nnet and nnet-concat.
>> >>>>> The problem, I think, is that there is an extra '| ark:-'. It means
>> that the
>> >>>>> output of nnet-forward is being piped into 'ark:-', which is not a
>> >>>>> executable.
>> >>>>> Is there is bug here?
>> >>>>>
>> >>>>> Regards,
>> >>>>> Xingyu
>> >>>>>
>> >>>>>
>> >>>>> On 10/24/2014 12:15 PM, Daniel Povey wrote:
>> >>>>>
>> >>>>> I'm running the same thing at JHU to see if I can replicate your
>> problem.
>> >>>>> Dan
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Oct 24, 2014 at 12:11 AM, Daniel Povey <dp...@gm...>
>> wrote:
>> >>>>>> cc'ing Karel who may be able to help you, although I think he
>> could be
>> >>>>>> behind on his email.
>> >>>>>> I'm afraid I don't know how to fix this.
>> >>>>>> If you can figure out the full command that's being run then it
>> might be
>> >>>>>> possible to get it in a debugger, e.g. gdb --args program arg1
>> arg2 ..., and
>> >>>>>> break into it and get a stack trace to find where it's stuck.
>> >>>>>>
>> >>>>>> Dan
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Oct 24, 2014 at 12:05 AM, Xingyu Na <
>> asr...@gm...>
>> >>>>>> wrote:
>> >>>>>>> Thank you Dan.
>> >>>>>>> I compiled with CUDA. kaldi.mk is like this:
>> >>>>>>>>> #Next section enables CUDA for compilation
>> >>>>>>>>> CUDA = true
>> >>>>>>>>> CUDATKDIR = /usr/local/cuda-5.5
>> >>>>>>>>> CUDA_INCLUDE= -I$(CUDATKDIR)/include
>> >>>>>>>>> CUDA_FLAGS = -g -Xcompiler -fPIC --verbose --machine 64
>> -DHAVE_CUDA
>> >>>>>>>>> CXXFLAGS += -DHAVE_CUDA -I$(CUDATKDIR)/include
>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib -Wl,-rpath,$(CUDATKDIR)/lib
>> >>>>>>>>> CUDA_LDFLAGS += -L$(CUDATKDIR)/lib64
>> -Wl,-rpath,$(CUDATKDIR)/lib64
>> >>>>>>>>> CUDA_LDLIBS += -lcublas -lcudart #LDLIBS : The libs are loaded
>> later
>> >>>>>>>>> than static libs in implicit rule
>> >>>>>>> The 'make' process does not give any error so I can claim that
>> the tools
>> >>>>>>> are compiled with CUDA successfully, right?
>> >>>>>>> Problem is, although the log stops updating, I can see
>> 'nnet-forward' is
>> >>>>>>> running on GPU-2.
>> >>>>>>> The log in the exp dir is cmvn_glob_fwd.log and it displays:
>> >>>>>>>>> nnet-forward --use-gpu=yes
>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> >>>>>>>>> 'ark:copy-feats scp:exp/dnn4_pretrain-dbn/train.scp.10k ark:-
>> |' ark:-
>> >>>>>>>>> WARNING (nnet-forward:SelectGpuId():cu-device.cc:130)
>> Suggestion: use
>> >>>>>>>>> 'nvidia-smi -c 1' to set compute exclusive mode
>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:242) Selecting
>> from 4
>> >>>>>>>>> GPUs
>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> >>>>>>>>> cudaSetDevice(0): Tesla K20m    free:4719M, used:80M,
>> total:4799M,
>> >>>>>>>>> free/total:0.983228
>> >>>>>>>>> LOG (nnet-forward:SelectGpuIdAuto():cu-device.cc:257)
>> >>>>>>>>> cudaSetDevice(1): Tesla K20m    free:4719M, used:80M,
>> total:4799M,
>> >>>>>>>>> free/total:0.983228
>> >>>>>>> and no more. I have 4 GPU cards installed, all same model.
>> >>>>>>> BTW, my configure command is:
>> >>>>>>> ./configure --atlas-root=/usr/lib/atlas-base --use-cuda=yes
>> >>>>>>> --cudatk-dir=/usr/local/cuda-5.5
>> >>>>>>>
>> >>>>>>> Am I doing something wrong? Why 'nnet-forward' is running on GPU
>> while
>> >>>>>>> log stops updating?
>> >>>>>>>
>> >>>>>>> Thank you and best regards,
>> >>>>>>> Xingyu
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On 10/24/2014 10:31 AM, Daniel Povey wrote:
>> >>>>>>>
>> >>>>>>> Possibly you did not compile for CUDA.  The logs should say which
>> GPU you
>> >>>>>>> are using (look in the dir, for *.log).  If the configure script
>> does not
>> >>>>>>> see nvcc on the command line, it will not use CUDA.  Grep for
>> CUDA in
>> >>>>>>> kaldi.mk to see.
>> >>>>>>>
>> >>>>>>> Dan
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Oct 23, 2014 at 10:17 PM, Xingyu Na <
>> asr...@gm...>
>> >>>>>>> wrote:
>> >>>>>>>> Hi, I'm new in this community.
>> >>>>>>>> I am running the TIMIT example s5, all the way to DNN Hybrid
>> Training &
>> >>>>>>>> Decoding part.
>> >>>>>>>> The script "steps/nnet/pretrain_dbn.sh" was called yesterday,
>> and still
>> >>>>>>>> running.
>> >>>>>>>> I checked the script and found that it stuck at calling
>> nnet-forward for
>> >>>>>>>> "Renormalizing MLP input features into
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet"
>> >>>>>>>> The program has been running more then 24 hours.
>> >>>>>>>> 'nvidia-smi' said 'nnet-forward' is still running on a Tesla
>> K20m...
>> >>>>>>>> How long does it normally take? Is there something going wrong?
>> >>>>>>>> Please help.
>> >>>>>>>>
>> >>>>>>>> The log is posted below.
>> >>>>>>>> Thank you
>> >>>>>>>> Xingyu
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> ============================================================================
>> >>>>>>>>
>> >>>>>>>>                     DNN Hybrid Training & Decoding (Karel's
>> recipe)
>> >>>>>>>>
>> >>>>>>>>
>> ============================================================================
>> >>>>>>>>
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>> --transform-dir
>> >>>>>>>> exp/tri3/decode_test data-fmllr-tri3/test data/test exp/tri3
>> >>>>>>>> data-fmllr-tri3/test/log data-fmllr-tri3/test/data
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/test
>> -->
>> >>>>>>>> data-fmllr-tri3/test, using : raw-trans None, gmm exp/tri3, trans
>> >>>>>>>> exp/tri3/decode_test
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>> --transform-dir
>> >>>>>>>> exp/tri3/decode_dev data-fmllr-tri3/dev data/dev exp/tri3
>> >>>>>>>> data-fmllr-tri3/dev/log data-fmllr-tri3/dev/data
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr, data/dev
>> -->
>> >>>>>>>> data-fmllr-tri3/dev, using : raw-trans None, gmm exp/tri3, trans
>> >>>>>>>> exp/tri3/decode_dev
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh --nj 10 --cmd run.pl
>> --transform-dir
>> >>>>>>>> exp/tri3_ali data-fmllr-tri3/train data/train exp/tri3
>> >>>>>>>> data-fmllr-tri3/train/log data-fmllr-tri3/train/data
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: feature type is lda_fmllr
>> >>>>>>>> steps/nnet/make_fmllr_feats.sh: Done!, type lda_fmllr,
>> data/train -->
>> >>>>>>>> data-fmllr-tri3/train, using : raw-trans None, gmm exp/tri3,
>> trans
>> >>>>>>>> exp/tri3_ali
>> >>>>>>>> utils/subset_data_dir_tr_cv.sh data-fmllr-tri3/train
>> >>>>>>>> data-fmllr-tri3/train_tr90 data-fmllr-tri3/train_cv10
>> >>>>>>>>
>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>> >>>>>>>> reducing #utt from 3696 to 3320
>> >>>>>>>>
>> /nobackup/s1/asr/naxingyu/exps/kaldi/egs/timit/utils/subset_data_dir.sh:
>> >>>>>>>> reducing #utt from 3696 to 376
>> >>>>>>>> # steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>> >>>>>>>> # Started at Wed Oct 22 16:11:09 CST 2014
>> >>>>>>>> #
>> >>>>>>>> steps/nnet/pretrain_dbn.sh --hid-dim 1024 --rbm-iter 20
>> >>>>>>>> data-fmllr-tri3/train exp/dnn4_pretrain-dbn
>> >>>>>>>> # INFO
>> >>>>>>>> steps/nnet/pretrain_dbn.sh : Pre-training Deep Belief Network as
>> a stack
>> >>>>>>>> of RBMs
>> >>>>>>>>           dir       : exp/dnn4_pretrain-dbn
>> >>>>>>>>           Train-set : data-fmllr-tri3/train
>> >>>>>>>>
>> >>>>>>>> # PREPARING FEATURES
>> >>>>>>>> Preparing train/cv lists
>> >>>>>>>> 3696 exp/dnn4_pretrain-dbn/train.scp
>> >>>>>>>> copy-feats scp:exp/dnn4_pretrain-dbn/train.scp_non_local
>> >>>>>>>>
>> ark,scp:/tmp/tmp.3ctodczOzO/train.ark,exp/dnn4_pretrain-dbn/train.scp
>> >>>>>>>> LOG (copy-feats:main():copy-feats.cc:100) Copied 3696 feature
>> matrices.
>> >>>>>>>> apply_cmvn disabled (per speaker norm. on input features)
>> >>>>>>>> Getting feature dim : copy-feats
>> scp:exp/dnn4_pretrain-dbn/train.scp
>> >>>>>>>> ark:-
>> >>>>>>>> WARNING (feat-to-dim:Close():kaldi-io.cc:446) Pipe copy-feats
>> >>>>>>>> scp:exp/dnn4_pretrain-dbn/train.scp ark:- | had nonzero return
>> status 13
>> >>>>>>>> 40
>> >>>>>>>> Using splice ± 5 , step 1
>> >>>>>>>> Renormalizing MLP input features into
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>> >>>>>>>> compute-cmvn-stats ark:- -
>> >>>>>>>> cmvn-to-nnet - -
>> >>>>>>>> nnet-concat --binary=false
>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet -
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1_cmvn-g.nnet
>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:53) Reading
>> >>>>>>>> exp/dnn4_pretrain-dbn/tr_splice5-1.nnet
>> >>>>>>>> LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> ------------------------------------------------------------------------------
>> >>>>>>>> _______________________________________________
>> >>>>>>>> Kaldi-users mailing list
>> >>>>>>>> Kal...@li...
>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>> >>>>>
>> ------------------------------------------------------------------------------
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> Kaldi-users mailing list
>> >>>>> Kal...@li...
>> >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>> >>>>>
>> >>>
>> ------------------------------------------------------------------------------
>> >>> _______________________________________________
>> >>> Kaldi-users mailing list
>> >>> Kal...@li...
>> >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>> >
>> >
>> ------------------------------------------------------------------------------
>> > _______________________________________________
>> > Kaldi-users mailing list
>> > Kal...@li...
>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>>  --
>> Karel Vesely, Brno University of Technology
>> ive...@fi..., +420-54114-1300
>>
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Kaldi-users mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>
>
>
> --
> Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Kaldi-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/kaldi-users
>
>
>
>
>
>