From: Kirill K. <kir...@sm...> - 2015-06-17 17:53:33
|
> From: David Warde-Farley [mailto:d.w...@gm...] > Sent: 2015-06-17 0028 > Subject: Re: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? > > Many thanks for the pointers. On your setup, how long does the entire > recipe take without decoding? A few hours to train the tri5 model (10 to 15 hours I guess, on a 6-core CPU), then maybe 4-5 days to train the nnet2 on the 460 hour data set on the 980 GPU board. I did not go any further than that. Guess it would take at least twice that time to process the 1000 hour set. > For the life of me I can't figure out where num_jobs_nnet is being set > (it's being written in the egs_dir as 4, I've changed it everywhere I > could find it.) I did not have to change anything in this regard, except for the number of jobs argument to train_multisplice_accel2 in run_nnet2_ms.sh. What file the number of jobs was saved into? Some steps rely on the number of jobs in previous steps. Sometimes the number of jobs sticks in the file which is not recreated. It may be easier to start clean. Do you run the discriminative training script (run_nnet2_ms_disc.sh)? I did not. -kkm > On Fri, Jun 12, 2015 at 7:00 PM, Kirill Katsnelson > <kir...@sm...> wrote: > >> From: David Warde-Farley [mailto:d.w...@gm...] > >> Subject: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? > >> > >> I'm trying to > >> use the s5 recipe for LibriSpeech on a single machine with a single > >> GPU. I've modified cmd.sh to use run.pl. > > > > I ran it on a single machine, it requires a few modifications. Note > that it took almost a week on a 6-core 4.1GHz overclocked i7-5930K CPU > and GeForce 980 to train on the 500 hour set. > > > >> After about a day, I see a lot of background processes like > >> gmm-latgen- faster, lattice-add-penalty, lattice-scale, etc. that > >> have been launched in the background (the terminal is actually free, > >> which suggests the run.sh script has terminated...). I'm not totally > >> sure what's going on, or how to find out. > > > > In librispeech/s5/run.sh, look for decode commands in subshells, like > > > > ( > > utils/mkgraph.sh data/lang_nosp_test_tgsmall \ > > exp/tri4b exp/tri4b/graph_nosp_tgsmall || exit 1; > > for test in test_clean test_other dev_clean dev_other; do > > steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \ > > . . . > > )& > > > > These decodes are quite slow, if you run them on your machine. They > are slower than other part of the script. In the end, they are > accumulating, eating CPU and blowing up out of memory. They are not > essential for NN training, except possibly for the mkgraph script. The > results are useful to check if you are getting expected WER, but really > not essential. You may either disable these decode blocks completely > (except mkgraph invocations) or remove the '&' at the end to run them > synchronously. NB they will take the most preparation time prior to NN > training step. Dunno about your machine but give it an extra couple > days to complete with these. > > > >> One thing I noticed earlier is that the script was trying to spawn > >> multiple GPU jobs, but this GPU is configured (by administrators) to > >> permit at most one CUDA process, and so I saw "3 of 4 jobs failed" > >> messages. Would these jobs have been retried? > > > > They will not, but you can restart NN training from the last step. > Modify local/online/run_nnet2_ms.sh so that > steps/nnet2/train_multisplice_accel2.sh is invoked with switches "-- > num-jobs-initial 1 --num-jobs-final 1" (the defaults are larger). When > running local/online/run_nnet2_ms.sh, pass it "--stage 7" (this is the > default) and "--train_stage N" the number of iteration you are > restarting from. > > > > Even if not the 1 job limit, you probably won't benefit from running > more than 1 at a time. > > > > -kkm > > > On Fri, Jun 12, 2015 at 4:00 PM, Kirill Katsnelson > <kir...@sm...> wrote: > >> From: David Warde-Farley [mailto:d.w...@gm...] > >> Subject: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? > >> > >> I'm trying to > >> use the s5 recipe for LibriSpeech on a single machine with a single > >> GPU. I've modified cmd.sh to use run.pl. > > > > I ran it on a single machine, it requires a few modifications. Note > that it took almost a week on a 6-core 4.1GHz overclocked i7-5930K CPU > and GeForce 980 to train on the 500 hour set. > > > >> After about a day, I see a lot of background processes like > >> gmm-latgen- faster, lattice-add-penalty, lattice-scale, etc. that > >> have been launched in the background (the terminal is actually free, > >> which suggests the run.sh script has terminated...). I'm not totally > >> sure what's going on, or how to find out. > > > > In librispeech/s5/run.sh, look for decode commands in subshells, like > > > > ( > > utils/mkgraph.sh data/lang_nosp_test_tgsmall \ > > exp/tri4b exp/tri4b/graph_nosp_tgsmall || exit 1; > > for test in test_clean test_other dev_clean dev_other; do > > steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \ > > . . . > > )& > > > > These decodes are quite slow, if you run them on your machine. They > are slower than other part of the script. In the end, they are > accumulating, eating CPU and blowing up out of memory. They are not > essential for NN training, except possibly for the mkgraph script. The > results are useful to check if you are getting expected WER, but really > not essential. You may either disable these decode blocks completely > (except mkgraph invocations) or remove the '&' at the end to run them > synchronously. NB they will take the most preparation time prior to NN > training step. Dunno about your machine but give it an extra couple > days to complete with these. > > > >> One thing I noticed earlier is that the script was trying to spawn > >> multiple GPU jobs, but this GPU is configured (by administrators) to > >> permit at most one CUDA process, and so I saw "3 of 4 jobs failed" > >> messages. Would these jobs have been retried? > > > > They will not, but you can restart NN training from the last step. > Modify local/online/run_nnet2_ms.sh so that > steps/nnet2/train_multisplice_accel2.sh is invoked with switches "-- > num-jobs-initial 1 --num-jobs-final 1" (the defaults are larger). When > running local/online/run_nnet2_ms.sh, pass it "--stage 7" (this is the > default) and "--train_stage N" the number of iteration you are > restarting from. > > > > Even if not the 1 job limit, you probably won't benefit from running > more than 1 at a time. > > > > -kkm |