Recently now I am interested in running KALDI with 'GPU' cluster.
I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use 'queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just modify 'run.pl' as each node access to other nodes by
rsh and run the program.
So here is my questions.
Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which options do I need to put?) at cluster environment?
Q2. How to run KALDI with 'GPU' cluster?
I saw postings from Michale Farjon (https://sourceforge.net/p/kaldi/discussion/1355347/thread/7d982dee/), but I just want to clarify that do I need to install Kluster on my cluster.
Thank you.
Ken Kim
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Everything is explained at http://kaldi.sourceforge.net/queue.html
It should be very easy to use queue.pl if you have GridEngine installed on
your cluster. The web page I just mentions explains how to configure it.
Dan
Recently now I am interested in running KALDI with 'GPU' cluster.
I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use '
queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just
modify 'run.pl' as each node access to other nodes by
rsh and run the program.
So here is my questions.
Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which
options do I need to put?) at cluster environment?
You don't write the hostnames there, you set them up when you install
GridEngine. There is a section in the documentation on how to install
GridEngine. If you read that whole page, it will answer most of your
questions.
Dan
I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin" to treat it as local (used 'run.pl' in cmd.sh) to run DNN related experiments.
Now, I had one more GPU machine added. The 2 GPU boxes are identical in hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "queue.pl -q gpu.q".
When I run ./local/run_nnet2.sh, it showed
cuda-compiled: error while loading shared libraries: libcublas.so.6.0: cannot open shared object file: No such file or directory
This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA
If you want to use GPUs (and have them), go to src/, and configure and make on a machine
where "nvcc" is installed.
The Linux machine I run the command doesn't have GPU. Please let me know the correct steps to resolve the problem, and submit the 2 parallel jobs to the 2 GPUs?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The problem is that even the commands that don't need GPUs sometimes link
against CUDA libraries. What I do here at Hopkins is that I copy those
libraries to a directory that's reachable via NFS from all machines, and
add that directory to my LD_LIBRARY_PATH in my .bashrc. Alternatively you
could install the NVidia CUDA toolkit also on your machines that lack GPUs.
Dan
I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin"
to treat it as local (used 'run.pl' in cmd.sh) to run DNN related
experiments.
Now, I had one more GPU machine added. The 2 GPU boxes are identical in
hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "
queue.pl -q gpu.q".
When I run ./local/run_nnet2.sh, it showed
cuda-compiled: error while loading shared libraries: libcublas.so.6.0:
cannot open shared object file: No such file or directory
This script is intended to be used with GPUs but you have not compiled
Kaldi with CUDA
If you want to use GPUs (and have them), go to src/, and configure and
make on a machine
where "nvcc" is installed.
The Linux machine I run the command doesn't have GPU. Please let me know
the correct steps to resolve the problem, and submit the 2 parallel jobs to
the 2 GPUs?
Hi all !
Recently now I am interested in running KALDI with 'GPU' cluster.
I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use 'queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just modify 'run.pl' as each node access to other nodes by
rsh and run the program.
So here is my questions.
Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which options do I need to put?) at cluster environment?
Q2. How to run KALDI with 'GPU' cluster?
I saw postings from Michale Farjon (https://sourceforge.net/p/kaldi/discussion/1355347/thread/7d982dee/), but I just want to clarify that do I need to install Kluster on my cluster.
Thank you.
Ken Kim
Everything is explained at
http://kaldi.sourceforge.net/queue.html
It should be very easy to use queue.pl if you have GridEngine installed on
your cluster. The web page I just mentions explains how to configure it.
Dan
On Sat, Jan 17, 2015 at 8:38 AM, Ken Kim kenkim@users.sf.net wrote:
Thanks you for your comments !
I didn't know that homepage already have this detail documentation.
I am looking through whole documents, but can't find how to set which host to use.
Maybe this is because I am not really familiar working with Sungrid Engine.
Let's say hostname of
master = "master"
node1 = "node1"
node2 = "node2"
and I want to distribute 2jobs to those nodes. (i.e. $num_jobs_nnet=2)
My specific question is...
When we train DNN with parallel option such as
queue.pl $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \ nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \ ark:$dir/$egsdir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \| \ nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts $perturb_opts \ --minibatch-size=$this_minibatch_size --srand=$x --use-gpu=$use_gpu "$mdl" \ ark:- $dir/$[$x+1].JOB.mdl \
where do I need to write hostname of nodes to distribute JOBs?
You don't write the hostnames there, you set them up when you install
GridEngine. There is a section in the documentation on how to install
GridEngine. If you read that whole page, it will answer most of your
questions.
Dan
On Sat, Jan 17, 2015 at 10:55 PM, Ken Kim kenkim@users.sf.net wrote:
Hi Dan, I have a similar question.
I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin" to treat it as local (used 'run.pl' in cmd.sh) to run DNN related experiments.
Now, I had one more GPU machine added. The 2 GPU boxes are identical in hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "queue.pl -q gpu.q".
When I run ./local/run_nnet2.sh, it showed
cuda-compiled: error while loading shared libraries: libcublas.so.6.0: cannot open shared object file: No such file or directory
This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA
If you want to use GPUs (and have them), go to src/, and configure and make on a machine
where "nvcc" is installed.
The Linux machine I run the command doesn't have GPU. Please let me know the correct steps to resolve the problem, and submit the 2 parallel jobs to the 2 GPUs?
The problem is that even the commands that don't need GPUs sometimes link
against CUDA libraries. What I do here at Hopkins is that I copy those
libraries to a directory that's reachable via NFS from all machines, and
add that directory to my LD_LIBRARY_PATH in my .bashrc. Alternatively you
could install the NVidia CUDA toolkit also on your machines that lack GPUs.
Dan
On Thu, Mar 12, 2015 at 1:56 PM, Lawrence vjdtao@users.sf.net wrote: