Kaldi / Discussion / Open Discussion: How can I run KALDI with GPU cluster?

Ken Kim - 2015-01-17

Hi all !

Recently now I am interested in running KALDI with 'GPU' cluster.

I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use 'queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just modify 'run.pl' as each node access to other nodes by
rsh and run the program.

So here is my questions.

Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which options do I need to put?) at cluster environment?

Q2. How to run KALDI with 'GPU' cluster?
I saw postings from Michale Farjon (https://sourceforge.net/p/kaldi/discussion/1355347/thread/7d982dee/), but I just want to clarify that do I need to install Kluster on my cluster.

Thank you.

Ken Kim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-01-17
  
  Everything is explained at
  http://kaldi.sourceforge.net/queue.html
  It should be very easy to use queue.pl if you have GridEngine installed on
  your cluster. The web page I just mentions explains how to configure it.
  Dan
  
  On Sat, Jan 17, 2015 at 8:38 AM, Ken Kim kenkim@users.sf.net wrote:
  
  Hi all !
  
  Recently now I am interested in running KALDI with 'GPU' cluster.
  
  I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use '
  queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just
  modify 'run.pl' as each node access to other nodes by
  rsh and run the program.
  
  So here is my questions.
  
  Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which
  options do I need to put?) at cluster environment?
  
  Q2. How to run KALDI with 'GPU' cluster?
  I saw postings from Michale Farjon (
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/7d982dee/), but
  I just want to clarify that do I need to install Kluster on my cluster.
  
  Thank you.
  
  Ken Kim
  
  How can I run KALDI with GPU cluster?
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/3110ea5e/?limit=25#7d66
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ken Kim - 2015-01-18
    
    Thanks you for your comments !
    
    I didn't know that homepage already have this detail documentation.
    
    I am looking through whole documents, but can't find how to set which host to use.
    
    Maybe this is because I am not really familiar working with Sungrid Engine.
    
    Let's say hostname of
    
    master = "master"
    node1 = "node1"
    node2 = "node2"
    
    and I want to distribute 2jobs to those nodes. (i.e. $num_jobs_nnet=2)
    
    My specific question is...
    When we train DNN with parallel option such as
    
    queue.pl $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \ nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \ ark:$dir/$egsdir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \| \ nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts $perturb_opts \ --minibatch-size=$this_minibatch_size --srand=$x --use-gpu=$use_gpu "$mdl" \ ark:- $dir/$[$x+1].JOB.mdl \
    
    where do I need to write hostname of nodes to distribute JOBs?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Daniel Povey - 2015-01-18
      
      You don't write the hostnames there, you set them up when you install
      GridEngine. There is a section in the documentation on how to install
      GridEngine. If you read that whole page, it will answer most of your
      questions.
      Dan
      
      On Sat, Jan 17, 2015 at 10:55 PM, Ken Kim kenkim@users.sf.net wrote:
      
      Thanks you for your comments !
      
      I didn't know that homepage already have this detail documentation.
      
      I am looking through whole documents, but can't find how to set which host
      to use.
      
      Maybe this is because I am not really familiar working with Sungrid Engine.
      
      Let's say hostname of
      
      master = "master"
      node1 = "node1"
      node2 = "node2"
      
      and I want to distribute 2jobs to those nodes. (i.e. $num_jobs_nnet=2)
      
      My specific question is...
      When we train DNN with parallel option such as
      
      queue.pl $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \ nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \ ark:$dir/$egsdir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \| \ nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts
      $perturb_opts \ --minibatch-size=$this_minibatch_size --srand=$x
      --use-gpu=$use_gpu "$mdl" \ ark:- $dir/$[$x+1].JOB.mdl \
      
      where do I need to write hostname of nodes to distribute JOBs?
      
      How can I run KALDI with GPU cluster?
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/3110ea5e/?limit=25#7d66/da95/7862
      
      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/
      
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jake - 2015-03-12

Hi Dan, I have a similar question.

I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin" to treat it as local (used 'run.pl' in cmd.sh) to run DNN related experiments.

Now, I had one more GPU machine added. The 2 GPU boxes are identical in hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "queue.pl -q gpu.q".

When I run ./local/run_nnet2.sh, it showed
cuda-compiled: error while loading shared libraries: libcublas.so.6.0: cannot open shared object file: No such file or directory
This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA
If you want to use GPUs (and have them), go to src/, and configure and make on a machine
where "nvcc" is installed.

The Linux machine I run the command doesn't have GPU. Please let me know the correct steps to resolve the problem, and submit the 2 parallel jobs to the 2 GPUs?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-03-12
  
  The problem is that even the commands that don't need GPUs sometimes link
  against CUDA libraries. What I do here at Hopkins is that I copy those
  libraries to a directory that's reachable via NFS from all machines, and
  add that directory to my LD_LIBRARY_PATH in my .bashrc. Alternatively you
  could install the NVidia CUDA toolkit also on your machines that lack GPUs.
  Dan
  
  On Thu, Mar 12, 2015 at 1:56 PM, Lawrence vjdtao@users.sf.net wrote:
  
  Hi Dan, I have a similar question.
  
  I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin"
  to treat it as local (used 'run.pl' in cmd.sh) to run DNN related
  experiments.
  
  Now, I had one more GPU machine added. The 2 GPU boxes are identical in
  hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "
  queue.pl -q gpu.q".
  
  When I run ./local/run_nnet2.sh, it showed
  cuda-compiled: error while loading shared libraries: libcublas.so.6.0:
  cannot open shared object file: No such file or directory
  This script is intended to be used with GPUs but you have not compiled
  Kaldi with CUDA
  If you want to use GPUs (and have them), go to src/, and configure and
  make on a machine
  where "nvcc" is installed.
  
  The Linux machine I run the command doesn't have GPU. Please let me know
  the correct steps to resolve the problem, and submit the 2 parallel jobs to
  the 2 GPUs?
  
  How can I run KALDI with GPU cluster?
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/3110ea5e/?limit=25#8c94
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How can I run KALDI with GPU cluster?

Forums

Help

How can I run KALDI with GPU cluster?

Ken Kim

where do I need to write hostname of nodes to distribute JOBs?

How can I run KALDI with GPU cluster?

Forums

Help

How can I run KALDI with GPU cluster? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Ken Kim

where do I need to write hostname of nodes to distribute JOBs?

How can I run KALDI with GPU cluster?