Menu

How can I run KALDI with GPU cluster?

Ken Kim
2015-01-17
2015-03-12
  • Ken Kim

    Ken Kim - 2015-01-17

    Hi all !

    Recently now I am interested in running KALDI with 'GPU' cluster.

    I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use 'queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just modify 'run.pl' as each node access to other nodes by
    rsh and run the program.

    So here is my questions.

    Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which options do I need to put?) at cluster environment?

    Q2. How to run KALDI with 'GPU' cluster?
    I saw postings from Michale Farjon (https://sourceforge.net/p/kaldi/discussion/1355347/thread/7d982dee/), but I just want to clarify that do I need to install Kluster on my cluster.

    Thank you.

    Ken Kim

     
    • Daniel Povey

      Daniel Povey - 2015-01-17

      Everything is explained at
      http://kaldi.sourceforge.net/queue.html
      It should be very easy to use queue.pl if you have GridEngine installed on
      your cluster. The web page I just mentions explains how to configure it.
      Dan

      On Sat, Jan 17, 2015 at 8:38 AM, Ken Kim kenkim@users.sf.net wrote:

      Hi all !

      Recently now I am interested in running KALDI with 'GPU' cluster.

      I succesfully run KALDI with 'CPU' cluster. Since I don't know how to use '
      queue.pl' (F.Y.I I have SunGridEngine installed on my cluster), I just
      modify 'run.pl' as each node access to other nodes by
      rsh and run the program.

      So here is my questions.

      Q1. Is there any example in KALDI how to use 'queue.pl' (i.e. which
      options do I need to put?) at cluster environment?

      Q2. How to run KALDI with 'GPU' cluster?
      I saw postings from Michale Farjon (
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/7d982dee/), but
      I just want to clarify that do I need to install Kluster on my cluster.

      Thank you.

      Ken Kim

      How can I run KALDI with GPU cluster?
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/3110ea5e/?limit=25#7d66


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • Ken Kim

        Ken Kim - 2015-01-18

        Thanks you for your comments !

        I didn't know that homepage already have this detail documentation.

        I am looking through whole documents, but can't find how to set which host to use.

        Maybe this is because I am not really familiar working with Sungrid Engine.

        Let's say hostname of

        master = "master"
        node1 = "node1"
        node2 = "node2"

        and I want to distribute 2jobs to those nodes. (i.e. $num_jobs_nnet=2)

        My specific question is...
        When we train DNN with parallel option such as

        queue.pl $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \ nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \ ark:$dir/$egsdir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \| \ nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts $perturb_opts \ --minibatch-size=$this_minibatch_size --srand=$x --use-gpu=$use_gpu "$mdl" \ ark:- $dir/$[$x+1].JOB.mdl \

        where do I need to write hostname of nodes to distribute JOBs?

         
        • Daniel Povey

          Daniel Povey - 2015-01-18

          You don't write the hostnames there, you set them up when you install
          GridEngine. There is a section in the documentation on how to install
          GridEngine. If you read that whole page, it will answer most of your
          questions.
          Dan

          On Sat, Jan 17, 2015 at 10:55 PM, Ken Kim kenkim@users.sf.net wrote:

          Thanks you for your comments !

          I didn't know that homepage already have this detail documentation.

          I am looking through whole documents, but can't find how to set which host
          to use.

          Maybe this is because I am not really familiar working with Sungrid Engine.

          Let's say hostname of

          master = "master"
          node1 = "node1"
          node2 = "node2"

          and I want to distribute 2jobs to those nodes. (i.e. $num_jobs_nnet=2)

          My specific question is...
          When we train DNN with parallel option such as

          queue.pl $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \ nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \ ark:$dir/$egsdir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \| \ nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts
          $perturb_opts \ --minibatch-size=$this_minibatch_size --srand=$x
          --use-gpu=$use_gpu "$mdl" \ ark:- $dir/$[$x+1].JOB.mdl \

          where do I need to write hostname of nodes to distribute JOBs?

          How can I run KALDI with GPU cluster?
          https://sourceforge.net/p/kaldi/discussion/1355347/thread/3110ea5e/?limit=25#7d66/da95/7862


          Sent from sourceforge.net because you indicated interest in
          https://sourceforge.net/p/kaldi/discussion/1355347/

          To unsubscribe from further messages, please visit
          https://sourceforge.net/auth/subscriptions/

           
  • Jake

    Jake - 2015-03-12

    Hi Dan, I have a similar question.

    I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin" to treat it as local (used 'run.pl' in cmd.sh) to run DNN related experiments.

    Now, I had one more GPU machine added. The 2 GPU boxes are identical in hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "queue.pl -q gpu.q".

    When I run ./local/run_nnet2.sh, it showed
    cuda-compiled: error while loading shared libraries: libcublas.so.6.0: cannot open shared object file: No such file or directory
    This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA
    If you want to use GPUs (and have them), go to src/, and configure and make on a machine
    where "nvcc" is installed.

    The Linux machine I run the command doesn't have GPU. Please let me know the correct steps to resolve the problem, and submit the 2 parallel jobs to the 2 GPUs?

     
    • Daniel Povey

      Daniel Povey - 2015-03-12

      The problem is that even the commands that don't need GPUs sometimes link
      against CUDA libraries. What I do here at Hopkins is that I copy those
      libraries to a directory that's reachable via NFS from all machines, and
      add that directory to my LD_LIBRARY_PATH in my .bashrc. Alternatively you
      could install the NVidia CUDA toolkit also on your machines that lack GPUs.
      Dan

      On Thu, Mar 12, 2015 at 1:56 PM, Lawrence vjdtao@users.sf.net wrote:

      Hi Dan, I have a similar question.

      I had a 1 GPU machine (K20) in the cluster before, so I just did "qlogin"
      to treat it as local (used 'run.pl' in cmd.sh) to run DNN related
      experiments.

      Now, I had one more GPU machine added. The 2 GPU boxes are identical in
      hardware, and both under gpu.q. I changed the 'run.pl' in cmd.sh to "
      queue.pl -q gpu.q".

      When I run ./local/run_nnet2.sh, it showed
      cuda-compiled: error while loading shared libraries: libcublas.so.6.0:
      cannot open shared object file: No such file or directory
      This script is intended to be used with GPUs but you have not compiled
      Kaldi with CUDA
      If you want to use GPUs (and have them), go to src/, and configure and
      make on a machine
      where "nvcc" is installed.

      The Linux machine I run the command doesn't have GPU. Please let me know
      the correct steps to resolve the problem, and submit the 2 parallel jobs to
      the 2 GPUs?


      How can I run KALDI with GPU cluster?
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/3110ea5e/?limit=25#8c94


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/