Menu

SLURM on kaldi

2015-07-14
2015-07-14
  • Angel Castro

    Angel Castro - 2015-07-14

    Hi everyone,

    I have been trying to set up kaldi on our cluster and substitute the run.pl command for slurm.pl, so far my cmd.sh looks like this:

    export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
    export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
    export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
    export mkgraph_cmd="slurm.pl -c 12"

    so far it has been working fairly well except when I use cuda_cmd. Then it appropriately allocates the job (so no problem with the slurm.conf or gres.conf files), however, the srun fails and the salloc relinquishes the job allocation. I checked the logs and something like this comes up:

    slurmd[tux05]: execve(): cuda-complied: No such file or directory
    I do have the file cuda-compiled and I am able to run it everytime since I added kaldi-trunk to my PATH in the .bashrc file

    My first guess was that whenever the inside the sbatch a new salloc is called, the node where the job is allocated doesn't have permission to access kaldi, so I went and change the permissions so that the whole folder was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but then again train_cmd works flawlessly and also need some access to the kaldi-trunk/src.

    Is anyone experiencing similar issues, any thoughts?

    Cheers,
    Angel

     
    • Daniel Povey

      Daniel Povey - 2015-07-14

      It's probably some local issue, maybe path or directory or
      permission-related, but who knows.

      I should mention, though, that that approach will only work for the
      nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
      approach wouldn't work. You should use the queue.pl itself and
      configure it using a suitable conf/queue.conf file to call slurm with
      appropriate arguments. See here
      http://kaldi.sourceforge.net/queue.html

      Dan

      On Tue, Jul 14, 2015 at 2:25 PM, Angel Castro angel-castro@users.sf.net wrote:

      Hi everyone,

      I have been trying to set up kaldi on our cluster and substitute the run.pl
      command for slurm.pl, so far my cmd.sh looks like this:

      export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
      export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
      export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
      export mkgraph_cmd="slurm.pl -c 12"

      so far it has been working fairly well except when I use cuda_cmd. Then it
      appropriately allocates the job (so no problem with the slurm.conf or
      gres.conf files), however, the srun fails and the salloc relinquishes the
      job allocation. I checked the logs and something like this comes up:

      slurmd[tux05]: execve(): cuda-complied: No such file or directory
      I do have the file cuda-compiled and I am able to run it everytime since I
      added kaldi-trunk to my PATH in the .bashrc file

      My first guess was that whenever the inside the sbatch a new salloc is
      called, the node where the job is allocated doesn't have permission to
      access kaldi, so I went and change the permissions so that the whole folder
      was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but
      then again train_cmd works flawlessly and also need some access to the
      kaldi-trunk/src.

      Is anyone experiencing similar issues, any thoughts?

      Cheers,
      Angel


      SLURM on kaldi


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • Jan "yenda" Trmal

        The modification of queue.pl conf file wouldn't work -- slurm has a
        completely different style of options (for example there isn't an
        equivalent to a parameter -tc -- the number of jobs have to be appended to
        the job-array using '%') and different names of the environment
        (meta)variables we use for substitution.

        But yes, the new slurm.pl does support the new style of options and as far
        as I can tell, it it stable and ready for use.

        y.

        On Tue, Jul 14, 2015 at 5:28 PM, Daniel Povey danielpovey@users.sf.net
        wrote:

        It's probably some local issue, maybe path or directory or
        permission-related, but who knows.

        I should mention, though, that that approach will only work for the
        nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
        approach wouldn't work. You should use the queue.pl itself and
        configure it using a suitable conf/queue.conf file to call slurm with
        appropriate arguments. See here
        http://kaldi.sourceforge.net/queue.html

        Dan

        On Tue, Jul 14, 2015 at 2:25 PM, Angel Castro angel-castro@users.sf.net
        wrote:

        Hi everyone,

        I have been trying to set up kaldi on our cluster and substitute the
        run.pl
        command for slurm.pl, so far my cmd.sh looks like this:

        export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
        export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
        export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
        export mkgraph_cmd="slurm.pl -c 12"

        so far it has been working fairly well except when I use cuda_cmd. Then
        it
        appropriately allocates the job (so no problem with the slurm.conf or
        gres.conf files), however, the srun fails and the salloc relinquishes the
        job allocation. I checked the logs and something like this comes up:

        slurmd[tux05]: execve(): cuda-complied: No such file or directory
        I do have the file cuda-compiled and I am able to run it everytime since
        I
        added kaldi-trunk to my PATH in the .bashrc file

        My first guess was that whenever the inside the sbatch a new salloc is
        called, the node where the job is allocated doesn't have permission to
        access kaldi, so I went and change the permissions so that the whole
        folder
        was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
        but
        then again train_cmd works flawlessly and also need some access to the
        kaldi-trunk/src.

        Is anyone experiencing similar issues, any thoughts?

        Cheers,
        Angel


        SLURM on kaldi


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/kaldi/discussion/1355347/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/


        SLURM on kaldi


        Sent from sourceforge.net because you indicated interest in <
        https://sourceforge.net/p/kaldi/discussion/1355347/>

        To unsubscribe from further messages, please visit <
        https://sourceforge.net/auth/subscriptions/>

         
      • Angel Castro

        Angel Castro - 2015-07-14

        Hi Dan,
        Thanks for your quick reply, you were right, apparently I didn't configure the nodes to access the folder where kaldi was installed. Also changing slurm.pl to queue.pl for cuda_cmd worked like a charm and I just added the local configuration file for calling slurm.

        Cheers,
        Angel

         
    • Jan "yenda" Trmal

      First of all, how old is your kaldi? Seems like you are using old-style
      options (which are not supported anymore).
      Can you please update your kaldi and use the new-style options (--gpu=1
      --mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
      gpu training on daily basis.
      y.

      On Tue, Jul 14, 2015 at 5:25 PM, Angel Castro angel-castro@users.sf.net
      wrote:

      Hi everyone,

      I have been trying to set up kaldi on our cluster and substitute the
      run.pl command for slurm.pl, so far my cmd.sh looks like this:

      export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
      export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
      export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
      export mkgraph_cmd="slurm.pl -c 12"

      so far it has been working fairly well except when I use cuda_cmd. Then it
      appropriately allocates the job (so no problem with the slurm.conf or
      gres.conf files), however, the srun fails and the salloc relinquishes the
      job allocation. I checked the logs and something like this comes up:

      slurmd[tux05]: execve(): cuda-complied: No such file or directory
      I do have the file cuda-compiled and I am able to run it everytime since I
      added kaldi-trunk to my PATH in the .bashrc file

      My first guess was that whenever the inside the sbatch a new salloc is
      called, the node where the job is allocated doesn't have permission to
      access kaldi, so I went and change the permissions so that the whole folder
      was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
      but then again train_cmd works flawlessly and also need some access to the
      kaldi-trunk/src.

      Is anyone experiencing similar issues, any thoughts?

      Cheers,
      Angel


      SLURM on kaldi


      Sent from sourceforge.net because you indicated interest in <
      https://sourceforge.net/p/kaldi/discussion/1355347/>

      To unsubscribe from further messages, please visit <
      https://sourceforge.net/auth/subscriptions/>

       
      • Daniel Povey

        Daniel Povey - 2015-07-14

        Yenda, so does the slurm.pl support those new-style options? I didn't
        realize that; I though the plan was to support slurm in queue.pl using
        a config file.
        Dan

        On Tue, Jul 14, 2015 at 2:30 PM, Jan jtrmal@users.sf.net wrote:

        ERROR! The markdown supplied could not be parsed correctly. Did you forget
        to surround a code snippet with "~~~~"?

        First of all, how old is your kaldi? Seems like you are using old-style
        options (which are not supported anymore).
        Can you please update your kaldi and use the new-style options (--gpu=1
        --mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
        gpu training on daily basis.
        y.

        On Tue, Jul 14, 2015 at 5:25 PM, Angel Castro angel-castro@users.sf.net
        wrote:

        Hi everyone,

        I have been trying to set up kaldi on our cluster and substitute the
        run.pl command for slurm.pl, so far my cmd.sh looks like this:

        export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
        export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
        export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
        export mkgraph_cmd="slurm.pl -c 12"

        so far it has been working fairly well except when I use cuda_cmd. Then it
        appropriately allocates the job (so no problem with the slurm.conf or
        gres.conf files), however, the srun fails and the salloc relinquishes the
        job allocation. I checked the logs and something like this comes up:

        slurmd[tux05]: execve(): cuda-complied: No such file or directory
        I do have the file cuda-compiled and I am able to run it everytime since I
        added kaldi-trunk to my PATH in the .bashrc file

        My first guess was that whenever the inside the sbatch a new salloc is
        called, the node where the job is allocated doesn't have permission to
        access kaldi, so I went and change the permissions so that the whole
        folder
        was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
        but then again train_cmd works flawlessly and also need some access to the
        kaldi-trunk/src.

        Is anyone experiencing similar issues, any thoughts?

        Cheers,
        Angel


        [SLURM on kaldi](

        https://sourceforge.net/p/kaldi/discussion/1355347/thread/dc3bb869/?limit=25#8d76
        )


        Sent from sourceforge.net because you indicated interest in <
        https://sourceforge.net/p/kaldi/discussion/1355347/>

        To unsubscribe from further messages, please visit <
        https://sourceforge.net/auth/subscriptions/>


        SLURM on
        kaldi


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/kaldi/discussion/1355347/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/