Kaldi / Discussion / Open Discussion: SLURM on kaldi

Angel Castro - 2015-07-14

Hi everyone,

I have been trying to set up kaldi on our cluster and substitute the run.pl command for slurm.pl, so far my cmd.sh looks like this:

export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
export mkgraph_cmd="slurm.pl -c 12"

so far it has been working fairly well except when I use cuda_cmd. Then it appropriately allocates the job (so no problem with the slurm.conf or gres.conf files), however, the srun fails and the salloc relinquishes the job allocation. I checked the logs and something like this comes up:

slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since I added kaldi-trunk to my PATH in the .bashrc file

My first guess was that whenever the inside the sbatch a new salloc is called, the node where the job is allocated doesn't have permission to access kaldi, so I went and change the permissions so that the whole folder was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but then again train_cmd works flawlessly and also need some access to the kaldi-trunk/src.

Is anyone experiencing similar issues, any thoughts?

Cheers,
Angel

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-07-14
  
  It's probably some local issue, maybe path or directory or
  permission-related, but who knows.
  
  I should mention, though, that that approach will only work for the
  nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
  approach wouldn't work. You should use the queue.pl itself and
  configure it using a suitable conf/queue.conf file to call slurm with
  appropriate arguments. See here
  http://kaldi.sourceforge.net/queue.html
  
  Dan
  
  On Tue, Jul 14, 2015 at 2:25 PM, Angel Castro angel-castro@users.sf.net wrote:
  
  Hi everyone,
  
  I have been trying to set up kaldi on our cluster and substitute the run.pl
  command for slurm.pl, so far my cmd.sh looks like this:
  
  export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
  export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
  export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
  export mkgraph_cmd="slurm.pl -c 12"
  
  so far it has been working fairly well except when I use cuda_cmd. Then it
  appropriately allocates the job (so no problem with the slurm.conf or
  gres.conf files), however, the srun fails and the salloc relinquishes the
  job allocation. I checked the logs and something like this comes up:
  
  slurmd[tux05]: execve(): cuda-complied: No such file or directory
  I do have the file cuda-compiled and I am able to run it everytime since I
  added kaldi-trunk to my PATH in the .bashrc file
  
  My first guess was that whenever the inside the sbatch a new salloc is
  called, the node where the job is allocated doesn't have permission to
  access kaldi, so I went and change the permissions so that the whole folder
  was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but
  then again train_cmd works flawlessly and also need some access to the
  kaldi-trunk/src.
  
  Is anyone experiencing similar issues, any thoughts?
  
  Cheers,
  Angel
  
  SLURM on kaldi
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jan "yenda" Trmal - 2015-07-14
    
    The modification of queue.pl conf file wouldn't work -- slurm has a
    completely different style of options (for example there isn't an
    equivalent to a parameter -tc -- the number of jobs have to be appended to
    the job-array using '%') and different names of the environment
    (meta)variables we use for substitution.
    
    But yes, the new slurm.pl does support the new style of options and as far
    as I can tell, it it stable and ready for use.
    
    y.
    
    On Tue, Jul 14, 2015 at 5:28 PM, Daniel Povey danielpovey@users.sf.net
    wrote:
    
    It's probably some local issue, maybe path or directory or
    permission-related, but who knows.
    
    I should mention, though, that that approach will only work for the
    nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
    approach wouldn't work. You should use the queue.pl itself and
    configure it using a suitable conf/queue.conf file to call slurm with
    appropriate arguments. See here
    http://kaldi.sourceforge.net/queue.html
    
    Dan
    
    On Tue, Jul 14, 2015 at 2:25 PM, Angel Castro angel-castro@users.sf.net
    wrote:
    
    Hi everyone,
    
    I have been trying to set up kaldi on our cluster and substitute the
    run.pl
    command for slurm.pl, so far my cmd.sh looks like this:
    
    export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
    export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
    export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
    export mkgraph_cmd="slurm.pl -c 12"
    
    so far it has been working fairly well except when I use cuda_cmd. Then
    it
    appropriately allocates the job (so no problem with the slurm.conf or
    gres.conf files), however, the srun fails and the salloc relinquishes the
    job allocation. I checked the logs and something like this comes up:
    
    slurmd[tux05]: execve(): cuda-complied: No such file or directory
    I do have the file cuda-compiled and I am able to run it everytime since
    I
    added kaldi-trunk to my PATH in the .bashrc file
    
    My first guess was that whenever the inside the sbatch a new salloc is
    called, the node where the job is allocated doesn't have permission to
    access kaldi, so I went and change the permissions so that the whole
    folder
    was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
    but
    then again train_cmd works flawlessly and also need some access to the
    kaldi-trunk/src.
    
    Is anyone experiencing similar issues, any thoughts?
    
    Cheers,
    Angel
    
    SLURM on kaldi
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/kaldi/discussion/1355347/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    SLURM on kaldi
    
    Sent from sourceforge.net because you indicated interest in <
    https://sourceforge.net/p/kaldi/discussion/1355347/>
    
    To unsubscribe from further messages, please visit <
    https://sourceforge.net/auth/subscriptions/>
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Angel Castro - 2015-07-14
    
    Hi Dan,
    Thanks for your quick reply, you were right, apparently I didn't configure the nodes to access the folder where kaldi was installed. Also changing slurm.pl to queue.pl for cuda_cmd worked like a charm and I just added the local configuration file for calling slurm.
    
    Cheers,
    Angel
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jan "yenda" Trmal - 2015-07-14
  
  First of all, how old is your kaldi? Seems like you are using old-style
  options (which are not supported anymore).
  Can you please update your kaldi and use the new-style options (--gpu=1
  --mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
  gpu training on daily basis.
  y.
  
  On Tue, Jul 14, 2015 at 5:25 PM, Angel Castro angel-castro@users.sf.net
  wrote:
  
  Hi everyone,
  
  I have been trying to set up kaldi on our cluster and substitute the
  run.pl command for slurm.pl, so far my cmd.sh looks like this:
  
  export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
  export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
  export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
  export mkgraph_cmd="slurm.pl -c 12"
  
  so far it has been working fairly well except when I use cuda_cmd. Then it
  appropriately allocates the job (so no problem with the slurm.conf or
  gres.conf files), however, the srun fails and the salloc relinquishes the
  job allocation. I checked the logs and something like this comes up:
  
  slurmd[tux05]: execve(): cuda-complied: No such file or directory
  I do have the file cuda-compiled and I am able to run it everytime since I
  added kaldi-trunk to my PATH in the .bashrc file
  
  My first guess was that whenever the inside the sbatch a new salloc is
  called, the node where the job is allocated doesn't have permission to
  access kaldi, so I went and change the permissions so that the whole folder
  was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
  but then again train_cmd works flawlessly and also need some access to the
  kaldi-trunk/src.
  
  Is anyone experiencing similar issues, any thoughts?
  
  Cheers,
  Angel
  
  SLURM on kaldi
  
  Sent from sourceforge.net because you indicated interest in <
  https://sourceforge.net/p/kaldi/discussion/1355347/>
  
  To unsubscribe from further messages, please visit <
  https://sourceforge.net/auth/subscriptions/>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Daniel Povey - 2015-07-14
    
    Yenda, so does the slurm.pl support those new-style options? I didn't
    realize that; I though the plan was to support slurm in queue.pl using
    a config file.
    Dan
    
    On Tue, Jul 14, 2015 at 2:30 PM, Jan jtrmal@users.sf.net wrote:
    
    ERROR! The markdown supplied could not be parsed correctly. Did you forget
    to surround a code snippet with "~~~~"?
    
    First of all, how old is your kaldi? Seems like you are using old-style
    options (which are not supported anymore).
    Can you please update your kaldi and use the new-style options (--gpu=1
    --mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
    gpu training on daily basis.
    y.
    
    On Tue, Jul 14, 2015 at 5:25 PM, Angel Castro angel-castro@users.sf.net
    wrote:
    
    Hi everyone,
    
    I have been trying to set up kaldi on our cluster and substitute the
    run.pl command for slurm.pl, so far my cmd.sh looks like this:
    
    export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
    export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
    export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
    export mkgraph_cmd="slurm.pl -c 12"
    
    so far it has been working fairly well except when I use cuda_cmd. Then it
    appropriately allocates the job (so no problem with the slurm.conf or
    gres.conf files), however, the srun fails and the salloc relinquishes the
    job allocation. I checked the logs and something like this comes up:
    
    slurmd[tux05]: execve(): cuda-complied: No such file or directory
    I do have the file cuda-compiled and I am able to run it everytime since I
    added kaldi-trunk to my PATH in the .bashrc file
    
    My first guess was that whenever the inside the sbatch a new salloc is
    called, the node where the job is allocated doesn't have permission to
    access kaldi, so I went and change the permissions so that the whole
    folder
    was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
    but then again train_cmd works flawlessly and also need some access to the
    kaldi-trunk/src.
    
    Is anyone experiencing similar issues, any thoughts?
    
    Cheers,
    Angel
    
    [SLURM on kaldi](
    
    https://sourceforge.net/p/kaldi/discussion/1355347/thread/dc3bb869/?limit=25#8d76
    )
    
    Sent from sourceforge.net because you indicated interest in <
    https://sourceforge.net/p/kaldi/discussion/1355347/>
    
    To unsubscribe from further messages, please visit <
    https://sourceforge.net/auth/subscriptions/>
    
    SLURM on
    kaldi
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/kaldi/discussion/1355347/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SLURM on kaldi

Forums

Help

SLURM on kaldi

SLURM on kaldi

Forums

Help

SLURM on kaldi document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

SLURM on kaldi