so far it has been working fairly well except when I use cuda_cmd. Then it appropriately allocates the job (so no problem with the slurm.conf or gres.conf files), however, the srun fails and the salloc relinquishes the job allocation. I checked the logs and something like this comes up:
slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since I added kaldi-trunk to my PATH in the .bashrc file
My first guess was that whenever the inside the sbatch a new salloc is called, the node where the job is allocated doesn't have permission to access kaldi, so I went and change the permissions so that the whole folder was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but then again train_cmd works flawlessly and also need some access to the kaldi-trunk/src.
Is anyone experiencing similar issues, any thoughts?
Cheers,
Angel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's probably some local issue, maybe path or directory or
permission-related, but who knows.
I should mention, though, that that approach will only work for the
nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
approach wouldn't work. You should use the queue.pl itself and
configure it using a suitable conf/queue.conf file to call slurm with
appropriate arguments. See here http://kaldi.sourceforge.net/queue.html
so far it has been working fairly well except when I use cuda_cmd. Then it
appropriately allocates the job (so no problem with the slurm.conf or
gres.conf files), however, the srun fails and the salloc relinquishes the
job allocation. I checked the logs and something like this comes up:
slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since I
added kaldi-trunk to my PATH in the .bashrc file
My first guess was that whenever the inside the sbatch a new salloc is
called, the node where the job is allocated doesn't have permission to
access kaldi, so I went and change the permissions so that the whole folder
was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but
then again train_cmd works flawlessly and also need some access to the
kaldi-trunk/src.
Is anyone experiencing similar issues, any thoughts?
The modification of queue.pl conf file wouldn't work -- slurm has a
completely different style of options (for example there isn't an
equivalent to a parameter -tc -- the number of jobs have to be appended to
the job-array using '%') and different names of the environment
(meta)variables we use for substitution.
But yes, the new slurm.pl does support the new style of options and as far
as I can tell, it it stable and ready for use.
It's probably some local issue, maybe path or directory or
permission-related, but who knows.
I should mention, though, that that approach will only work for the
nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
approach wouldn't work. You should use the queue.pl itself and
configure it using a suitable conf/queue.conf file to call slurm with
appropriate arguments. See here http://kaldi.sourceforge.net/queue.html
so far it has been working fairly well except when I use cuda_cmd. Then
it
appropriately allocates the job (so no problem with the slurm.conf or
gres.conf files), however, the srun fails and the salloc relinquishes the
job allocation. I checked the logs and something like this comes up:
slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since
I
added kaldi-trunk to my PATH in the .bashrc file
My first guess was that whenever the inside the sbatch a new salloc is
called, the node where the job is allocated doesn't have permission to
access kaldi, so I went and change the permissions so that the whole
folder
was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
but
then again train_cmd works flawlessly and also need some access to the
kaldi-trunk/src.
Is anyone experiencing similar issues, any thoughts?
Hi Dan,
Thanks for your quick reply, you were right, apparently I didn't configure the nodes to access the folder where kaldi was installed. Also changing slurm.pl to queue.pl for cuda_cmd worked like a charm and I just added the local configuration file for calling slurm.
Cheers,
Angel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First of all, how old is your kaldi? Seems like you are using old-style
options (which are not supported anymore).
Can you please update your kaldi and use the new-style options (--gpu=1
--mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
gpu training on daily basis.
y.
so far it has been working fairly well except when I use cuda_cmd. Then it
appropriately allocates the job (so no problem with the slurm.conf or
gres.conf files), however, the srun fails and the salloc relinquishes the
job allocation. I checked the logs and something like this comes up:
slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since I
added kaldi-trunk to my PATH in the .bashrc file
My first guess was that whenever the inside the sbatch a new salloc is
called, the node where the job is allocated doesn't have permission to
access kaldi, so I went and change the permissions so that the whole folder
was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
but then again train_cmd works flawlessly and also need some access to the
kaldi-trunk/src.
Is anyone experiencing similar issues, any thoughts?
Yenda, so does the slurm.pl support those new-style options? I didn't
realize that; I though the plan was to support slurm in queue.pl using
a config file.
Dan
ERROR! The markdown supplied could not be parsed correctly. Did you forget
to surround a code snippet with "~~~~"?
First of all, how old is your kaldi? Seems like you are using old-style
options (which are not supported anymore).
Can you please update your kaldi and use the new-style options (--gpu=1
--mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
gpu training on daily basis.
y.
so far it has been working fairly well except when I use cuda_cmd. Then it
appropriately allocates the job (so no problem with the slurm.conf or
gres.conf files), however, the srun fails and the salloc relinquishes the
job allocation. I checked the logs and something like this comes up:
slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since I
added kaldi-trunk to my PATH in the .bashrc file
My first guess was that whenever the inside the sbatch a new salloc is
called, the node where the job is allocated doesn't have permission to
access kaldi, so I went and change the permissions so that the whole
folder
was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue,
but then again train_cmd works flawlessly and also need some access to the
kaldi-trunk/src.
Is anyone experiencing similar issues, any thoughts?
Hi everyone,
I have been trying to set up kaldi on our cluster and substitute the run.pl command for slurm.pl, so far my cmd.sh looks like this:
export train_cmd="slurm.pl -N2 -c 12 -t 48:00:00"
export decode_cmd="slurm.pl -N1 -c 12 -t 48:00:00"
export cuda_cmd="slurm.pl -N1 --gres=gpu:2 -c 12 -t 24:00:00"
export mkgraph_cmd="slurm.pl -c 12"
so far it has been working fairly well except when I use cuda_cmd. Then it appropriately allocates the job (so no problem with the slurm.conf or gres.conf files), however, the srun fails and the salloc relinquishes the job allocation. I checked the logs and something like this comes up:
slurmd[tux05]: execve(): cuda-complied: No such file or directory
I do have the file cuda-compiled and I am able to run it everytime since I added kaldi-trunk to my PATH in the .bashrc file
My first guess was that whenever the inside the sbatch a new salloc is called, the node where the job is allocated doesn't have permission to access kaldi, so I went and change the permissions so that the whole folder was accessible ($chmod -R 777 kaldi-trunk). That didn't solve the issue, but then again train_cmd works flawlessly and also need some access to the kaldi-trunk/src.
Is anyone experiencing similar issues, any thoughts?
Cheers,
Angel
It's probably some local issue, maybe path or directory or
permission-related, but who knows.
I should mention, though, that that approach will only work for the
nnet1 scripts. The nnet2 scripts add the gpu flags directly so your
approach wouldn't work. You should use the queue.pl itself and
configure it using a suitable conf/queue.conf file to call slurm with
appropriate arguments. See here
http://kaldi.sourceforge.net/queue.html
Dan
On Tue, Jul 14, 2015 at 2:25 PM, Angel Castro angel-castro@users.sf.net wrote:
The modification of queue.pl conf file wouldn't work -- slurm has a
completely different style of options (for example there isn't an
equivalent to a parameter -tc -- the number of jobs have to be appended to
the job-array using '%') and different names of the environment
(meta)variables we use for substitution.
But yes, the new slurm.pl does support the new style of options and as far
as I can tell, it it stable and ready for use.
y.
On Tue, Jul 14, 2015 at 5:28 PM, Daniel Povey danielpovey@users.sf.net
wrote:
Hi Dan,
Thanks for your quick reply, you were right, apparently I didn't configure the nodes to access the folder where kaldi was installed. Also changing slurm.pl to queue.pl for cuda_cmd worked like a charm and I just added the local configuration file for calling slurm.
Cheers,
Angel
First of all, how old is your kaldi? Seems like you are using old-style
options (which are not supported anymore).
Can you please update your kaldi and use the new-style options (--gpu=1
--mem=4G and so on)? The new slurm.pl does not use salloc and I'm running
gpu training on daily basis.
y.
On Tue, Jul 14, 2015 at 5:25 PM, Angel Castro angel-castro@users.sf.net
wrote:
Yenda, so does the slurm.pl support those new-style options? I didn't
realize that; I though the plan was to support slurm in queue.pl using
a config file.
Dan
On Tue, Jul 14, 2015 at 2:30 PM, Jan jtrmal@users.sf.net wrote: