rocks set host attr compute-0-1 slurm_gres_template value="gres.conf.1"
rocks set host attr compute-0-1 slurm_gres value="gpu"
Than at the "rocks sync slurm" command , I have the following error:
compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-1: ssh exited with exit code 1
Here is the error message on compute-0-1 [chris@compute-0-1 dev]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2018-04-24 18:34:00 EDT; 15s ago
Process: 15289 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 14897 (code=exited, status=0/SUCCESS)
Any idea ?
Thanks for any help.
Chris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The device files (/dev/nvidia*) are not created on boot, if you don't use Xwindow.
To create the device files , login as root on compute-0-1 and call nvidia-smi.
Then look at the device files, the rights should be -rw-rw-rw.
Call rocks sync slurm again.
If this does not help, then look at the file /var/log/slurm/slurm.log.
Best regards
Werner
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As Werner says, slurm will fail if the nivida device files haven't been created. On our GPU nodes, the device files don't exist when slurm starts so I have this script run after the node boots:
Hi,
I try to add 4 GPUs on one node using SLURM.
Here what I did accoding the slurm-roll.pdf
update the gres.conf.1 file as followed.
Example for two Nvidia GPU's
Name=gpu Type=nvidia File=/dev/nvidia0 CPUs=0
Name=gpu Type=nvidia File=/dev/nvidia1 CPUs=1
Name=gpu Type=nvidia File=/dev/nvidia2 CPUs=2
Name=gpu Type=nvidia File=/dev/nvidia3 CPUs=3
rocks set host attr compute-0-1 slurm_gres_template value="gres.conf.1"
rocks set host attr compute-0-1 slurm_gres value="gpu"
Than at the "rocks sync slurm" command , I have the following error:
compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-1: ssh exited with exit code 1
Here is the error message on compute-0-1
[chris@compute-0-1 dev]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2018-04-24 18:34:00 EDT; 15s ago
Process: 15289 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 14897 (code=exited, status=0/SUCCESS)
Any idea ?
Thanks for any help.
Chris
Hi,
The device files (/dev/nvidia*) are not created on boot, if you don't use Xwindow.
To create the device files , login as root on compute-0-1 and call nvidia-smi.
Then look at the device files, the rights should be -rw-rw-rw.
Call rocks sync slurm again.
If this does not help, then look at the file /var/log/slurm/slurm.log.
Best regards
Werner
Hi Christophe
As Werner says, slurm will fail if the nivida device files haven't been created. On our GPU nodes, the device files don't exist when slurm starts so I have this script run after the node boots:
[[ -e /dev/nvidia0 ]] || nvidia-smi
/sbin/service slurm status || /sbin/service slurm start
--
Ian