slurm-roll / Discussion / General Discussion: Error trying to install 4 GPUs on compute-0-1

Christophe Guilbert - 2018-04-24

Hi,
I try to add 4 GPUs on one node using SLURM.
Here what I did accoding the slurm-roll.pdf

update the gres.conf.1 file as followed.

Example for two Nvidia GPU's

Name=gpu Type=nvidia File=/dev/nvidia0 CPUs=0
Name=gpu Type=nvidia File=/dev/nvidia1 CPUs=1
Name=gpu Type=nvidia File=/dev/nvidia2 CPUs=2
Name=gpu Type=nvidia File=/dev/nvidia3 CPUs=3

rocks set host attr compute-0-1 slurm_gres_template value="gres.conf.1"
rocks set host attr compute-0-1 slurm_gres value="gpu"

Than at the "rocks sync slurm" command , I have the following error:
compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@jcluster: compute-0-1: ssh exited with exit code 1

Here is the error message on compute-0-1
[chris@compute-0-1 dev]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2018-04-24 18:34:00 EDT; 15s ago
Process: 15289 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 14897 (code=exited, status=0/SUCCESS)

Any idea ?
Thanks for any help.

Chris

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Werner Saar - 2018-04-25
  
  Hi,
  
  The device files (/dev/nvidia*) are not created on boot, if you don't use Xwindow.
  To create the device files , login as root on compute-0-1 and call nvidia-smi.
  Then look at the device files, the rights should be -rw-rw-rw.
  Call rocks sync slurm again.
  
  If this does not help, then look at the file /var/log/slurm/slurm.log.
  
  Best regards
  Werner
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ian Mortimer - 2018-04-25

Hi Christophe

As Werner says, slurm will fail if the nivida device files haven't been created. On our GPU nodes, the device files don't exist when slurm starts so I have this script run after the node boots:

[[ -e /dev/nvidia0 ]] || nvidia-smi
/sbin/service slurm status || /sbin/service slurm start

--
Ian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Error trying to install 4 GPUs on compute-0-1

Slurm Resource Manager for Rocks Clusters

Forums

Help

Error trying to install 4 GPUs on compute-0-1

Example for two Nvidia GPU's

Error trying to install 4 GPUs on compute-0-1

Slurm Resource Manager for Rocks Clusters

Forums

Help

Error trying to install 4 GPUs on compute-0-1 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Example for two Nvidia GPU's

Error trying to install 4 GPUs on compute-0-1