Menu

Error trying to install 4 GPUs on compute-0-1

2018-04-24
2018-04-25
  • Christophe Guilbert

    Hi,
    I try to add 4 GPUs on one node using SLURM.
    Here what I did accoding the slurm-roll.pdf

    update the gres.conf.1 file as followed.

    Example for two Nvidia GPU's

    Name=gpu Type=nvidia File=/dev/nvidia0 CPUs=0
    Name=gpu Type=nvidia File=/dev/nvidia1 CPUs=1
    Name=gpu Type=nvidia File=/dev/nvidia2 CPUs=2
    Name=gpu Type=nvidia File=/dev/nvidia3 CPUs=3

    rocks set host attr compute-0-1 slurm_gres_template value="gres.conf.1"
    rocks set host attr compute-0-1 slurm_gres value="gpu"

    Than at the "rocks sync slurm" command , I have the following error:
    compute-0-1: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
    pdsh@jcluster: compute-0-1: ssh exited with exit code 1

    Here is the error message on compute-0-1
    [chris@compute-0-1 dev]$ systemctl status slurmd.service
    ● slurmd.service - Slurm node daemon
    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
    Active: failed (Result: exit-code) since Tue 2018-04-24 18:34:00 EDT; 15s ago
    Process: 15289 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
    Main PID: 14897 (code=exited, status=0/SUCCESS)

    Any idea ?
    Thanks for any help.

    Chris

     
    • Werner Saar

      Werner Saar - 2018-04-25

      Hi,

      The device files (/dev/nvidia*) are not created on boot, if you don't use Xwindow.
      To create the device files , login as root on compute-0-1 and call nvidia-smi.
      Then look at the device files, the rights should be -rw-rw-rw.
      Call rocks sync slurm again.

      If this does not help, then look at the file /var/log/slurm/slurm.log.

      Best regards
      Werner

       
  • Ian Mortimer

    Ian Mortimer - 2018-04-25

    Hi Christophe

    As Werner says, slurm will fail if the nivida device files haven't been created. On our GPU nodes, the device files don't exist when slurm starts so I have this script run after the node boots:

    [[ -e /dev/nvidia0 ]] || nvidia-smi
    /sbin/service slurm status || /sbin/service slurm start

    --
    Ian

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.