Menu

Batch job only runs on single core

2018-12-04
2018-12-05
  • David Chalmers

    David Chalmers - 2018-12-04

    Hi All,

    I have a new Rocks GPU cluster with the slurm roll installed.

    I am trying to submit Gromacs jobs. Gromacs uses hybrid MPI/OpenMPI parallelisation.

    Gromacs jobs run correctly when run logged into the compute node.

    Command: /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx mdrun -deffnm md6

    Top shows a single process using > 100% CPU

    Tasks: 472 total, 2 running, 470 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 96.9 us, 3.0 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem : 32340540 total, 23928132 free, 5520972 used, 2891436 buff/cache
    KiB Swap: 1048572 total, 1048572 free, 0 used. 25748636 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    10648 david 20 0 46.297g 5.184g 664920 R 4800 16.8 76:50.89 gmx
    10803 root 20 0 158116 2552 1484 R 5.6 0.0 0:00.03 top
    1 root 20 0 47036 7416 2504 S 0.0 0.0 0:03.42 systemd
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.06 kthreadd
    3 root 20 0 0 0 0 S 0.0 0.0 0:14.59 ksoftirqd/0

    Jobs submitted through sbatch start to run, but stall after writing preliminary information to the log file. Top shows only 100% CPU usage.

    top - 17:32:34 up 1 day, 2:41, 2 users, load average: 0.27, 7.06, 7.83
    Tasks: 474 total, 2 running, 472 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 0.4 us, 1.7 sy, 0.0 ni, 97.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem : 32340540 total, 29322852 free, 519560 used, 2498128 buff/cache
    KiB Swap: 1048572 total, 1048572 free, 0 used. 31182300 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    10922 david 20 0 6779248 9036 6564 R 98.7 0.0 0:10.01 gmx
    2150 root -51 0 0 0 0 S 1.0 0.0 25:33.73 irq/64-nvidia
    10 root 20 0 0 0 0 S 0.7 0.0 1:51.39 rcu_sched
    3 root 20 0 0 0 0 S 0.3 0.0 0:14.61 ksoftirqd/0
    1464 root 20 0 224884 12124 6400 S 0.3 0.0 1:07.08 snmpd

    The gromacs logfile is empty

    The slurm.out file contains only the following

                      :-) GROMACS - gmx mdrun, 2018.3 (-:
    
                            GROMACS is written by:
     Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
    Par Bjelkmar    Aldert van Buuren   Rudi van Drunen     Anton Feenstra
    

    Gerrit Groenhof Aleksei Iupinov Christoph Junghans Anca Hamuraru
    Vincent Hindriksen Dimitrios Karkoulis Peter Kasson Jiri Kraus
    Carsten Kutzner Per Larsson Justin A. Lemkul Viveca Lindahl
    Magnus Lundborg Pieter Meulenhoff Erik Marklund Teemu Murtola
    Szilard Pall Sander Pronk Roland Schulz Alexey Shvetsov
    Michael Shirts Alfons Sijbers Peter Tieleman Teemu Virolainen
    Christian Wennberg Maarten Wolf
    and the project leaders:
    Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

    Copyright (c) 1991-2000, University of Groningen, The Netherlands.
    Copyright (c) 2001-2017, The GROMACS development team at
    Uppsala University, Stockholm University and
    the Royal Institute of Technology, Sweden.
    check out http://www.gromacs.org for more information.

    GROMACS is free software; you can redistribute it and/or modify it
    under the terms of the GNU Lesser General Public License
    as published by the Free Software Foundation; either version 2.1
    of the License, or (at your option) any later version.

    GROMACS: gmx mdrun, version 2018.3
    Executable: /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx
    Data prefix: /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0
    Working dir: /scratch/david/sim135-5-3-2
    Command line:
    gmx mdrun -deffnm md6

    Back Off! I just backed up md6.log to ./#md6.log.28#
    slurmstepd: error: JOB 127 ON compute-0-0 CANCELLED AT 2018-12-04T17:30:42

    Any suggestions appreciated.

    David

     
    • Werner Saar

      Werner Saar - 2018-12-04

      Hi,

      Please send me the output of
      scontrol show node
      and
      scontrol show part
      and
      the job file for sbatch.

      But I don't have any experience with gromacs.

      You should also use the slurm-users discussion group.

      Best regards

      Werner

      On 12/04/2018 08:42 AM, David Chalmers wrote:

      Hi All,

      I have a new Rocks GPU cluster with the slurm roll installed.

      I am trying to submit Gromacs jobs. Gromacs uses hybrid MPI/OpenMPI parallelisation.

      Gromacs jobs run correctly when run logged into the compute node.

      Command: /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx mdrun -deffnm md6

      Top shows a single process using > 100% CPU

      Tasks: 472 total, 2 running, 470 sleeping, 0 stopped, 0 zombie
      %Cpu(s): 96.9 us, 3.0 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
      KiB Mem : 32340540 total, 23928132 free, 5520972 used, 2891436 buff/cache
      KiB Swap: 1048572 total, 1048572 free, 0 used. 25748636 avail Mem

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      

      10648 david 20 0 46.297g 5.184g 664920 R 4800 16.8 76:50.89 gmx
      10803 root 20 0 158116 2552 1484 R 5.6 0.0 0:00.03 top
      1 root 20 0 47036 7416 2504 S 0.0 0.0 0:03.42 systemd
      2 root 20 0 0 0 0 S 0.0 0.0 0:00.06 kthreadd
      3 root 20 0 0 0 0 S 0.0 0.0 0:14.59 ksoftirqd/0

      Jobs submitted through sbatch start to run, but stall after writing preliminary information to the log file. Top shows only 100% CPU usage.

      top - 17:32:34 up 1 day, 2:41, 2 users, load average: 0.27, 7.06, 7.83
      Tasks: 474 total, 2 running, 472 sleeping, 0 stopped, 0 zombie
      %Cpu(s): 0.4 us, 1.7 sy, 0.0 ni, 97.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
      KiB Mem : 32340540 total, 29322852 free, 519560 used, 2498128 buff/cache
      KiB Swap: 1048572 total, 1048572 free, 0 used. 31182300 avail Mem

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      

      10922 david 20 0 6779248 9036 6564 R 98.7 0.0 0:10.01 gmx
      2150 root -51 0 0 0 0 S 1.0 0.0 25:33.73 irq/64-nvidia
      10 root 20 0 0 0 0 S 0.7 0.0 1:51.39 rcu_sched
      3 root 20 0 0 0 0 S 0.3 0.0 0:14.61 ksoftirqd/0
      1464 root 20 0 224884 12124 6400 S 0.3 0.0 1:07.08 snmpd

      The gromacs logfile is empty

      The slurm.out file contains only the following

                         :-) GROMACS - gmx mdrun, 2018.3 (-:
      
                               GROMACS is written by:
        Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
       Par Bjelkmar    Aldert van Buuren   Rudi van Drunen     Anton Feenstra
      

      Gerrit Groenhof Aleksei Iupinov Christoph Junghans Anca Hamuraru
      Vincent Hindriksen Dimitrios Karkoulis Peter Kasson Jiri Kraus
      Carsten Kutzner Per Larsson Justin A. Lemkul Viveca Lindahl
      Magnus Lundborg Pieter Meulenhoff Erik Marklund Teemu Murtola
      Szilard Pall Sander Pronk Roland Schulz Alexey Shvetsov
      Michael Shirts Alfons Sijbers Peter Tieleman Teemu Virolainen
      Christian Wennberg Maarten Wolf
      and the project leaders:
      Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

      Copyright (c) 1991-2000, University of Groningen, The Netherlands.
      Copyright (c) 2001-2017, The GROMACS development team at
      Uppsala University, Stockholm University and
      the Royal Institute of Technology, Sweden.
      check out http://www.gromacs.org for more information.

      GROMACS is free software; you can redistribute it and/or modify it
      under the terms of the GNU Lesser General Public License
      as published by the Free Software Foundation; either version 2.1
      of the License, or (at your option) any later version.

      GROMACS: gmx mdrun, version 2018.3
      Executable: /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx
      Data prefix: /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0
      Working dir: /scratch/david/sim135-5-3-2
      Command line:
      gmx mdrun -deffnm md6

      Back Off! I just backed up md6.log to ./#md6.log.28#
      slurmstepd: error: JOB 127 ON compute-0-0 CANCELLED AT 2018-12-04T17:30:42

      Any suggestions appreciated.

      David


      Batch job only runs on single core


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
  • David Chalmers

    David Chalmers - 2018-12-05

    Hi Werner,

    scontrol show node gives:

    NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=12 
       CPUAlloc=0 CPUTot=48 CPULoad=0.01
       AvailableFeatures=rack-0,48CPUs
       ActiveFeatures=rack-0,48CPUs
       Gres=gpu:3
       NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=31582 AllocMem=0 FreeMem=28683 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527900 Owner=N/A MCS_label=N/A
       Partitions=CLUSTER,WHEEL 
       BootTime=2018-12-03T14:50:52 SlurmdStartTime=2018-12-04T13:57:55
       CfgTRES=cpu=48,mem=31582M,billing=57,gres/gpu=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    
    NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=12 
       CPUAlloc=0 CPUTot=48 CPULoad=0.01
       AvailableFeatures=rack-0,48CPUs
       ActiveFeatures=rack-0,48CPUs
       Gres=gpu:3
       NodeAddr=10.1.1.253 NodeHostName=compute-0-1 Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=23519 AllocMem=0 FreeMem=17429 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527899 Owner=N/A MCS_label=N/A
       Partitions=CLUSTER,WHEEL 
       BootTime=2018-12-01T11:43:44 SlurmdStartTime=2018-12-04T13:57:55
       CfgTRES=cpu=48,mem=23519M,billing=55,gres/gpu=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    
    NodeName=compute-0-2 Arch=x86_64 CoresPerSocket=12 
       CPUAlloc=0 CPUTot=48 CPULoad=0.01
       AvailableFeatures=rack-0,48CPUs
       ActiveFeatures=rack-0,48CPUs
       Gres=gpu:3
       NodeAddr=10.1.1.252 NodeHostName=compute-0-2 Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=31583 AllocMem=0 FreeMem=26097 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527898 Owner=N/A MCS_label=N/A
       Partitions=CLUSTER,WHEEL 
       BootTime=2018-12-01T11:43:44 SlurmdStartTime=2018-12-04T13:57:55
       CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    
    NodeName=compute-0-3 Arch=x86_64 CoresPerSocket=12 
       CPUAlloc=0 CPUTot=48 CPULoad=0.01
       AvailableFeatures=rack-0,48CPUs
       ActiveFeatures=rack-0,48CPUs
       Gres=gpu:3
       NodeAddr=10.1.1.251 NodeHostName=compute-0-3 Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=31583 AllocMem=0 FreeMem=28424 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527897 Owner=N/A MCS_label=N/A
       Partitions=CLUSTER,WHEEL 
       BootTime=2018-12-01T11:44:12 SlurmdStartTime=2018-12-04T13:57:55
       CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    
    NodeName=compute-0-4 Arch=x86_64 CoresPerSocket=12 
       CPUAlloc=0 CPUTot=48 CPULoad=0.01
       AvailableFeatures=rack-0,48CPUs
       ActiveFeatures=rack-0,48CPUs
       Gres=gpu:3
       NodeAddr=10.1.1.250 NodeHostName=compute-0-4 Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=31583 AllocMem=0 FreeMem=26426 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527896 Owner=N/A MCS_label=N/A
       Partitions=CLUSTER,WHEEL 
       BootTime=2018-12-01T11:43:44 SlurmdStartTime=2018-12-04T13:57:55
       CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    
    NodeName=compute-0-5 Arch=x86_64 CoresPerSocket=12 
       CPUAlloc=0 CPUTot=48 CPULoad=0.01
       AvailableFeatures=rack-0,48CPUs
       ActiveFeatures=rack-0,48CPUs
       Gres=gpu:3
       NodeAddr=10.1.1.249 NodeHostName=compute-0-5 Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=31583 AllocMem=0 FreeMem=26087 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527895 Owner=N/A MCS_label=N/A
       Partitions=CLUSTER,WHEEL 
       BootTime=2018-12-01T11:43:39 SlurmdStartTime=2018-12-04T13:57:55
       CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    
    NodeName=oracle Arch=x86_64 CoresPerSocket=1 
       CPUAlloc=0 CPUTot=1 CPULoad=0.01
       AvailableFeatures=(null)
       ActiveFeatures=(null)
       Gres=(null)
       NodeAddr=10.1.1.1 NodeHostName=oracle Version=18.08
       OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
       RealMemory=7372 AllocMem=0 FreeMem=3679 Sockets=1 Boards=1
       State=IDLE ThreadsPerCore=1 TmpDisk=50268 Weight=1 Owner=N/A MCS_label=N/A
       Partitions=WHEEL 
       BootTime=2018-12-04T13:32:14 SlurmdStartTime=2018-12-04T16:15:11
       CfgTRES=cpu=1,mem=7372M,billing=1
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    

    scontrol show part:

    PartitionName=DEBUG
       AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
       AllocNodes=oracle Default=NO QoS=N/A
       DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
       MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
       Nodes=(null)
       PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
       OverTimeLimit=NONE PreemptMode=OFF
       State=UP TotalCPUs=0 TotalNodes=0 SelectTypeParameters=NONE
       JobDefaults=(null)
       DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
    
    PartitionName=CLUSTER
       AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
       AllocNodes=oracle Default=YES QoS=N/A
       DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
       MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
       Nodes=compute-0-[0-5]
       PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
       OverTimeLimit=NONE PreemptMode=OFF
       State=UP TotalCPUs=288 TotalNodes=6 SelectTypeParameters=NONE
       JobDefaults=(null)
       DefMemPerCPU=512 MaxMemPerNode=UNLIMITED
       TRESBillingWeights=CPU=1.0,Mem=0.25G,GRES/gpu=2.0
    
    PartitionName=WHEEL
       AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
       AllocNodes=oracle Default=NO QoS=N/A
       DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
       MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
       Nodes=compute-0-[0-5],oracle
       PriorityJobFactor=1000 PriorityTier=1000 RootOnly=YES ReqResv=NO OverSubscribe=NO
       OverTimeLimit=NONE PreemptMode=OFF
       State=UP TotalCPUs=289 TotalNodes=7 SelectTypeParameters=NONE
       JobDefaults=(null)
       DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
    

    The sbatch input file is below - although I have tried changing many options - with no result!

    #SBATCH --mem=8000
    
    #SBATCH --nodes=1
    #SBATCH --gres=gpu:3
    
    /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx mdrun -deffnm md6
    

    Thanks

    David

     
    • Werner Saar

      Werner Saar - 2018-12-05

      Hi,

      you must give more options:
      You only gave:

      --nodes=1
      --mem=8000
      --gres=gpu:3

      I would test the following:
      call salloc with these options:

      four hours runtime

      --time=4:00:00

      one node

      --nodes=1

      one Task per node

      or do you need more tasks per node ?

      --ntasks=1

      eight cpu's per task

      --cpus-per-task=8

      give each cpu 2 GB memory

      --mem-per-cpu=2048

      try it at first with one GPU

      --gres=gpu:1

      The line is:
      salloc --time=4:00:00 --nodes=1 --ntasks=1 --cpus-per-task=8
      --mem-per-cpu=2048 --gres=gpu:1

      now run in the salloc session:

      srun --pty bash -i

      and run the application:

      /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx mdrun -deffnm md6

      On 12/05/2018 01:25 AM, David Chalmers wrote:

      Hi Werner,

      scontrol show node gives:

      ~~~
      NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=12
      CPUAlloc=0 CPUTot=48 CPULoad=0.01
      AvailableFeatures=rack-0,48CPUs
      ActiveFeatures=rack-0,48CPUs
      Gres=gpu:3
      NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=31582 AllocMem=0 FreeMem=28683 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527900 Owner=N/A MCS_label=N/A
      Partitions=CLUSTER,WHEEL
      BootTime=2018-12-03T14:50:52 SlurmdStartTime=2018-12-04T13:57:55
      CfgTRES=cpu=48,mem=31582M,billing=57,gres/gpu=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

      NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=12
      CPUAlloc=0 CPUTot=48 CPULoad=0.01
      AvailableFeatures=rack-0,48CPUs
      ActiveFeatures=rack-0,48CPUs
      Gres=gpu:3
      NodeAddr=10.1.1.253 NodeHostName=compute-0-1 Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=23519 AllocMem=0 FreeMem=17429 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527899 Owner=N/A MCS_label=N/A
      Partitions=CLUSTER,WHEEL
      BootTime=2018-12-01T11:43:44 SlurmdStartTime=2018-12-04T13:57:55
      CfgTRES=cpu=48,mem=23519M,billing=55,gres/gpu=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

      NodeName=compute-0-2 Arch=x86_64 CoresPerSocket=12
      CPUAlloc=0 CPUTot=48 CPULoad=0.01
      AvailableFeatures=rack-0,48CPUs
      ActiveFeatures=rack-0,48CPUs
      Gres=gpu:3
      NodeAddr=10.1.1.252 NodeHostName=compute-0-2 Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=31583 AllocMem=0 FreeMem=26097 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527898 Owner=N/A MCS_label=N/A
      Partitions=CLUSTER,WHEEL
      BootTime=2018-12-01T11:43:44 SlurmdStartTime=2018-12-04T13:57:55
      CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

      NodeName=compute-0-3 Arch=x86_64 CoresPerSocket=12
      CPUAlloc=0 CPUTot=48 CPULoad=0.01
      AvailableFeatures=rack-0,48CPUs
      ActiveFeatures=rack-0,48CPUs
      Gres=gpu:3
      NodeAddr=10.1.1.251 NodeHostName=compute-0-3 Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=31583 AllocMem=0 FreeMem=28424 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527897 Owner=N/A MCS_label=N/A
      Partitions=CLUSTER,WHEEL
      BootTime=2018-12-01T11:44:12 SlurmdStartTime=2018-12-04T13:57:55
      CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

      NodeName=compute-0-4 Arch=x86_64 CoresPerSocket=12
      CPUAlloc=0 CPUTot=48 CPULoad=0.01
      AvailableFeatures=rack-0,48CPUs
      ActiveFeatures=rack-0,48CPUs
      Gres=gpu:3
      NodeAddr=10.1.1.250 NodeHostName=compute-0-4 Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=31583 AllocMem=0 FreeMem=26426 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527896 Owner=N/A MCS_label=N/A
      Partitions=CLUSTER,WHEEL
      BootTime=2018-12-01T11:43:44 SlurmdStartTime=2018-12-04T13:57:55
      CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

      NodeName=compute-0-5 Arch=x86_64 CoresPerSocket=12
      CPUAlloc=0 CPUTot=48 CPULoad=0.01
      AvailableFeatures=rack-0,48CPUs
      ActiveFeatures=rack-0,48CPUs
      Gres=gpu:3
      NodeAddr=10.1.1.249 NodeHostName=compute-0-5 Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=31583 AllocMem=0 FreeMem=26087 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=209402 Weight=20527895 Owner=N/A MCS_label=N/A
      Partitions=CLUSTER,WHEEL
      BootTime=2018-12-01T11:43:39 SlurmdStartTime=2018-12-04T13:57:55
      CfgTRES=cpu=48,mem=31583M,billing=57,gres/gpu=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

      NodeName=oracle Arch=x86_64 CoresPerSocket=1
      CPUAlloc=0 CPUTot=1 CPULoad=0.01
      AvailableFeatures=(null)
      ActiveFeatures=(null)
      Gres=(null)
      NodeAddr=10.1.1.1 NodeHostName=oracle Version=18.08
      OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
      RealMemory=7372 AllocMem=0 FreeMem=3679 Sockets=1 Boards=1
      State=IDLE ThreadsPerCore=1 TmpDisk=50268 Weight=1 Owner=N/A MCS_label=N/A
      Partitions=WHEEL
      BootTime=2018-12-04T13:32:14 SlurmdStartTime=2018-12-04T16:15:11
      CfgTRES=cpu=1,mem=7372M,billing=1
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
      ~~~

      scontrol show part:

      ~~~
      PartitionName=DEBUG
      AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
      AllocNodes=oracle Default=NO QoS=N/A
      DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
      Nodes=(null)
      PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
      OverTimeLimit=NONE PreemptMode=OFF
      State=UP TotalCPUs=0 TotalNodes=0 SelectTypeParameters=NONE
      JobDefaults=(null)
      DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

      PartitionName=CLUSTER
      AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
      AllocNodes=oracle Default=YES QoS=N/A
      DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
      Nodes=compute-0-[0-5]
      PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
      OverTimeLimit=NONE PreemptMode=OFF
      State=UP TotalCPUs=288 TotalNodes=6 SelectTypeParameters=NONE
      JobDefaults=(null)
      DefMemPerCPU=512 MaxMemPerNode=UNLIMITED
      TRESBillingWeights=CPU=1.0,Mem=0.25G,GRES/gpu=2.0

      PartitionName=WHEEL
      AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
      AllocNodes=oracle Default=NO QoS=N/A
      DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
      Nodes=compute-0-[0-5],oracle
      PriorityJobFactor=1000 PriorityTier=1000 RootOnly=YES ReqResv=NO OverSubscribe=NO
      OverTimeLimit=NONE PreemptMode=OFF
      State=UP TotalCPUs=289 TotalNodes=7 SelectTypeParameters=NONE
      JobDefaults=(null)
      DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
      ~~~

      The sbatch input file is below - although I have tried changing many options - with no result!

      ~~~

      SBATCH --mem=8000

      SBATCH --nodes=1

      SBATCH --gres=gpu:3

      /apps/linux/gromacs/gromacs-2018.3_Intel_CUDA8.0/bin/gmx mdrun -deffnm md6

      ~~~

      Thanks

      David


      Batch job only runs on single core


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/slurm-roll/discussion/general/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.