Menu

Performance Regression v8->v9

Elk Users
2024-02-07
2024-04-01
  • Andreas Fischer

    Andreas Fischer - 2024-02-07

    Hi,

    my Users are reporting a performance regression when switching from v8.8.26 to v9.1.15 or v9.2.12. The issue becomes especially apparent when the number of k-points is equal to the number of MPI tasks.

    All versions were compiled using Intel MKL classic compiler (v2023.2.1) using the following make.inc:

    MAKE = make
    AR = xiar
    F90 = mpiifort
    F90_OPTS = -O3 -march=core-avx2 -align array64byte -fma -ftz -fomit-frame-pointer -ipo -qopenmp -qmkl=parallel
    F90_LIB = -liomp5 -lpthread -lm -ldl
    SRC_MKL =
    SRC_OMP =
    SRC_MPI =
    SRC_OBLAS = oblas_stub.f90
    SRC_BLIS = blis_stub.f90
    LIB_LIBXC = libxcf90.a libxc.a
    SRC_LIBXC = libxcf90.f90 libxcifc.f90
    SRC_FFT = mkl_dfti.f90 zfftifc_mkl.f90 cfftifc_mkl.f90
    SRC_W90S =
    LIB_W90 = libwannier.a
    

    All elk versions are compiled with the same make.inc, compiler and libraries. Nodes are Dual Epyc-7713 which were empty during these benchmarks.

    Test case (based on B12 from examples/basic):
    elk.in:

    ! B12 ground state (Andrew Chizmeshya)
    
    tasks
      0
    
    vhighq
    .true.
    
    ngridk
    2 2 2
    
    scale
      9.6376071
    
    avec
      0.55522    0.00000    0.82027
     -0.27761    0.48083    0.82027
     -0.27761   -0.48083    0.82027
    
    sppath
      ''
    
    atoms
      1                                 : nspecies
      'B.in'                             : spfname
      12                                : natoms; atposl below
      0.77917   0.77917   0.36899
      0.36899   0.77917   0.77914
      0.77917   0.36899   0.77918
      0.22082   0.22082   0.63108
      0.22082   0.63100   0.22082
      0.63100   0.22082   0.22086
      0.98989   0.98989   0.34576
      0.34579   0.98989   0.98986
      0.98989   0.34579   0.98981
      0.01010   0.01010   0.65424
      0.01010   0.65420   0.01019
      0.65420   0.01010   0.01014
    

    B.in

       'B'                                        : spsymb
     'boron'                                    : spname
      -5.00000                                  : spzn
       19707.24741                              : spmass
      0.894427E-06    1.8000   47.7465   300    : rminsp, rmt, rmaxsp, nrmt
       3                                        : nstsp
       1   0   1   2.00000    T                 : nsp, lsp, ksp, occsp, spcore
       2   0   1   2.00000    F
       2   1   1  1.000000    F
       1                                        : apword
        0.1500   0  F                           : apwe0, apwdm, apwve
       0                                        : nlx
       2                                        : nlorb
       0   2                                    : lorbl, lorbord
        0.1500   0  F                           : lorbe0, lorbdm, lorbve
        0.1500   1  F
       1   2                                    : lorbl, lorbord
        0.1500   0  F                           : lorbe0, lorbdm, lorbve
        0.1500   1  F
    

    Slurm-Parameters:

    #!/bin/bash
    #SBATCH --job-name=8kpt
    #SBATCH --partition=epyc
    #SBATCH --ntasks-per-node=1
    #SBATCH --nodes=8
    #SBATCH --cpus-per-task=8
    #SBATCH --mem-per-cpu=4G
    #SBATCH --time=7-0
    
    ml purge
    ml elk/8.8.26
    #ml elk/9.1.15
    #ml elk/9.2.12
    
    
    srun elk
    

    This example has 8 k-points, and when this is run on 64 cores (8 MPI tasks, 8 OpenMP threads each) the runtime is as follows:

    • v8.8.26: 2m37s
    • v9.1.15: 5m01s
    • v9.2.12: 4m42s

    Our elk Lmod module sets basically what elk.sh would set and sets OMP_NUM_THREADS according to the chosen Slurm variable:

    export OMP_NUM_THREADS=8
    export OMP_PROC_BIND=false
    export OMP_STACKSIZE=256M
    ulimit -Ss unlimited
    

    When observing the process via htop, the following can be noticed:
    * v8.8.26: CPU utilization per MPI task is around 800%, except when the cycle finishes
    * v9.1.15: CPU utilization jumps to 800% only for a small timeframe, then maxes out at 200% for the majority of each cycle
    * v9.2.12: same as v9.1.15

    Can anybody reproduce this and/or explain the difference?

     
  • J. K. Dewhurst

    J. K. Dewhurst - 2024-02-07

    Hi Andreas,

    Thanks for discovering this! It was a fairly obscure bug: the threads used for calculating the linearisation energies were not being freed up resulting in too few threads for MKL to use for the diagonalisation.

    You can fix the problem quite easily by adding the line call freethd(nthd) at the end of linengy.f90 here:

    ...
    end do
    !$OMP END PARALLEL
    call freethd(nthd)
    if (mp_mpi.and.(nnf > 0)) then
    ...
    

    Then version 9.2.12 will be at least as fast as 8.8.26.

    I'll release a fixed version with some additional optimisations next week.

    Thanks and regards,
    Kay.

     

    Last edit: J. K. Dewhurst 2024-02-07
  • Andreas Fischer

    Andreas Fischer - 2024-02-08

    Hi Kay,

    thanks for the quick fix, which I could successfully verify.

    Best,
    Andreas

     
  • J. K. Dewhurst

    J. K. Dewhurst - 2024-02-21

    Hi Andreas,

    Elk version 9.4.2 has been released with the fix.

    Thanks and regards,
    Kay.

     
  • Ronald Cohen

    Ronald Cohen - 2024-04-01

    I am having a similar problem to this, with the number of threads used dropping to 1 per process after a short time. I am running though 9.5.1 so it is not this problem but perhaps something similar somewhere else. The problem occurs during the very first k-point. Here is my input:

     
  • Ronald Cohen

    Ronald Cohen - 2024-04-01

    Actually I have the same problem with elk 9.4.2. Here is a log showing cpu usage on a n excusive node:

    /carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$ grep elk log.dat
    1382219 rcohen 20 0 245536 26620 12600 S 0.0 0.0 0:00.02 elk
    1382219 rcohen 20 0 20.5g 547880 35964 R 2307 0.2 3:51.64 elk
    1382219 rcohen 20 0 20.6g 620884 36120 R 3183 0.2 9:10.89 elk
    1382219 rcohen 20 0 20.8g 721564 37108 R 1701 0.3 12:01.70 elk
    1382219 rcohen 20 0 20.8g 721564 37108 R 99.7 0.3 12:11.71 elk
    1382219 rcohen 20 0 20.8g 721564 37108 R 99.6 0.3 12:21.71 elk
    1382219 rcohen 20 0 20.8g 721580 37124 R 99.6 0.3 12:31.71 elk
    1382219 rcohen 20 0 20.8g 721580 37124 R 99.7 0.3 12:41.72 elk
    1382219 rcohen 20 0 20.8g 721580 37124 R 99.6 0.3 12:51.72 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:01.72 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 13:11.73 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:21.73 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:31.73 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 13:41.74 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:51.74 elk
    1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 14:01.75 elk
    1382219 rcohen 20 0 20.8g 747364 37128 R 99.6 0.3 14:11.75 elk
    /carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$

    It very quickly goes to just a single thread.

    Ron

     
  • Ronald Cohen

    Ronald Cohen - 2024-04-01

    Actually I have this same problem with 8.8.26:

    grep elk log.dat
    1318993 rcohen 20 0 245608 28360 12288 S 0.0 0.0 0:00.00 elk
    1318993 rcohen 20 0 20.3g 440452 32700 R 2341 0.3 3:54.60 elk
    1318993 rcohen 20 0 20.4g 507720 32776 R 3187 0.4 9:13.97 elk
    1318993 rcohen 20 0 20.4g 557324 32776 R 2774 0.4 13:52.18 elk
    1318993 rcohen 20 0 20.6g 635700 33488 R 99.7 0.5 14:02.17 elk
    1318993 rcohen 20 0 20.6g 635700 33488 R 99.8 0.5 14:12.17 elk
    1318993 rcohen 20 0 20.6g 635700 33488 R 99.7 0.5 14:22.16 elk
    1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:32.15 elk
    1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:42.14 elk
    1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:52.13 elk

     
  • Ronald Cohen

    Ronald Cohen - 2024-04-01

    and 8.7.10. Very strange.

     
  • J. K. Dewhurst

    J. K. Dewhurst - 2024-04-01

    Try running on one node with

    wrtdsk
     .false.
    

    I suspect that the filesystem is holding up everything.

     
    • Ronald Cohen

      Ronald Cohen - 2024-04-01

      I added this as you suggest, but still does the same thing:

      1328993 rcohen 20 0 245532 26640 12616 S 0.0 0.0 0:00.00 elk
      1328993 rcohen 20 0 20.3g 435384 33416 R 2338 0.3 3:54.26 elk
      1328993 rcohen 20 0 20.4g 505088 33416 R 3189 0.4 9:14.43 elk
      1328993 rcohen 20 0 20.6g 603700 34152 R 1361 0.5 11:30.97 elk
      1328993 rcohen 20 0 20.6g 603700 34152 R 99.7 0.5 11:40.96 elk
      1328993 rcohen 20 0 20.6g 603700 34152 R 99.6 0.5 11:50.94 elk
      1328993 rcohen 20 0 20.6g 603712 34164 R 99.5 0.5 12:00.92 elk
      1328993 rcohen 20 0 20.6g 603712 34164 R 99.5 0.5 12:10.89 elk

      wrtdsk
      .false.

      Goes shortly to one thread.

      Ron

      On Apr 1, 2024, at 12:48 PM, J. K. Dewhurst jkdewhurst@users.sourceforge.net wrote:

      Try running on one node with

      wrtdsk
      .false.
      I suspect that the filesystem is holding up everything.

      Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#ec26
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

Log in to post a comment.