Elk / Discussion / Elk Users: Performance Regression v8->v9

Hi,

my Users are reporting a performance regression when switching from v8.8.26 to v9.1.15 or v9.2.12. The issue becomes especially apparent when the number of k-points is equal to the number of MPI tasks.

All versions were compiled using Intel MKL classic compiler (v2023.2.1) using the following make.inc:

MAKE = make
AR = xiar
F90 = mpiifort
F90_OPTS = -O3 -march=core-avx2 -align array64byte -fma -ftz -fomit-frame-pointer -ipo -qopenmp -qmkl=parallel
F90_LIB = -liomp5 -lpthread -lm -ldl
SRC_MKL =
SRC_OMP =
SRC_MPI =
SRC_OBLAS = oblas_stub.f90
SRC_BLIS = blis_stub.f90
LIB_LIBXC = libxcf90.a libxc.a
SRC_LIBXC = libxcf90.f90 libxcifc.f90
SRC_FFT = mkl_dfti.f90 zfftifc_mkl.f90 cfftifc_mkl.f90
SRC_W90S =
LIB_W90 = libwannier.a

All elk versions are compiled with the same make.inc, compiler and libraries. Nodes are Dual Epyc-7713 which were empty during these benchmarks.

Test case (based on B12 from examples/basic):
elk.in:

! B12 ground state (Andrew Chizmeshya)

tasks
  0

vhighq
.true.

ngridk
2 2 2

scale
  9.6376071

avec
  0.55522    0.00000    0.82027
 -0.27761    0.48083    0.82027
 -0.27761   -0.48083    0.82027

sppath
  ''

atoms
  1                                 : nspecies
  'B.in'                             : spfname
  12                                : natoms; atposl below
  0.77917   0.77917   0.36899
  0.36899   0.77917   0.77914
  0.77917   0.36899   0.77918
  0.22082   0.22082   0.63108
  0.22082   0.63100   0.22082
  0.63100   0.22082   0.22086
  0.98989   0.98989   0.34576
  0.34579   0.98989   0.98986
  0.98989   0.34579   0.98981
  0.01010   0.01010   0.65424
  0.01010   0.65420   0.01019
  0.65420   0.01010   0.01014

B.in

   'B'                                        : spsymb
 'boron'                                    : spname
  -5.00000                                  : spzn
   19707.24741                              : spmass
  0.894427E-06    1.8000   47.7465   300    : rminsp, rmt, rmaxsp, nrmt
   3                                        : nstsp
   1   0   1   2.00000    T                 : nsp, lsp, ksp, occsp, spcore
   2   0   1   2.00000    F
   2   1   1  1.000000    F
   1                                        : apword
    0.1500   0  F                           : apwe0, apwdm, apwve
   0                                        : nlx
   2                                        : nlorb
   0   2                                    : lorbl, lorbord
    0.1500   0  F                           : lorbe0, lorbdm, lorbve
    0.1500   1  F
   1   2                                    : lorbl, lorbord
    0.1500   0  F                           : lorbe0, lorbdm, lorbve
    0.1500   1  F

Slurm-Parameters:

#!/bin/bash
#SBATCH --job-name=8kpt
#SBATCH --partition=epyc
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=8
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=7-0

ml purge
ml elk/8.8.26
#ml elk/9.1.15
#ml elk/9.2.12


srun elk

This example has 8 k-points, and when this is run on 64 cores (8 MPI tasks, 8 OpenMP threads each) the runtime is as follows:

v8.8.26: 2m37s
v9.1.15: 5m01s
v9.2.12: 4m42s

Our elk Lmod module sets basically what elk.sh would set and sets OMP_NUM_THREADS according to the chosen Slurm variable:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=false
export OMP_STACKSIZE=256M
ulimit -Ss unlimited

When observing the process via htop, the following can be noticed:
* v8.8.26: CPU utilization per MPI task is around 800%, except when the cycle finishes
* v9.1.15: CPU utilization jumps to 800% only for a small timeframe, then maxes out at 200% for the majority of each cycle
* v9.2.12: same as v9.1.15

Can anybody reproduce this and/or explain the difference?

J. K. Dewhurst - 2024-02-07

Hi Andreas,

Thanks for discovering this! It was a fairly obscure bug: the threads used for calculating the linearisation energies were not being freed up resulting in too few threads for MKL to use for the diagonalisation.

You can fix the problem quite easily by adding the line call freethd(nthd) at the end of linengy.f90 here:

... end do !$OMP END PARALLEL call freethd(nthd) if (mp_mpi.and.(nnf > 0)) then ...

Then version 9.2.12 will be at least as fast as 8.8.26.

I'll release a fixed version with some additional optimisations next week.

Thanks and regards,
Kay.

Last edit: J. K. Dewhurst 2024-02-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andreas Fischer - 2024-02-08

Hi Kay,

thanks for the quick fix, which I could successfully verify.

Best,
Andreas

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

J. K. Dewhurst - 2024-02-21

Hi Andreas,

Elk version 9.4.2 has been released with the fix.

Thanks and regards,
Kay.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ronald Cohen - 2024-04-01

I am having a similar problem to this, with the number of threads used dropping to 1 per process after a short time. I am running though 9.5.1 so it is not this problem but perhaps something similar somewhere else. The problem occurs during the very first k-point. Here is my input:

elk.in

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ronald Cohen - 2024-04-01

Actually I have the same problem with elk 9.4.2. Here is a log showing cpu usage on a n excusive node:

/carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$ grep elk log.dat
1382219 rcohen 20 0 245536 26620 12600 S 0.0 0.0 0:00.02 elk
1382219 rcohen 20 0 20.5g 547880 35964 R 2307 0.2 3:51.64 elk
1382219 rcohen 20 0 20.6g 620884 36120 R 3183 0.2 9:10.89 elk
1382219 rcohen 20 0 20.8g 721564 37108 R 1701 0.3 12:01.70 elk
1382219 rcohen 20 0 20.8g 721564 37108 R 99.7 0.3 12:11.71 elk
1382219 rcohen 20 0 20.8g 721564 37108 R 99.6 0.3 12:21.71 elk
1382219 rcohen 20 0 20.8g 721580 37124 R 99.6 0.3 12:31.71 elk
1382219 rcohen 20 0 20.8g 721580 37124 R 99.7 0.3 12:41.72 elk
1382219 rcohen 20 0 20.8g 721580 37124 R 99.6 0.3 12:51.72 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:01.72 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 13:11.73 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:21.73 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:31.73 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 13:41.74 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:51.74 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 14:01.75 elk
1382219 rcohen 20 0 20.8g 747364 37128 R 99.6 0.3 14:11.75 elk
/carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$

It very quickly goes to just a single thread.

Ron

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ronald Cohen - 2024-04-01

Actually I have this same problem with 8.8.26:

grep elk log.dat
1318993 rcohen 20 0 245608 28360 12288 S 0.0 0.0 0:00.00 elk
1318993 rcohen 20 0 20.3g 440452 32700 R 2341 0.3 3:54.60 elk
1318993 rcohen 20 0 20.4g 507720 32776 R 3187 0.4 9:13.97 elk
1318993 rcohen 20 0 20.4g 557324 32776 R 2774 0.4 13:52.18 elk
1318993 rcohen 20 0 20.6g 635700 33488 R 99.7 0.5 14:02.17 elk
1318993 rcohen 20 0 20.6g 635700 33488 R 99.8 0.5 14:12.17 elk
1318993 rcohen 20 0 20.6g 635700 33488 R 99.7 0.5 14:22.16 elk
1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:32.15 elk
1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:42.14 elk
1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:52.13 elk

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ronald Cohen - 2024-04-01

and 8.7.10. Very strange.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

J. K. Dewhurst - 2024-04-01

Hi Ron,

Are you running on multiple nodes? If so then it could be a filesystem bottleneck.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ronald Cohen - 2024-04-01
  
  This only happenes when I try OEP! If I change:
  xctype
  -20
  to
  xctype
  20
  I get the expected performance:
  
  carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$ grep elk log.dat
  1323034 rcohen 20 0 241676 26352 12400 S 0.0 0.0 0:00.00 elk
  1323972 rcohen 20 0 241680 26204 12200 S 0.0 0.0 0:00.00 elk
  1323972 rcohen 20 0 20.3g 448988 32644 R 2230 0.3 3:43.48 elk
  1323972 rcohen 20 0 20.3g 500520 32648 R 2782 0.4 8:22.20 elk
  1323972 rcohen 20 0 20.4g 551672 33500 R 2934 0.4 13:16.14 elk
  1323972 rcohen 20 0 20.6g 657304 34308 R 382.5 0.5 13:54.50 elk
  1323972 rcohen 20 0 20.6g 657304 34308 R 99.6 0.5 14:04.49 elk
  1323972 rcohen 20 0 20.6g 657304 34308 R 99.7 0.5 14:14.48 elk
  1324814 rcohen 20 0 20.3g 421764 33100 R 2230 0.3 3:43.48 elk
  1324814 rcohen 20 0 20.3g 474996 33708 R 2757 0.4 8:20.33 elk
  1324814 rcohen 20 0 20.4g 523088 33848 R 3092 0.4 13:30.19 elk
  1324814 rcohen 20 0 20.4g 556728 33972 R 2609 0.4 17:51.85 elk
  1324814 rcohen 20 0 20.4g 555020 33980 R 3182 0.4 23:11.36 elk
  1324814 rcohen 20 0 20.4g 566024 33980 R 3011 0.4 28:13.36 elk
  1324814 rcohen 20 0 20.4g 561176 33980 R 2786 0.4 32:52.56 elk
  1324814 rcohen 20 0 20.4g 560004 33980 R 3186 0.4 38:12.16 elk
  1324814 rcohen 20 0 20.4g 562452 33980 D 2613 0.4 42:34.25 elk
  
  Ron
  
  On Apr 1, 2024, at 12:06 PM, Ronald Cohen recohen3@gmail.com wrote:
  
  No I am running this test on a single node.
  I tried both ifx and ifort compilers.
  It is going to exactly 100%--one thread--not a hair over.
  
  Ron
  
  On Apr 1, 2024, at 11:59 AM, J. K. Dewhurst jkdewhurst@users.sourceforge.net wrote:
  
  Hi Ron,
  
  Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
  
  Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#1e5c
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ronald Cohen - 2024-04-01
  
  No I am running this test on a single node.
  I tried both ifx and ifort compilers.
  It is going to exactly 100%--one thread--not a hair over.
  
  Ron
  
  On Apr 1, 2024, at 11:59 AM, J. K. Dewhurst jkdewhurst@users.sourceforge.net wrote:
  
  Hi Ron,
  
  Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
  
  Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#1e5c
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ronald Cohen - 2024-04-01
    
    It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
    
    $OMP PARALLEL DO DEFAULT(SHARED) &
    !$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
    do ik=1,nkpt
    ! distribute among MPI processes
    if (mod(ik-1,np_mpi) /= lp_mpi) cycle
    !$OMP CRITICAL(oepvcl_)
    write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
    !$OMP END CRITICAL(oepvcl_)
    call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
    end do
    !$OMP END PARALLEL DO
    
    On Apr 1, 2024, at 1:03 PM, Ronald Cohen recohen3@users.sourceforge.net wrote:
    
    No I am running this test on a single node.
    I tried both ifx and ifort compilers.
    It is going to exactly 100%--one thread--not a hair over.
    
    Ron
    
    On Apr 1, 2024, at 11:59 AM, J. K. Dewhurst jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net wrote:
    
    Hi Ron,
    
    Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
    
    Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#1e5c
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=25#1e5c/53f7
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Ronald Cohen - 2024-04-01
      
      scratch that--looking some more. May be a complicatino problem!
      Ron
      
      On Apr 1, 2024, at 1:45 PM, Ronald Cohen recohen3@users.sourceforge.net wrote:
      
      It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
      
      $OMP PARALLEL DO DEFAULT(SHARED) &
      !$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
      do ik=1,nkpt
      ! distribute among MPI processes
      if (mod(ik-1,np_mpi) /= lp_mpi) cycle
      !$OMP CRITICAL(oepvcl_)
      write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
      !$OMP END CRITICAL(oepvcl_)
      call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
      end do
      !$OMP END PARALLEL DO
      
      On Apr 1, 2024, at 1:03 PM, Ronald Cohen recohen3@users.sourceforge.net recohen3@users.sourceforge.net wrote:
      
      No I am running this test on a single node.
      I tried both ifx and ifort compilers.
      It is going to exactly 100%--one thread--not a hair over.
      
      Ron
      
      On Apr 1, 2024, at 11:59 AM, J. K. Dewhurst jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net wrote:
      
      Hi Ron,
      
      Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
      
      Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#1e5c
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
      
      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
      
      Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=25#1e5c/53f7
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
      
      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
      
      Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=25#1e5c/53f7/4373
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
      
      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
      
      alternate
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Ronald Cohen - 2024-04-01
        
        My apologies on this whole thread! The code is fine. It was the compilation that failed--a typo in my make.inc!
        
        Ron
        
        On Apr 1, 2024, at 1:58 PM, Ronald Cohen recohen3@users.sourceforge.net wrote:
        
        scratch that--looking some more. May be a complicatino problem!
        Ron
        
        On Apr 1, 2024, at 1:45 PM, Ronald Cohen recohen3@users.sourceforge.net recohen3@users.sourceforge.net wrote:
        
        It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
        
        $OMP PARALLEL DO DEFAULT(SHARED) &
        !$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
        do ik=1,nkpt
        ! distribute among MPI processes
        if (mod(ik-1,np_mpi) /= lp_mpi) cycle
        !$OMP CRITICAL(oepvcl_)
        write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
        !$OMP END CRITICAL(oepvcl_)
        call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
        end do
        !$OMP END PARALLEL DO
        
        On Apr 1, 2024, at 1:03 PM, Ronald Cohen recohen3@users.sourceforge.net recohen3@users.sourceforge.net recohen3@users.sourceforge.net wrote:
        
        No I am running this test on a single node.
        I tried both ifx and ifort compilers.
        It is going to exactly 100%--one thread--not a hair over.
        
        Ron
        
        On Apr 1, 2024, at 11:59 AM, J. K. Dewhurst jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net wrote:
        
        Hi Ron,
        
        Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
        
        Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#1e5c
        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
        
        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
        
        Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=25#1e5c/53f7
        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
        
        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
        
        Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=25#1e5c/53f7/4373
        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
        
        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
        
        Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=25#1e5c/53f7/4373/0a90
        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
        
        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
        
        alternate
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

J. K. Dewhurst - 2024-04-01

Try running on one node with

wrtdsk .false.

I suspect that the filesystem is holding up everything.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ronald Cohen - 2024-04-01
  
  I added this as you suggest, but still does the same thing:
  
  1328993 rcohen 20 0 245532 26640 12616 S 0.0 0.0 0:00.00 elk
  1328993 rcohen 20 0 20.3g 435384 33416 R 2338 0.3 3:54.26 elk
  1328993 rcohen 20 0 20.4g 505088 33416 R 3189 0.4 9:14.43 elk
  1328993 rcohen 20 0 20.6g 603700 34152 R 1361 0.5 11:30.97 elk
  1328993 rcohen 20 0 20.6g 603700 34152 R 99.7 0.5 11:40.96 elk
  1328993 rcohen 20 0 20.6g 603700 34152 R 99.6 0.5 11:50.94 elk
  1328993 rcohen 20 0 20.6g 603712 34164 R 99.5 0.5 12:00.92 elk
  1328993 rcohen 20 0 20.6g 603712 34164 R 99.5 0.5 12:10.89 elk
  
  wrtdsk
  .false.
  
  Goes shortly to one thread.
  
  Ron
  
  On Apr 1, 2024, at 12:48 PM, J. K. Dewhurst jkdewhurst@users.sourceforge.net wrote:
  
  Try running on one node with
  
  wrtdsk
  .false.
  I suspect that the filesystem is holding up everything.
  
  Performance Regression v8->v9 https://sourceforge.net/p/elk/discussion/897820/thread/e2b34b6191/?limit=50#ec26
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/elk/discussion/897820/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Performance Regression v8->v9

Forums

Help

Performance Regression v8->v9

Performance Regression v8->v9

Forums

Help

Performance Regression v8->v9 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Performance Regression v8->v9