my Users are reporting a performance regression when switching from v8.8.26 to v9.1.15 or v9.2.12. The issue becomes especially apparent when the number of k-points is equal to the number of MPI tasks.
All versions were compiled using Intel MKL classic compiler (v2023.2.1) using the following make.inc:
When observing the process via htop, the following can be noticed:
* v8.8.26: CPU utilization per MPI task is around 800%, except when the cycle finishes
* v9.1.15: CPU utilization jumps to 800% only for a small timeframe, then maxes out at 200% for the majority of each cycle
* v9.2.12: same as v9.1.15
Can anybody reproduce this and/or explain the difference?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for discovering this! It was a fairly obscure bug: the threads used for calculating the linearisation energies were not being freed up resulting in too few threads for MKL to use for the diagonalisation.
You can fix the problem quite easily by adding the line call freethd(nthd) at the end of linengy.f90 here:
I am having a similar problem to this, with the number of threads used dropping to 1 per process after a short time. I am running though 9.5.1 so it is not this problem but perhaps something similar somewhere else. The problem occurs during the very first k-point. Here is my input:
It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
do ik=1,nkpt
! distribute among MPI processes
if (mod(ik-1,np_mpi) /= lp_mpi) cycle
!$OMP CRITICAL(oepvcl_)
write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
!$OMP END CRITICAL(oepvcl_)
call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
end do
!$OMP END PARALLEL DO
It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
do ik=1,nkpt
! distribute among MPI processes
if (mod(ik-1,np_mpi) /= lp_mpi) cycle
!$OMP CRITICAL(oepvcl_)
write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
!$OMP END CRITICAL(oepvcl_)
call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
end do
!$OMP END PARALLEL DO
It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
do ik=1,nkpt
! distribute among MPI processes
if (mod(ik-1,np_mpi) /= lp_mpi) cycle
!$OMP CRITICAL(oepvcl_)
write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
!$OMP END CRITICAL(oepvcl_)
call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
end do
!$OMP END PARALLEL DO
On Apr 1, 2024, at 1:03 PM, Ronald Cohen recohen3@users.sourceforge.net recohen3@users.sourceforge.net recohen3@users.sourceforge.net wrote:
No I am running this test on a single node.
I tried both ifx and ifort compilers.
It is going to exactly 100%--one thread--not a hair over.
Ron
On Apr 1, 2024, at 11:59 AM, J. K. Dewhurst jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net jkdewhurst@users.sourceforge.net wrote:
Hi Ron,
Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
Hi,
my Users are reporting a performance regression when switching from v8.8.26 to v9.1.15 or v9.2.12. The issue becomes especially apparent when the number of k-points is equal to the number of MPI tasks.
All versions were compiled using Intel MKL classic compiler (v2023.2.1) using the following make.inc:
All elk versions are compiled with the same make.inc, compiler and libraries. Nodes are Dual Epyc-7713 which were empty during these benchmarks.
Test case (based on B12 from examples/basic):
elk.in:
B.in
Slurm-Parameters:
This example has 8 k-points, and when this is run on 64 cores (8 MPI tasks, 8 OpenMP threads each) the runtime is as follows:
Our elk Lmod module sets basically what elk.sh would set and sets OMP_NUM_THREADS according to the chosen Slurm variable:
When observing the process via htop, the following can be noticed:
* v8.8.26: CPU utilization per MPI task is around 800%, except when the cycle finishes
* v9.1.15: CPU utilization jumps to 800% only for a small timeframe, then maxes out at 200% for the majority of each cycle
* v9.2.12: same as v9.1.15
Can anybody reproduce this and/or explain the difference?
Hi Andreas,
Thanks for discovering this! It was a fairly obscure bug: the threads used for calculating the linearisation energies were not being freed up resulting in too few threads for MKL to use for the diagonalisation.
You can fix the problem quite easily by adding the line
call freethd(nthd)
at the end of linengy.f90 here:Then version 9.2.12 will be at least as fast as 8.8.26.
I'll release a fixed version with some additional optimisations next week.
Thanks and regards,
Kay.
Last edit: J. K. Dewhurst 2024-02-07
Hi Kay,
thanks for the quick fix, which I could successfully verify.
Best,
Andreas
Hi Andreas,
Elk version 9.4.2 has been released with the fix.
Thanks and regards,
Kay.
I am having a similar problem to this, with the number of threads used dropping to 1 per process after a short time. I am running though 9.5.1 so it is not this problem but perhaps something similar somewhere else. The problem occurs during the very first k-point. Here is my input:
Actually I have the same problem with elk 9.4.2. Here is a log showing cpu usage on a n excusive node:
/carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$ grep elk log.dat
1382219 rcohen 20 0 245536 26620 12600 S 0.0 0.0 0:00.02 elk
1382219 rcohen 20 0 20.5g 547880 35964 R 2307 0.2 3:51.64 elk
1382219 rcohen 20 0 20.6g 620884 36120 R 3183 0.2 9:10.89 elk
1382219 rcohen 20 0 20.8g 721564 37108 R 1701 0.3 12:01.70 elk
1382219 rcohen 20 0 20.8g 721564 37108 R 99.7 0.3 12:11.71 elk
1382219 rcohen 20 0 20.8g 721564 37108 R 99.6 0.3 12:21.71 elk
1382219 rcohen 20 0 20.8g 721580 37124 R 99.6 0.3 12:31.71 elk
1382219 rcohen 20 0 20.8g 721580 37124 R 99.7 0.3 12:41.72 elk
1382219 rcohen 20 0 20.8g 721580 37124 R 99.6 0.3 12:51.72 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:01.72 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 13:11.73 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:21.73 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:31.73 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 13:41.74 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.6 0.3 13:51.74 elk
1382219 rcohen 20 0 20.8g 719512 37128 R 99.7 0.3 14:01.75 elk
1382219 rcohen 20 0 20.8g 747364 37128 R 99.6 0.3 14:11.75 elk
/carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$
It very quickly goes to just a single thread.
Ron
Actually I have this same problem with 8.8.26:
grep elk log.dat
1318993 rcohen 20 0 245608 28360 12288 S 0.0 0.0 0:00.00 elk
1318993 rcohen 20 0 20.3g 440452 32700 R 2341 0.3 3:54.60 elk
1318993 rcohen 20 0 20.4g 507720 32776 R 3187 0.4 9:13.97 elk
1318993 rcohen 20 0 20.4g 557324 32776 R 2774 0.4 13:52.18 elk
1318993 rcohen 20 0 20.6g 635700 33488 R 99.7 0.5 14:02.17 elk
1318993 rcohen 20 0 20.6g 635700 33488 R 99.8 0.5 14:12.17 elk
1318993 rcohen 20 0 20.6g 635700 33488 R 99.7 0.5 14:22.16 elk
1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:32.15 elk
1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:42.14 elk
1318993 rcohen 20 0 20.6g 635716 33504 R 99.7 0.5 14:52.13 elk
and 8.7.10. Very strange.
Hi Ron,
Are you running on multiple nodes? If so then it could be a filesystem bottleneck.
This only happenes when I try OEP! If I change:
xctype
-20
to
xctype
20
I get the expected performance:
carnegie/nobackup/users/rcohen/ELK/LuH2/Fluorite/OEP/LuH3/chgexs0.5$ grep elk log.dat
1323034 rcohen 20 0 241676 26352 12400 S 0.0 0.0 0:00.00 elk
1323972 rcohen 20 0 241680 26204 12200 S 0.0 0.0 0:00.00 elk
1323972 rcohen 20 0 20.3g 448988 32644 R 2230 0.3 3:43.48 elk
1323972 rcohen 20 0 20.3g 500520 32648 R 2782 0.4 8:22.20 elk
1323972 rcohen 20 0 20.4g 551672 33500 R 2934 0.4 13:16.14 elk
1323972 rcohen 20 0 20.6g 657304 34308 R 382.5 0.5 13:54.50 elk
1323972 rcohen 20 0 20.6g 657304 34308 R 99.6 0.5 14:04.49 elk
1323972 rcohen 20 0 20.6g 657304 34308 R 99.7 0.5 14:14.48 elk
1324814 rcohen 20 0 20.3g 421764 33100 R 2230 0.3 3:43.48 elk
1324814 rcohen 20 0 20.3g 474996 33708 R 2757 0.4 8:20.33 elk
1324814 rcohen 20 0 20.4g 523088 33848 R 3092 0.4 13:30.19 elk
1324814 rcohen 20 0 20.4g 556728 33972 R 2609 0.4 17:51.85 elk
1324814 rcohen 20 0 20.4g 555020 33980 R 3182 0.4 23:11.36 elk
1324814 rcohen 20 0 20.4g 566024 33980 R 3011 0.4 28:13.36 elk
1324814 rcohen 20 0 20.4g 561176 33980 R 2786 0.4 32:52.56 elk
1324814 rcohen 20 0 20.4g 560004 33980 R 3186 0.4 38:12.16 elk
1324814 rcohen 20 0 20.4g 562452 33980 D 2613 0.4 42:34.25 elk
Ron
No I am running this test on a single node.
I tried both ifx and ifort compilers.
It is going to exactly 100%--one thread--not a hair over.
Ron
It seems intel will not use OMP parallelization for this loop in oepvcl.f90:
$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
do ik=1,nkpt
! distribute among MPI processes
if (mod(ik-1,np_mpi) /= lp_mpi) cycle
!$OMP CRITICAL(oepvcl_)
write(*,'("Info(oepvcl): ",I6," of ",I6," k-points")') ik,nkpt
!$OMP END CRITICAL(oepvcl_)
call oepvclk(ik,vclcv(:,:,:,ik),vclvv(:,:,ik))
end do
!$OMP END PARALLEL DO
scratch that--looking some more. May be a complicatino problem!
Ron
My apologies on this whole thread! The code is fine. It was the compilation that failed--a typo in my make.inc!
Ron
Try running on one node with
I suspect that the filesystem is holding up everything.
I added this as you suggest, but still does the same thing:
1328993 rcohen 20 0 245532 26640 12616 S 0.0 0.0 0:00.00 elk
1328993 rcohen 20 0 20.3g 435384 33416 R 2338 0.3 3:54.26 elk
1328993 rcohen 20 0 20.4g 505088 33416 R 3189 0.4 9:14.43 elk
1328993 rcohen 20 0 20.6g 603700 34152 R 1361 0.5 11:30.97 elk
1328993 rcohen 20 0 20.6g 603700 34152 R 99.7 0.5 11:40.96 elk
1328993 rcohen 20 0 20.6g 603700 34152 R 99.6 0.5 11:50.94 elk
1328993 rcohen 20 0 20.6g 603712 34164 R 99.5 0.5 12:00.92 elk
1328993 rcohen 20 0 20.6g 603712 34164 R 99.5 0.5 12:10.89 elk
wrtdsk
.false.
Goes shortly to one thread.
Ron