I've been trying this amazing package. But I found that during solvated surface iterations, the required memory keeps growing... Most of the times it exhausts the cluster's memory and halts with an "out of memory" error. The system I chose has 18 atoms and a supercell of 10x5x28 bohrs^3.
Is this normal?
I did a Valgrind test but did not find any leakage yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This sounds unusual: that sounds like a relatively small system size which should run fine on any decent workstation.
Could you provide some more details, such as the input file, jdftx svn revision number and the configuration of the system you are running on, and the number of threads/processes you specify?
Best,
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your quick reply. I think I have the latest version available on August 5, 2015. The cluster I'm using has 16 cores and 6G of memory per node. I have used 4 nodes (64 cores in total), each utilizing 4 of its cores (this way 16 cores are running the job), and looks like each core started 4 threads (perhaps this makes the total number of threads 64 again).
Like you said this is a relatively small system. But during the iterations like:
ElecMinimize: Iter: 78 Etot: -274.224406272195097 |grad|K: 1.689e-06 alpha: 5.476e-02 linmin: 2.150e-03 cgtest: 4.822e-02 Linear fluid (dielectric constant: 7.6) occupying 0.278034 of unit cell: Completed after 7 iterations.*
the memory used by jDFTx keeps growing.
Here are the input files:
MnO2 surface/vacuum: (the blank lines among the lattice vectors are added by hand in this post for the ease of eyes)
lattice \
10.8368072510 0.0000000000 0.0000000000 \
0.0000000000 5.0157899857 0.0000000000 \
0.0000000000 0.0000000000 26.8700580597
ion-species Mn.fhi
ion-species O.fhi
coords-type cartesian
ion Mn 5.418403625 0.000000000 21.451654166 1
ion Mn 0.000000000 0.000000000 0.000000000 0
ion Mn 5.418403625 0.000000000 5.418403493 1
ion Mn 0.000000000 2.507894993 21.451654166 1
ion Mn 5.418403625 2.507894993 0.000000000 0
ion Mn 0.000000000 2.507894993 5.418403493 1
ion O 0.000000000 0.000000000 19.240462802 1
ion O 5.418402657 0.000000000 24.658865094 1
ion O 0.000000000 0.000000000 3.207209927 1
ion O 0.000000000 0.000000000 23.662848733 1
ion O 5.418403625 0.000000000 2.211193566 1
ion O 0.000000000 0.000000000 7.629597660 1
ion O 7.603240422 2.507894993 21.451654166 1
ion O 2.184836958 2.507894993 0.000000000 0
ion O 7.603240422 2.507894993 5.418403893 1
ion O 3.233566829 2.507894993 21.451654166 1
ion O 8.651970455 2.507894993 0.000000000 0
ion O 3.233566829 2.507894993 5.418403893 1
electronic-minimize \
energyDiffThreshold 1e-07 \
nEnergyDiff 10 \
nIterations 2000
ionic-minimize \
nIterations 2000 \
energyDiffThreshold 1e-6 \
knormThreshold 1e-4 #Threshold on RMS cartesian force
kpoint-folding 1 1 1 # 4x4x1 uniform k-mesh
elec-ex-corr lda
dump End State Ecomponents
include surf.in <--- This is the previous file
initial-state surf-vac.$var
dump-name surf-solv.$var
fluid LinearPCM
fluid-solvent THF
chargeball O 1 0.4
====================== MnO2 surface/THF above =====
Last edit: feng zimin 2015-10-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is indeed strange. I tried your input on my workstation without any MPI: the vacuum calculation's memory usage saturated to under < 110 MB and the fluid calculation to < 170 MB.
Even with 4x4x1 k-points (9 reduced states), and switching to 4 MPI processes on a single node, the total memory consumption was ~ 1 GB for vacuum and ~ 1.2 GB for fluid.
I'd recommend trying this calculation on a single node as well, and check the memory with top. (For doing this test quickly, you can reduce the number of iterations and not worry about convergence.) Use "jdftx -c 4 ..." with no mpirun in order to run with 4 threads and no MPI. You can do "mpirun -n 4 jdftx -c 1 ..." to run with four processes and one thread per process. (If you don't specify -c, jdftx will by default launch as many threads as necessary to occupy the full node).
Also note that JDFTx currently does not parallelize plane waves or bands over MPI, so you should use at most as many MPI processes as reduced k-points in your calculation. That would be 1 in your 1x1x1 kpoint configuration and 9 in your 4x4x1 case (check the nStates line in th einitialization for reduced kpoints). If you use more processes than that, they would be wasted. Instead use as many cores as you can using threads, so if you were running 4x4x1, I would recommend running "mpirun -n 4 jdftx -c 4 ..." to use 4 threads each on 4 nodes, but from 4 MPI processes instead of 16 as you preumably did initially.
Hope that helps!
Best,
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Shankar,
Yes it helped a lot!
But there is definitely something wrong with parallelization: I tried it without MPI or thread (one bare process only), this time the memory usage stayed minimum and NOT growing during the SCF iterations... When I was doing parallel calculation, EACH of the parallel processes consumed 200+ times more memory than this one process without parallelization.
Could there be something wrong with the compilation? May I know what MPI/compiler you used? Is there a later version of jDFTx after August 2015?
Thanks a lot!
Sincerely,
fzm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
August 2015 should be recent enough; there haven't been any mojor updates since.
It does sound like an MPI specific memory issue. My local copy of JDFTx is compiled with gcc 4.8.2 and OpenMPI 1.6.5 on Ubuntu, but we have routinely run it with other compiler / MPI combinations without issue, especially on supercomputing clusters eg. Cray compiler wrappers (intel/gcc) with MVAPICH2, intel compile with MPICH etc.
What MPI and compiler do you have?
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
done.
============================
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm, I'm unable to reproduce the issue. I tried your input with one and four threads (without MPI) and the memory consumption remains similar (~110 MB vacuum, ~170 MB fluids). Can you check if there is a minimum number of threads before this issue shows up?
Can you tell me more about your compilation? Are you using MKL or ATLAS for your LAPACk/BLAS? Are you linking to any non-standard thread libraries?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have used JDFTx with MKL previously without this issue. I will check if recent updates of MKL create this issue for some reason. It would be great to know the exact MKL version on your cluster to compare.
Best,
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Shankar,
The technician told me that the problembatic version was compiled with Intel 2015u3 version of the compiler suite. Perhaps there should be a warning sign on that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have updated the latest jdftx (svn revision 1204) to disable MKL's potentially problematic internal memory management. It seems to fix the issue for me, and it would be great to know if it does for you too!
Also, MKL coud give you a 20-50% speedup depending on the relative time spent in BLAS3 ops. So it would be worthwhile to see if you can get the updated JDFTx to run fine with MKL.
Best,
Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello developpers,
I've been trying this amazing package. But I found that during solvated surface iterations, the required memory keeps growing... Most of the times it exhausts the cluster's memory and halts with an "out of memory" error. The system I chose has 18 atoms and a supercell of 10x5x28 bohrs^3.
Is this normal?
I did a Valgrind test but did not find any leakage yet.
Hi Feng,
This sounds unusual: that sounds like a relatively small system size which should run fine on any decent workstation.
Could you provide some more details, such as the input file, jdftx svn revision number and the configuration of the system you are running on, and the number of threads/processes you specify?
Best,
Shankar
Hello Shankar,
Thanks for your quick reply. I think I have the latest version available on August 5, 2015. The cluster I'm using has 16 cores and 6G of memory per node. I have used 4 nodes (64 cores in total), each utilizing 4 of its cores (this way 16 cores are running the job), and looks like each core started 4 threads (perhaps this makes the total number of threads 64 again).
Like you said this is a relatively small system. But during the iterations like:
ElecMinimize: Iter: 78 Etot: -274.224406272195097 |grad|K: 1.689e-06 alpha: 5.476e-02 linmin: 2.150e-03 cgtest: 4.822e-02
Linear fluid (dielectric constant: 7.6) occupying 0.278034 of unit cell: Completed after 7 iterations.*
the memory used by jDFTx keeps growing.
Here are the input files:
MnO2 surface/vacuum: (the blank lines among the lattice vectors are added by hand in this post for the ease of eyes)
lattice \
10.8368072510 0.0000000000 0.0000000000 \
0.0000000000 5.0157899857 0.0000000000 \
0.0000000000 0.0000000000 26.8700580597
ion-species Mn.fhi
ion-species O.fhi
coords-type cartesian
ion Mn 5.418403625 0.000000000 21.451654166 1
ion Mn 0.000000000 0.000000000 0.000000000 0
ion Mn 5.418403625 0.000000000 5.418403493 1
ion Mn 0.000000000 2.507894993 21.451654166 1
ion Mn 5.418403625 2.507894993 0.000000000 0
ion Mn 0.000000000 2.507894993 5.418403493 1
ion O 0.000000000 0.000000000 19.240462802 1
ion O 5.418402657 0.000000000 24.658865094 1
ion O 0.000000000 0.000000000 3.207209927 1
ion O 0.000000000 0.000000000 23.662848733 1
ion O 5.418403625 0.000000000 2.211193566 1
ion O 0.000000000 0.000000000 7.629597660 1
ion O 7.603240422 2.507894993 21.451654166 1
ion O 2.184836958 2.507894993 0.000000000 0
ion O 7.603240422 2.507894993 5.418403893 1
ion O 3.233566829 2.507894993 21.451654166 1
ion O 8.651970455 2.507894993 0.000000000 0
ion O 3.233566829 2.507894993 5.418403893 1
electronic-minimize \ energyDiffThreshold 1e-07 \ nEnergyDiff 10 \ nIterations 2000
ionic-minimize \ nIterations 2000 \ energyDiffThreshold 1e-6 \ knormThreshold 1e-4 #Threshold on RMS cartesian force
kpoint-folding 1 1 1 # 4x4x1 uniform k-mesh
elec-ex-corr lda
dump End State Ecomponents
=================== MnO2 surface/vacuum above ====
MnO2 surface/THF:
include surf.in <--- This is the previous file
initial-state surf-vac.$var
dump-name surf-solv.$var
fluid LinearPCM
fluid-solvent THF
chargeball O 1 0.4
====================== MnO2 surface/THF above =====
Last edit: feng zimin 2015-10-01
Hi Feng,
This is indeed strange. I tried your input on my workstation without any MPI: the vacuum calculation's memory usage saturated to under < 110 MB and the fluid calculation to < 170 MB.
Even with 4x4x1 k-points (9 reduced states), and switching to 4 MPI processes on a single node, the total memory consumption was ~ 1 GB for vacuum and ~ 1.2 GB for fluid.
I'd recommend trying this calculation on a single node as well, and check the memory with top. (For doing this test quickly, you can reduce the number of iterations and not worry about convergence.) Use "jdftx -c 4 ..." with no mpirun in order to run with 4 threads and no MPI. You can do "mpirun -n 4 jdftx -c 1 ..." to run with four processes and one thread per process. (If you don't specify -c, jdftx will by default launch as many threads as necessary to occupy the full node).
Also note that JDFTx currently does not parallelize plane waves or bands over MPI, so you should use at most as many MPI processes as reduced k-points in your calculation. That would be 1 in your 1x1x1 kpoint configuration and 9 in your 4x4x1 case (check the nStates line in th einitialization for reduced kpoints). If you use more processes than that, they would be wasted. Instead use as many cores as you can using threads, so if you were running 4x4x1, I would recommend running "mpirun -n 4 jdftx -c 4 ..." to use 4 threads each on 4 nodes, but from 4 MPI processes instead of 16 as you preumably did initially.
Hope that helps!
Best,
Shankar
Hi Shankar,
Yes it helped a lot!
But there is definitely something wrong with parallelization: I tried it without MPI or thread (one bare process only), this time the memory usage stayed minimum and NOT growing during the SCF iterations... When I was doing parallel calculation, EACH of the parallel processes consumed 200+ times more memory than this one process without parallelization.
Could there be something wrong with the compilation? May I know what MPI/compiler you used? Is there a later version of jDFTx after August 2015?
Thanks a lot!
Sincerely,
fzm
Hi Feng,
August 2015 should be recent enough; there haven't been any mojor updates since.
It does sound like an MPI specific memory issue. My local copy of JDFTx is compiled with gcc 4.8.2 and OpenMPI 1.6.5 on Ubuntu, but we have routinely run it with other compiler / MPI combinations without issue, especially on supercomputing clusters eg. Cray compiler wrappers (intel/gcc) with MVAPICH2, intel compile with MPICH etc.
What MPI and compiler do you have?
Shankar
Hi Shankar,
So it is now certain that the problem lies in the threads.. With or without MPI, as long as I use more than one thread, the memory usage goes crazy.
I used both gcc 4.8 and 4.9. The MPI I used is openmpi 1.6.3.
Would you like to have any suggestions regarding the threads? Thanks!
fzm
Hmm, this is rather strange. Can you please send me an example output file for a case with threads that went crazy?
Thanks,
Shankar
Yes, it's here. This one is without MPI but with 16 threads.
The 'top' screen a few seconds before the crash down looks like this. Note the 89.3% memory usage.
top - 13:14:21 up 7 days, 33 min, 1 user, load average: 12.01, 7.15, 3.58
Tasks: 375 total, 1 running, 374 sleeping, 0 stopped, 0 zombie
Cpu(s): 17.3%us, 46.2%sy, 0.0%ni, 36.3%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 66068796k total, 65789588k used, 279208k free, 675536k buffers
Swap: 32767992k total, 1536988k used, 31231004k free, 540412k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
826 xs8830 20 0 66.5g 56g 12m S 987.0 89.3 58:47.04 jdftx
25854 xs8830 20 0 95696 1768 852 S 0.0 0.0 0:00.02 sshd
25900 xs8830 20 0 107m 1872 1408 S 0.0 0.0 0:00.01 bash
30584 xs8830 20 0 15284 1468 936 R 0.0 0.0 0:01.24 top
64711 xs8830 20 0 105m 1432 1164 S 0.0 0.0 0:00.02 bash
============================
And the final part of the output file reads:
ElecMinimize: Iter: 31 Etot: -274.102426145761569 |grad|_K: 1.799e-06 alpha: 9.016e-01 linmin: -3.112e-07 cgtest: 1.663e-04
ElecMinimize: Iter: 32 Etot: -274.102426998682176 |grad|_K: 1.699e-06 alpha: 7.502e-01 linmin: 5.582e-06 cgtest: 4.682e-05
ElecMinimize: Iter: 33 Etot: -274.102427847729814 |grad|_K: 1.697e-06 alpha: 8.374e-01 linmin: 7.160e-06 cgtest: 8.522e-05
ElecMinimize: Iter: 34 Etot: -274.102428606806825 |grad|_K: 1.565e-06 alpha: 7.503e-01 linmin: 5.307e-06 cgtest: 7.898e-05
Stack trace:
0: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z10printStackb+0x21) [0x2b36b649d501]
1: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z14stackTraceExiti+0xd) [0x2b36b649d64d]
2: /lib64/libc.so.6() [0x38b50326a0]
3: /lib64/libc.so.6(gsignal+0x35) [0x38b5032625]
4: /lib64/libc.so.6(abort+0x175) [0x38b5033e05]
5: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(_ZN9gnu_cxx27verbose_terminate_handlerEv+0x15d) [0x2b36ba7aeb6d]
6: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0x8cbb6) [0x2b36ba7acbb6]
7: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0x8cc01) [0x2b36ba7acc01]
8: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0x8ce18) [0x2b36ba7ace18]
9: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(_ZSt20throw_system_errori+0x7f) [0x2b36ba7d9a5f]
10: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0xbc063) [0x2b36ba7dc063]
11: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(ZNSt6thread15_M_start_threadESt10shared_ptrINS_10_Impl_baseEE+0x3d) [0x2b36ba7dc0ad]
12: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z12threadLaunchIFvmmPFviiiPK7complexPS0_d7matrix3IdEPK7vector3IiES6_IdEdEmiS2_S3_dS5_PS7_SA_dEJSC_miS2_S3_dS5_SD_SA_dEEviPT_mDpT0+0x3a1) [0x2b36b666d271]
13: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(Z19precond_inv_kineticRK12ColumnBundled+0x1d8) [0x2b36b6659608]
14: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN8ElecVars18orthonormalizeGradEiRK10diagMatrixRK12ColumnBundleRS3_dPS3_P6matrixS9+0x2db) [0x2b36b664903b]
15: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(ZN8ElecVars17elecEnergyAndGradER8EnergiesP12ElecGradientS3_b+0x5f7) [0x2b36b664c567]
16: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN13ElecMinimizer7computeEP12ElecGradient+0x45) [0x2b36b662d265]
17: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN15MinimizePrivate10linminQuadI12ElecGradientEEbR11MinimizableIT_ERK14MinimizeParamsRKS3_dRdSB_RS3+0x2fe) [0x2b36b663015e]
18: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(Z12elecMinimizeR10Everything+0x3cf) [0x2b36b662dd2f]
19: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z17elecFluidMinimizeR10Everything+0xcb) [0x2b36b662e8fb]
20: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN14IonicMinimizer7computeEP13IonicGradient+0x46) [0x2b36b65b69c6]
21: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN15MinimizePrivate16linminCubicWolfeI13IonicGradientEEbR11MinimizableIT_ERK14MinimizeParamsRKS3_dRdSB_RS3+0xe0) [0x2b36b65b7460]
22: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN11MinimizableI13IonicGradientE5lBFGSERK14MinimizeParams+0x9bb) [0x2b36b65b91ab]
23: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN11MinimizableI13IonicGradientE8minimizeERK14MinimizeParams+0x6d1) [0x2b36b65b9e41]
24: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN14IonicMinimizer8minimizeERK14MinimizeParams+0xd) [0x2b36b65b6c9d]
25: /gpfsFS1/scratch/nobackup/projets/gc029/opt/jdftx_mpi/bin/jdftx(main+0xc7c) [0x41cf7c]
26: /lib64/libc.so.6(libc_start_main+0xfd) [0x38b501ed5d]
27: /gpfsFS1/scratch/nobackup/projets/gc029/opt/jdftx_mpi/bin/jdftx() [0x40ff89]
Writing 'jdftx-stacktrace' (for use with script printStackTrace): --------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
done.
============================
Hmm, I'm unable to reproduce the issue. I tried your input with one and four threads (without MPI) and the memory consumption remains similar (~110 MB vacuum, ~170 MB fluids). Can you check if there is a minimum number of threads before this issue shows up?
Can you tell me more about your compilation? Are you using MKL or ATLAS for your LAPACk/BLAS? Are you linking to any non-standard thread libraries?
We have a computer technician here. I guess he compiled it with MKL.
I was told that by linking to "standard" libraries only, the memory usage remained reasonable.
So for sure this issue is not related to jDFTx; we will keep working on it on our end. I thank you for all your help in the past days.
Hi Feng,
Thanks for narrowing it down!
I have used JDFTx with MKL previously without this issue. I will check if recent updates of MKL create this issue for some reason. It would be great to know the exact MKL version on your cluster to compare.
Best,
Shankar
Sure!
As soon as I get any further news from our technician I will let you know.
Hi Shankar,
The technician told me that the problembatic version was compiled with Intel 2015u3 version of the compiler suite. Perhaps there should be a warning sign on that.
Hi Feng,
I also encountered this bug recently on NERSC after their latest Cray software update. This seems to be related to the memory leaks suggested here:
https://software.intel.com/en-us/node/528564
I have updated the latest jdftx (svn revision 1204) to disable MKL's potentially problematic internal memory management. It seems to fix the issue for me, and it would be great to know if it does for you too!
Also, MKL coud give you a 20-50% speedup depending on the relative time spent in BLAS3 ops. So it would be worthwhile to see if you can get the updated JDFTx to run fine with MKL.
Best,
Shankar