Menu

memory management?

feng zimin
2015-10-01
2015-10-20
  • feng zimin

    feng zimin - 2015-10-01

    Hello developpers,

    I've been trying this amazing package. But I found that during solvated surface iterations, the required memory keeps growing... Most of the times it exhausts the cluster's memory and halts with an "out of memory" error. The system I chose has 18 atoms and a supercell of 10x5x28 bohrs^3.

    Is this normal?

    I did a Valgrind test but did not find any leakage yet.

     
  • Ravishankar Sundararaman

    Hi Feng,

    This sounds unusual: that sounds like a relatively small system size which should run fine on any decent workstation.

    Could you provide some more details, such as the input file, jdftx svn revision number and the configuration of the system you are running on, and the number of threads/processes you specify?

    Best,
    Shankar

     
  • feng zimin

    feng zimin - 2015-10-01

    Hello Shankar,

    Thanks for your quick reply. I think I have the latest version available on August 5, 2015. The cluster I'm using has 16 cores and 6G of memory per node. I have used 4 nodes (64 cores in total), each utilizing 4 of its cores (this way 16 cores are running the job), and looks like each core started 4 threads (perhaps this makes the total number of threads 64 again).

    Like you said this is a relatively small system. But during the iterations like:

    ElecMinimize: Iter: 78 Etot: -274.224406272195097 |grad|K: 1.689e-06 alpha: 5.476e-02 linmin: 2.150e-03 cgtest: 4.822e-02
    Linear fluid (dielectric constant: 7.6) occupying 0.278034 of unit cell: Completed after 7 iterations.*

    the memory used by jDFTx keeps growing.

    Here are the input files:

    MnO2 surface/vacuum: (the blank lines among the lattice vectors are added by hand in this post for the ease of eyes)


    lattice \

    10.8368072510 0.0000000000 0.0000000000 \

    0.0000000000 5.0157899857 0.0000000000 \

    0.0000000000 0.0000000000 26.8700580597

    ion-species Mn.fhi
    ion-species O.fhi

    coords-type cartesian
    ion Mn 5.418403625 0.000000000 21.451654166 1
    ion Mn 0.000000000 0.000000000 0.000000000 0
    ion Mn 5.418403625 0.000000000 5.418403493 1
    ion Mn 0.000000000 2.507894993 21.451654166 1
    ion Mn 5.418403625 2.507894993 0.000000000 0
    ion Mn 0.000000000 2.507894993 5.418403493 1
    ion O 0.000000000 0.000000000 19.240462802 1
    ion O 5.418402657 0.000000000 24.658865094 1
    ion O 0.000000000 0.000000000 3.207209927 1
    ion O 0.000000000 0.000000000 23.662848733 1
    ion O 5.418403625 0.000000000 2.211193566 1
    ion O 0.000000000 0.000000000 7.629597660 1
    ion O 7.603240422 2.507894993 21.451654166 1
    ion O 2.184836958 2.507894993 0.000000000 0
    ion O 7.603240422 2.507894993 5.418403893 1
    ion O 3.233566829 2.507894993 21.451654166 1
    ion O 8.651970455 2.507894993 0.000000000 0
    ion O 3.233566829 2.507894993 5.418403893 1
    electronic-minimize \ energyDiffThreshold 1e-07 \ nEnergyDiff 10 \ nIterations 2000
    ionic-minimize \ nIterations 2000 \ energyDiffThreshold 1e-6 \ knormThreshold 1e-4 #Threshold on RMS cartesian force
    kpoint-folding 1 1 1 # 4x4x1 uniform k-mesh
    elec-ex-corr lda
    dump End State Ecomponents

    =================== MnO2 surface/vacuum above ====

    MnO2 surface/THF:


    include surf.in <--- This is the previous file
    initial-state surf-vac.$var
    dump-name surf-solv.$var
    fluid LinearPCM
    fluid-solvent THF
    chargeball O 1 0.4
    ====================== MnO2 surface/THF above =====

     

    Last edit: feng zimin 2015-10-01
  • Ravishankar Sundararaman

    Hi Feng,

    This is indeed strange. I tried your input on my workstation without any MPI: the vacuum calculation's memory usage saturated to under < 110 MB and the fluid calculation to < 170 MB.

    Even with 4x4x1 k-points (9 reduced states), and switching to 4 MPI processes on a single node, the total memory consumption was ~ 1 GB for vacuum and ~ 1.2 GB for fluid.

    I'd recommend trying this calculation on a single node as well, and check the memory with top. (For doing this test quickly, you can reduce the number of iterations and not worry about convergence.) Use "jdftx -c 4 ..." with no mpirun in order to run with 4 threads and no MPI. You can do "mpirun -n 4 jdftx -c 1 ..." to run with four processes and one thread per process. (If you don't specify -c, jdftx will by default launch as many threads as necessary to occupy the full node).

    Also note that JDFTx currently does not parallelize plane waves or bands over MPI, so you should use at most as many MPI processes as reduced k-points in your calculation. That would be 1 in your 1x1x1 kpoint configuration and 9 in your 4x4x1 case (check the nStates line in th einitialization for reduced kpoints). If you use more processes than that, they would be wasted. Instead use as many cores as you can using threads, so if you were running 4x4x1, I would recommend running "mpirun -n 4 jdftx -c 4 ..." to use 4 threads each on 4 nodes, but from 4 MPI processes instead of 16 as you preumably did initially.

    Hope that helps!

    Best,
    Shankar

     
  • feng zimin

    feng zimin - 2015-10-02

    Hi Shankar,
    Yes it helped a lot!
    But there is definitely something wrong with parallelization: I tried it without MPI or thread (one bare process only), this time the memory usage stayed minimum and NOT growing during the SCF iterations... When I was doing parallel calculation, EACH of the parallel processes consumed 200+ times more memory than this one process without parallelization.
    Could there be something wrong with the compilation? May I know what MPI/compiler you used? Is there a later version of jDFTx after August 2015?
    Thanks a lot!
    Sincerely,
    fzm

     
  • Ravishankar Sundararaman

    Hi Feng,

    August 2015 should be recent enough; there haven't been any mojor updates since.

    It does sound like an MPI specific memory issue. My local copy of JDFTx is compiled with gcc 4.8.2 and OpenMPI 1.6.5 on Ubuntu, but we have routinely run it with other compiler / MPI combinations without issue, especially on supercomputing clusters eg. Cray compiler wrappers (intel/gcc) with MVAPICH2, intel compile with MPICH etc.

    What MPI and compiler do you have?

    Shankar

     
  • feng zimin

    feng zimin - 2015-10-06

    Hi Shankar,

    So it is now certain that the problem lies in the threads.. With or without MPI, as long as I use more than one thread, the memory usage goes crazy.

    I used both gcc 4.8 and 4.9. The MPI I used is openmpi 1.6.3.

    Would you like to have any suggestions regarding the threads? Thanks!

    fzm

     
  • Ravishankar Sundararaman

    Hmm, this is rather strange. Can you please send me an example output file for a case with threads that went crazy?

    Thanks,
    Shankar

     
  • feng zimin

    feng zimin - 2015-10-06

    Yes, it's here. This one is without MPI but with 16 threads.

    The 'top' screen a few seconds before the crash down looks like this. Note the 89.3% memory usage.


    top - 13:14:21 up 7 days, 33 min, 1 user, load average: 12.01, 7.15, 3.58
    Tasks: 375 total, 1 running, 374 sleeping, 0 stopped, 0 zombie
    Cpu(s): 17.3%us, 46.2%sy, 0.0%ni, 36.3%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
    Mem: 66068796k total, 65789588k used, 279208k free, 675536k buffers
    Swap: 32767992k total, 1536988k used, 31231004k free, 540412k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    826 xs8830 20 0 66.5g 56g 12m S 987.0 89.3 58:47.04 jdftx
    25854 xs8830 20 0 95696 1768 852 S 0.0 0.0 0:00.02 sshd
    25900 xs8830 20 0 107m 1872 1408 S 0.0 0.0 0:00.01 bash
    30584 xs8830 20 0 15284 1468 936 R 0.0 0.0 0:01.24 top
    64711 xs8830 20 0 105m 1432 1164 S 0.0 0.0 0:00.02 bash

    ============================

    And the final part of the output file reads:


    ElecMinimize: Iter: 31 Etot: -274.102426145761569 |grad|_K: 1.799e-06 alpha: 9.016e-01 linmin: -3.112e-07 cgtest: 1.663e-04
    ElecMinimize: Iter: 32 Etot: -274.102426998682176 |grad|_K: 1.699e-06 alpha: 7.502e-01 linmin: 5.582e-06 cgtest: 4.682e-05
    ElecMinimize: Iter: 33 Etot: -274.102427847729814 |grad|_K: 1.697e-06 alpha: 8.374e-01 linmin: 7.160e-06 cgtest: 8.522e-05
    ElecMinimize: Iter: 34 Etot: -274.102428606806825 |grad|_K: 1.565e-06 alpha: 7.503e-01 linmin: 5.307e-06 cgtest: 7.898e-05

    Stack trace:
    0: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z10printStackb+0x21) [0x2b36b649d501]
    1: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z14stackTraceExiti+0xd) [0x2b36b649d64d]
    2: /lib64/libc.so.6() [0x38b50326a0]
    3: /lib64/libc.so.6(gsignal+0x35) [0x38b5032625]
    4: /lib64/libc.so.6(abort+0x175) [0x38b5033e05]
    5: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(_ZN9gnu_cxx27verbose_terminate_handlerEv+0x15d) [0x2b36ba7aeb6d]
    6: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0x8cbb6) [0x2b36ba7acbb6]
    7: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0x8cc01) [0x2b36ba7acc01]
    8: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0x8ce18) [0x2b36ba7ace18]
    9: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(_ZSt20throw_system_errori+0x7f) [0x2b36ba7d9a5f]
    10: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(+0xbc063) [0x2b36ba7dc063]
    11: /gpfsFS1/scratch/nobackup/projets/gc029/opt/gcc-5.2.0/lib64/libstdc++.so.6(ZNSt6thread15_M_start_threadESt10shared_ptrINS_10_Impl_baseEE+0x3d) [0x2b36ba7dc0ad]
    12: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z12threadLaunchIFvmmPFviiiPK7complexPS0_d7matrix3IdEPK7vector3IiES6_IdEdEmiS2_S3_dS5_PS7_SA_dEJSC_miS2_S3_dS5_SD_SA_dEEviPT_mDpT0
    +0x3a1) [0x2b36b666d271]
    13: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(Z19precond_inv_kineticRK12ColumnBundled+0x1d8) [0x2b36b6659608]
    14: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN8ElecVars18orthonormalizeGradEiRK10diagMatrixRK12ColumnBundleRS3_dPS3_P6matrixS9
    +0x2db) [0x2b36b664903b]
    15: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(ZN8ElecVars17elecEnergyAndGradER8EnergiesP12ElecGradientS3_b+0x5f7) [0x2b36b664c567]
    16: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN13ElecMinimizer7computeEP12ElecGradient+0x45) [0x2b36b662d265]
    17: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN15MinimizePrivate10linminQuadI12ElecGradientEEbR11MinimizableIT_ERK14MinimizeParamsRKS3_dRdSB_RS3
    +0x2fe) [0x2b36b663015e]
    18: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(Z12elecMinimizeR10Everything+0x3cf) [0x2b36b662dd2f]
    19: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_Z17elecFluidMinimizeR10Everything+0xcb) [0x2b36b662e8fb]
    20: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN14IonicMinimizer7computeEP13IonicGradient+0x46) [0x2b36b65b69c6]
    21: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN15MinimizePrivate16linminCubicWolfeI13IonicGradientEEbR11MinimizableIT_ERK14MinimizeParamsRKS3_dRdSB_RS3
    +0xe0) [0x2b36b65b7460]
    22: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN11MinimizableI13IonicGradientE5lBFGSERK14MinimizeParams+0x9bb) [0x2b36b65b91ab]
    23: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN11MinimizableI13IonicGradientE8minimizeERK14MinimizeParams+0x6d1) [0x2b36b65b9e41]
    24: /gpfsFS1/scratch/nobackup/projets/gc029/cs6292/jdftx_openmpi/jdftx/libjdftx.so(_ZN14IonicMinimizer8minimizeERK14MinimizeParams+0xd) [0x2b36b65b6c9d]
    25: /gpfsFS1/scratch/nobackup/projets/gc029/opt/jdftx_mpi/bin/jdftx(main+0xc7c) [0x41cf7c]
    26: /lib64/libc.so.6(
    libc_start_main+0xfd) [0x38b501ed5d]
    27: /gpfsFS1/scratch/nobackup/projets/gc029/opt/jdftx_mpi/bin/jdftx() [0x40ff89]
    Writing 'jdftx-stacktrace' (for use with script printStackTrace): --------------------------------------------------------------------------
    MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
    with errorcode 1.

    NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
    You may or may not see output from other processes, depending on
    exactly when Open MPI kills them.


    done.

    ============================

     
  • Ravishankar Sundararaman

    Hmm, I'm unable to reproduce the issue. I tried your input with one and four threads (without MPI) and the memory consumption remains similar (~110 MB vacuum, ~170 MB fluids). Can you check if there is a minimum number of threads before this issue shows up?

    Can you tell me more about your compilation? Are you using MKL or ATLAS for your LAPACk/BLAS? Are you linking to any non-standard thread libraries?

     
  • feng zimin

    feng zimin - 2015-10-08

    We have a computer technician here. I guess he compiled it with MKL.

    I was told that by linking to "standard" libraries only, the memory usage remained reasonable.

    So for sure this issue is not related to jDFTx; we will keep working on it on our end. I thank you for all your help in the past days.

     
  • Ravishankar Sundararaman

    Hi Feng,

    Thanks for narrowing it down!

    I have used JDFTx with MKL previously without this issue. I will check if recent updates of MKL create this issue for some reason. It would be great to know the exact MKL version on your cluster to compare.

    Best,
    Shankar

     
  • feng zimin

    feng zimin - 2015-10-08

    Sure!
    As soon as I get any further news from our technician I will let you know.

     
  • feng zimin

    feng zimin - 2015-10-09

    Hi Shankar,
    The technician told me that the problembatic version was compiled with Intel 2015u3 version of the compiler suite. Perhaps there should be a warning sign on that.

     
  • Ravishankar Sundararaman

    Hi Feng,

    I also encountered this bug recently on NERSC after their latest Cray software update. This seems to be related to the memory leaks suggested here:

    https://software.intel.com/en-us/node/528564

    I have updated the latest jdftx (svn revision 1204) to disable MKL's potentially problematic internal memory management. It seems to fix the issue for me, and it would be great to know if it does for you too!

    Also, MKL coud give you a 20-50% speedup depending on the relative time spent in BLAS3 ops. So it would be worthwhile to see if you can get the updated JDFTx to run fine with MKL.

    Best,
    Shankar

     
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.