Menu

#74 Problem with large parallel run: increasing memory usage with increasing number of nodes

1.0
accepted
None
2022-08-08
2022-05-09
No

Hi all,

There is an unexpected behavior when running a (relatively small) simulation in a large number of nodes in a cluster. With an increasing number of compute nodes, the part of the grid solved by each node decreases, and my expectation was that the memory usage by each node should decrease. This happens with a limited number of nodes, but the memory usage increases substantially after a certain number of nodes, which limits the usability of the code. The same behavior has been seen using OpenMPI or MPT (HPE MPI implementation), I am reporting here the results with OpenMPI. As a comparison, I have also used OpenFOAM v2106 to run the same test case, and the memory usage follows the expected behavior.

The details of the test are:

Test case: lid-drive cavity 3d, 8 million grid elements, fixedIter variant https://develop.openfoam.com/committees/hpc/-/tree/cavity-updates/microbenchmarks/cavity-3d/8M/fixedIter
Solver: icoFoam
Compiler: GCC 9.2.0
MPI: OpenMPI 4.0.5
Compute cluster: Hawk supercomputer at HLRS https://www.hlrs.de/systems/hpe-apollo-hawk/
Nodes: 2x AMD EPYC 7742 processors ( 2 x 64 cores), 256GB DDR4 RAM
Interconnect: InfiniBand HDR200

The figure below shows the mean memory used by each node, recorded using the shell command "free" for each time step. With OpenFOAM, the memory usage stabilizes around 16 GB using 16 nodes (2048 cores) up to 128 nodes (16384 cores), while with foam-extend the memory usage starts to rapidly increase with 16 and 32 nodes (2048 and 4096 cores). A run with foam-extend and 128 nodes has crashed due to being out of memory.

I have also used valgrind to detect any memory leaks, and I have seen no difference between foam-extend and OpenFOAM.

Please contact me if I can help running any other test case in a large system.

Best regards,

Flavio Galeazzo

1 Attachments

Discussion

  • Sergey Lesnik

    Sergey Lesnik - 2022-07-29

    Hi Flavio,
    the problem comes from the Pstream class while allocating linear and tree communication lists (discovered using valgrind's massif tool). The lists are N large and their entries are of type commsStruct which is approximately also N large, where N is the number of MPI ranks. Thus, these lists are by design of size N^2. The difference to the OpenFOAM version is that, there, the lists are not allocated at start-up, but only sized to N and each entry (commsStruct) is constructed (and therefore allocated) only if the overloaded operator[] is called on the list. I don't see the large lists in massif's output if this lazy evaluation is introduced in Pstream. I also cleaned some private members, which were needed only for the communication lists allocation. Please try out the attached patch on your large setups to be sure the bug is fixed. The patch is to be applied from $WM_PROJECT_DIR (tested on the the ubuntu2004 branch - commit b42fb8a34696e21)

     
  • Hrvoje Jasak

    Hrvoje Jasak - 2022-07-31

    Hi Guys,

    Sergey, thank you - excellent work. I have applied the patch on my machine: how can I test that everything is correct? Is it safe to push this into nextRelease and run with it fur a while?

    Hrv

     
  • Hrvoje Jasak

    Hrvoje Jasak - 2022-07-31
    • status: open --> accepted
    • assigned_to: Hrvoje Jasak
     
  • Sergey Lesnik

    Sergey Lesnik - 2022-08-02

    Hi Hrvoje,

    To be 100% sure that the bug is fixed, we should wait for the results from Flavio's run with 128 nodes.

    For testing locally you can run valgrind's massif with and without the patch and look at the difference of the allocated memory. In order to spot it, you'll need a decent number of ranks. I used 1024, which produces lists of 4MB. I took the standard 2D cavity case with 1000x1000 cells. The command to run:
    mpirun -np 1024 --oversubscribe valgrind --tool=massif icoFoam -parallel

    If you get an error regarding opened pipe/descriptor limit, here is a solution:
    https://superuser.com/questions/1200539/cannot-increase-open-file-limit-past-4096-ubuntu/1200818#1200818

    After the run, visualize one of the massif.out files written by valgrind with the massif-visualizer tool. Without the patch, you'll find the two bottom entries from the attached screenshot. With the patch applied, these are absent and the total peak memory per rank is lower by 8MB.

    It should be safe to push it to the nextRelease. I deleted only the private members and the access to the communication lists should always go via operator[], which is overloaded now.

    Sergey

     
  • Flavio Galeazzo

    Flavio Galeazzo - 2022-08-02

    Hi guys,

    I have applied the patch and prepared the large runs in the Hawk supercomputer. These large runs always take a while, as they really push the limit of inodes of the storage system. I should have the results in a couple of days.

    Flavio

     
  • Flavio Galeazzo

    Flavio Galeazzo - 2022-08-08

    Hi guys,

    Currently I can run foam-extend with up to 32 nodes ( 4096 cores) in the Hawk supercomputer due to an inode limit in the file system. My tests using 32 nodes show that the patch decreased significantly the memory usage from 38.5 GB to 16.1 GB. The results are sumarized in the figure attached, comparing foam-extend-4.1 with and without the patch and OpenFOAM v2160, all using the MPT (HPE) MPI library. It seems that we will be able to runs larger runs with foam-extend with the patch. Thank you Sergey for this!

    Flavio

     

Log in to post a comment.