On Mon, Nov 11, 2013 at 3:51 PM, Derek Gaston <friedmud@gmail.com> wrote:
I've mentioned this a few times in passing - but now I have concrete evidence: Outputting the solution takes a TON of memory... way more than it should.  This is something we've seen for a very long time - and it's one of the main reasons why many of our parallel jobs die...

Firstly, here's the graph:

This is a run with 2 systems in it - the first one has 40 variables and totals about 25 million DoFs... the second one is two variables and comes out to a little over 1 million DoFs.  This job is spread out across 160 MPI processes (and we're going to be looking at the aggregate memory across all of those).

Quick correction:  We are only looking at the memory of the rank 0 MPI process in this graph.  The memory profile of all the other ranks pretty much match this one though.

The two lines you're seeing are for the exact same run - but the green one is doing output (Exodus in this case - although it doesn't matter what type) and the blue one has output completely turned off.  Due to our awesome memory logger I can tell you that those huge green spikes are occurring in EquationSystems::build_solution_vector()

The problem in there is two-fold:

1.  System::update_global_solution() does a localization (to all processors!) of the entire solution vector!  That's a really terrible idea - especially since we're only going to access local entries in the solution vector in build_solution_vector()!  The normal current_local_solution should suffice - without any of this localization at all....

2.  The global solution vector in build_solution_vector()  (which is called "soln" in the function) is of number_of_nodes*number_of_variables in length - AND it gets allocated on every processor... AND at the end we do an awesome all-to-all global sum of that guy... even though it's only going to get used on processor zero for serialized output formats....

When doing serialized output (like Exodus) that solution vector should only be allocated on processor 0.  Every other processor should have a much shorter vector that is num_local_nodes*num_vars long... and store another vector that is the mapping into the global one (or something along those lines).  Then, at the end, each processor should sum its entries to the correct positions in processor 0 (with some sort of AllGather I would suspect).

When doing parallelized output (like Nemesis) that nodes*vars length vector should _never_ be allocated!  Instead, simply the pieces on each processor that are going to be output should be built and passed along.

Yes - right now we are going through the process of build a global_nodes*vars length solution vector on every processor even for our parallel output formats.

As a first cut we're definitely going to try removing the call to update_global_solution() and just use current_local_solution instead.  We'll report back with another memory graph of that tomorrow.

To do the rest we might need a bit of brainstorming - but if anyone is feeling like they want to get in there and fix this stuff - please do!


November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and register
Libmesh-devel mailing list