That graph is pretty awesome - thanks! I'm gonna have to digest that, but I think there could be some small room for improvement with a different data structure - if I interpret the steep ramp up to (global_index_map end) as map construction.
I've uploaded a slightly better memory usage graph for the 200^3 case:
(Don't read anything into the time axis of the graph: I've inserted artificial delays around print statements so the labels would be more legible.)
Here's a description of the labeled points:
100: The start of MetisPartitioner::_do_partition()
200.a/b: Wraps the creation of the 'vwgt' and 'part' vectors. Note that the memory usage doesn't change here, this might be an OS-level optimization which doesn't prevents allocating memory until you actually write to it.
300.a/b: Wraps the creation of the global_index_map object.
400.a/b: Wraps the creation of the 'xadj', 'adjncy', and 'graph' objects. vwgt is also written to during this time, so it must finally be allocated. Note that 'graph' has been deallocated by the time we reach 400.b.
450.a/b: Wraps just the creation of the 'xadj' and 'adjncy' vectors.
500.a/b: Wraps the actual call to Metis.
.) Keep in mind that these numbers are _total_ memory used on 2 processors, so the amount/proc is half what is shown.
.) The memory usage before/after the Metis call is definitely not equal, but you can't conclude it's a leak without valgrind verification, could just be the OS choosing not to deallocate some memory...