On Tue, Oct 29, 2013 at 9:32 AM, Cody Permann <codypermann@gmail.com> wrote:

On Tue, Oct 29, 2013 at 5:54 AM, ernestol <ernestol@lncc.br> wrote:

> I am using an cluster with 23 node for a total of 184 cores, and each node
> additionally has 16GB of RAM. I was thinking that the problem maybe is in
> the code. Because if I run at up to 3 processors I dont have any problens
> but when I try with 4 or more I get this problem.

So you have 8 cores per node, and 2 GB of RAM per core, which is pretty standard.

I ran your 200^3 code on my Mac workstation and watched the memory usage in Activity Monitor.

The results were somewhat surprising as I added cores:

1 core: 2.22 Gb/core
2 cores:  4.0 Gb/core
3 cores: slightly more than 4.0 Gb/core
4 cores: machine went into swap (I think) after approaching about 3.5 Gb/core but code eventually finished
5 cores: machine again went into swap at around 3.3 Gb/core but finished eventually

My workstation has 20 Gb of RAM, so including the OS I guess I could see how approaching 16Gb might cause it to go into swap.

But, what is happening when we go from 1 to 2 cores that causes the memory usage per core to double?!

Note that in all cases the memory quickly jumps to about 2.22 Gb core.  In the 1 processor case it stays there, but in the 2-5 processor cases, after reaching 2Gb/core, it slowly ramps up to the approximately 4 Gb/core listed above.

This, combined with the error message you received (which comes from Metis) leads me to believe that the partitioner is taking up a ton of memory (partitioner doesn't run on 1 proc).  So the questions become:

1.) Is the partitioner taking up a lot more memory than it conceivably should?  (Seems like yes.)
2.) Is it taking up more than it used to?  I.e., has a bug been introduced recently (Metis and Parmetis were last updated in April 2013, so pretty recently actually)

I don't know about reverting to a prior version of Metis/Parmetis is easily done at this point, but the relevant hashes where the refresh happened seem to be:

e80824e86a
1c4b6a0d12

I may take a stab at this after lunch... Cody has been seeing similar issues recently as well.

-- 
John