From: John P. <jwp...@gm...> - 2013-10-29 17:19:45
|
On Tue, Oct 29, 2013 at 9:32 AM, Cody Permann <cod...@gm...> wrote: > > On Tue, Oct 29, 2013 at 5:54 AM, ernestol <ern...@ln...> wrote: > > > I am using an cluster with 23 node for a total of 184 cores, and each > node > > additionally has 16GB of RAM. I was thinking that the problem maybe is in > > the code. Because if I run at up to 3 processors I dont have any problens > > but when I try with 4 or more I get this problem. > So you have 8 cores per node, and 2 GB of RAM per core, which is pretty standard. I ran your 200^3 code on my Mac workstation and watched the memory usage in Activity Monitor. The results were somewhat surprising as I added cores: 1 core: 2.22 Gb/core 2 cores: 4.0 Gb/core 3 cores: slightly more than 4.0 Gb/core 4 cores: machine went into swap (I think) after approaching about 3.5 Gb/core but code eventually finished 5 cores: machine again went into swap at around 3.3 Gb/core but finished eventually My workstation has 20 Gb of RAM, so including the OS I guess I could see how approaching 16Gb might cause it to go into swap. But, what is happening when we go from 1 to 2 cores that causes the memory usage per core to double?! Note that in all cases the memory quickly jumps to about 2.22 Gb core. In the 1 processor case it stays there, but in the 2-5 processor cases, after reaching 2Gb/core, it slowly ramps up to the approximately 4 Gb/core listed above. This, combined with the error message you received (which comes from Metis) leads me to believe that the partitioner is taking up a ton of memory (partitioner doesn't run on 1 proc). So the questions become: 1.) Is the partitioner taking up a lot more memory than it conceivably should? (Seems like yes.) 2.) Is it taking up more than it used to? I.e., has a bug been introduced recently (Metis and Parmetis were last updated in April 2013, so pretty recently actually) I don't know about reverting to a prior version of Metis/Parmetis is easily done at this point, but the relevant hashes where the refresh happened seem to be: e80824e86a 1c4b6a0d12 I may take a stab at this after lunch... Cody has been seeing similar issues recently as well. -- John |
From: John P. <jwp...@gm...> - 2013-10-29 18:31:37
|
On Tue, Oct 29, 2013 at 11:19 AM, John Peterson <jwp...@gm...>wrote: > On Tue, Oct 29, 2013 at 9:32 AM, Cody Permann <cod...@gm...>wrote: > >> >> On Tue, Oct 29, 2013 at 5:54 AM, ernestol <ern...@ln...> wrote: >> >> > I am using an cluster with 23 node for a total of 184 cores, and each >> node >> > additionally has 16GB of RAM. I was thinking that the problem maybe is >> in >> > the code. Because if I run at up to 3 processors I dont have any >> problens >> > but when I try with 4 or more I get this problem. >> > > So you have 8 cores per node, and 2 GB of RAM per core, which is pretty > standard. > > I ran your 200^3 code on my Mac workstation and watched the memory usage > in Activity Monitor. > > The results were somewhat surprising as I added cores: > > 1 core: 2.22 Gb/core > 2 cores: 4.0 Gb/core > 3 cores: slightly more than 4.0 Gb/core > 4 cores: machine went into swap (I think) after approaching about 3.5 > Gb/core but code eventually finished > 5 cores: machine again went into swap at around 3.3 Gb/core but finished > eventually > > My workstation has 20 Gb of RAM, so including the OS I guess I could see > how approaching 16Gb might cause it to go into swap. > > But, what is happening when we go from 1 to 2 cores that causes the memory > usage per core to double?! > > Note that in all cases the memory quickly jumps to about 2.22 Gb core. In > the 1 processor case it stays there, but in the 2-5 processor cases, after > reaching 2Gb/core, it slowly ramps up to the approximately 4 Gb/core listed > above. > > This, combined with the error message you received (which comes from > Metis) leads me to believe that the partitioner is taking up a ton of > memory (partitioner doesn't run on 1 proc). So the questions become: > > 1.) Is the partitioner taking up a lot more memory than it conceivably > should? (Seems like yes.) > 2.) Is it taking up more than it used to? I.e., has a bug been introduced > recently (Metis and Parmetis were last updated in April 2013, so pretty > recently actually) > > I don't know about reverting to a prior version of Metis/Parmetis is > easily done at this point, but the relevant hashes where the refresh > happened seem to be: > > e80824e86a > 1c4b6a0d12 > > I may take a stab at this after lunch... Cody has been seeing similar > issues recently as well. > I confirmed that changing the partitioner does seem to reduce the overall memory usage appreciably. Linear Partitioner 1 core: 2.22 Gb/core 2 cores: about 2.7 Gb/core peak 3 cores: same as 2 cores 4 cores: about 2.6 Gb/core CentroidPartitioner 1 core: 2.22 2 cores: about 3 Gb/core peak 4 cores: about 2.8 Gb/core peak SFCPartitioner 1 core: 2.22 2 cores: slightly > 3 Gb/core peak 4 cores: almost exactly the same Gb/core as 2 cores case Using the Activity Monitor does not provide a huge amount of accuracy, but I think the trends are about the same for the Linear, Centroid, and SFC partitioners, and make a lot more sense than the Metis results. In particular, I was able to run on 4 cores without going into swap. -- John |
From: John P. <jwp...@gm...> - 2013-10-29 19:58:33
|
On Tue, Oct 29, 2013 at 1:38 PM, ernestol <ern...@ln...> wrote: > Thanks for the answers. > > So wich of the three partitioner do you recommend and how can I change it? > I wouldn't say any of them are actually "recommended" for production code, but you can certainly try them by first including the relevant headers: #include "libmesh/linear_partitioner.h" #include "libmesh/centroid_partitioner.h" #include "libmesh/sfc_partitioner.h" and then picking one of them _before_ calling build_cube: Mesh mesh; // Choose a non-default partitioner // mesh.partitioner().reset(new LinearPartitioner); // mesh.partitioner().reset(new CentroidPartitioner); mesh.partitioner().reset(new SFCPartitioner); -- John |
From: John P. <jwp...@gm...> - 2013-10-29 20:48:53
|
On Tue, Oct 29, 2013 at 12:31 PM, John Peterson <jwp...@gm...>wrote: > > > > On Tue, Oct 29, 2013 at 11:19 AM, John Peterson <jwp...@gm...>wrote: > >> On Tue, Oct 29, 2013 at 9:32 AM, Cody Permann <cod...@gm...>wrote: >> >>> >>> On Tue, Oct 29, 2013 at 5:54 AM, ernestol <ern...@ln...> wrote: >>> >>> > I am using an cluster with 23 node for a total of 184 cores, and each >>> node >>> > additionally has 16GB of RAM. I was thinking that the problem maybe is >>> in >>> > the code. Because if I run at up to 3 processors I dont have any >>> problens >>> > but when I try with 4 or more I get this problem. >>> >> >> So you have 8 cores per node, and 2 GB of RAM per core, which is pretty >> standard. >> >> I ran your 200^3 code on my Mac workstation and watched the memory usage >> in Activity Monitor. >> >> The results were somewhat surprising as I added cores: >> >> 1 core: 2.22 Gb/core >> 2 cores: 4.0 Gb/core >> 3 cores: slightly more than 4.0 Gb/core >> 4 cores: machine went into swap (I think) after approaching about 3.5 >> Gb/core but code eventually finished >> 5 cores: machine again went into swap at around 3.3 Gb/core but finished >> eventually >> >> My workstation has 20 Gb of RAM, so including the OS I guess I could see >> how approaching 16Gb might cause it to go into swap. >> >> But, what is happening when we go from 1 to 2 cores that causes the >> memory usage per core to double?! >> >> Note that in all cases the memory quickly jumps to about 2.22 Gb core. >> In the 1 processor case it stays there, but in the 2-5 processor cases, >> after reaching 2Gb/core, it slowly ramps up to the approximately 4 Gb/core >> listed above. >> >> This, combined with the error message you received (which comes from >> Metis) leads me to believe that the partitioner is taking up a ton of >> memory (partitioner doesn't run on 1 proc). So the questions become: >> >> 1.) Is the partitioner taking up a lot more memory than it conceivably >> should? (Seems like yes.) >> 2.) Is it taking up more than it used to? I.e., has a bug been >> introduced recently (Metis and Parmetis were last updated in April 2013, so >> pretty recently actually) >> >> I don't know about reverting to a prior version of Metis/Parmetis is >> easily done at this point, but the relevant hashes where the refresh >> happened seem to be: >> >> e80824e86a >> 1c4b6a0d12 >> >> I may take a stab at this after lunch... Cody has been seeing similar >> issues recently as well. >> > > > I confirmed that changing the partitioner does seem to reduce the overall > memory usage appreciably. > > Linear Partitioner > 1 core: 2.22 Gb/core > 2 cores: about 2.7 Gb/core peak > 3 cores: same as 2 cores > 4 cores: about 2.6 Gb/core > > CentroidPartitioner > 1 core: 2.22 > 2 cores: about 3 Gb/core peak > 4 cores: about 2.8 Gb/core peak > > SFCPartitioner > 1 core: 2.22 > 2 cores: slightly > 3 Gb/core peak > 4 cores: almost exactly the same Gb/core as 2 cores case > > Using the Activity Monitor does not provide a huge amount of accuracy, but > I think the trends are about the same for the Linear, Centroid, and SFC > partitioners, and make a lot more sense than the Metis results. In > particular, I was able to run on 4 cores without going into swap. > I just checked out the hash immediately prior to the latest Metis/Parmetis refresh (git co 5771c42933), ran the same tests again, and got basically the same results on the 200^3 case. So I don't think the metis/parmetis refresh introduced any new memory bugs... Just for the hell of it, I also tried some other problem sizes, and in going from 1 core to 2 cores (Metis off to Metis on) the memory usage per core always increases (to within the accuracy of Activity Monitor) by a factor between 1.5 and 1.9: 100^3: 300 -> 500 Mb/core (1.67X) 150^3: 975 ->1700 Mb/core (1.75X) 175^3: 1.5 -> 2.8 Gb/core (1.87X) 200^3: 2.22 -> 4 Gb/core (1.80X) 225^3: 3.15 -> 4.75 Gb/core (1.5X) So I guess it's possible that Metis has always been like this, but we just haven't noticed it because we don't run problems this big (with SerialMesh) very often? Also, the memory usage does go back down after the partitioning step is complete, so as long as you can survive the memory spike, you can still run an actual problem... We have a more fine-grained memory checker tool here that I'm going to try in a bit, and I'm also going to try the same tests with ParallelMesh/Parmetis. Ben, it looks like we currently base our partitioning algorithm choice solely on the number of partitions... Do you recall if PartGraphKway is any more memory efficient than the PartGraphRecursive algorithm? If so, perhaps we could base our algorithm choice on the size of the mesh requested as well as the number of partitions... I might experiment with this a bit as well. -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-29 21:17:08
|
If forced to choose one of those options for a cube, though, I'd suggest the SFC option. -Ben On Oct 29, 2013, at 2:58 PM, "John Peterson" <jwp...@gm...<mailto:jwp...@gm...>> wrote: On Tue, Oct 29, 2013 at 1:38 PM, ernestol <ern...@ln...<mailto:ern...@ln...>> wrote: Thanks for the answers. So wich of the three partitioner do you recommend and how can I change it? I wouldn't say any of them are actually "recommended" for production code, but you can certainly try them by first including the relevant headers: #include "libmesh/linear_partitioner.h" #include "libmesh/centroid_partitioner.h" #include "libmesh/sfc_partitioner.h" and then picking one of them _before_ calling build_cube: Mesh mesh; // Choose a non-default partitioner // mesh.partitioner().reset(new LinearPartitioner); // mesh.partitioner().reset(new CentroidPartitioner); mesh.partitioner().reset(new SFCPartitioner); -- John ------------------------------------------------------------------------------ Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk _______________________________________________ Libmesh-devel mailing list Lib...@li...<mailto:Lib...@li...> https://lists.sourceforge.net/lists/listinfo/libmesh-devel |
From: John P. <jwp...@gm...> - 2013-10-29 22:09:04
|
On Tue, Oct 29, 2013 at 2:48 PM, John Peterson <jwp...@gm...> wrote: > > I just checked out the hash immediately prior to the latest Metis/Parmetis > refresh (git co 5771c42933), ran the same tests again, and got basically > the same results on the 200^3 case. > > So I don't think the metis/parmetis refresh introduced any new memory > bugs... > > Just for the hell of it, I also tried some other problem sizes, and in > going from 1 core to 2 cores (Metis off to Metis on) the memory usage per > core always increases (to within the accuracy of Activity Monitor) by a > factor between 1.5 and 1.9: > > 100^3: 300 -> 500 Mb/core (1.67X) > 150^3: 975 ->1700 Mb/core (1.75X) > 175^3: 1.5 -> 2.8 Gb/core (1.87X) > 200^3: 2.22 -> 4 Gb/core (1.80X) > 225^3: 3.15 -> 4.75 Gb/core (1.5X) > Using a more accurate memory logger evened these numbers out quite a bit. It is a nearly universal 1.9X increase in peak memory per core to use Metis: 150^3 ----- 3864660 / 1001284 / 2 = 1.9298 (per core) 175^3 ----- 6118764 / 1570976 / 2 = 1.9474 (per core) 200^3 ----- 9070180 / 2333592 / 2 = 1.9434 (per core) (numbers are _total_ peak memory for 2 procs, peak memory for 1 proc, then divide by 2 to get a per-core number.) > We have a more fine-grained memory checker tool here that I'm going to try > in a bit, and I'm also going to try the same tests with > ParallelMesh/Parmetis. > The numbers are a bit better when using ParallelMesh with Parmetis (rather than Metis) as the partitioner, but not great: peak memory per core increases by about 1.4X when using the partitioner. 150^3 ----- 4147908 / 1433204 / 2 = 1.4470 (per core) 175^3 ----- 6483244 / 2258816 / 2 = 1.4350 (per core) 200^3 ----- 9783764 / 3356264 / 2 = 1.4575 (per core) So, in summary, if you use Metis/Parmetis don't assume that because the Mesh alone takes up 2 gigs on 1 processor that you can safely run the same problem in, say, 8 gigs on 4 procs. In reality, you are looking at about 1.9 * 2 gigs/proc * 4 procs = 15.2 gigs for SerialMesh or 1.45*2*4=11.6 gigs for ParalleMesh... Ben, it looks like we currently base our partitioning algorithm choice > solely on the number of partitions... Do you recall if PartGraphKway is > any more memory efficient than the PartGraphRecursive algorithm? If so, > perhaps we could base our algorithm choice on the size of the mesh > requested as well as the number of partitions... I might experiment with > this a bit as well. > Testing the PartGraphKway algorithm now, will report back with results... -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-29 22:22:52
|
Thanks, John. I've had limited access today by hopefully can get caught up and contribute something tonight/tomorrow. -Ben On Oct 29, 2013, at 5:09 PM, "John Peterson" <jwp...@gm...> wrote: > On Tue, Oct 29, 2013 at 2:48 PM, John Peterson <jwp...@gm...> wrote: > >> >> I just checked out the hash immediately prior to the latest Metis/Parmetis >> refresh (git co 5771c42933), ran the same tests again, and got basically >> the same results on the 200^3 case. >> >> So I don't think the metis/parmetis refresh introduced any new memory >> bugs... >> >> Just for the hell of it, I also tried some other problem sizes, and in >> going from 1 core to 2 cores (Metis off to Metis on) the memory usage per >> core always increases (to within the accuracy of Activity Monitor) by a >> factor between 1.5 and 1.9: >> >> 100^3: 300 -> 500 Mb/core (1.67X) >> 150^3: 975 ->1700 Mb/core (1.75X) >> 175^3: 1.5 -> 2.8 Gb/core (1.87X) >> 200^3: 2.22 -> 4 Gb/core (1.80X) >> 225^3: 3.15 -> 4.75 Gb/core (1.5X) > > Using a more accurate memory logger evened these numbers out quite a bit. > It is a nearly universal 1.9X increase in peak memory per core to use > Metis: > > 150^3 > ----- > 3864660 / 1001284 / 2 = 1.9298 (per core) > > 175^3 > ----- > 6118764 / 1570976 / 2 = 1.9474 (per core) > > 200^3 > ----- > 9070180 / 2333592 / 2 = 1.9434 (per core) > > (numbers are _total_ peak memory for 2 procs, peak memory for 1 proc, then > divide by 2 to get a per-core number.) > > > > > >> We have a more fine-grained memory checker tool here that I'm going to try >> in a bit, and I'm also going to try the same tests with >> ParallelMesh/Parmetis. > > The numbers are a bit better when using ParallelMesh with Parmetis (rather > than Metis) as the partitioner, but not great: peak memory per core > increases by about 1.4X when using the partitioner. > > 150^3 > ----- > 4147908 / 1433204 / 2 = 1.4470 (per core) > > 175^3 > ----- > 6483244 / 2258816 / 2 = 1.4350 (per core) > > 200^3 > ----- > 9783764 / 3356264 / 2 = 1.4575 (per core) > > > > So, in summary, if you use Metis/Parmetis don't assume that because the > Mesh alone takes up 2 gigs on 1 processor that you can safely run the same > problem in, say, 8 gigs on 4 procs. > > In reality, you are looking at about 1.9 * 2 gigs/proc * 4 procs = 15.2 > gigs for SerialMesh or 1.45*2*4=11.6 gigs for ParalleMesh... > > > > Ben, it looks like we currently base our partitioning algorithm choice >> solely on the number of partitions... Do you recall if PartGraphKway is >> any more memory efficient than the PartGraphRecursive algorithm? If so, >> perhaps we could base our algorithm choice on the size of the mesh >> requested as well as the number of partitions... I might experiment with >> this a bit as well. > > Testing the PartGraphKway algorithm now, will report back with results... > > -- > John > ------------------------------------------------------------------------------ > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > _______________________________________________ > Libmesh-users mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-users |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-30 14:54:33
|
On Oct 29, 2013, at 5:08 PM, John Peterson <jwp...@gm...> wrote: > Ben, it looks like we currently base our partitioning algorithm choice > solely on the number of partitions... Do you recall if PartGraphKway is > any more memory efficient than the PartGraphRecursive algorithm? If so, > perhaps we could base our algorithm choice on the size of the mesh > requested as well as the number of partitions... I might experiment with > this a bit as well. IIRC that's just a guideline provided form the Metis manual - that there is a tradeoff in algorithm performance based on the number of partitions requested. Looks like your experimentation confirms there is no major memory benefit. -Ben |
From: John P. <jwp...@gm...> - 2013-10-29 23:01:58
|
On Tue, Oct 29, 2013 at 4:08 PM, John Peterson <jwp...@gm...>wrote: > > > > Ben, it looks like we currently base our partitioning algorithm choice >> solely on the number of partitions... Do you recall if PartGraphKway is >> any more memory efficient than the PartGraphRecursive algorithm? If so, >> perhaps we could base our algorithm choice on the size of the mesh >> requested as well as the number of partitions... I might experiment with >> this a bit as well. >> > > Testing the PartGraphKway algorithm now, will report back with results... > Memory usage for PartGraphKway is basically identical to that for PartGraphRecursive. I guess it makes sense if we are allocating most of the memory up front ourselves and not much is actually being allocated by Metis itself... -- John |
From: Derek G. <fri...@gm...> - 2013-10-29 23:14:30
|
Just to add something to this - we've been seeing a memory "leak" associated with métis during adaptive simulations in parallel... every time it repartitions it doesn't seem like we get all the memory back. I don't remember if we ran that through valgrind yet or not. It may not actually "leak" but it might accumulate over time... Derek Sent from my iPhone > On Oct 29, 2013, at 6:01 PM, John Peterson <jwp...@gm...> wrote: > > > > >> On Tue, Oct 29, 2013 at 4:08 PM, John Peterson <jwp...@gm...> wrote: >> >> >>> Ben, it looks like we currently base our partitioning algorithm choice solely on the number of partitions... Do you recall if PartGraphKway is any more memory efficient than the PartGraphRecursive algorithm? If so, perhaps we could base our algorithm choice on the size of the mesh requested as well as the number of partitions... I might experiment with this a bit as well. >> >> >> Testing the PartGraphKway algorithm now, will report back with results... > > > Memory usage for PartGraphKway is basically identical to that for PartGraphRecursive. > > I guess it makes sense if we are allocating most of the memory up front ourselves and not much is actually being allocated by Metis itself... > > -- > John > ------------------------------------------------------------------------------ > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel |
From: Cody P. <cod...@gm...> - 2013-10-30 14:43:01
|
On Tue, Oct 29, 2013 at 3:17 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > If forced to choose one of those options for a cube, though, I'd suggest > the SFC option. > > -Ben > > Thanks Ben! I wasn't even aware of that Partitioner. I just tried it on my very large 3D cube domain simulation and it's giving me a 5% boost in performance over linear with no other changes. I'm running on 120 processors across 60 nodes + threading (using tons of memory). I guess the communication pattern really makes that much difference. Also, that's a low estimate, I have an expensive postprocessor that runs at the end of the timestep that's being added into the timestep timer so the actual solve performance boost might be closer to 10%! Cody > > On Oct 29, 2013, at 2:58 PM, "John Peterson" <jwp...@gm...> wrote: > > > > > On Tue, Oct 29, 2013 at 1:38 PM, ernestol <ern...@ln...> wrote: > >> Thanks for the answers. >> >> So wich of the three partitioner do you recommend and how can I change it? >> > > I wouldn't say any of them are actually "recommended" for production > code, but you can certainly try them by first including the relevant > headers: > > #include "libmesh/linear_partitioner.h" > #include "libmesh/centroid_partitioner.h" > #include "libmesh/sfc_partitioner.h" > > and then picking one of them _before_ calling build_cube: > > Mesh mesh; > > // Choose a non-default partitioner > // mesh.partitioner().reset(new LinearPartitioner); > // mesh.partitioner().reset(new CentroidPartitioner); > mesh.partitioner().reset(new SFCPartitioner); > > -- > John > > > ------------------------------------------------------------------------------ > > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > > > > ------------------------------------------------------------------------------ > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > > |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-30 14:56:53
|
> Thanks Ben! I wasn't even aware of that Partitioner. I just tried it on my very large 3D cube domain simulation and it's giving me a 5% boost in performance over linear with no other changes. I'm running on 120 processors across 60 nodes + threading (using tons of memory). I guess the communication pattern really makes that much difference. Also, that's a low estimate, I have an expensive postprocessor that runs at the end of the timestep that's being added into the timestep timer so the actual solve performance boost might be closer to 10%! Excellent - that's an old space filling curve partitioner from a Carter Edwards class project. It has Hilbert and Morton ordering, but I believe Hilbert is the default. For general meshes I'd expect a graph partitioner to be a better default, but for cubes and sensical numbers of processors the Hilbert space filling curve could be faster. -Ben |
From: Cody P. <cod...@gm...> - 2013-10-30 15:03:12
|
Very interesting, we are all about making sensible default choices for our users. We might make MOOSE default to sfc_hilbert for meshes built with the internal generator. At least until we figure out how to make Metis/Parmetis work better. I can't even come close to getting this problem to run on this many processors when using Metis right now. I run out of memory at about 1/8th this number of cores... More investigation will be necessary. Cody On Wed, Oct 30, 2013 at 8:56 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > > Thanks Ben! I wasn't even aware of that Partitioner. I just tried it > on my very large 3D cube domain simulation and it's giving me a 5% boost in > performance over linear with no other changes. I'm running on 120 > processors across 60 nodes + threading (using tons of memory). I guess the > communication pattern really makes that much difference. Also, that's a > low estimate, I have an expensive postprocessor that runs at the end of the > timestep that's being added into the timestep timer so the actual solve > performance boost might be closer to 10%! > > > Excellent - that's an old space filling curve partitioner from a Carter > Edwards class project. It has Hilbert and Morton ordering, but I believe > Hilbert is the default. For general meshes I'd expect a graph partitioner > to be a better default, but for cubes and sensical numbers of processors > the Hilbert space filling curve could be faster. > > -Ben > > |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-30 15:08:54
|
On Oct 30, 2013, at 10:03 AM, Cody Permann <cod...@gm...> wrote: > Very interesting, we are all about making sensible default choices for our users. We might make MOOSE default to sfc_hilbert for meshes built with the internal generator. At least until we figure out how to make Metis/Parmetis work better. I can't even come close to getting this problem to run on this many processors when using Metis right now. I run out of memory at about 1/8th this number of cores... More investigation will be necessary. Definitely. I wonder if having build_cube() set the SFC partitioner might be a good way to go. Let me know how things compare with an AMR problem, especially with ParallelMesh if you can - Parmetis has a diffusion-based repartitioning scheme that seeks to minimize data movement and could be better than just repartitioning the whole mesh regardless of initial distribution, which is what SFC would do for an adaptive repartitioning. Of course, not allocating that much memory might make it a net speed win! -Ben |
From: Cody P. <cod...@gm...> - 2013-10-30 15:14:45
|
On Wed, Oct 30, 2013 at 9:08 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > On Oct 30, 2013, at 10:03 AM, Cody Permann <cod...@gm...> > wrote: > > > Very interesting, we are all about making sensible default choices for > our users. We might make MOOSE default to sfc_hilbert for meshes built > with the internal generator. At least until we figure out how to make > Metis/Parmetis work better. I can't even come close to getting this > problem to run on this many processors when using Metis right now. I run > out of memory at about 1/8th this number of cores... More investigation > will be necessary. > > Definitely. I wonder if having build_cube() set the SFC partitioner might > be a good way to go. > > Let me know how things compare with an AMR problem, especially with > ParallelMesh if you can - Parmetis has a diffusion-based repartitioning > scheme that seeks to minimize data movement and could be better than just > repartitioning the whole mesh regardless of initial distribution, which is > what SFC would do for an adaptive repartitioning. Of course, not > allocating that much memory might make it a net speed win! > > -Ben > > I confirmed all this before with my 2D simulations. The parmetis partitioner did an excellent job of minimizing movement. Unfortunately, I can't get parmetis to scale to this mesh size with this number of processors (80x80x80 - with 120 procs). It's likely the same Metis problem that John's seeing. I'm excited to try AMR, I'll keep you posted. Cody |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-30 20:42:05
|
Back to the subject issue of Metis memory usage - https://github.com/libMesh/libmesh/blob/master/src/partitioning/metis_partitioner.C#L105 We build a std::map<> from Elem* to a unique sorted contiguous ID, as Metis only considers the active elements and needs some contiguous numbering. I expect that gets quite big, and maybe should be refactored to use a sorted std::vector<std::pair<Elem*, dof_id_type> > instead? We could build it in one pass, sort it, and then use it with a binary search or something. -Ben |
From: Cody P. <cod...@gm...> - 2013-10-30 21:01:11
|
Wow, If this does indeed fix the issue, then I can think of a lot of memory hog areas in MOOSE that we might have to cleanup sooner rather than later. I hope the overhead of the tree doesn't dominate the value_type stored so much that it blows us up our total usage by 200%! On the other hand sizeof(std::pair<Elem*, dof_id_type>) is probably about half that of a single node in the equivalent red/black tree when you consider the left right pointers... yikes! Cody On Wed, Oct 30, 2013 at 2:41 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > Back to the subject issue of Metis memory usage - > > > https://github.com/libMesh/libmesh/blob/master/src/partitioning/metis_partitioner.C#L105 > > We build a std::map<> from Elem* to a unique sorted contiguous ID, as > Metis only considers the active elements and needs some contiguous > numbering. I expect that gets quite big, and maybe should be refactored to > use a sorted std::vector<std::pair<Elem*, dof_id_type> > instead? > > We could build it in one pass, sort it, and then use it with a binary > search or something. > > -Ben > > > > ------------------------------------------------------------------------------ > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-30 21:09:20
|
Yeah, before I get too carried away I should probably just try running the existing code path twice: Once as-is, and again actually commenting out the underlying Metis call, making the partitioner a big, expensive no-op. Actually, John, if you have a chance could you rerun one of the cases you have data for, but just comment out the call to metis? Hopefully the memory usage will drop, verifying metis is the issue. It should suffice to comment out the metis call, and add a std::fill (part.begin(), part.end(), 0); instead, provided its this simple stand-alone case where the mesh is not used! -Ben On Oct 30, 2013, at 4:01 PM, Cody Permann <cod...@gm...> wrote: > Wow, If this does indeed fix the issue, then I can think of a lot of memory hog areas in MOOSE that we might have to cleanup sooner rather than later. I hope the overhead of the tree doesn't dominate the value_type stored so much that it blows us up our total usage by 200%! On the other hand sizeof(std::pair<Elem*, dof_id_type>) is probably about half that of a single node in the equivalent red/black tree when you consider the left right pointers... yikes! > > Cody > > > On Wed, Oct 30, 2013 at 2:41 PM, Kirk, Benjamin (JSC-EG311) <ben...@na...> wrote: > Back to the subject issue of Metis memory usage - > > https://github.com/libMesh/libmesh/blob/master/src/partitioning/metis_partitioner.C#L105 > > We build a std::map<> from Elem* to a unique sorted contiguous ID, as Metis only considers the active elements and needs some contiguous numbering. I expect that gets quite big, and maybe should be refactored to use a sorted std::vector<std::pair<Elem*, dof_id_type> > instead? > > We could build it in one pass, sort it, and then use it with a binary search or something. > > -Ben > > > ------------------------------------------------------------------------------ > Android is increasing in popularity, but the open development platform that > developers love is also attractive to malware creators. Download this white > paper to learn more about secure code signing practices that can help keep > Android apps secure. > http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > |
From: John P. <jwp...@gm...> - 2013-10-30 21:28:30
|
On Wed, Oct 30, 2013 at 3:09 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > Yeah, before I get too carried away I should probably just try running the > existing code path twice: Once as-is, and again actually commenting out > the underlying Metis call, making the partitioner a big, expensive no-op. > > Actually, John, if you have a chance could you rerun one of the cases you > have data for, but just comment out the call to metis? Hopefully the > memory usage will drop, verifying metis is the issue. > > It should suffice to comment out the metis call, and add a > > std::fill (part.begin(), part.end(), 0); > > instead, provided its this simple stand-alone case where the mesh is not > used! > Yep, I can certainly do that, but I think this is already verified just by looking at the difference in memory usage between Centroid/Linear/SFC Paritioner and Metis I posted in one of the prior emails this week. Switching from std::map to a sorted vector sounds like a good idea... especially considering the use case here. I was also wondering about the size of the std::vector<std::vector<dof_id_type> > graph(n_active_elem); that gets temporarily created... it should be approximately n_active_elem * n_neighbors * sizeof(unsigned) in size, but maybe that still pales in comparison to the std::map... -- John |
From: John P. <jwp...@gm...> - 2013-10-30 23:55:27
|
On Wed, Oct 30, 2013 at 3:27 PM, John Peterson <jwp...@gm...> wrote: > > > > On Wed, Oct 30, 2013 at 3:09 PM, Kirk, Benjamin (JSC-EG311) < > ben...@na...> wrote: > >> Yeah, before I get too carried away I should probably just try running >> the existing code path twice: Once as-is, and again actually commenting >> out the underlying Metis call, making the partitioner a big, expensive >> no-op. >> >> Actually, John, if you have a chance could you rerun one of the cases you >> have data for, but just comment out the call to metis? Hopefully the >> memory usage will drop, verifying metis is the issue. >> >> It should suffice to comment out the metis call, and add a >> >> std::fill (part.begin(), part.end(), 0); >> >> instead, provided its this simple stand-alone case where the mesh is not >> used! >> > > Yep, I can certainly do that, but I think this is already verified just by > looking at the difference in memory usage between Centroid/Linear/SFC > Paritioner and Metis I posted in one of the prior emails this week. > Here's a link to a plot of total memory usage (across 2 procs) for the 200^3 case, annoated at different points in the simulation: https://drive.google.com/file/d/0B9BK7pg8se_iWmloaHNhOTJSNUE/edit?usp=sharing The plot didn't quite include all the annotations I was expecting, but I do have some more precise numbers: 1. before/after building global_index_map: 6653660 - 5615440 K = 0.99 Gb total, half a gig/core 2. begin/end call to Metis: 7628896 - 7460828 = .16 Gb, we actually have slightly _more_ memory free when Metis finishes (plus/minus sampling error) so I don't think there are any major leaks in Metis 3. The ramp between the "global_index_map end" and "graph alloc" is the time when the graph is filled up and when the entries in vwgt, which was allocated earlier, are finally being touched. Could be the OS is finally assigning vwgt actual memory during this time? I would have thought we would recover more memory when the graph is deallocated, which happens just before the call to PartGraphRecursive (you can see a slight dip there)... I'll have to try and instrument it a bit more carefully tomorrow. -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-31 00:37:45
|
That graph is pretty awesome - thanks! I'm gonna have to digest that, but I think there could be some small room for improvement with a different data structure - if I interpret the steep ramp up to (global_index_map end) as map construction. -Ben On Oct 30, 2013, at 6:55 PM, "John Peterson" <jwp...@gm...<mailto:jwp...@gm...>> wrote: On Wed, Oct 30, 2013 at 3:27 PM, John Peterson <jwp...@gm...<mailto:jwp...@gm...>> wrote: On Wed, Oct 30, 2013 at 3:09 PM, Kirk, Benjamin (JSC-EG311) <ben...@na...<mailto:ben...@na...>> wrote: Yeah, before I get too carried away I should probably just try running the existing code path twice: Once as-is, and again actually commenting out the underlying Metis call, making the partitioner a big, expensive no-op. Actually, John, if you have a chance could you rerun one of the cases you have data for, but just comment out the call to metis? Hopefully the memory usage will drop, verifying metis is the issue. It should suffice to comment out the metis call, and add a std::fill (part.begin(), part.end(), 0); instead, provided its this simple stand-alone case where the mesh is not used! Yep, I can certainly do that, but I think this is already verified just by looking at the difference in memory usage between Centroid/Linear/SFC Paritioner and Metis I posted in one of the prior emails this week. Here's a link to a plot of total memory usage (across 2 procs) for the 200^3 case, annoated at different points in the simulation: https://drive.google.com/file/d/0B9BK7pg8se_iWmloaHNhOTJSNUE/edit?usp=sharing The plot didn't quite include all the annotations I was expecting, but I do have some more precise numbers: 1. before/after building global_index_map: 6653660 - 5615440 K = 0.99 Gb total, half a gig/core 2. begin/end call to Metis: 7628896 - 7460828 = .16 Gb, we actually have slightly _more_ memory free when Metis finishes (plus/minus sampling error) so I don't think there are any major leaks in Metis 3. The ramp between the "global_index_map end" and "graph alloc" is the time when the graph is filled up and when the entries in vwgt, which was allocated earlier, are finally being touched. Could be the OS is finally assigning vwgt actual memory during this time? I would have thought we would recover more memory when the graph is deallocated, which happens just before the call to PartGraphRecursive (you can see a slight dip there)... I'll have to try and instrument it a bit more carefully tomorrow. -- John |
From: John P. <jwp...@gm...> - 2013-10-31 16:47:38
|
On Wed, Oct 30, 2013 at 6:37 PM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > That graph is pretty awesome - thanks! I'm gonna have to digest that, > but I think there could be some small room for improvement with a different > data structure - if I interpret the steep ramp up to (global_index_map end) > as map construction. > I've uploaded a slightly better memory usage graph for the 200^3 case: https://drive.google.com/file/d/0B9BK7pg8se_ia1YxSUFkb19TSTg/edit?usp=sharing (Don't read anything into the time axis of the graph: I've inserted artificial delays around print statements so the labels would be more legible.) Here's a description of the labeled points: 100: The start of MetisPartitioner::_do_partition() 200.a/b: Wraps the creation of the 'vwgt' and 'part' vectors. Note that the memory usage doesn't change here, this might be an OS-level optimization which doesn't prevents allocating memory until you actually write to it. 300.a/b: Wraps the creation of the global_index_map object. 400.a/b: Wraps the creation of the 'xadj', 'adjncy', and 'graph' objects. vwgt is also written to during this time, so it must finally be allocated. Note that 'graph' has been deallocated by the time we reach 400.b. 450.a/b: Wraps just the creation of the 'xadj' and 'adjncy' vectors. 500.a/b: Wraps the actual call to Metis. Remarks: .) Keep in mind that these numbers are _total_ memory used on 2 processors, so the amount/proc is half what is shown. .) The memory usage before/after the Metis call is definitely not equal, but you can't conclude it's a leak without valgrind verification, could just be the OS choosing not to deallocate some memory... .) The behavior changes depending on the size of the problem. For example, if you run the 100^3 case, https://drive.google.com/file/d/0B9BK7pg8se_icS0wSllhWE5wSWM/edit?usp=sharing, it looks like the OS decides not to actually deallocate memory between 450.b and 400.b... -- John |
From: John P. <jwp...@gm...> - 2013-10-31 21:07:42
|
> > We build a std::map<> from Elem* to a unique sorted contiguous ID, as > Metis only considers the active elements and needs some contiguous > numbering. I expect that gets quite big, and maybe should be refactored to > use a sorted std::vector<std::pair<Elem*, dof_id_type> > instead? You are quite correct about this. I ran our memory logger to compare the size of memory of a map<int,int> and a vector of pair<int,int>, both with 8M elements. The results were pretty striking: https://drive.google.com/file/d/0B9BK7pg8se_iaWFzbHlEemxmX0U/edit?usp=sharing The map's peak memory usage is almost exactly 6X that of the vector's... The extra memory presumably comes from the "color", "parent", "left", and "right" data members stored at each node of the RB tree. The last 3 of those are each 8-byte pointers on 64bit machines... -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-10-31 21:14:25
|
> You are quite correct about this. I ran our memory logger to compare the size of memory of a map<int,int> and a vector of pair<int,int>, both with 8M elements. The results were pretty striking Wow, sounds like we need a MaplikeVector<> to borrow Roy's terminology from the ParallelMesh stuff. Something that supports foo.insert(key,value); foo.sort() val = foo[key]; // where we assert we've been sorted I'll see if I can get one started in include/utils. Presumably that would be of interest to Moose as well, no? -Ben |
From: Roy S. <roy...@ic...> - 2013-11-01 11:56:48
|
On Thu, 31 Oct 2013, John Peterson wrote: > We build a std::map<> from Elem* to a unique sorted contiguous ID, as Metis only considers the active elements and needs > some contiguous numbering. I expect that gets quite big, and maybe should be refactored to use a sorted > std::vector<std::pair<Elem*, dof_id_type> > instead? > > You are quite correct about this. I ran our memory logger to compare the size of memory of a map<int,int> and a vector of > pair<int,int>, both with 8M elements. The results were pretty striking: > > https://drive.google.com/file/d/0B9BK7pg8se_iaWFzbHlEemxmX0U/edit?usp=sharing > > The map's peak memory usage is almost exactly 6X that of the vector's... Wow, nasty. I was expecting 3X; I forgot about the sizeof(int*) == 2*sizeof(int) issue. It's not really relevant here, but out of curiosity, could you give unordered_map a try? --- Roy |