From: robert <rob...@un...> - 2011-08-30 15:56:11
|
Hello, I am trying to get to run my libmesh code in parallel. Thanks to the help of you guys most of the code already works. However, I still have some small problems where I still don't have enough understanding of libmesh to solve them alone. One of them is to do mesh refinement in parallel: Mesh new_mesh; TetGenIO TETGEN(new_mesh); { // read TETGEN file OStringStream inmesh; inmesh<<"geometry_wings/export_pov/out/"<<ReadMesh[actual_pulse]<<"_box.1.ele"; TETGEN.read(inmesh.str().c_str()); } set_subdomains(&TETGEN, new_mesh); // works set_boundary_ids(new_mesh); // works MeshRefinement mesh_refinement (new_mesh); // see note below!!! mesh_refinement.uniformly_refine(no_unirefine); // see note below!!! new_mesh.prepare_for_use(); EquationSystems new_equation_systems (new_mesh); The mesh.uniformly_refine(1) takes some minutes when I do it in serial on my PC. However, on the BlueGene/P it doesn't come to an end (I aborted after 15 min). Do I have to prepare my mesh in a special way for use in parallel? One thing I have tried was to call the prepare_for_use() before the refinement. I thought this makes sense because the mesh is partitioned before. But, the problem remains. Mesh Information: Mesh Information: mesh_dimension()=3 spatial_dimension()=3 n_nodes()=447273 n_local_nodes()=447273 n_elem()=2994336 n_local_elem()=2994336 n_active_elem()=2661632 n_subdomains()=2 n_processors()=1 processor_id()=0 Robert |
From: John P. <jwp...@gm...> - 2011-08-30 16:08:53
|
On Tue, Aug 30, 2011 at 9:56 AM, robert <rob...@un...> wrote: > > The mesh.uniformly_refine(1) takes some minutes when I do it in serial > on my PC. However, on the BlueGene/P it doesn't come to an end (I > aborted after 15 min). Do I have to prepare my mesh in a special way for > use in parallel? One thing I have tried was to call the > prepare_for_use() before the refinement. I thought this makes sense > because the mesh is partitioned before. But, the problem remains. Your mesh is serial unless you specifically compiled libmesh with --enable-parmesh. What this means is that the mesh data structure (and any refinements, etc.) is duplicated on all processors -- convenient for writing code, not great for conserving memory. And it's particularly bad on multicore shared memory architectures: you will have one copy of the mesh per MPI process, typically that means one per core. > Mesh Information: > Mesh Information: > mesh_dimension()=3 > spatial_dimension()=3 > n_nodes()=447273 > n_local_nodes()=447273 > n_elem()=2994336 > n_local_elem()=2994336 > n_active_elem()=2661632 > n_subdomains()=2 > n_processors()=1 > processor_id()=0 You may be going into swap depending on how little memory is available per core on this machine, but 3M elements isn't particularly huge.... so .) How much memory do you have per core? .) What is going on in 'top' or the equivalent on BGL when your code is running? -- John |
From: robert <rob...@un...> - 2011-08-30 16:36:58
|
Am Dienstag, den 30.08.2011, 10:08 -0600 schrieb John Peterson: > On Tue, Aug 30, 2011 at 9:56 AM, robert <rob...@un...> wrote: > > > > The mesh.uniformly_refine(1) takes some minutes when I do it in serial > > on my PC. However, on the BlueGene/P it doesn't come to an end (I > > aborted after 15 min). Do I have to prepare my mesh in a special way for > > use in parallel? One thing I have tried was to call the > > prepare_for_use() before the refinement. I thought this makes sense > > because the mesh is partitioned before. But, the problem remains. > > Your mesh is serial unless you specifically compiled libmesh with > --enable-parmesh. Ok, but nevertheless it shouldn't be slower than on my PC, or am I wrong? Would this option just work with the existing code or do I have to change to ParallelMesh? > > What this means is that the mesh data structure (and any refinements, > etc.) is duplicated on all processors -- convenient for writing code, > not great for conserving memory. > > And it's particularly bad on multicore shared memory architectures: > you will have one copy of the mesh per MPI process, typically that > means one per core. > > > Mesh Information: > > Mesh Information: > > mesh_dimension()=3 > > spatial_dimension()=3 > > n_nodes()=447273 > > n_local_nodes()=447273 > > n_elem()=2994336 > > n_local_elem()=2994336 > > n_active_elem()=2661632 > > n_subdomains()=2 > > n_processors()=1 > > processor_id()=0 > > You may be going into swap depending on how little memory is available > per core on this machine, but 3M elements isn't particularly huge.... > so > > .) How much memory do you have per core? – SMP mode : 4 GB of physical memory available to the MPI task running on each node (4 cores/node) For testing and learning I only use a partition of 32 nodes. > .) What is going on in 'top' or the equivalent on BGL when your code is running? What do you mean by this? Excuse my question - I am just not the most experienced person with computers. |
From: John P. <jwp...@gm...> - 2011-08-30 16:49:05
|
On Tue, Aug 30, 2011 at 10:36 AM, robert <rob...@un...> wrote: > Am Dienstag, den 30.08.2011, 10:08 -0600 schrieb John Peterson: >> On Tue, Aug 30, 2011 at 9:56 AM, robert <rob...@un...> wrote: >> > >> > The mesh.uniformly_refine(1) takes some minutes when I do it in serial >> > on my PC. However, on the BlueGene/P it doesn't come to an end (I >> > aborted after 15 min). Do I have to prepare my mesh in a special way for >> > use in parallel? One thing I have tried was to call the >> > prepare_for_use() before the refinement. I thought this makes sense >> > because the mesh is partitioned before. But, the problem remains. >> >> Your mesh is serial unless you specifically compiled libmesh with >> --enable-parmesh. > > Ok, but nevertheless it shouldn't be slower than on my PC, or am I > wrong? It will be extremely slow if it goes into swap. > Would this option just work with the existing code or do I have to > change to ParallelMesh? Should work with existing code. Hopefully. >> What this means is that the mesh data structure (and any refinements, >> etc.) is duplicated on all processors -- convenient for writing code, >> not great for conserving memory. >> >> And it's particularly bad on multicore shared memory architectures: >> you will have one copy of the mesh per MPI process, typically that >> means one per core. >> >> > Mesh Information: >> > Mesh Information: >> > mesh_dimension()=3 >> > spatial_dimension()=3 >> > n_nodes()=447273 >> > n_local_nodes()=447273 >> > n_elem()=2994336 >> > n_local_elem()=2994336 >> > n_active_elem()=2661632 >> > n_subdomains()=2 >> > n_processors()=1 >> > processor_id()=0 >> >> You may be going into swap depending on how little memory is available >> per core on this machine, but 3M elements isn't particularly huge.... >> so >> >> .) How much memory do you have per core? > > – SMP mode : 4 GB of physical memory available to the MPI task > running on each node (4 cores/node) I can't tell if you mean 4GB/node or 4GB/core. If it's the latter (16 GB/node) then I don't see how you can possibly be running into swap. If it's the former then there could be an issue. > For testing and learning I only use a partition of 32 nodes. 32 nodes or 32 cores? I don't know the details of your cluster so it may be obvious, but make sure you aren't accidentally running too many MPI processes on a given node. >> .) What is going on in 'top' or the equivalent on BGL when your code is running? > What do you mean by this? Excuse my question - I am just not the most > experienced person with computers. On Linux/UNIX systems, top is a program that prints what processes are currently running, the relative amount of CPU they are using, and their memory consumption. If you can access the compute node where your code is running and see how much memory/CPU each of its processes is consuming, you can get some idea if the code is running into swap (there will be a very low CPU utilization and a very high memory consumption). -- John |
From: robert <rob...@un...> - 2011-08-30 18:23:28
|
Am Dienstag, den 30.08.2011, 10:48 -0600 schrieb John Peterson: > It will be extremely slow if it goes into swap. > > > Would this option just work with the existing code or do I have to > > change to ParallelMesh? > > Should work with existing code. Hopefully. > > > > > – SMP mode : 4 GB of physical memory available to the MPI task > > running on each node (4 cores/node) > > I can't tell if you mean 4GB/node or 4GB/core. If it's the latter (16 > GB/node) then I don't see how you can possibly be running into swap. > If it's the former then there could be an issue. > > > For testing and learning I only use a partition of 32 nodes. > > 32 nodes or 32 cores? I don't know the details of your cluster so it > may be obvious, but make sure you aren't accidentally running too many > MPI processes on a given node. > As far as I understood it it is: 1 node = 4cores 4GB/node For testing and learning I only used a partition of 32 nodes. I have just changed to 128 nodes but this doesn't change anything. If I am running into swap and I use --enable-parmesh this wouldn't change much, (since I have one copy of the mesh per mpi-process), right? > > On Linux/UNIX systems, top is a program that prints what processes are > currently running, the relative amount of CPU they are using, and > their memory consumption. If you can access the compute node where > your code is running and see how much memory/CPU each of its processes > is consuming, you can get some idea if the code is running into swap > (there will be a very low CPU utilization and a very high memory > consumption). > top - 20:19:21 up 35 days, 8:55, 51 users, load average: 0.01, 0.29, 0.45 Tasks: 399 total, 1 running, 397 sleeping, 1 stopped, 0 zombie Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 31985140k total, 31158420k used, 826720k free, 274980k buffers Swap: 8393952k total, 160k used, 8393792k free, 16572876k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2955 bodner 16 0 3392 1932 1244 R 1 0.0 0:00.69 top 6602 bodner 15 0 14296 3248 1864 S 0 0.0 0:10.11 sshd 2829 bodner 15 0 19604 3892 3092 S 0 0.0 0:00.17 mpirun . . . The last one is the process of interest. |
From: John P. <jwp...@gm...> - 2011-08-30 18:34:53
|
On Tue, Aug 30, 2011 at 12:23 PM, robert <rob...@un...> wrote: > >> 32 nodes or 32 cores? I don't know the details of your cluster so it >> may be obvious, but make sure you aren't accidentally running too many >> MPI processes on a given node. >> > As far as I understood it it is: > > 1 node = 4cores > > 4GB/node This doesn't match the output of the top command you posted below. The total memory given there is 31 985 140 kilobytes = 30.5034065 gigabytes. Does the cluster you are on have a public information web page? That would probably help clear things up... > For testing and learning I only used a partition of 32 nodes. > I have just changed to 128 nodes but this doesn't change anything. > > > If I am running into swap and I use --enable-parmesh this wouldn't > change much, (since I have one copy of the mesh per mpi-process), right? The idea would be to run fewer processes per node. For example, you could run 1 MPI process each on 128 different nodes, then each of the individual processes would have access to the full amount of RAM for the node. The method for doing this is again cluster dependent; I don't know if it's possible on your particular cluster. > top - 20:19:21 up 35 days, 8:55, 51 users, load average: 0.01, 0.29, > 0.45 > Tasks: 399 total, 1 running, 397 sleeping, 1 stopped, 0 zombie > Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 31985140k total, 31158420k used, 826720k free, 274980k buffers > Swap: 8393952k total, 160k used, 8393792k free, 16572876k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 2955 bodner 16 0 3392 1932 1244 R 1 0.0 0:00.69 > top > 6602 bodner 15 0 14296 3248 1864 S 0 0.0 0:10.11 > sshd > 2829 bodner 15 0 19604 3892 3092 S 0 0.0 0:00.17 mpirun > > The last one is the process of interest. Actually none of these are interesting... we would need to see that actual processes that mpirun spawned. That is, if you ran something like mpirun -np 4 ./foo You would need to look for the four instances of "foo" in the top command, see how much CPU/memory they are consuming. -- John |
From: robert <rob...@un...> - 2011-08-30 18:51:46
|
Am Dienstag, den 30.08.2011, 12:34 -0600 schrieb John Peterson: > On Tue, Aug 30, 2011 at 12:23 PM, robert <rob...@un...> wrote: > > > >> 32 nodes or 32 cores? I don't know the details of your cluster so it > >> may be obvious, but make sure you aren't accidentally running too many > >> MPI processes on a given node. > >> > > As far as I understood it it is: > > > > 1 node = 4cores > > > > 4GB/node > > This doesn't match the output of the top command you posted below. > The total memory given there is 31 985 140 kilobytes = 30.5034065 > gigabytes. > > Does the cluster you are on have a public information web page? That > would probably help clear things up... > > > > > For testing and learning I only used a partition of 32 nodes. > > I have just changed to 128 nodes but this doesn't change anything. > > > > > > If I am running into swap and I use --enable-parmesh this wouldn't > > change much, (since I have one copy of the mesh per mpi-process), right? > > The idea would be to run fewer processes per node. For example, you > could run 1 MPI process each on 128 different nodes, then each of the > individual processes would have access to the full amount of RAM for > the node. The method for doing this is again cluster dependent; I > don't know if it's possible on your particular cluster. > It is possible to run 1, 2 or 4 processes per node. If I run 2 or 4 processes I get: Error! ***Memory allocation failed for SetUpCoarseGraph: gdata. Requested size: 107754020 bytesError! ***Memory allocation failed for SetUpCoarseGraph: gdata. Requested size: 107754020 bytesError! For 1 process it works but very, very slowly > > > > top - 20:19:21 up 35 days, 8:55, 51 users, load average: 0.01, 0.29, > > 0.45 > > Tasks: 399 total, 1 running, 397 sleeping, 1 stopped, 0 zombie > > Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, > > 0.0%st > > Mem: 31985140k total, 31158420k used, 826720k free, 274980k buffers > > Swap: 8393952k total, 160k used, 8393792k free, 16572876k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > > COMMAND > > 2955 bodner 16 0 3392 1932 1244 R 1 0.0 0:00.69 > > top > > 6602 bodner 15 0 14296 3248 1864 S 0 0.0 0:10.11 > > sshd > > 2829 bodner 15 0 19604 3892 3092 S 0 0.0 0:00.17 mpirun > > > > The last one is the process of interest. > > Actually none of these are interesting... we would need to see that > actual processes that mpirun spawned. That is, if you ran something > like > > mpirun -np 4 ./foo > > You would need to look for the four instances of "foo" in the top > command, see how much CPU/memory they are consuming. > |
From: John P. <jwp...@gm...> - 2011-08-30 19:03:17
|
On Tue, Aug 30, 2011 at 12:34 PM, John Peterson <jwp...@gm...> wrote: > On Tue, Aug 30, 2011 at 12:23 PM, robert <rob...@un...> wrote: >> >>> 32 nodes or 32 cores? I don't know the details of your cluster so it >>> may be obvious, but make sure you aren't accidentally running too many >>> MPI processes on a given node. >>> >> As far as I understood it it is: >> >> 1 node = 4cores >> >> 4GB/node > > This doesn't match the output of the top command you posted below. > The total memory given there is 31 985 140 kilobytes = 30.5034065 > gigabytes. > > Does the cluster you are on have a public information web page? That > would probably help clear things up... The 32GB from your top command is for the head node. It does appear that there are 4GBs of physical memory for each compute node. > It is possible to run 1, 2 or 4 processes per node. If I run 2 or 4 processes I get: > Error! ***Memory allocation failed for SetUpCoarseGraph: gdata. > Requested size: 107754020 bytesError! ***Memory allocation failed for > SetUpCoarseGraph: gdata. Requested size: 107754020 bytesError! This function is in Metis, so you are running out of memory during the mesh partitioning stage. For one process, it's possible you are not running out of memory but are going into swap... which is making the code run really slowly. I'd try recompiling libmesh with parallel mesh enabled... but I'm still surprised that 4GB is not enough memory. Are the 2,994,336 elements in the mesh you posted before or after uniform refinement? -- John |
From: robert <rob...@un...> - 2011-08-30 19:19:32
|
> > > It is possible to run 1, 2 or 4 processes per node. If I run 2 or 4 processes I get: > > Error! ***Memory allocation failed for SetUpCoarseGraph: gdata. > > Requested size: 107754020 bytesError! ***Memory allocation failed for > > SetUpCoarseGraph: gdata. Requested size: 107754020 bytesError! > > This function is in Metis, so you are running out of memory during the > mesh partitioning stage. > > For one process, it's possible you are not running out of memory but > are going into swap... which is making the code run really slowly. > > I'd try recompiling libmesh with parallel mesh enabled... but I'm > still surprised that 4GB is not enough memory. > > Are the 2,994,336 elements in the mesh you posted before or after > uniform refinement? > They are after the refinement which I calculated on my PC. Additionally to libmesh I wrote a class which reads some textfiles which contain some geological information. This class also includes some functions to handle them - I basically use them to set subdomains in the system. Since it is the first time that I work in parallel, every node also has a copy of this class. But, the total size of the textfiles which are read in is about 4 MB - so, I didn't consider it too big to make problems. But maybe you have a different opinion about it?? |
From: robert <rob...@un...> - 2011-08-31 07:31:38
|
Am Dienstag, den 30.08.2011, 21:19 +0200 schrieb robert: > > > > > It is possible to run 1, 2 or 4 processes per node. If I run 2 or 4 processes I get: > > > Error! ***Memory allocation failed for SetUpCoarseGraph: gdata. > > > Requested size: 107754020 bytesError! ***Memory allocation failed for > > > SetUpCoarseGraph: gdata. Requested size: 107754020 bytesError! > > > > This function is in Metis, so you are running out of memory during the > > mesh partitioning stage. > > > > For one process, it's possible you are not running out of memory but > > are going into swap... which is making the code run really slowly. > > > > I'd try recompiling libmesh with parallel mesh enabled... but I'm > > still surprised that 4GB is not enough memory. > > > > Are the 2,994,336 elements in the mesh you posted before or after > > uniform refinement? > > > They are after the refinement which I calculated on my PC. > > Additionally to libmesh I wrote a class which reads some textfiles which > contain some geological information. This class also includes some > functions to handle them - I basically use them to set subdomains in the > system. Since it is the first time that I work in parallel, every node > also has a copy of this class. But, the total size of the textfiles > which are read in is about 4 MB - so, I didn't consider it too big to > make problems. But maybe you have a different opinion about it?? > running the job with --enable-parmesh doesn't change. I run only one mpi-job per node but it is still quite slow. Robert > > ------------------------------------------------------------------------------ > Special Offer -- Download ArcSight Logger for FREE! > Finally, a world-class log management solution at an even better > price-free! And you'll get a free "Love Thy Logs" t-shirt when you > download Logger. Secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsisghtdev2dev > _______________________________________________ > Libmesh-users mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-users |
From: John P. <jwp...@gm...> - 2011-08-31 14:47:17
|
On Wed, Aug 31, 2011 at 1:31 AM, robert <rob...@un...> wrote: > > running the job with --enable-parmesh doesn't change. I run only one > mpi-job per node but it is still quite slow. You need to **reconfigure libmesh** with --enable-parmesh and then recompile it. -- John |
From: robert <rob...@un...> - 2011-08-31 14:55:01
|
Am Mittwoch, den 31.08.2011, 08:46 -0600 schrieb John Peterson: > On Wed, Aug 31, 2011 at 1:31 AM, robert <rob...@un...> wrote: > > > > running the job with --enable-parmesh doesn't change. I run only one > > mpi-job per node but it is still quite slow. > > You need to **reconfigure libmesh** with --enable-parmesh and then recompile it. Sorry, my last post was not clear. I did recompile libmesh with ./configure --enable-parmesh --enable-slepc --disable-shared make before I posted the above message. > |
From: robert <rob...@un...> - 2011-08-31 16:55:23
|
Am Mittwoch, den 31.08.2011, 16:55 +0200 schrieb robert: > Am Mittwoch, den 31.08.2011, 08:46 -0600 schrieb John Peterson: > > On Wed, Aug 31, 2011 at 1:31 AM, robert <rob...@un...> wrote: > > > > > > running the job with --enable-parmesh doesn't change. I run only one > > > mpi-job per node but it is still quite slow. > > > > You need to **reconfigure libmesh** with --enable-parmesh and then recompile it. > > Sorry, my last post was not clear. I did recompile libmesh with > > ./configure --enable-parmesh --enable-slepc --disable-shared > make > > before I posted the above message. Is there something else which I can do wrong, e.g. when reading the mesh with TetgenIO? I am quite sure that the memory of the nodes is not the problem. I just monitored on my PC the memory which it takes during the refinement: 1.6 GiB of 3.2 GiB - but with other applications also running. Robert > > > > > > > ------------------------------------------------------------------------------ > Special Offer -- Download ArcSight Logger for FREE! > Finally, a world-class log management solution at an even better > price-free! And you'll get a free "Love Thy Logs" t-shirt when you > download Logger. Secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsisghtdev2dev > _______________________________________________ > Libmesh-users mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-users |