From: Magnus H. <Mag...@oh...> - 2008-01-24 15:25:46
|
Hello world, First of all, please excuse the cross posting, but I would like to address both deal.II and libmesh communities. I am currently looking into designing a new 3D finite element geo-electric code. I have come across deal.II and libmesh. Both libraries look quite promising, each having their advantages/disadvantages. On the plus side deal.II supports Nedelec elements which I am quite interested in using. However, libmesh allows to distribute the degrees of freedom (in addition to parallelising the linear algebra). A major design factor is memory requirements. This also seems to be one of the main distinguishing factors of the two libraries. Do you have estimates for the memory requirements for the stiffness matrix and load vector? What size problems can be reasonably run? At this stage our plans are still quite vague and this is just an initial probing question to get a feel for the capabilities of the two libraries. Many thanks Regards magnus -- Dr Magnus Hagdorn Research Geophysicist OHM Ltd The Technology Centre Claymore Drive Aberdeen Science and Energy Park Aberdeen, AB23 8GD www.OHMsurveys.com +44.870.4296581 |
From: John P. <pet...@cf...> - 2008-01-24 16:47:57
|
I accidentally replied off-list, oops ... |
From: Wolfgang B. <ban...@ma...> - 2008-01-24 15:57:44
|
> I am currently looking into designing a new 3D finite element > geo-electric code. I have come across deal.II and libmesh. Both > libraries look quite promising, each having their > advantages/disadvantages. On the plus side deal.II supports Nedelec > elements which I am quite interested in using. However, libmesh allows > to distribute the degrees of freedom (in addition to parallelising the > linear algebra). A major design factor is memory requirements. This also > seems to be one of the main distinguishing factors of the two libraries. > Do you have estimates for the memory requirements for the stiffness > matrix and load vector? That depends on the problem you solve. For the Laplace equation in 2d with Q1 elements you have 9 entries per row of the matrix, so you are talking about 108 bytes in the matrix per degree of freedom. Plus storage for solution vector, right hand side, and temporary vectors in the CG method. For the Laplace equation in 2d, the linear system is not the largest memory consumer of a finite element program, deal.II needs several hundred (but less than 1000) bytes per degree of freedom. That changes if you go to 3d and systems of equations because there you have many more entries per row of the matrix (for example for an equation like 'curl curl B = f' you would have 81 entries per row of the matrix already for the lowest order elements. So you would already consume ~1000 bytes per degree of freedom for the linear system. This is independent of the library you use, unless you go to schemes where the matrix is not stored. Higher order elements would use even more memory, of course. In these cases, the memory footprint of the linear system is larger than grid, DoF handler, and the rest together. > What size problems can be reasonably run? On a machine with 16GB, using deal.II, you can solve linear systems with a few million degrees of freedom (in 2d) or about a million (in 3d) without too much trouble. On clusters we have solved problems with >20 million unknowns. Now let's see what our friends from the competition have to say ;-) Cheers W. ------------------------------------------------------------------------- Wolfgang Bangerth email: ban...@ma... www: http://www.math.tamu.edu/~bangerth/ |
From: Roy S. <roy...@ic...> - 2008-01-24 18:06:34
|
On Thu, 24 Jan 2008, Wolfgang Bangerth wrote: > Now let's see what our friends from the competition have to say ;-) Competition? Does this mean you've added triangles, tets, prisms and pyramids while we weren't looking? ;-) I think you've basically covered it - the biggest memory expenditure is typically the system matrix, which when running on clusters should be the same for either library. The only things I can think of to add are: libMesh does have hooks into matrix-free solvers to avoid that memory usage, but they're fairly new; if you want to see an example using them you'll have to check the mailing list as I don't think Ben's committed it to the SVN head yet. libMesh also does now have a distributed ParallelMesh, but it's also new, SVN head only, and hasn't been tested on huge problems yet. Still, if you're solving on a 256-CPU cluster, it will be a nice option to have one copy of the mesh (plus a copy or three of each "ghost" element and node) stored instead of 256 copies. libMesh just hooks to PETSc and LASPACK for sparse linear algebra, whereas deal.II has its own multithreaded linear solvers (which IIRC were more efficient than PETSc?) for shared memory systems. If you're running on a four-core workstation, for example, I think deal.II only needs to store one copy of the mesh, where with libMesh you need to either use ParallelMesh or you end up storing four copies of a SerialMesh, one for each process. Can deal.II mix multithreading with PETSc? If so then they've got the same advantage on clusters. As John pointed out on the libMesh list (what, no cross-post to the deal.II list? you can't start a fun flamewar on just one mailing list...), your memory usage depends not just on how many DoFs you have but on how they're connected. That's complicated, though - using tets instead of hexes generally reduces your sparse matrix bandwidth slightly, and reduces the number of pointers stored per element, but I think the increase in the number of elements stored probably makes the memory usage worse overall. --- Roy |
From: Benjamin K. <ben...@na...> - 2008-01-24 18:36:11
|
> libMesh just hooks to PETSc and LASPACK for sparse linear algebra, > whereas deal.II has its own multithreaded linear solvers (which IIRC > were more efficient than PETSc?) for shared memory systems. If you're > running on a four-core workstation, for example, I think deal.II only > needs to store one copy of the mesh, where with libMesh you need to > either use ParallelMesh or you end up storing four copies of a > SerialMesh, one for each process. Can deal.II mix multithreading with > PETSc? If so then they've got the same advantage on clusters. I'm actually quite interested in the answer to this question. Has *anyone* gotten PETSc working well inside a multithreaded program? It should be possible to call PETSc only from a single thread, but that kinda kills the point. Along those lines, Derek, do you know about threading in Trilinos? Our new parallel mesh and parallel algorithm implementations are clearly the answer for large-scale distributed memory machines, but when we are talking about 16+ cores per node threading has its place even inside a distributed memory mesh implementation. -Ben |
From: Wolfgang B. <ban...@ma...> - 2008-01-24 19:07:29
|
> I'm actually quite interested in the answer to this question. Has *anyone* > gotten PETSc working well inside a multithreaded program? It should be > possible to call PETSc only from a single thread, but that kinda kills the > point. You can call it from several threads, but you have to synchronise with a mutex. If you assemble a linear system, copying local to global doesn't take that much time (this is the part that needs to be blocked) whereas computing local contributions runs nicely in parallel. > Along those lines, Derek, do you know about threading in Trilinos? I would be interested in hearing about that too. > Our new parallel mesh and parallel algorithm implementations are clearly the > answer for large-scale distributed memory machines, but when we are talking > about 16+ cores per node threading has its place even inside a distributed > memory mesh implementation. And that point isn't going to go away anytime soon :-) W. ------------------------------------------------------------------------- Wolfgang Bangerth email: ban...@ma... www: http://www.math.tamu.edu/~bangerth/ |
From: Derek G. <fri...@gm...> - 2008-01-24 20:02:03
|
On Jan 24, 2008 12:07 PM, Wolfgang Bangerth <ban...@ma...> wrote: > > Along those lines, Derek, do you know about threading in Trilinos? > > I would be interested in hearing about that too. I really don't know about threading and Trilinos... there have been some investigations on whether or not it's a good idea... but I haven't seen a MPI+SMP solver come out of any of those investigations. A lot of people (including myself) are still skeptical that it's even a good idea. I personally think that the complexity involved in creating MPISMP software outweighs any potential gains. MPI software is hard to write... and so is SMP.... put the two together and you are just asking for trouble. But, I am also personally interested in looking at these issues. It is clear that machines are headed in this direction... but I still haven't had anyone show me that software written this way would provide dramatic enough speed up to justify the complexity of the software... Derek |
From: Wolfgang B. <ban...@ma...> - 2008-01-24 22:09:23
|
> A lot of people (including myself) are still skeptical that it's even > a good idea. I personally think that the complexity involved in > creating MPISMP software outweighs any potential gains. MPI software > is hard to write... and so is SMP.... put the two together and you are > just asking for trouble. I totally disagree, it's all a matter of making it simple. Of course, if you write threads code using calls to pthread_create & friends, then it's a pain, but if you can write Thread thread_id; thread_id = spawn matrix.vec_mult (v); // do something else Vector w = thread_id.return_value(); or something similar to run the matrix-vector multiplication on a separate thread, then it's a separate matter. This is by and large the syntax we have in deal.II. Best W. ------------------------------------------------------------------------- Wolfgang Bangerth email: ban...@ma... www: http://www.math.tamu.edu/~bangerth/ |
From: Benjamin K. <ben...@na...> - 2008-01-25 17:06:20
|
>> A lot of people (including myself) are still skeptical that it's even >> a good idea. I personally think that the complexity involved in >> creating MPISMP software outweighs any potential gains. MPI software >> is hard to write... and so is SMP.... put the two together and you are >> just asking for trouble. > > I totally disagree, it's all a matter of making it simple. Of course, if > you write threads code using calls to pthread_create & friends, then it's > a pain, but if you can write > > Thread thread_id; > thread_id = spawn matrix.vec_mult (v); > // do something else > Vector w = thread_id.return_value(); > > or something similar to run the matrix-vector multiplication on a separate > thread, then it's a separate matter. This is by and large the syntax we > have in deal.II. I'm with Wolfgang here - ultimately I'd like to partition for nodes and use thread-based parallelism on each node. I am not sure this will work with PETSc, though, and I need to look into that some more. Certainly shared-memory within a node makes the load balancing problem *much* easier for on-node cores. The issue is that PETSc does everything on a per-MPI-process basis and, to the best of my knowledge, does not use threads itself to implement e.g. its matvec. And threaded BLAS will only take you so far... So you could assemble the system matrix with threads on 16 cores in a node, but when PETSc solves the linear system I think it will only be using one core. Wolfgang's petsc flag was new to me and I'm gonna look into it, although this is not promising: http://www-unix.mcs.anl.gov/petsc/petsc-2/miscellaneous/petscthreads.html -Ben |
From: Wolfgang B. <ban...@ma...> - 2008-01-25 17:14:42
|
> The issue is that PETSc does everything on a per-MPI-process basis and, to > the best of my knowledge, does not use threads itself to implement e.g. its > matvec. And threaded BLAS will only take you so far... So you could > assemble the system matrix with threads on 16 cores in a node, but when > PETSc solves the linear system I think it will only be using one core. Yes, the PETSc people need to do some work there as well. > Wolfgang's petsc flag was new to me and I'm gonna look into it, although > this is not promising: > http://www-unix.mcs.anl.gov/petsc/petsc-2/miscellaneous/petscthreads.html That page has been there for at least 4 years. My reading is that all it says is that you can't call PETSc functions from different threads at the same time. I don't think they want to say that PETSc can't internally use threads to do its computations (and this is what we want them to do of course). I'm going to see if I can't figure out what the flag's name was. I believe it was still internal, and may only exist on PETSc dev. W. ------------------------------------------------------------------------- Wolfgang Bangerth email: ban...@ma... www: http://www.math.tamu.edu/~bangerth/ |
From: Magnus H. <Mag...@oh...> - 2008-01-25 17:22:25
|
On Fri, 2008-01-25 at 11:14 -0600, Wolfgang Bangerth wrote: > > The issue is that PETSc does everything on a per-MPI-process basis and, to > > the best of my knowledge, does not use threads itself to implement e.g. its > > matvec. And threaded BLAS will only take you so far... So you could > > assemble the system matrix with threads on 16 cores in a node, but when > > PETSc solves the linear system I think it will only be using one core. > > Yes, the PETSc people need to do some work there as well. > > > > Wolfgang's petsc flag was new to me and I'm gonna look into it, although > > this is not promising: > > http://www-unix.mcs.anl.gov/petsc/petsc-2/miscellaneous/petscthreads.html > > That page has been there for at least 4 years. My reading is that all it > says is that you can't call PETSc functions from different threads at the > same time. I don't think they want to say that PETSc can't internally use > threads to do its computations (and this is what we want them to do of > course). > > I'm going to see if I can't figure out what the flag's name was. I believe > it was still internal, and may only exist on PETSc dev. > We tried using PETSc in an OpenMP program and yes, it is *not* thread safe. PETSc maintains state in some global variables :-(. Cheers magnus -- Dr Magnus Hagdorn Research Geophysicist OHM Ltd The Technology Centre Claymore Drive Aberdeen Science and Energy Park Aberdeen, AB23 8GD www.OHMsurveys.com +44.870.4296581 |
From: Wolfgang B. <ban...@ma...> - 2008-01-24 19:05:45
|
> > Now let's see what our friends from the competition have to say ;-) > > Competition? Does this mean you've added triangles, tets, prisms and > pyramids while we weren't looking? ;-) :-) I guess I could come up with other things instead ;-) > libMesh does have hooks into matrix-free solvers to avoid that memory > usage, but they're fairly new; deal.II can do that as well, but not... > libMesh also does now have a distributed ParallelMesh, but it's also > new, SVN head only, and hasn't been tested on huge problems yet. > Still, if you're solving on a 256-CPU cluster, it will be a nice > option to have one copy of the mesh (plus a copy or three of each > "ghost" element and node) stored instead of 256 copies. ...this one. > libMesh just hooks to PETSc and LASPACK for sparse linear algebra, > whereas deal.II has its own multithreaded linear solvers (which IIRC > were more efficient than PETSc?) for shared memory systems. Yes. Maybe more importantly (because it scales very nicely) we also assemble linear systems in parallel on multicore systems. > SerialMesh, one for each process. Can deal.II mix multithreading with > PETSc? For assembling yes (which also works on individual cluster nodes, of course). For solvers we just hand things over to PETSc. I hear PETSc has some magic flag that lets it run multithreaded but I don't know how to turn that on. W. ------------------------------------------------------------------------- Wolfgang Bangerth email: ban...@ma... www: http://www.math.tamu.edu/~bangerth/ |
From: Roy S. <roy...@ic...> - 2008-01-24 19:39:51
|
On Thu, 24 Jan 2008, Wolfgang Bangerth wrote: >>> Now let's see what our friends from the competition have to say ;-) >> >> Competition? Does this mean you've added triangles, tets, prisms and >> pyramids while we weren't looking? ;-) > > :-) I guess I could come up with other things instead ;-) What, first library to add NURBS wins? >> libMesh just hooks to PETSc and LASPACK for sparse linear algebra, >> whereas deal.II has its own multithreaded linear solvers (which IIRC >> were more efficient than PETSc?) for shared memory systems. > > Yes. Maybe more importantly (because it scales very nicely) we also > assemble linear systems in parallel on multicore systems. Just to make sure it's clear, so does libMesh, just by using separate processes instead of separate threads. The assembly time scales like 1/N with number of cores; our problem is just that the SerialMesh memory usage scales like N. >> SerialMesh, one for each process. Can deal.II mix multithreading with >> PETSc? > > For assembling yes (which also works on individual cluster nodes, of > course). For solvers we just hand things over to PETSc. > > I hear PETSc has some magic flag that lets it run multithreaded but I > don't know how to turn that on. We ought to figure that out. Multi-level shared+distributed memory systems are here to stay. The alternative to threading is to use multiple processes which allocate blocks of shared memory, but that's 99% as hard to debug and about 400% harder to code. It's probably less portable, too; I think differing shared memory APIs were part of the SysV/BSD rift. --- Roy |
From: Wolfgang B. <ban...@ma...> - 2008-01-27 05:56:28
|
> I hear PETSc has some magic flag that lets it run multithreaded but I > don't know how to turn that on. Below is Matt's answer to my question. I guess I misunderstoof back then how this is going to work, it isn't quite what I had in mind... The link to the function he mentions is this: http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCOPENMP.html W. -------------------------------- Date: Fri, 25 Jan 2008 12:06:54 -0600 From: Matthew Knepley To: Wolfgang Bangerth Subject: Re: Multithreading in PETSc It took me a while to parse this. We do not support threads per se, but we have a preconditioner called OPENMP. It is a true Barryism since there is only MPI in it :) It is for serial jobs. Basically, it starts up a large communicator but only uses the first process for normal PETSc work. When you get to the solver, it brings in all the other MPI processes, which in a good implementation are threads. There are examples in the development documentation on the web if you go to PC/OPENMP. Warning: you need a simply ridiculous number of command line options to drive it, but it shows great speedup. The guy at Harvard loves it. Matt On Jan 25, 2008 11:16 AM, Wolfgang Bangerth <ban...@ma...> wrote: > > Hi Matt, > back in Boulder you said that there is a flag that makes PETSc internally > use threads to speed up things. This topic has come up again in a > discussion, and it reminded me that I wanted to ask you about the secret > way to enable this feature. Is there a short answer? > > Best > Wolfgang > > ------------------------------------------------------------------------- > Wolfgang Bangerth email: ban...@ma... > www: http://www.math.tamu.edu/~bangerth/ > > |