From: Joa L. <li...@jo...> - 2010-01-15 19:03:24
|
Hi again, thanks for all the answers, some swoshing by over my head. But, I think I got the essiential: 1) If I go for threads, I will have to write the assembly part myself (no problems threads I know) 2) Threads only will not use all my cores for the matrix inversion, error estimates etc. Sometimes it will sometimes not. 3) MPI would be a nice solution since this will allow me to use all my cores or many computers and theire cores etc. This with or without threading... Up to me and my skills/taste But, never having used MPI... If I want to use libmesh the way I do, as an library used by a library that I can change during runtime (dlopen...) I would have to make all my code "understand" MPI so I can start it using e.g. mpiexec. And then???? To me it seems quite hard... Could I use MPI only in a library called by my program? Without resorting to std::system("mpiexec ...") as a solution somewhere in the code (maybe not bad???) cheers Joa On Thu, Jan 14, 2010 at 09:40:28PM +0100, Jed Brown wrote: > On Thu, 14 Jan 2010 14:11:04 -0600, "Kirk, Benjamin (JSC-EG311)" <ben...@na...> wrote: > > From a PETSc point of view that is certainly true, but for libMesh > > applications we allocate the sparsity exactly, so there is no allocation > > inside the MatSetValues. Eventhough MatSetValues is not thread-safe, when > > the allocation is correct putting a mutex around the element matrix > > insertion has suffered no noticeable performance degradation out to 16 > > threads... > > Interesting, but this just means that your integration is very > expensive. If the problem is simple so that integration is very cheap, > then the cost of insertion will matter. I know that the Fenics group > has seen some cases where insertion costs are significant compared to > integration and solves, but I think it was only an issue for > heterogeneous Helmholtz operators (which show up in many semi-implicit > schemes). > > > Depending on the architecture, we've seen 4-socket 4-core nodes perform > > *better* (lower wall-clock time) using 4 MPI tasks on the node that > > 16. > > This isn't uncommon, the cores end up just competing for bandwidth. > Also, the extra cores mean smaller subdomains which increases your > iteration counts with most preconditioners. > > > Of course, as previously discussed, this only has a real impact on runtime > > if you spend a nontrivial amount of time in the matrix assembly. This is > > true in some of my applications, where the matrix assembly time is > > comparable to (or sometimes greater than) the linear solve time. > > Yeah, I try to avoid assembly in these cases, by lagging the > preconditioner and doing more things matrix-free (the matrix-free > operations can certainly benefit from threading, you'll probably realize > a greater speedup this way since you can parallelize the mat-vecs, which > are CPU-bound instead of memory-bound when done unassembled -- usually > by storing the nonlinearity at quadrature points). > > Jed > > ------------------------------------------------------------------------------ > Throughout its 18-year history, RSA Conference consistently attracts the > world's best and brightest in the field, creating opportunities for Conference > attendees to learn about information security's most important issues through > interactions with peers, luminaries and emerging and established companies. > http://p.sf.net/sfu/rsaconf-dev2dev > _______________________________________________ > Libmesh-users mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-users |