Re: [Libmesh-users] threads in libmesh

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi again,

thanks for all the answers, some swoshing by over my head. But, I think I got the essiential:

1) If I go for threads, I will have to write the assembly part myself (no problems threads I know)
2) Threads only will not use all my cores for the matrix inversion, error estimates etc. Sometimes
   it will sometimes not.
3) MPI would be a nice solution since this will allow me to use all my cores or many computers and theire cores etc.
   This with or without threading... Up to me and my skills/taste

But, never having used MPI... If I want to use libmesh the way I do, as an library used by a library that I can change
during runtime (dlopen...) I would have to make all my code "understand" MPI so I can start it using e.g. mpiexec. And 
then???? To me it seems quite hard... Could I use MPI only in a library called by my program? Without resorting to
std::system("mpiexec ...") as a solution somewhere in the code (maybe not bad???)

cheers

Joa 

On Thu, Jan 14, 2010 at 09:40:28PM +0100, Jed Brown wrote:
> On Thu, 14 Jan 2010 14:11:04 -0600, "Kirk, Benjamin (JSC-EG311)" <ben...@na...> wrote:
> > From a PETSc point of view that is certainly true, but for libMesh
> > applications we allocate the sparsity exactly, so there is no allocation
> > inside the MatSetValues.  Eventhough MatSetValues is not thread-safe, when
> > the allocation is correct putting a mutex around the element matrix
> > insertion has suffered no noticeable performance degradation out to 16
> > threads...
> 
> Interesting, but this just means that your integration is very
> expensive.  If the problem is simple so that integration is very cheap,
> then the cost of insertion will matter.  I know that the Fenics group
> has seen some cases where insertion costs are significant compared to
> integration and solves, but I think it was only an issue for
> heterogeneous Helmholtz operators (which show up in many semi-implicit
> schemes).
> 
> > Depending on the architecture, we've seen 4-socket 4-core nodes perform
> > *better* (lower wall-clock time) using 4 MPI tasks on the node that
> > 16.
> 
> This isn't uncommon, the cores end up just competing for bandwidth.
> Also, the extra cores mean smaller subdomains which increases your
> iteration counts with most preconditioners.
> 
> > Of course, as previously discussed, this only has a real impact on runtime
> > if you spend a nontrivial amount of time in the matrix assembly.  This is
> > true in some of my applications, where the matrix assembly time is
> > comparable to (or sometimes greater than) the linear solve time.
> 
> Yeah, I try to avoid assembly in these cases, by lagging the
> preconditioner and doing more things matrix-free (the matrix-free
> operations can certainly benefit from threading, you'll probably realize
> a greater speedup this way since you can parallelize the mat-vecs, which
> are CPU-bound instead of memory-bound when done unassembled -- usually
> by storing the nonlinearity at quadrature points).
> 
> Jed
> 
> ------------------------------------------------------------------------------
> Throughout its 18-year history, RSA Conference consistently attracts the
> world's best and brightest in the field, creating opportunities for Conference
> attendees to learn about information security's most important issues through
> interactions with peers, luminaries and emerging and established companies.
> http://p.sf.net/sfu/rsaconf-dev2dev
> _______________________________________________
> Libmesh-users mailing list
> Lib...@li...
> https://lists.sourceforge.net/lists/listinfo/libmesh-users