Thread: [Libmesh-users] multi-thread example?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dear Libmesh Developers,

I remember one multi-thread example will be added in previous
discussion. I can't find it. Do you have any plan for it? Thanks a
lot.

Regards,
Yujie

On Mon, 9 Aug 2010, Yujie wrote:

> I remember one multi-thread example will be added in previous
> discussion. I can't find it. Do you have any plan for it? Thanks a
> lot.

Not short-term.  Your best bet is to look at one of the multi-threaded
functions in the library; search for ConstElemRange to find all the
appropriate ones.

You're not using FEMSystem by any chance?  Adding multithreading to
the assembly there has been at the bottom of my TODO list there for a
while; if I knew someone else needed it it would be closer to the top.
---
Roy

Dear Roy,

Does multi-threading only benefit the assembly regarding the
performance improvement?

Is there an implementation barrier in current libmesh and PETSc? if it
is, is there any solution for it?

I remember Derek had a presentation/paper about it. However, I can't
find some information on libMesh website.
It looks like multi-threading is not welcome in libMesh :).

Regards,
Yujie

On Mon, Aug 9, 2010 at 4:04 PM, Roy Stogner <roy...@ic...> wrote:
>
> On Mon, 9 Aug 2010, Yujie wrote:
>
>> I remember one multi-thread example will be added in previous
>> discussion. I can't find it. Do you have any plan for it? Thanks a
>> lot.
>
> Not short-term.  Your best bet is to look at one of the multi-threaded
> functions in the library; search for ConstElemRange to find all the
> appropriate ones.
>
> You're not using FEMSystem by any chance?  Adding multithreading to
> the assembly there has been at the bottom of my TODO list there for a
> while; if I knew someone else needed it it would be closer to the top.
> ---
> Roy
>

On Mon, 9 Aug 2010, Yujie wrote:

> Does multi-threading only benefit the assembly regarding the
> performance improvement?

Lots of chunks of the library itself are multithreaded; however a
typical application will spend most of its time in the solver and most
of the rest in assembly, so those are the things that need speedup.

> Is there an implementation barrier in current libmesh and PETSc?

PETSc isn't multithreaded.
http://www.mcs.anl.gov/petsc/petsc-as/miscellaneous/petscthreads.html
(it's not even thread-safe if called from multiple threads at once,
but the TBB-based threading in libMesh only uses PETSc in a safe way)

> if it is, is there any solution for it?

Using a matrix-free solver (i.e. approximating matrix-vector products
with finite differences) or an explicit method, your code will spend a
greater fraction of its time in assembly and can then benefit more
from threading that.

> It looks like multi-threading is not welcome in libMesh :).

It's in the library itself; how much more welcome do you want?  ;-)
What's lacking in libMesh is a multithreaded solver, which would be
quite welcome, it's just that we're all too busy (and too disinclined
to reinvent the wheel) to write one ourselves.  It's just too easy to
run a single MPI process per core instead.
---
Roy

Dear Roy,

Thanks for your reply.

Regarding your reply, if one wants to realize a multithreading-MPI
implementation, PETSc and TBB is a choice in libMesh for FEM?

The performance improvement can be realized not only in matrix
assembly and solver. It looks like multi-threading realization for
matrix assembly is easier than that in solver. Do you know any solver
packages are multi-threading? How about Trilinos?

Thanks so much.

Regards,
Yujie

On Mon, Aug 9, 2010 at 4:25 PM, Roy Stogner <roy...@ic...> wrote:
>
> On Mon, 9 Aug 2010, Yujie wrote:
>
>> Does multi-threading only benefit the assembly regarding the
>> performance improvement?
>
> Lots of chunks of the library itself are multithreaded; however a
> typical application will spend most of its time in the solver and most
> of the rest in assembly, so those are the things that need speedup.
>
>> Is there an implementation barrier in current libmesh and PETSc?
>
> PETSc isn't multithreaded.
> http://www.mcs.anl.gov/petsc/petsc-as/miscellaneous/petscthreads.html
> (it's not even thread-safe if called from multiple threads at once,
> but the TBB-based threading in libMesh only uses PETSc in a safe way)
>
>> if it is, is there any solution for it?
>
> Using a matrix-free solver (i.e. approximating matrix-vector products
> with finite differences) or an explicit method, your code will spend a
> greater fraction of its time in assembly and can then benefit more
> from threading that.
>
>> It looks like multi-threading is not welcome in libMesh :).
>
> It's in the library itself; how much more welcome do you want?  ;-)
> What's lacking in libMesh is a multithreaded solver, which would be
> quite welcome, it's just that we're all too busy (and too disinclined
> to reinvent the wheel) to write one ourselves.  It's just too easy to
> run a single MPI process per core instead.
> ---
> Roy
>

On Mon, 9 Aug 2010, Yujie wrote:

> Regarding your reply, if one wants to realize a multithreading-MPI
> implementation, PETSc and TBB is a choice in libMesh for FEM?

It will work.  But it's often an inferior choice to using PETSc
without threads - if you have the same number of cores as MPI
processes, then you might as well just use one thread per core.  If
you have more cores than MPI processes, then you'll see less memory
usage and some slightly more efficient CPU usage in the multithreaded
parts of the code, but in other parts of the code you'll have cores
doing nothing while waiting on PETSc.

> Do you know any solver packages are multi-threading? How about
> Trilinos?

I don't think so.  Trilinos has something called ThreadPool, so
they're at least taking steps in the right direction, but neither
Epetra nor AztecOO seem to use it yet.  I could be wrong, though.

It's a shame.  5 years ago I would have guessed that by now *every*
major solver package would be multithreaded and we'd just be worrying
about which ones had the better GPU support.
---
Roy

Dear Roy,

You mean even if you use multi-threading-MPI implementation, you need
to find the balance between multithreading computation and the
communication between several processings?
However, sometimes one wants to use more CPU cores to accelerate the
computation, meaning more processings/nodes are used. However, more
data communication between processings aggravates the computation
speed? In this case, can multithreading-MPI implemention not benefit
this?

from your comments, GPU-based multithreading packages are more
efficient than the counterpart in CPU although the latter is not
popular currently?

Thanks a lot.

Regards,
Yujie

On Mon, Aug 9, 2010 at 4:48 PM, Roy Stogner <roy...@ic...> wrote:
>
> On Mon, 9 Aug 2010, Yujie wrote:
>
>> Regarding your reply, if one wants to realize a multithreading-MPI
>> implementation, PETSc and TBB is a choice in libMesh for FEM?
>
> It will work.  But it's often an inferior choice to using PETSc
> without threads - if you have the same number of cores as MPI
> processes, then you might as well just use one thread per core.  If
> you have more cores than MPI processes, then you'll see less memory
> usage and some slightly more efficient CPU usage in the multithreaded
> parts of the code, but in other parts of the code you'll have cores
> doing nothing while waiting on PETSc.
>
>> Do you know any solver packages are multi-threading? How about
>> Trilinos?
>
> I don't think so.  Trilinos has something called ThreadPool, so
> they're at least taking steps in the right direction, but neither
> Epetra nor AztecOO seem to use it yet.  I could be wrong, though.
>
> It's a shame.  5 years ago I would have guessed that by now *every*
> major solver package would be multithreaded and we'd just be worrying
> about which ones had the better GPU support.
> ---
> Roy
>

On Aug 9, 2010, at 4:05 PM, Yujie wrote:

> However, sometimes one wants to use more CPU cores to accelerate the
> computation, meaning more processings/nodes are used. However, more
> data communication between processings aggravates the computation
> speed? In this case, can multithreading-MPI implemention not benefit
> this?

This is mostly fantasy (one that I have had as well... but still fantasy).  From my experience with multithreading... the overhead of communication is basically NEVER enough to justify using threads over more MPI processes (for a typical full Jacobian based solve where you have to fill a matrix and then hand it off to an MPI bases solver like Petsc or Trilinos).  Every core you add to Petsc will improve your speed more than adding threads for assembly.

The ONLY way it makes sense is (like Roy said) if you are spending 90% of your time in assembly... like if you are doing JFNK or an explicit method (which is what we do here at INL).  THEN you can see some benefits for certain problems.

There is one edge case here: large memory jobs.  Sometimes you have a job that for whatever reason uses a lot of memory per MPI process.... and in some cases it might use so much that you can't fully pack the nodes on your supercomputer (ie you have to leave cores sitting around unused because of the lack of memory).  In that specific situation... if you can make use of those idle cores through threading your assembly... you can pick up some extra speed.

In general.... I just wouldn't pursue it unless you are doing JFNK or an explicit scheme.  Otherwise it's just not going to pay off.

Derek

On Tue, 10 Aug 2010 09:40:46 -0600, Derek Gaston <fri...@gm...> wrote:
> On Aug 9, 2010, at 4:05 PM, Yujie wrote:
> 
> > However, sometimes one wants to use more CPU cores to accelerate the
> > computation, meaning more processings/nodes are used. However, more
> > data communication between processings aggravates the computation
> > speed?

Note that this communication is generally done by mapping pages between
processes, it is much cheaper than serializing over a network.  Please
present a benchmark showing that this cost is significant for your
application before concluding that you need to use a hybrid programming
model.

> > In this case, can multithreading-MPI implemention not benefit this?

Note that you have to pay for a reentrant MPI too, the locks are not
free, and this use case is less well tested.

> This is mostly fantasy (one that I have had as well... but still
> fantasy).  From my experience with multithreading... the overhead of
> communication is basically NEVER enough to justify using threads over
> more MPI processes (for a typical full Jacobian based solve where you
> have to fill a matrix and then hand it off to an MPI bases solver like
> Petsc or Trilinos).  Every core you add to Petsc will improve your
> speed more than adding threads for assembly.

Note that many operations are bandwidth limited, and adding more cores
doesn't always help.  As a simple example, on one 6-core socket of XT5,
we have the following numbers for STREAM Triad (numbers courtesy Dinesh
Kaushik):

Threads    Total (MB/s)     Per core (MB/s)
  1           8448               8448
  2          10112               5056
  4          10715               2679
  6          10482               1747

In contrast BlueGene/P produces

Threads    Total (MB/s)     Per core (MB/s)
  1           2266               2266
  2           4529               2264
  4           8903               2226

PETSc has support for used-defined preconditioners using OpenMP, but
this is rarely used, and I'm not aware of any cases where it has been
used to beat an intelligent MPI-only preconditioner.

PETSc-dev now has CUDA support for all vector and many sparse matrix
kernels (no source-level changes required).

The non-reentrant parts of PETSc are mostly in the
profiling/logging/debugging part.  Making it reentrant would not be
terribly deep, but would take a fair amount of time to add all the
fine-grained locks (and this would cost some performance, logging
functions are potentially performance-sensitive).  I don't think it's
possible to make the sparse matrix data formats reentrant without
unacceptable performance/usability impact, so MatSetValues() would
always have to take a (coarse-grained) lock.  The PETSc team doesn't see
this offering sufficient benefit to justify the complexity and
implementation effort.  If you have non-contrived use cases where it is
an unambiguous win, you may be able to make a case for getting it done.

> There is one edge case here: large memory jobs.  Sometimes you have a
> job that for whatever reason uses a lot of memory per MPI
> process.... and in some cases it might use so much that you can't
> fully pack the nodes on your supercomputer (ie you have to leave cores
> sitting around unused because of the lack of memory).  In that
> specific situation... if you can make use of those idle cores through
> threading your assembly... you can pick up some extra speed.

This is a relevant scenario, particularly if the mesh or geometric model
is not distributed.

Jed

Hey Jed... I just wanted to thank you for your insightful commentary!  Your
presence on this list is greatly appreciated!  It's definitely always good
to get some inside knowledge into where Petsc is headed....

Derek

On Tue, Aug 10, 2010 at 2:37 PM, Jed Brown <je...@59...> wrote:

> On Tue, 10 Aug 2010 09:40:46 -0600, Derek Gaston <fri...@gm...>
> wrote:
> > On Aug 9, 2010, at 4:05 PM, Yujie wrote:
> >
> > > However, sometimes one wants to use more CPU cores to accelerate the
> > > computation, meaning more processings/nodes are used. However, more
> > > data communication between processings aggravates the computation
> > > speed?
>
> Note that this communication is generally done by mapping pages between
> processes, it is much cheaper than serializing over a network.  Please
> present a benchmark showing that this cost is significant for your
> application before concluding that you need to use a hybrid programming
> model.
>
> > > In this case, can multithreading-MPI implemention not benefit this?
>
> Note that you have to pay for a reentrant MPI too, the locks are not
> free, and this use case is less well tested.
>
> > This is mostly fantasy (one that I have had as well... but still
> > fantasy).  From my experience with multithreading... the overhead of
> > communication is basically NEVER enough to justify using threads over
> > more MPI processes (for a typical full Jacobian based solve where you
> > have to fill a matrix and then hand it off to an MPI bases solver like
> > Petsc or Trilinos).  Every core you add to Petsc will improve your
> > speed more than adding threads for assembly.
>
> Note that many operations are bandwidth limited, and adding more cores
> doesn't always help.  As a simple example, on one 6-core socket of XT5,
> we have the following numbers for STREAM Triad (numbers courtesy Dinesh
> Kaushik):
>
> Threads    Total (MB/s)     Per core (MB/s)
>  1           8448               8448
>  2          10112               5056
>  4          10715               2679
>  6          10482               1747
>
> In contrast BlueGene/P produces
>
> Threads    Total (MB/s)     Per core (MB/s)
>  1           2266               2266
>  2           4529               2264
>  4           8903               2226
>
> PETSc has support for used-defined preconditioners using OpenMP, but
> this is rarely used, and I'm not aware of any cases where it has been
> used to beat an intelligent MPI-only preconditioner.
>
> PETSc-dev now has CUDA support for all vector and many sparse matrix
> kernels (no source-level changes required).
>
> The non-reentrant parts of PETSc are mostly in the
> profiling/logging/debugging part.  Making it reentrant would not be
> terribly deep, but would take a fair amount of time to add all the
> fine-grained locks (and this would cost some performance, logging
> functions are potentially performance-sensitive).  I don't think it's
> possible to make the sparse matrix data formats reentrant without
> unacceptable performance/usability impact, so MatSetValues() would
> always have to take a (coarse-grained) lock.  The PETSc team doesn't see
> this offering sufficient benefit to justify the complexity and
> implementation effort.  If you have non-contrived use cases where it is
> an unambiguous win, you may be able to make a case for getting it done.
>
> > There is one edge case here: large memory jobs.  Sometimes you have a
> > job that for whatever reason uses a lot of memory per MPI
> > process.... and in some cases it might use so much that you can't
> > fully pack the nodes on your supercomputer (ie you have to leave cores
> > sitting around unused because of the lack of memory).  In that
> > specific situation... if you can make use of those idle cores through
> > threading your assembly... you can pick up some extra speed.
>
> This is a relevant scenario, particularly if the mesh or geometric model
> is not distributed.
>
> Jed
>

Thank you very much, Jed.

I made a mistake. I have thought the processings of MPI-based
computation in one nodes of the cluster need to use ethenet for data
communication. In MPICH2, it will be done inside one nodes, meaning
the multithread techinque can't help a lot for the performance
improvements regarding slow data communication between nodes. The
problem is that the dimension of my problem is small.

Regarding GPU-based PETSc, is there an implicit multithreaded-MPI
solver when distributed matrix is used?
I have noticed some functions for SEQ- and MPI- Vec and AIJMAT are
added. Regarding MATDense, do you have any plan for it?
Is there some challenges for SEQ- and MPI-Dense matrix? Thanks again.

Regards,
Yujie

On Tue, Aug 10, 2010 at 3:37 PM, Jed Brown <je...@59...> wrote:
> On Tue, 10 Aug 2010 09:40:46 -0600, Derek Gaston <fri...@gm...> wrote:
>> On Aug 9, 2010, at 4:05 PM, Yujie wrote:
>>
>> > However, sometimes one wants to use more CPU cores to accelerate the
>> > computation, meaning more processings/nodes are used. However, more
>> > data communication between processings aggravates the computation
>> > speed?
>
> Note that this communication is generally done by mapping pages between
> processes, it is much cheaper than serializing over a network.  Please
> present a benchmark showing that this cost is significant for your
> application before concluding that you need to use a hybrid programming
> model.
>
>> > In this case, can multithreading-MPI implemention not benefit this?
>
> Note that you have to pay for a reentrant MPI too, the locks are not
> free, and this use case is less well tested.
>
>> This is mostly fantasy (one that I have had as well... but still
>> fantasy).  From my experience with multithreading... the overhead of
>> communication is basically NEVER enough to justify using threads over
>> more MPI processes (for a typical full Jacobian based solve where you
>> have to fill a matrix and then hand it off to an MPI bases solver like
>> Petsc or Trilinos).  Every core you add to Petsc will improve your
>> speed more than adding threads for assembly.
>
> Note that many operations are bandwidth limited, and adding more cores
> doesn't always help.  As a simple example, on one 6-core socket of XT5,
> we have the following numbers for STREAM Triad (numbers courtesy Dinesh
> Kaushik):
>
> Threads    Total (MB/s)     Per core (MB/s)
>  1           8448               8448
>  2          10112               5056
>  4          10715               2679
>  6          10482               1747
>
> In contrast BlueGene/P produces
>
> Threads    Total (MB/s)     Per core (MB/s)
>  1           2266               2266
>  2           4529               2264
>  4           8903               2226
>
> PETSc has support for used-defined preconditioners using OpenMP, but
> this is rarely used, and I'm not aware of any cases where it has been
> used to beat an intelligent MPI-only preconditioner.
>
> PETSc-dev now has CUDA support for all vector and many sparse matrix
> kernels (no source-level changes required).
>
> The non-reentrant parts of PETSc are mostly in the
> profiling/logging/debugging part.  Making it reentrant would not be
> terribly deep, but would take a fair amount of time to add all the
> fine-grained locks (and this would cost some performance, logging
> functions are potentially performance-sensitive).  I don't think it's
> possible to make the sparse matrix data formats reentrant without
> unacceptable performance/usability impact, so MatSetValues() would
> always have to take a (coarse-grained) lock.  The PETSc team doesn't see
> this offering sufficient benefit to justify the complexity and
> implementation effort.  If you have non-contrived use cases where it is
> an unambiguous win, you may be able to make a case for getting it done.
>
>> There is one edge case here: large memory jobs.  Sometimes you have a
>> job that for whatever reason uses a lot of memory per MPI
>> process.... and in some cases it might use so much that you can't
>> fully pack the nodes on your supercomputer (ie you have to leave cores
>> sitting around unused because of the lack of memory).  In that
>> specific situation... if you can make use of those idle cores through
>> threading your assembly... you can pick up some extra speed.
>
> This is a relevant scenario, particularly if the mesh or geometric model
> is not distributed.
>
> Jed
>

On Tue, 10 Aug 2010 15:51:18 -0500, Yujie <rec...@gm...> wrote:
> Thank you very much, Jed.
> 
> I made a mistake. I have thought the processings of MPI-based
> computation in one nodes of the cluster need to use ethenet for data
> communication. In MPICH2, it will be done inside one nodes,

Open MPI (and all other modern implementations) do this too.

> Regarding GPU-based PETSc, is there an implicit multithreaded-MPI
> solver when distributed matrix is used?

If I understand your question, then yes, the MPI matrix formats are
supported and the GPU kernels are "multithreaded".

> I have noticed some functions for SEQ- and MPI- Vec and AIJMAT are
> added. Regarding MATDense, do you have any plan for it?
> Is there some challenges for SEQ- and MPI-Dense matrix?

The dense cases are much easier than the sparse ones (and offer more
potential for GPU speedup).  I don't know if it will happen in the next
few weeks (it's not much work, but someone has to get to it), but it
should certainly be there for the next release (probably Q1 2011).

Jed

Thanks so much, Jed :)

Best Regards,
Yujie

On Tue, Aug 10, 2010 at 4:01 PM, Jed Brown <je...@59...> wrote:
> On Tue, 10 Aug 2010 15:51:18 -0500, Yujie <rec...@gm...> wrote:
>> Thank you very much, Jed.
>>
>> I made a mistake. I have thought the processings of MPI-based
>> computation in one nodes of the cluster need to use ethenet for data
>> communication. In MPICH2, it will be done inside one nodes,
>
> Open MPI (and all other modern implementations) do this too.
>
>> Regarding GPU-based PETSc, is there an implicit multithreaded-MPI
>> solver when distributed matrix is used?
>
> If I understand your question, then yes, the MPI matrix formats are
> supported and the GPU kernels are "multithreaded".
>
>> I have noticed some functions for SEQ- and MPI- Vec and AIJMAT are
>> added. Regarding MATDense, do you have any plan for it?
>> Is there some challenges for SEQ- and MPI-Dense matrix?
>
> The dense cases are much easier than the sparse ones (and offer more
> potential for GPU speedup).  I don't know if it will happen in the next
> few weeks (it's not much work, but someone has to get to it), but it
> should certainly be there for the next release (probably Q1 2011).
>
> Jed
>

On Tue, 10 Aug 2010, Jed Brown wrote:

> On Tue, 10 Aug 2010 15:51:18 -0500, Yujie <rec...@gm...> wrote:
>> Thank you very much, Jed.
>>
>> I made a mistake. I have thought the processings of MPI-based
>> computation in one nodes of the cluster need to use ethenet for data
>> communication. In MPICH2, it will be done inside one nodes,
>
> Open MPI (and all other modern implementations) do this too.

Even non-modern implementations will do better than Yujie assumed -
TCP/IP data sent to the local IP address doesn't touch the network
card, regardless of whether the MPI stack realizes that it's sending
to someplace local.  I'd guess that a modern MPI stack can use shared
memory locally and avoid RAM-to-RAM copies, but even RAM-to-RAM should
be much faster than RAM-to-NIC-to-NIC-to-RAM.
---
Roy

On Tue, 10 Aug 2010 16:11:23 -0500 (CDT), Roy Stogner <roy...@ic...> wrote:
> Even non-modern implementations will do better than Yujie assumed -
> TCP/IP data sent to the local IP address doesn't touch the network
> card, regardless of whether the MPI stack realizes that it's sending
> to someplace local.  I'd guess that a modern MPI stack can use shared
> memory locally and avoid RAM-to-RAM copies, but even RAM-to-RAM should
> be much faster than RAM-to-NIC-to-NIC-to-RAM.

Yes, but TCP needs more context switches and copying through kernel
space so the hit is still not trivial.  Also, if you have nice network
hardware, you can have it do the copy via DMA, which requires no system
calls, and your program can do something useful while the messages are
delivered.  I have heard second-hand reports of this sometimes being
faster than the shared-memory transport layer when overlapping
communication and computation, but don't have a concrete example to
point to.

It's really easy to experiment with this when using Open MPI (MPICH2 has
similar ability, but it is through environment variables and I don't
recall the details).  Run with 'mpiexec -mca btl self,tcp' to use TCP
even for local communication (as Roy says, this never touches the
network device, but still involves copying through kernel space).  The
default will include 'sm' which is the dedicated shared-memory transport
layer.  With InfiniBand, try '-mca btl self,ib' to use the HCA for local
communication as well (usually this involves no context switches, the
registered memory is copied directly by the HCA, but this of course
means that it has to actually get to the device, which is usually slower
than the memory bus).

Jed

Thread: [Libmesh-users] multi-thread example?

libmesh-users