From: Yujie <rec...@gm...> - 2010-08-09 20:46:34
|
Dear Libmesh Developers, I remember one multi-thread example will be added in previous discussion. I can't find it. Do you have any plan for it? Thanks a lot. Regards, Yujie |
From: Roy S. <roy...@ic...> - 2010-08-09 21:04:34
|
On Mon, 9 Aug 2010, Yujie wrote: > I remember one multi-thread example will be added in previous > discussion. I can't find it. Do you have any plan for it? Thanks a > lot. Not short-term. Your best bet is to look at one of the multi-threaded functions in the library; search for ConstElemRange to find all the appropriate ones. You're not using FEMSystem by any chance? Adding multithreading to the assembly there has been at the bottom of my TODO list there for a while; if I knew someone else needed it it would be closer to the top. --- Roy |
From: Yujie <rec...@gm...> - 2010-08-09 21:11:22
|
Dear Roy, Does multi-threading only benefit the assembly regarding the performance improvement? Is there an implementation barrier in current libmesh and PETSc? if it is, is there any solution for it? I remember Derek had a presentation/paper about it. However, I can't find some information on libMesh website. It looks like multi-threading is not welcome in libMesh :). Regards, Yujie On Mon, Aug 9, 2010 at 4:04 PM, Roy Stogner <roy...@ic...> wrote: > > On Mon, 9 Aug 2010, Yujie wrote: > >> I remember one multi-thread example will be added in previous >> discussion. I can't find it. Do you have any plan for it? Thanks a >> lot. > > Not short-term. Your best bet is to look at one of the multi-threaded > functions in the library; search for ConstElemRange to find all the > appropriate ones. > > You're not using FEMSystem by any chance? Adding multithreading to > the assembly there has been at the bottom of my TODO list there for a > while; if I knew someone else needed it it would be closer to the top. > --- > Roy > |
From: Roy S. <roy...@ic...> - 2010-08-09 21:25:02
|
On Mon, 9 Aug 2010, Yujie wrote: > Does multi-threading only benefit the assembly regarding the > performance improvement? Lots of chunks of the library itself are multithreaded; however a typical application will spend most of its time in the solver and most of the rest in assembly, so those are the things that need speedup. > Is there an implementation barrier in current libmesh and PETSc? PETSc isn't multithreaded. http://www.mcs.anl.gov/petsc/petsc-as/miscellaneous/petscthreads.html (it's not even thread-safe if called from multiple threads at once, but the TBB-based threading in libMesh only uses PETSc in a safe way) > if it is, is there any solution for it? Using a matrix-free solver (i.e. approximating matrix-vector products with finite differences) or an explicit method, your code will spend a greater fraction of its time in assembly and can then benefit more from threading that. > It looks like multi-threading is not welcome in libMesh :). It's in the library itself; how much more welcome do you want? ;-) What's lacking in libMesh is a multithreaded solver, which would be quite welcome, it's just that we're all too busy (and too disinclined to reinvent the wheel) to write one ourselves. It's just too easy to run a single MPI process per core instead. --- Roy |
From: Yujie <rec...@gm...> - 2010-08-09 21:36:16
|
Dear Roy, Thanks for your reply. Regarding your reply, if one wants to realize a multithreading-MPI implementation, PETSc and TBB is a choice in libMesh for FEM? The performance improvement can be realized not only in matrix assembly and solver. It looks like multi-threading realization for matrix assembly is easier than that in solver. Do you know any solver packages are multi-threading? How about Trilinos? Thanks so much. Regards, Yujie On Mon, Aug 9, 2010 at 4:25 PM, Roy Stogner <roy...@ic...> wrote: > > On Mon, 9 Aug 2010, Yujie wrote: > >> Does multi-threading only benefit the assembly regarding the >> performance improvement? > > Lots of chunks of the library itself are multithreaded; however a > typical application will spend most of its time in the solver and most > of the rest in assembly, so those are the things that need speedup. > >> Is there an implementation barrier in current libmesh and PETSc? > > PETSc isn't multithreaded. > http://www.mcs.anl.gov/petsc/petsc-as/miscellaneous/petscthreads.html > (it's not even thread-safe if called from multiple threads at once, > but the TBB-based threading in libMesh only uses PETSc in a safe way) > >> if it is, is there any solution for it? > > Using a matrix-free solver (i.e. approximating matrix-vector products > with finite differences) or an explicit method, your code will spend a > greater fraction of its time in assembly and can then benefit more > from threading that. > >> It looks like multi-threading is not welcome in libMesh :). > > It's in the library itself; how much more welcome do you want? ;-) > What's lacking in libMesh is a multithreaded solver, which would be > quite welcome, it's just that we're all too busy (and too disinclined > to reinvent the wheel) to write one ourselves. It's just too easy to > run a single MPI process per core instead. > --- > Roy > |
From: Roy S. <roy...@ic...> - 2010-08-09 21:48:09
|
On Mon, 9 Aug 2010, Yujie wrote: > Regarding your reply, if one wants to realize a multithreading-MPI > implementation, PETSc and TBB is a choice in libMesh for FEM? It will work. But it's often an inferior choice to using PETSc without threads - if you have the same number of cores as MPI processes, then you might as well just use one thread per core. If you have more cores than MPI processes, then you'll see less memory usage and some slightly more efficient CPU usage in the multithreaded parts of the code, but in other parts of the code you'll have cores doing nothing while waiting on PETSc. > Do you know any solver packages are multi-threading? How about > Trilinos? I don't think so. Trilinos has something called ThreadPool, so they're at least taking steps in the right direction, but neither Epetra nor AztecOO seem to use it yet. I could be wrong, though. It's a shame. 5 years ago I would have guessed that by now *every* major solver package would be multithreaded and we'd just be worrying about which ones had the better GPU support. --- Roy |
From: Yujie <rec...@gm...> - 2010-08-09 22:05:12
|
Dear Roy, You mean even if you use multi-threading-MPI implementation, you need to find the balance between multithreading computation and the communication between several processings? However, sometimes one wants to use more CPU cores to accelerate the computation, meaning more processings/nodes are used. However, more data communication between processings aggravates the computation speed? In this case, can multithreading-MPI implemention not benefit this? from your comments, GPU-based multithreading packages are more efficient than the counterpart in CPU although the latter is not popular currently? Thanks a lot. Regards, Yujie On Mon, Aug 9, 2010 at 4:48 PM, Roy Stogner <roy...@ic...> wrote: > > On Mon, 9 Aug 2010, Yujie wrote: > >> Regarding your reply, if one wants to realize a multithreading-MPI >> implementation, PETSc and TBB is a choice in libMesh for FEM? > > It will work. But it's often an inferior choice to using PETSc > without threads - if you have the same number of cores as MPI > processes, then you might as well just use one thread per core. If > you have more cores than MPI processes, then you'll see less memory > usage and some slightly more efficient CPU usage in the multithreaded > parts of the code, but in other parts of the code you'll have cores > doing nothing while waiting on PETSc. > >> Do you know any solver packages are multi-threading? How about >> Trilinos? > > I don't think so. Trilinos has something called ThreadPool, so > they're at least taking steps in the right direction, but neither > Epetra nor AztecOO seem to use it yet. I could be wrong, though. > > It's a shame. 5 years ago I would have guessed that by now *every* > major solver package would be multithreaded and we'd just be worrying > about which ones had the better GPU support. > --- > Roy > |
From: Derek G. <fri...@gm...> - 2010-08-10 15:41:00
|
On Aug 9, 2010, at 4:05 PM, Yujie wrote: > However, sometimes one wants to use more CPU cores to accelerate the > computation, meaning more processings/nodes are used. However, more > data communication between processings aggravates the computation > speed? In this case, can multithreading-MPI implemention not benefit > this? This is mostly fantasy (one that I have had as well... but still fantasy). From my experience with multithreading... the overhead of communication is basically NEVER enough to justify using threads over more MPI processes (for a typical full Jacobian based solve where you have to fill a matrix and then hand it off to an MPI bases solver like Petsc or Trilinos). Every core you add to Petsc will improve your speed more than adding threads for assembly. The ONLY way it makes sense is (like Roy said) if you are spending 90% of your time in assembly... like if you are doing JFNK or an explicit method (which is what we do here at INL). THEN you can see some benefits for certain problems. There is one edge case here: large memory jobs. Sometimes you have a job that for whatever reason uses a lot of memory per MPI process.... and in some cases it might use so much that you can't fully pack the nodes on your supercomputer (ie you have to leave cores sitting around unused because of the lack of memory). In that specific situation... if you can make use of those idle cores through threading your assembly... you can pick up some extra speed. In general.... I just wouldn't pursue it unless you are doing JFNK or an explicit scheme. Otherwise it's just not going to pay off. Derek |
From: Jed B. <je...@59...> - 2010-08-10 20:34:50
|
On Tue, 10 Aug 2010 09:40:46 -0600, Derek Gaston <fri...@gm...> wrote: > On Aug 9, 2010, at 4:05 PM, Yujie wrote: > > > However, sometimes one wants to use more CPU cores to accelerate the > > computation, meaning more processings/nodes are used. However, more > > data communication between processings aggravates the computation > > speed? Note that this communication is generally done by mapping pages between processes, it is much cheaper than serializing over a network. Please present a benchmark showing that this cost is significant for your application before concluding that you need to use a hybrid programming model. > > In this case, can multithreading-MPI implemention not benefit this? Note that you have to pay for a reentrant MPI too, the locks are not free, and this use case is less well tested. > This is mostly fantasy (one that I have had as well... but still > fantasy). From my experience with multithreading... the overhead of > communication is basically NEVER enough to justify using threads over > more MPI processes (for a typical full Jacobian based solve where you > have to fill a matrix and then hand it off to an MPI bases solver like > Petsc or Trilinos). Every core you add to Petsc will improve your > speed more than adding threads for assembly. Note that many operations are bandwidth limited, and adding more cores doesn't always help. As a simple example, on one 6-core socket of XT5, we have the following numbers for STREAM Triad (numbers courtesy Dinesh Kaushik): Threads Total (MB/s) Per core (MB/s) 1 8448 8448 2 10112 5056 4 10715 2679 6 10482 1747 In contrast BlueGene/P produces Threads Total (MB/s) Per core (MB/s) 1 2266 2266 2 4529 2264 4 8903 2226 PETSc has support for used-defined preconditioners using OpenMP, but this is rarely used, and I'm not aware of any cases where it has been used to beat an intelligent MPI-only preconditioner. PETSc-dev now has CUDA support for all vector and many sparse matrix kernels (no source-level changes required). The non-reentrant parts of PETSc are mostly in the profiling/logging/debugging part. Making it reentrant would not be terribly deep, but would take a fair amount of time to add all the fine-grained locks (and this would cost some performance, logging functions are potentially performance-sensitive). I don't think it's possible to make the sparse matrix data formats reentrant without unacceptable performance/usability impact, so MatSetValues() would always have to take a (coarse-grained) lock. The PETSc team doesn't see this offering sufficient benefit to justify the complexity and implementation effort. If you have non-contrived use cases where it is an unambiguous win, you may be able to make a case for getting it done. > There is one edge case here: large memory jobs. Sometimes you have a > job that for whatever reason uses a lot of memory per MPI > process.... and in some cases it might use so much that you can't > fully pack the nodes on your supercomputer (ie you have to leave cores > sitting around unused because of the lack of memory). In that > specific situation... if you can make use of those idle cores through > threading your assembly... you can pick up some extra speed. This is a relevant scenario, particularly if the mesh or geometric model is not distributed. Jed |
From: Derek G. <fri...@gm...> - 2010-08-10 21:06:21
|
Hey Jed... I just wanted to thank you for your insightful commentary! Your presence on this list is greatly appreciated! It's definitely always good to get some inside knowledge into where Petsc is headed.... Derek On Tue, Aug 10, 2010 at 2:37 PM, Jed Brown <je...@59...> wrote: > On Tue, 10 Aug 2010 09:40:46 -0600, Derek Gaston <fri...@gm...> > wrote: > > On Aug 9, 2010, at 4:05 PM, Yujie wrote: > > > > > However, sometimes one wants to use more CPU cores to accelerate the > > > computation, meaning more processings/nodes are used. However, more > > > data communication between processings aggravates the computation > > > speed? > > Note that this communication is generally done by mapping pages between > processes, it is much cheaper than serializing over a network. Please > present a benchmark showing that this cost is significant for your > application before concluding that you need to use a hybrid programming > model. > > > > In this case, can multithreading-MPI implemention not benefit this? > > Note that you have to pay for a reentrant MPI too, the locks are not > free, and this use case is less well tested. > > > This is mostly fantasy (one that I have had as well... but still > > fantasy). From my experience with multithreading... the overhead of > > communication is basically NEVER enough to justify using threads over > > more MPI processes (for a typical full Jacobian based solve where you > > have to fill a matrix and then hand it off to an MPI bases solver like > > Petsc or Trilinos). Every core you add to Petsc will improve your > > speed more than adding threads for assembly. > > Note that many operations are bandwidth limited, and adding more cores > doesn't always help. As a simple example, on one 6-core socket of XT5, > we have the following numbers for STREAM Triad (numbers courtesy Dinesh > Kaushik): > > Threads Total (MB/s) Per core (MB/s) > 1 8448 8448 > 2 10112 5056 > 4 10715 2679 > 6 10482 1747 > > In contrast BlueGene/P produces > > Threads Total (MB/s) Per core (MB/s) > 1 2266 2266 > 2 4529 2264 > 4 8903 2226 > > PETSc has support for used-defined preconditioners using OpenMP, but > this is rarely used, and I'm not aware of any cases where it has been > used to beat an intelligent MPI-only preconditioner. > > PETSc-dev now has CUDA support for all vector and many sparse matrix > kernels (no source-level changes required). > > The non-reentrant parts of PETSc are mostly in the > profiling/logging/debugging part. Making it reentrant would not be > terribly deep, but would take a fair amount of time to add all the > fine-grained locks (and this would cost some performance, logging > functions are potentially performance-sensitive). I don't think it's > possible to make the sparse matrix data formats reentrant without > unacceptable performance/usability impact, so MatSetValues() would > always have to take a (coarse-grained) lock. The PETSc team doesn't see > this offering sufficient benefit to justify the complexity and > implementation effort. If you have non-contrived use cases where it is > an unambiguous win, you may be able to make a case for getting it done. > > > There is one edge case here: large memory jobs. Sometimes you have a > > job that for whatever reason uses a lot of memory per MPI > > process.... and in some cases it might use so much that you can't > > fully pack the nodes on your supercomputer (ie you have to leave cores > > sitting around unused because of the lack of memory). In that > > specific situation... if you can make use of those idle cores through > > threading your assembly... you can pick up some extra speed. > > This is a relevant scenario, particularly if the mesh or geometric model > is not distributed. > > Jed > |
From: Yujie <rec...@gm...> - 2010-08-10 20:51:20
|
Thank you very much, Jed. I made a mistake. I have thought the processings of MPI-based computation in one nodes of the cluster need to use ethenet for data communication. In MPICH2, it will be done inside one nodes, meaning the multithread techinque can't help a lot for the performance improvements regarding slow data communication between nodes. The problem is that the dimension of my problem is small. Regarding GPU-based PETSc, is there an implicit multithreaded-MPI solver when distributed matrix is used? I have noticed some functions for SEQ- and MPI- Vec and AIJMAT are added. Regarding MATDense, do you have any plan for it? Is there some challenges for SEQ- and MPI-Dense matrix? Thanks again. Regards, Yujie On Tue, Aug 10, 2010 at 3:37 PM, Jed Brown <je...@59...> wrote: > On Tue, 10 Aug 2010 09:40:46 -0600, Derek Gaston <fri...@gm...> wrote: >> On Aug 9, 2010, at 4:05 PM, Yujie wrote: >> >> > However, sometimes one wants to use more CPU cores to accelerate the >> > computation, meaning more processings/nodes are used. However, more >> > data communication between processings aggravates the computation >> > speed? > > Note that this communication is generally done by mapping pages between > processes, it is much cheaper than serializing over a network. Please > present a benchmark showing that this cost is significant for your > application before concluding that you need to use a hybrid programming > model. > >> > In this case, can multithreading-MPI implemention not benefit this? > > Note that you have to pay for a reentrant MPI too, the locks are not > free, and this use case is less well tested. > >> This is mostly fantasy (one that I have had as well... but still >> fantasy). From my experience with multithreading... the overhead of >> communication is basically NEVER enough to justify using threads over >> more MPI processes (for a typical full Jacobian based solve where you >> have to fill a matrix and then hand it off to an MPI bases solver like >> Petsc or Trilinos). Every core you add to Petsc will improve your >> speed more than adding threads for assembly. > > Note that many operations are bandwidth limited, and adding more cores > doesn't always help. As a simple example, on one 6-core socket of XT5, > we have the following numbers for STREAM Triad (numbers courtesy Dinesh > Kaushik): > > Threads Total (MB/s) Per core (MB/s) > 1 8448 8448 > 2 10112 5056 > 4 10715 2679 > 6 10482 1747 > > In contrast BlueGene/P produces > > Threads Total (MB/s) Per core (MB/s) > 1 2266 2266 > 2 4529 2264 > 4 8903 2226 > > PETSc has support for used-defined preconditioners using OpenMP, but > this is rarely used, and I'm not aware of any cases where it has been > used to beat an intelligent MPI-only preconditioner. > > PETSc-dev now has CUDA support for all vector and many sparse matrix > kernels (no source-level changes required). > > The non-reentrant parts of PETSc are mostly in the > profiling/logging/debugging part. Making it reentrant would not be > terribly deep, but would take a fair amount of time to add all the > fine-grained locks (and this would cost some performance, logging > functions are potentially performance-sensitive). I don't think it's > possible to make the sparse matrix data formats reentrant without > unacceptable performance/usability impact, so MatSetValues() would > always have to take a (coarse-grained) lock. The PETSc team doesn't see > this offering sufficient benefit to justify the complexity and > implementation effort. If you have non-contrived use cases where it is > an unambiguous win, you may be able to make a case for getting it done. > >> There is one edge case here: large memory jobs. Sometimes you have a >> job that for whatever reason uses a lot of memory per MPI >> process.... and in some cases it might use so much that you can't >> fully pack the nodes on your supercomputer (ie you have to leave cores >> sitting around unused because of the lack of memory). In that >> specific situation... if you can make use of those idle cores through >> threading your assembly... you can pick up some extra speed. > > This is a relevant scenario, particularly if the mesh or geometric model > is not distributed. > > Jed > |
From: Jed B. <je...@59...> - 2010-08-10 20:58:59
|
On Tue, 10 Aug 2010 15:51:18 -0500, Yujie <rec...@gm...> wrote: > Thank you very much, Jed. > > I made a mistake. I have thought the processings of MPI-based > computation in one nodes of the cluster need to use ethenet for data > communication. In MPICH2, it will be done inside one nodes, Open MPI (and all other modern implementations) do this too. > Regarding GPU-based PETSc, is there an implicit multithreaded-MPI > solver when distributed matrix is used? If I understand your question, then yes, the MPI matrix formats are supported and the GPU kernels are "multithreaded". > I have noticed some functions for SEQ- and MPI- Vec and AIJMAT are > added. Regarding MATDense, do you have any plan for it? > Is there some challenges for SEQ- and MPI-Dense matrix? The dense cases are much easier than the sparse ones (and offer more potential for GPU speedup). I don't know if it will happen in the next few weeks (it's not much work, but someone has to get to it), but it should certainly be there for the next release (probably Q1 2011). Jed |
From: Yujie <rec...@gm...> - 2010-08-10 21:10:01
|
Thanks so much, Jed :) Best Regards, Yujie On Tue, Aug 10, 2010 at 4:01 PM, Jed Brown <je...@59...> wrote: > On Tue, 10 Aug 2010 15:51:18 -0500, Yujie <rec...@gm...> wrote: >> Thank you very much, Jed. >> >> I made a mistake. I have thought the processings of MPI-based >> computation in one nodes of the cluster need to use ethenet for data >> communication. In MPICH2, it will be done inside one nodes, > > Open MPI (and all other modern implementations) do this too. > >> Regarding GPU-based PETSc, is there an implicit multithreaded-MPI >> solver when distributed matrix is used? > > If I understand your question, then yes, the MPI matrix formats are > supported and the GPU kernels are "multithreaded". > >> I have noticed some functions for SEQ- and MPI- Vec and AIJMAT are >> added. Regarding MATDense, do you have any plan for it? >> Is there some challenges for SEQ- and MPI-Dense matrix? > > The dense cases are much easier than the sparse ones (and offer more > potential for GPU speedup). I don't know if it will happen in the next > few weeks (it's not much work, but someone has to get to it), but it > should certainly be there for the next release (probably Q1 2011). > > Jed > |
From: Roy S. <roy...@ic...> - 2010-08-10 21:11:21
|
On Tue, 10 Aug 2010, Jed Brown wrote: > On Tue, 10 Aug 2010 15:51:18 -0500, Yujie <rec...@gm...> wrote: >> Thank you very much, Jed. >> >> I made a mistake. I have thought the processings of MPI-based >> computation in one nodes of the cluster need to use ethenet for data >> communication. In MPICH2, it will be done inside one nodes, > > Open MPI (and all other modern implementations) do this too. Even non-modern implementations will do better than Yujie assumed - TCP/IP data sent to the local IP address doesn't touch the network card, regardless of whether the MPI stack realizes that it's sending to someplace local. I'd guess that a modern MPI stack can use shared memory locally and avoid RAM-to-RAM copies, but even RAM-to-RAM should be much faster than RAM-to-NIC-to-NIC-to-RAM. --- Roy |
From: Jed B. <je...@59...> - 2010-08-10 21:27:14
|
On Tue, 10 Aug 2010 16:11:23 -0500 (CDT), Roy Stogner <roy...@ic...> wrote: > Even non-modern implementations will do better than Yujie assumed - > TCP/IP data sent to the local IP address doesn't touch the network > card, regardless of whether the MPI stack realizes that it's sending > to someplace local. I'd guess that a modern MPI stack can use shared > memory locally and avoid RAM-to-RAM copies, but even RAM-to-RAM should > be much faster than RAM-to-NIC-to-NIC-to-RAM. Yes, but TCP needs more context switches and copying through kernel space so the hit is still not trivial. Also, if you have nice network hardware, you can have it do the copy via DMA, which requires no system calls, and your program can do something useful while the messages are delivered. I have heard second-hand reports of this sometimes being faster than the shared-memory transport layer when overlapping communication and computation, but don't have a concrete example to point to. It's really easy to experiment with this when using Open MPI (MPICH2 has similar ability, but it is through environment variables and I don't recall the details). Run with 'mpiexec -mca btl self,tcp' to use TCP even for local communication (as Roy says, this never touches the network device, but still involves copying through kernel space). The default will include 'sm' which is the dedicated shared-memory transport layer. With InfiniBand, try '-mca btl self,ib' to use the HCA for local communication as well (usually this involves no context switches, the registered memory is copied directly by the HCA, but this of course means that it has to actually get to the device, which is usually slower than the memory bus). Jed |