From: Derek G. <fri...@gm...> - 2008-06-24 18:30:37
|
Hey guys, Using Roy's workaround so that partitioning doesn't happen with ParallelMesh I've been able to run some pretty big problems today, and I thought I would share some numbers All I'm doing is solving pure diffusion with a Dirichlet BC and a forcing function in 3d on hexes.... but I'm doing it completely matrix free using the NonlinearSystem class. The point of these runs is to do some parallel scaling tests. From previous runs with SerialMesh I knew that I needed more that 2 million dofs to see any kind of good parallel scaling over 128 procs. Wanting to get good scaling up to 1024 procs... I did some quick calculations (using some small problems on my 8GB of RAM desktop) I figured I could fit 80 million DOFs on 128 procs (each proc has 2GB of RAM) using ParallelMesh.... it turns out I was pretty far off! In fact, I was off so far that I ended up bringing down half of our supercomputer as it started swapping like crazy! After rebooting a few nodes, I'm now running a slightly smaller problem at 10 million DOFs. I have it running on 64,128 and 256 processors (the 512, and 1024 jobs are still in the queue). This gives me an interesting opportunity to look at the memory scaling using ParallelMesh, since I don't have a matrix involved. Here is how much each proc is using: #CPU : #MB/proc 256 : 200-700 128 : 350-700 64 : 450-800 First thing to note is that the #MB/proc is a _range_. This range is taken from me watching "top" on one of the compute nodes. The memory usage of each process _oscillates_ between the two numbers listed about every 5 seconds. Based on watching using "xosview" I believe that the high numbers occur during communication steps, while the low numbers are during "assembly" (residual computation). At this point, these are just guesses, but I've been watching these things for a few weeks now and kind of have a feel for what's going on. Second thing to realize is that the upper number (700,700,800) is about the same. This is somewhat of a bummer since it means that just adding more procs isn't going to allow me to run an appreciably larger problem. I'm guessing that we have some serialized vectors that at this point are contributing much more memory than the mesh is... Anyway, I just thought I would share some data with everyone. This is by no means a request for anything to happen... nor is it a bug report or anything of the sort. The fact of the matter is that I will be able to run something like 30 million DOFs on this computer without problem.... and and that is freaking sweet! That will more than satisfy my goals this year. But... I am going to keep an eye out for where some fat might be trimmed along the way... and maybe we can get the memory usage scaling a bit better... Derek |
From: John P. <jwp...@gm...> - 2008-06-24 18:46:03
|
Hi Derek, On Tue, Jun 24, 2008 at 1:30 PM, Derek Gaston <fri...@gm...> wrote: > > Here is how much each proc is using: > > #CPU : #MB/proc > 256 : 200-700 > 128 : 350-700 > 64 : 450-800 Thanks! These numbers are pretty interesting. I know there are some places in the code where we go ahead and build a serial std::vector of size global n_elem or n_nodes. I think this happens sometimes on all processors and sometimes on just one, so it's possible we will run out of memory on one CPU way before the others... > > First thing to note is that the #MB/proc is a _range_. This range is > taken from me watching "top" on one of the compute nodes. The memory > usage of each process _oscillates_ between the two numbers listed > about every 5 seconds. Based on watching using "xosview" I believe > that the high numbers occur during communication steps, while the low > numbers are during "assembly" (residual computation). At this point, > these are just guesses, but I've been watching these things for a few > weeks now and kind of have a feel for what's going on. I believe mpip (http://mpip.sourceforge.net/) can report more detailed memory usage, but implementing the profiling does require additional work. I'd be interested to see how uniform the memory usage is across all the procs. We have some people around here who know about mpip so if you find some strange things happening I might be able to take a closer look. Also, it sounds like now might be a good time to start thinking about our StructuredMesh implementation again...unless your eventual problem will be unstructured and you don't care. We could probably run some truly enormous matrix-free problems on structured grids.... -- John |
From: Derek G. <fri...@gm...> - 2008-06-24 18:52:49
|
On Jun 24, 2008, at 12:46 PM, John Peterson wrote: > I believe mpip (http://mpip.sourceforge.net/) can report more detailed > memory usage, but implementing the profiling does require additional > work. Thanks for the tip... I'll check into it when I get a chance (which might be next year!). > I'd be interested to see how uniform the memory usage is across > all the procs. Pretty dang uniform. They basically stay right together. > Also, it sounds like now might be a good time to start thinking about > our StructuredMesh implementation again...unless your eventual problem > will be unstructured and you don't care. We could probably run some > truly enormous matrix-free problems on structured grids.... We're definitely going to be unstructured. Right now the geometry is just a cylinder... but later it's going to have "cracks" in it... along with moving mesh (which is coming online now) and contact (which we've just started with).... and even more complicated geometries are yet to come. Derek |
From: Roy S. <ro...@st...> - 2008-06-24 18:50:30
|
On Tue, 24 Jun 2008, Derek Gaston wrote: > Using Roy's workaround so that partitioning doesn't happen with > ParallelMesh I've been able to run some pretty big problems today, and > I thought I would share some numbers All I'm doing is solving pure > diffusion with a Dirichlet BC and a forcing function in 3d on > hexes.... First order elements? > but I'm doing it completely matrix free using the > NonlinearSystem class. Thanks; this is interesting stuff. All the numbers I've taken down have been with matrix and preconditioner included, on smaller problems. > First thing to note is that the #MB/proc is a _range_. This range is > taken from me watching "top" on one of the compute nodes. The memory > usage of each process _oscillates_ between the two numbers listed > about every 5 seconds. Based on watching using "xosview" I believe > that the high numbers occur during communication steps, Define "communication steps" - I assume you're not talking about synching up ghost DoFs during a solve? Are you doing mesh refinement at each oscillation? If that's it, then there's two likely culprits, but I don't know if they're sufficient explanation: We still use global error vectors; these need to be (optionally) moved to a mapvector sort of structure just like the mesh node and element vectors were. A single-precision float per element would be 40MB on each node for 10 million elements. We use serialized DoF vectors in System::project_vector(). This is even worse because of the double precision; each vector must cost 80MB for 10 million DoFs. But even that adds up to only 120MB; if you're seeing ~500MB usage jumps, I'm not sure where the rest is coming from. I don't suppose you can pin down exactly where the memory's being allocated? --- Roy |
From: Derek G. <fri...@gm...> - 2008-06-24 19:02:54
|
On Jun 24, 2008, at 12:50 PM, Roy Stogner wrote: > First order elements? Yep > Define "communication steps" - I assume you're not talking about > synching up ghost DoFs during a solve? I really haven't looked into it. It's more of a "feeling" you can watch the cpu's peg out for a few seconds... then they all go crazy and the network traffic spikes. It could be something specific to the way the NonlinearSystem solve progresses.... maybe syncing the global solution so that it can pass it into compute_residual()? I don't know. > Are you doing mesh refinement at each oscillation? If that's it, then > there's two likely culprits, but I don't know if they're sufficient > explanation: > > We still use global error vectors; these need to be (optionally) moved > to a mapvector sort of structure just like the mesh node and element > vectors were. A single-precision float per element would be 40MB on > each node for 10 million elements. > > We use serialized DoF vectors in System::project_vector(). This is > even worse because of the double precision; each vector must cost 80MB > for 10 million DoFs. There is no refinement happening at all (well, beyond the initial uniform refinements to get up to the problem size I want). I'm not even looping around "system.solve()". This is just one big solve. > But even that adds up to only 120MB; if you're seeing ~500MB usage > jumps, I'm not sure where the rest is coming from. I don't suppose > you can pin down exactly where the memory's being allocated? I'm not exactly sure how to even start down that path. John suggested mpip to get some info out. Do you have an idea of how to find out what the memory swings are? I'm probably just going to start with print statements and see if I can "see" when the memory spikes are occurring. Derek |
From: Roy S. <ro...@st...> - 2008-06-24 19:05:34
|
On Tue, 24 Jun 2008, Derek Gaston wrote: > I thought I would share some numbers All I'm doing is solving pure > diffusion with a Dirichlet BC and a forcing function in 3d on > hexes.... but I'm doing it completely matrix free using the > NonlinearSystem class. Another question: what NonlinearSystem class? If you're using NonlinearImplicitSystem, doesn't that allocate a matrix whether you and PETSc eventually use it or not? --- Roy |
From: Roy S. <ro...@st...> - 2008-06-24 19:19:24
|
On Tue, 24 Jun 2008, Derek Gaston wrote: > I really haven't looked into it. It's more of a "feeling" you can watch the > cpu's peg out for a few seconds... then they all go crazy and the network > traffic spikes. It could be something specific to the way the > NonlinearSystem solve progresses.... maybe syncing the global solution so > that it can pass it into compute_residual()? I don't know. > > There is no refinement happening at all (well, beyond the initial uniform > refinements to get up to the problem size I want). I'm not even looping > around "system.solve()". This is just one big solve. > > I'm probably just going to start with print statements and see if I can "see" > when the memory spikes are occurring. I've got a good guess as to where the memory spikes are occuring: check out __libmesh_petsc_snes_residual() in petsc_nonlinear_solver.C. Is that X_local a serial vector? I think I noticed this problem and tried to avoid it in petsc_diff_solver.C; you might cut and paste some of that code into Ben's solver to see if using my "swap vectors, System::update" localization works any better. --- Roy |
From: Derek G. <fri...@gm...> - 2008-06-24 19:23:06
|
On Jun 24, 2008, at 1:19 PM, Roy Stogner wrote: > I've got a good guess as to where the memory spikes are occuring: > check out __libmesh_petsc_snes_residual() in petsc_nonlinear_solver.C. > Is that X_local a serial vector? Good catch Roy... that's along the lines of what I was thinking. > I think I noticed this problem and tried to avoid it in > petsc_diff_solver.C; you might cut and paste some of that code into > Ben's solver to see if using my "swap vectors, System::update" > localization works any better. I will try that. Thanks! Derek |
From: Benjamin K. <ben...@na...> - 2008-06-24 19:27:39
|
>> I've got a good guess as to where the memory spikes are occuring: >> check out __libmesh_petsc_snes_residual() in petsc_nonlinear_solver.C. >> Is that X_local a serial vector? > > Good catch Roy... that's along the lines of what I was thinking. > >> I think I noticed this problem and tried to avoid it in >> petsc_diff_solver.C; you might cut and paste some of that code into >> Ben's solver to see if using my "swap vectors, System::update" >> localization works any better. > > I will try that. Thanks! Please advise of the outcome. Big picture, the concept of serialized vectors needs to be wholesale replaced with vectors+ghost padding. PETSc offers a way to do this, and Trilinos has something similar. What I think is needed is that we implement this functionality in DistributedVector<> and then derive PetscVector<>/EpetraVector<> from this class. -Ben |
From: Roy S. <ro...@st...> - 2008-06-24 19:38:05
|
On Tue, 24 Jun 2008, Benjamin Kirk wrote: > Big picture, the concept of serialized vectors needs to be wholesale > replaced with vectors+ghost padding. PETSc offers a way to do this, and > Trilinos has something similar. Keep in mind, for anywhere we still need to use serialized vectors (Derek's case not included), the padding's going to have to be more complicated than just "include ghost dofs". For that HP selection code of mine, for example, you might need to go several levels of elements out from the current partition to get enough DoFs to do a proper coarsening, and for System::project_vector() there may be no relation between the current dof numbering and the numbering of the old dofs we're trying to project onto them. > What I think is needed is that we implement this functionality in > DistributedVector<> and then derive PetscVector<>/EpetraVector<> from this > class. Assuming you mean NumericVector not DistributedVector, that sounds like an excellent idea. --- Roy |
From: Benjamin K. <ben...@na...> - 2008-06-24 19:44:01
|
> Assuming you mean NumericVector not DistributedVector, that sounds > like an excellent idea Actually, I meant DistributedVector<>, and the inheritance would change. But your point is well taken. The implementation could just as easily be done in NumericVector<>, and then the DistributedVector<> would become obsolete. It could be retained for compatibility, but wouldn't need to implement anything. -Ben |
From: Roy S. <ro...@st...> - 2008-06-24 19:51:50
|
On Tue, 24 Jun 2008, Benjamin Kirk wrote: >> Assuming you mean NumericVector not DistributedVector, that sounds >> like an excellent idea > > Actually, I meant DistributedVector<>, and the inheritance would change. > But your point is well taken. The implementation could just as easily be > done in NumericVector<>, and then the DistributedVector<> would become > obsolete. It could be retained for compatibility, but wouldn't need to > implement anything. Well, actually I'd like to see DistributedVector retained as a leaf class, even if we refactor the inheritance to supply new support for distributed ghost dofs in the "trunk" above NumericVector. Yay PETSc, yay Trilinos, but I still kind of like the idea of having a bare-bones parallel vector implementation that doesn't require a large third party package to be compiled in. --- Roy |
From: John P. <jwp...@gm...> - 2008-06-24 19:59:16
|
On Tue, Jun 24, 2008 at 2:51 PM, Roy Stogner <ro...@st...> wrote: > > > On Tue, 24 Jun 2008, Benjamin Kirk wrote: > >>> Assuming you mean NumericVector not DistributedVector, that sounds >>> like an excellent idea >> >> Actually, I meant DistributedVector<>, and the inheritance would change. >> But your point is well taken. The implementation could just as easily be >> done in NumericVector<>, and then the DistributedVector<> would become >> obsolete. It could be retained for compatibility, but wouldn't need to >> implement anything. > > Well, actually I'd like to see DistributedVector retained as a leaf > class, even if we refactor the inheritance to supply new support for > distributed ghost dofs in the "trunk" above NumericVector. Yay PETSc, > yay Trilinos, but I still kind of like the idea of having a bare-bones > parallel vector implementation that doesn't require a large third > party package to be compiled in. I agree. I've always thought it would be cool to have a home-grown DistributedMatrix to go with it as well (LibMesh's own SparseMatrix implementation) so I'd like to see it stay as a leaf class if possible... Now that so many supercomputer sites have PETSc or Trilinos installed this doesn't seem very necessary, but as LibMesh is getting packaged by other distros, it would be cool if it had some parallel linear algebra abilities "out of the box". -- John |
From: Benjamin K. <ben...@na...> - 2008-06-24 20:11:46
|
> I agree. I've always thought it would be cool to have a home-grown > DistributedMatrix to go with it as well (LibMesh's own SparseMatrix > implementation) so I'd like to see it stay as a leaf class if > possible... That would be my intent. But by pushing the majority of the implementation into NumericVector<> I would bet the derived PetscVector<>/EpetraVector<> classes could get a lot smaller. I don't know if anything would be specialized in DistributedVector<> any more, but keeping it as a class derived from NumericVector<> is not a problem. > Now that so many supercomputer sites have PETSc or Trilinos installed > this doesn't seem very necessary, but as LibMesh is getting packaged > by other distros, it would be cool if it had some parallel linear > algebra abilities "out of the box". My mind heads down this path every so often too... Who wants to handle the preconditioning part? ;-) Thanks for the quadrature rule updates! -Ben |
From: Derek G. <fri...@gm...> - 2008-06-24 19:20:45
|
On Jun 24, 2008, at 1:05 PM, Roy Stogner wrote: > Another question: what NonlinearSystem class? If you're using > NonlinearImplicitSystem, doesn't that allocate a matrix whether you > and PETSc eventually use it or not? Yes NonlinearImplicitSystem Crap... I just went and dug down into ImplicitSystem... and you're right! There is a matrix that's getting created and sized! Hmmm.... we might have to turn that off somehow... Derek |
From: John P. <jwp...@gm...> - 2008-06-24 19:25:26
|
On Tue, Jun 24, 2008 at 2:20 PM, Derek Gaston <fri...@gm...> wrote: > On Jun 24, 2008, at 1:05 PM, Roy Stogner wrote: > >> Another question: what NonlinearSystem class? If you're using >> NonlinearImplicitSystem, doesn't that allocate a matrix whether you >> and PETSc eventually use it or not? > > Yes NonlinearImplicitSystem > > Crap... I just went and dug down into ImplicitSystem... and you're > right! There is a matrix that's getting created and sized! Hmmm.... > we might have to turn that off somehow... If it's getting repeatedly created and destroyed, that might explain the big swings in mem. usage. But if it just gets created once and then tossed out at the end of the run I'm not sure it would cause this. OTOH, Derek, this may be why your "back of the envelope" problem size calculation caused the machines to swap ;-) -- John |
From: Benjamin K. <ben...@na...> - 2008-06-24 19:29:13
|
> OTOH, Derek, this may be why your "back of the envelope" problem size > calculation caused the machines to swap ;-) FWIW, I have no swap on my compute nodes. (No disk for that matter too.) I'd rather have a memory allocation request that does not fit in RAM kill the process than swap!! -ben |
From: Derek G. <fri...@gm...> - 2008-06-24 19:32:56
|
On Jun 24, 2008, at 1:29 PM, Benjamin Kirk wrote: >> > FWIW, I have no swap on my compute nodes. (No disk for that matter > too.) > I'd rather have a memory allocation request that does not fit in RAM > kill > the process than swap!! I completely agree with this. I've even gone to having no (or very little) swap on my desktop for the same reason. If I'm ever swapping I've done some horribly wrong... and it just needs to be killed. With more than 2 or 4 gigs of RAM there should never be a _good_ reason to swap. Derek |
From: Derek G. <fri...@gm...> - 2008-06-24 21:32:39
|
On Jun 24, 2008, at 1:19 PM, Roy Stogner wrote: > I think I noticed this problem and tried to avoid it in > petsc_diff_solver.C; you might cut and paste some of that code into > Ben's solver to see if using my "swap vectors, System::update" > localization works any better. Ok - I've tried to implement this, but I must be off somewhere. What I've done is add a _system refence inside of NonlinearSolver (the same way there is one in DiffSolver) and then I'm using code like this inside of __libmesh_petsc_snes_residual: PetscVector<Number> X_global(x), R(r); PetscVector<Number>& X_local = *dynamic_cast<PetscVector<Number>*>(sys.solution.get()); X_global.swap(X_local); sys.update(); X_global.swap(X_local); solver->residual (X_global, R); This seems to work ok in 2D... but in 3D it is generating NAN's for some reason. Someone please point out my logic troubles... Derek |
From: Roy S. <ro...@st...> - 2008-06-24 22:14:53
|
On Tue, 24 Jun 2008, Derek Gaston wrote: > On Jun 24, 2008, at 1:19 PM, Roy Stogner wrote: > >> I think I noticed this problem and tried to avoid it in >> petsc_diff_solver.C; you might cut and paste some of that code into >> Ben's solver to see if using my "swap vectors, System::update" >> localization works any better. > > Ok - I've tried to implement this, but I must be off somewhere. What I've > done is add a _system refence inside of NonlinearSolver (the same way there > is one in DiffSolver) and then I'm using code like this inside of > __libmesh_petsc_snes_residual: > > PetscVector<Number> X_global(x), R(r); > PetscVector<Number>& X_local = > *dynamic_cast<PetscVector<Number>*>(sys.solution.get()); > > X_global.swap(X_local); > sys.update(); > X_global.swap(X_local); > > solver->residual (X_global, R); > > This seems to work ok in 2D... but in 3D it is generating NAN's for some > reason. Someone please point out my logic troubles... Very strange. I can't see why the dimensionality would make a difference (particularly on a conforming mesh), but I must point out that my own brief tests of PetscDiffSolver were all on 2D problems; once I found out that my NewtonSolver was performing better, I lost interest in trying to tweak SNES options for 3D. Here's what I'm trying to figure out, though: The DiffSystem::assembly() call in PetscDiffSolver uses current_local_solution to plug into the weighted residual equations, thus ensuring that DoFs which are owned by another processor have correct values. How does that work here? When you swap the solution out again, the current_local_solution doesn't come with it - you're still passing to residual() a vector that only has locally owned DoFs, right? Also: Ben, is current_local_solution a serial vector? It certainly looks that way to me, but I've always been a little confused by the solution/current_local_solution divide, so maybe I'm missing some sort of PETSc magic under the hood. --- Roy |
From: Benjamin K. <ben...@na...> - 2008-06-25 20:55:07
|
> Also: Ben, is current_local_solution a serial vector? It certainly > looks that way to me, but I've always been a little confused by the > solution/current_local_solution divide, so maybe I'm missing some sort > of PETSc magic under the hood. In the beginning... I understood a little about parallel computing and less about PETSc. "solution" is a parallel vector, and "current_local_solution" is a serialized version of a *subset* of "solution". Thus it is full-sized, but only a fraction of the entries are correctly populated from remote processors. Now, let "global" and "local" be, uh, global and local vectors, respectively. Calling global.localize(local); Is kinda like an allreduce - at the end each processor has a complete picture of the global vector stored in local. global.localize(local, index_list); // this is what is done in // System::update() Copies only the subset of "global" to "local" as specified by "index_list" I'm not sure how PETSc is going about implementing these sends, but I bet in the former case the underlying memory allocations when you get to 100s of processors is what you are seeing. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Where I want this to go is something like the following: DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices); Where "vec" is a vector which stores n_local entries, and also has "ghost storage" for array_of_remote_indices.size() entries. You then go about your business, computing with "vec" and performing operations which update the local entries, and when you need to get synchronized copies of the remote values you call something like vec.localize(); This would effectively eliminate current_local_solution -- everything would be done in-place from solution. -Ben |
From: Roy S. <ro...@st...> - 2008-06-25 21:09:13
|
On Wed, 25 Jun 2008, Benjamin Kirk wrote: > DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices); > > Where "vec" is a vector which stores n_local entries, and also has "ghost > storage" for array_of_remote_indices.size() entries. We'd also need an "unsigned int first_local_index" argument for the offset of n_local, right? In any case that sounds like an excellent idea. We'll also want to have some higher level APIs where array_of_local_indices is implied to come from the system's send_list, of course, but if we get the lower level stuff right first that'll make it easier to handle the exceptional cases like System::project_vector. > You then go about your business, computing with "vec" and performing > operations which update the local entries, and when you need to get > synchronized copies of the remote values you call something like > > vec.localize(); > > This would effectively eliminate current_local_solution -- everything would > be done in-place from solution. Sounds good. While you're explaining the fundamentals of our parallel vector structures to me, could you explain the VecAssemblyBegin / VecAssemblyEnd pair in PetscVector::close? That's what does all the communication required by VecSetValues, right? What's the reason for the dual API call; does the communication get started asynchronously by Begin() and then End() blocks waiting for completion? --- Roy |
From: Benjamin K. <ben...@na...> - 2008-06-25 21:21:54
|
>> DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices); >> >> Where "vec" is a vector which stores n_local entries, and also has "ghost >> storage" for array_of_remote_indices.size() entries. > > We'd also need an "unsigned int first_local_index" argument for the > offset of n_local, right? In any case that sounds like an excellent > idea. Is there ever a case where you need it? Was thinking it can always be computed from the partial sums of n_local over all the processors lower than you. Similarly, n_global really would not be used except to assert that it is the same as the sum of n_local. ... > While you're explaining the fundamentals of our parallel vector > structures to me, could you explain the VecAssemblyBegin / > VecAssemblyEnd pair in PetscVector::close? That's what does all the > communication required by VecSetValues, right? What's the reason for > the dual API call; does the communication get started asynchronously > by Begin() and then End() blocks waiting for completion? That is absolutely the case. I fretted for a while about implementing them separately, but I feared there would be a million places you could get out of sync. For example, you call vec.close_start() at the end of your matrix assembly routine, then vec.close_finish() before the linear solver? What computation would you want to slip between those two? I figured we'd be setting ourselves up for a most frequently-asked question, and it doesn't seem to have hurt anything to date... -Ben |
From: Roy S. <ro...@st...> - 2008-06-25 21:36:08
|
On Wed, 25 Jun 2008, Benjamin Kirk wrote: >>> DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices); >>> >>> Where "vec" is a vector which stores n_local entries, and also has "ghost >>> storage" for array_of_remote_indices.size() entries. >> >> We'd also need an "unsigned int first_local_index" argument for the >> offset of n_local, right? In any case that sounds like an excellent >> idea. > > Is there ever a case where you need it? Was thinking it can always be > computed from the partial sums of n_local over all the processors lower than > you. Hmm... good point. But since we already do that operation in the DofMap, we could probably avoid doing it a second time (except in debug mode, where we'd double-check) by making vector constructors more explicit. It's not a big communication, but it is all-to-all. > Similarly, n_global really would not be used except to assert that it is the > same as the sum of n_local. In that case let's leave it out too; forcing n_global to be supplied but then computing local offsets inside the constructor pretty much forces redundant communication, I think. >> While you're explaining the fundamentals of our parallel vector >> structures to me, could you explain the VecAssemblyBegin / >> VecAssemblyEnd pair in PetscVector::close? That's what does all the >> communication required by VecSetValues, right? What's the reason for >> the dual API call; does the communication get started asynchronously >> by Begin() and then End() blocks waiting for completion? > > That is absolutely the case. I fretted for a while about implementing them > separately, but I feared there would be a million places you could get out > of sync. For example, you call vec.close_start() at the end of your matrix > assembly routine, then vec.close_finish() before the linear solver? What > computation would you want to slip between those two? I figured we'd be > setting ourselves up for a most frequently-asked question, and it doesn't > seem to have hurt anything to date... I just went through the code looking for places where we could profitably use a two-part close, and the best thing I could come up with is that it might potentially be a more efficient way to close multiple vectors/matrices at once. But that's not worth futzing with the API for; if someone ever does find some computation to usefully slip in between assembly and solve, we'll add close_start() and close_finish() member functions then. If the PETSc people went to all the trouble of adding asynchronous communication to their assembly code, were they smart enough to start that communication as soon as possible? Because there is one bit of computation that could definitely be started before the start of the assembly has been communicated: the rest of the assembly. --- Roy |
From: Vijay M <vi...@gm...> - 2008-06-25 23:42:47
|
> I just went through the code looking for places where we could > profitably use a two-part close, and the best thing I could come up > with is that it might potentially be a more efficient way to close > multiple vectors/matrices at once. But that's not worth futzing with > the API for; if someone ever does find some computation to usefully > slip in between assembly and solve, we'll add close_start() and > close_finish() member functions then. I think you could provide a hook, using a function pointer, to be called between the assembly_begin and assembly_end calls. And making this as an optional argument would also ensure backward compatibility. Just a thought ! > -----Original Message----- > From: lib...@li... [mailto:libmesh-devel- > bo...@li...] On Behalf Of Roy Stogner > Sent: Wednesday, June 25, 2008 3:36 PM > To: Benjamin Kirk > Cc: lib...@li... > Subject: Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh > > > On Wed, 25 Jun 2008, Benjamin Kirk wrote: > > >>> DistributedVector<Real> vec(n_global, n_local, > array_of_remote_indices); > >>> > >>> Where "vec" is a vector which stores n_local entries, and also has > "ghost > >>> storage" for array_of_remote_indices.size() entries. > >> > >> We'd also need an "unsigned int first_local_index" argument for the > >> offset of n_local, right? In any case that sounds like an excellent > >> idea. > > > > Is there ever a case where you need it? Was thinking it can always be > > computed from the partial sums of n_local over all the processors lower > than > > you. > > Hmm... good point. But since we already do that operation in the > DofMap, we could probably avoid doing it a second time (except in > debug mode, where we'd double-check) by making vector constructors > more explicit. It's not a big communication, but it is all-to-all. > > > Similarly, n_global really would not be used except to assert that it is > the > > same as the sum of n_local. > > In that case let's leave it out too; forcing n_global to be supplied but > then computing local offsets inside the constructor pretty much forces > redundant communication, I think. > > >> While you're explaining the fundamentals of our parallel vector > >> structures to me, could you explain the VecAssemblyBegin / > >> VecAssemblyEnd pair in PetscVector::close? That's what does all the > >> communication required by VecSetValues, right? What's the reason for > >> the dual API call; does the communication get started asynchronously > >> by Begin() and then End() blocks waiting for completion? > > > > That is absolutely the case. I fretted for a while about implementing > them > > separately, but I feared there would be a million places you could get > out > > of sync. For example, you call vec.close_start() at the end of your > matrix > > assembly routine, then vec.close_finish() before the linear solver? > What > > computation would you want to slip between those two? I figured we'd be > > setting ourselves up for a most frequently-asked question, and it > doesn't > > seem to have hurt anything to date... > > I just went through the code looking for places where we could > profitably use a two-part close, and the best thing I could come up > with is that it might potentially be a more efficient way to close > multiple vectors/matrices at once. But that's not worth futzing with > the API for; if someone ever does find some computation to usefully > slip in between assembly and solve, we'll add close_start() and > close_finish() member functions then. > > If the PETSc people went to all the trouble of adding asynchronous > communication to their assembly code, were they smart enough to start > that communication as soon as possible? Because there is one bit of > computation that could definitely be started before the start of the > assembly has been communicated: the rest of the assembly. > --- > Roy > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel |