Thread: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

libmesh-devel

[Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 18:30:37

Hey guys,

Using Roy's workaround so that partitioning doesn't happen with  
ParallelMesh I've been able to run some pretty big problems today, and  
I thought I would share some numbers  All I'm doing is solving pure  
diffusion with a Dirichlet BC and a forcing function in 3d on  
hexes.... but I'm doing it completely matrix free using the  
NonlinearSystem class.

The point of these runs is to do some parallel scaling tests.  From  
previous runs with SerialMesh I knew that I needed more that 2 million  
dofs to see any kind of good parallel scaling over 128 procs.  Wanting  
to get good scaling up to 1024 procs... I did some quick calculations  
(using some small problems on my 8GB of RAM desktop) I figured I could  
fit 80 million DOFs on 128 procs (each proc has 2GB of RAM) using  
ParallelMesh.... it turns out I was pretty far off!  In fact, I was  
off so far that I ended up bringing down half of our supercomputer as  
it started swapping like crazy!

After rebooting a few nodes, I'm now running a slightly smaller  
problem at 10 million DOFs.  I have it running on 64,128 and 256  
processors (the 512, and 1024 jobs are still in the queue).  This  
gives me an interesting opportunity to look at the memory scaling  
using ParallelMesh, since I don't have a matrix involved.

Here is how much each proc is using:

#CPU : #MB/proc
256 : 200-700
128 : 350-700
64 : 450-800

First thing to note is that the #MB/proc is a _range_.  This range is  
taken from me watching "top" on one of the compute nodes.  The memory  
usage of each process _oscillates_ between the two numbers listed  
about every 5 seconds.  Based on watching using "xosview" I believe  
that the high numbers occur during communication steps, while the low  
numbers are during "assembly" (residual computation).  At this point,  
these are just guesses, but I've been watching these things for a few  
weeks now and kind of have a feel for what's going on.

Second thing to realize is that the upper number (700,700,800) is  
about the same.  This is somewhat of a bummer since it means that just  
adding more procs isn't going to allow me to run an appreciably larger  
problem.  I'm guessing that we have some serialized vectors that at  
this point are contributing much more memory than the mesh is...

Anyway, I just thought I would share some data with everyone.  This is  
by no means a request for anything to happen... nor is it a bug report  
or anything of the sort.  The fact of the matter is that I will be  
able to run something like 30 million DOFs on this computer without  
problem.... and and that is freaking sweet!  That will more than  
satisfy my goals this year.  But... I am going to keep an eye out for  
where some fat might be trimmed along the way... and maybe we can get  
the memory usage scaling a bit better...

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: John P. <jwp...@gm...> - 2008-06-24 18:46:03

Hi Derek,

On Tue, Jun 24, 2008 at 1:30 PM, Derek Gaston <fri...@gm...> wrote:
>
> Here is how much each proc is using:
>
> #CPU : #MB/proc
> 256 : 200-700
> 128 : 350-700
> 64 : 450-800

Thanks!  These numbers are pretty interesting.  I know there are some
places in the code where we go ahead and build a serial std::vector of
size global n_elem or n_nodes.  I think this happens sometimes on all
processors and sometimes on just one, so it's possible we will run out
of memory on one CPU way before the others...

>
> First thing to note is that the #MB/proc is a _range_.  This range is
> taken from me watching "top" on one of the compute nodes.  The memory
> usage of each process _oscillates_ between the two numbers listed
> about every 5 seconds.  Based on watching using "xosview" I believe
> that the high numbers occur during communication steps, while the low
> numbers are during "assembly" (residual computation).  At this point,
> these are just guesses, but I've been watching these things for a few
> weeks now and kind of have a feel for what's going on.

I believe mpip (http://mpip.sourceforge.net/) can report more detailed
memory usage, but implementing the profiling does require additional
work.  I'd be interested to see how uniform the memory usage is across
all the procs.  We have some people around here who know about mpip so
if you find some strange things happening I might be able to take a
closer look.

Also, it sounds like now might be a good time to start thinking about
our StructuredMesh implementation again...unless your eventual problem
will be unstructured and you don't care.  We could probably run some
truly enormous matrix-free problems on structured grids....

-- 
John

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 18:52:49

On Jun 24, 2008, at 12:46 PM, John Peterson wrote:

> I believe mpip (http://mpip.sourceforge.net/) can report more detailed
> memory usage, but implementing the profiling does require additional
> work.

Thanks for the tip... I'll check into it when I get a chance (which  
might be next year!).

> I'd be interested to see how uniform the memory usage is across
> all the procs.

Pretty dang uniform.  They basically stay right together.

> Also, it sounds like now might be a good time to start thinking about
> our StructuredMesh implementation again...unless your eventual problem
> will be unstructured and you don't care.  We could probably run some
> truly enormous matrix-free problems on structured grids....

We're definitely going to be unstructured.  Right now the geometry is  
just a cylinder... but later it's going to have "cracks" in it...  
along with moving mesh (which is coming online now) and contact (which  
we've just started with).... and even more complicated geometries are  
yet to come.

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-24 18:50:30

On Tue, 24 Jun 2008, Derek Gaston wrote:

> Using Roy's workaround so that partitioning doesn't happen with
> ParallelMesh I've been able to run some pretty big problems today, and
> I thought I would share some numbers  All I'm doing is solving pure
> diffusion with a Dirichlet BC and a forcing function in 3d on
> hexes....

First order elements?

> but I'm doing it completely matrix free using the
> NonlinearSystem class.

Thanks; this is interesting stuff.  All the numbers I've taken down
have been with matrix and preconditioner included, on smaller
problems.

> First thing to note is that the #MB/proc is a _range_.  This range is
> taken from me watching "top" on one of the compute nodes.  The memory
> usage of each process _oscillates_ between the two numbers listed
> about every 5 seconds.  Based on watching using "xosview" I believe
> that the high numbers occur during communication steps,

Define "communication steps" - I assume you're not talking about
synching up ghost DoFs during a solve?

Are you doing mesh refinement at each oscillation?  If that's it, then
there's two likely culprits, but I don't know if they're sufficient
explanation:

We still use global error vectors; these need to be (optionally) moved
to a mapvector sort of structure just like the mesh node and element
vectors were.  A single-precision float per element would be 40MB on
each node for 10 million elements.

We use serialized DoF vectors in System::project_vector().  This is
even worse because of the double precision; each vector must cost 80MB
for 10 million DoFs.

But even that adds up to only 120MB; if you're seeing ~500MB usage
jumps, I'm not sure where the rest is coming from.  I don't suppose
you can pin down exactly where the memory's being allocated?
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 19:02:54

On Jun 24, 2008, at 12:50 PM, Roy Stogner wrote:
> First order elements?

Yep

> Define "communication steps" - I assume you're not talking about
> synching up ghost DoFs during a solve?

I really haven't looked into it.  It's more of a "feeling" you can  
watch the cpu's peg out for a few seconds... then they all go crazy  
and the network traffic spikes.  It could be something specific to the  
way the NonlinearSystem solve progresses.... maybe syncing the global  
solution so that it can pass it into compute_residual()?  I don't know.

> Are you doing mesh refinement at each oscillation?  If that's it, then
> there's two likely culprits, but I don't know if they're sufficient
> explanation:
>
> We still use global error vectors; these need to be (optionally) moved
> to a mapvector sort of structure just like the mesh node and element
> vectors were.  A single-precision float per element would be 40MB on
> each node for 10 million elements.
>
> We use serialized DoF vectors in System::project_vector().  This is
> even worse because of the double precision; each vector must cost 80MB
> for 10 million DoFs.

There is no refinement happening at all (well, beyond the initial  
uniform refinements to get up to the problem size I want).  I'm not  
even looping around "system.solve()".  This is just one big solve.

> But even that adds up to only 120MB; if you're seeing ~500MB usage
> jumps, I'm not sure where the rest is coming from.  I don't suppose
> you can pin down exactly where the memory's being allocated?

I'm not exactly sure how to even start down that path.  John suggested  
mpip to get some info out.  Do you have an idea of how to find out  
what the memory swings are?

I'm probably just going to start with print statements and see if I  
can "see" when the memory spikes are occurring.

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-24 19:05:34

On Tue, 24 Jun 2008, Derek Gaston wrote:

> I thought I would share some numbers  All I'm doing is solving pure
> diffusion with a Dirichlet BC and a forcing function in 3d on
> hexes.... but I'm doing it completely matrix free using the
> NonlinearSystem class.

Another question: what NonlinearSystem class?  If you're using
NonlinearImplicitSystem, doesn't that allocate a matrix whether you
and PETSc eventually use it or not?
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-24 19:19:24


On Tue, 24 Jun 2008, Derek Gaston wrote:

> I really haven't looked into it.  It's more of a "feeling" you can watch the 
> cpu's peg out for a few seconds... then they all go crazy and the network 
> traffic spikes.  It could be something specific to the way the 
> NonlinearSystem solve progresses.... maybe syncing the global solution so 
> that it can pass it into compute_residual()?  I don't know.
>
> There is no refinement happening at all (well, beyond the initial uniform 
> refinements to get up to the problem size I want).  I'm not even looping 
> around "system.solve()".  This is just one big solve.
>
> I'm probably just going to start with print statements and see if I can "see" 
> when the memory spikes are occurring.

I've got a good guess as to where the memory spikes are occuring:
check out __libmesh_petsc_snes_residual() in petsc_nonlinear_solver.C.
Is that X_local a serial vector?

I think I noticed this problem and tried to avoid it in
petsc_diff_solver.C; you might cut and paste some of that code into
Ben's solver to see if using my "swap vectors, System::update"
localization works any better.
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 19:23:06

On Jun 24, 2008, at 1:19 PM, Roy Stogner wrote:

> I've got a good guess as to where the memory spikes are occuring:
> check out __libmesh_petsc_snes_residual() in petsc_nonlinear_solver.C.
> Is that X_local a serial vector?

Good catch Roy... that's along the lines of what I was thinking.

> I think I noticed this problem and tried to avoid it in
> petsc_diff_solver.C; you might cut and paste some of that code into
> Ben's solver to see if using my "swap vectors, System::update"
> localization works any better.

I will try that.  Thanks!

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Benjamin K. <ben...@na...> - 2008-06-24 19:27:39

>> I've got a good guess as to where the memory spikes are occuring:
>> check out __libmesh_petsc_snes_residual() in petsc_nonlinear_solver.C.
>> Is that X_local a serial vector?
> 
> Good catch Roy... that's along the lines of what I was thinking.
> 
>> I think I noticed this problem and tried to avoid it in
>> petsc_diff_solver.C; you might cut and paste some of that code into
>> Ben's solver to see if using my "swap vectors, System::update"
>> localization works any better.
> 
> I will try that.  Thanks!

Please advise of the outcome.

Big picture, the concept of serialized vectors needs to be wholesale
replaced with vectors+ghost padding.  PETSc offers a way to do this, and
Trilinos has something similar.

What I think is needed is that we implement this functionality in
DistributedVector<> and then derive PetscVector<>/EpetraVector<> from this
class.  

-Ben

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-24 19:38:05

On Tue, 24 Jun 2008, Benjamin Kirk wrote:

> Big picture, the concept of serialized vectors needs to be wholesale
> replaced with vectors+ghost padding.  PETSc offers a way to do this, and
> Trilinos has something similar.

Keep in mind, for anywhere we still need to use serialized vectors
(Derek's case not included), the padding's going to have to be more
complicated than just "include ghost dofs".  For that HP selection
code of mine, for example, you might need to go several levels of
elements out from the current partition to get enough DoFs to do a
proper coarsening, and for System::project_vector() there may be no
relation between the current dof numbering and the numbering of the
old dofs we're trying to project onto them.

> What I think is needed is that we implement this functionality in
> DistributedVector<> and then derive PetscVector<>/EpetraVector<> from this
> class.

Assuming you mean NumericVector not DistributedVector, that sounds
like an excellent idea.
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Benjamin K. <ben...@na...> - 2008-06-24 19:44:01

> Assuming you mean NumericVector not DistributedVector, that sounds
> like an excellent idea

Actually, I meant DistributedVector<>, and the inheritance would change.
But your point is well taken.  The implementation could just as easily be
done in NumericVector<>, and then the DistributedVector<> would become
obsolete.  It could be retained for compatibility, but wouldn't need to
implement anything.

-Ben

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-24 19:51:50


On Tue, 24 Jun 2008, Benjamin Kirk wrote:

>> Assuming you mean NumericVector not DistributedVector, that sounds
>> like an excellent idea
>
> Actually, I meant DistributedVector<>, and the inheritance would change.
> But your point is well taken.  The implementation could just as easily be
> done in NumericVector<>, and then the DistributedVector<> would become
> obsolete.  It could be retained for compatibility, but wouldn't need to
> implement anything.

Well, actually I'd like to see DistributedVector retained as a leaf
class, even if we refactor the inheritance to supply new support for
distributed ghost dofs in the "trunk" above NumericVector.  Yay PETSc,
yay Trilinos, but I still kind of like the idea of having a bare-bones
parallel vector implementation that doesn't require a large third
party package to be compiled in.
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: John P. <jwp...@gm...> - 2008-06-24 19:59:16

On Tue, Jun 24, 2008 at 2:51 PM, Roy Stogner <ro...@st...> wrote:
>
>
> On Tue, 24 Jun 2008, Benjamin Kirk wrote:
>
>>> Assuming you mean NumericVector not DistributedVector, that sounds
>>> like an excellent idea
>>
>> Actually, I meant DistributedVector<>, and the inheritance would change.
>> But your point is well taken.  The implementation could just as easily be
>> done in NumericVector<>, and then the DistributedVector<> would become
>> obsolete.  It could be retained for compatibility, but wouldn't need to
>> implement anything.
>
> Well, actually I'd like to see DistributedVector retained as a leaf
> class, even if we refactor the inheritance to supply new support for
> distributed ghost dofs in the "trunk" above NumericVector.  Yay PETSc,
> yay Trilinos, but I still kind of like the idea of having a bare-bones
> parallel vector implementation that doesn't require a large third
> party package to be compiled in.

I agree.  I've always thought it would be cool to have a home-grown
DistributedMatrix to go with it as well (LibMesh's own SparseMatrix
implementation) so I'd like to see it stay as a leaf class if
possible...

Now that so many supercomputer sites have PETSc or Trilinos installed
this doesn't seem very necessary, but as LibMesh is getting packaged
by other distros, it would be cool if it had some parallel linear
algebra abilities "out of the box".

-- 
John

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Benjamin K. <ben...@na...> - 2008-06-24 20:11:46

> I agree.  I've always thought it would be cool to have a home-grown
> DistributedMatrix to go with it as well (LibMesh's own SparseMatrix
> implementation) so I'd like to see it stay as a leaf class if
> possible...

That would be my intent.  But by pushing the majority of the implementation
into NumericVector<> I would bet the derived PetscVector<>/EpetraVector<>
classes could get a lot smaller.  I don't know if anything would be
specialized in DistributedVector<> any more, but keeping it as a class
derived from NumericVector<> is not a problem.

> Now that so many supercomputer sites have PETSc or Trilinos installed
> this doesn't seem very necessary, but as LibMesh is getting packaged
> by other distros, it would be cool if it had some parallel linear
> algebra abilities "out of the box".

My mind heads down this path every so often too...  Who wants to handle the
preconditioning part? ;-)

Thanks for the quadrature rule updates!

-Ben

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 19:20:45

On Jun 24, 2008, at 1:05 PM, Roy Stogner wrote:

> Another question: what NonlinearSystem class?  If you're using
> NonlinearImplicitSystem, doesn't that allocate a matrix whether you
> and PETSc eventually use it or not?

Yes NonlinearImplicitSystem

Crap... I just went and dug down into ImplicitSystem... and you're  
right!  There is a matrix that's getting created and sized!  Hmmm....  
we might have to turn that off somehow...

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: John P. <jwp...@gm...> - 2008-06-24 19:25:26

On Tue, Jun 24, 2008 at 2:20 PM, Derek Gaston <fri...@gm...> wrote:
> On Jun 24, 2008, at 1:05 PM, Roy Stogner wrote:
>
>> Another question: what NonlinearSystem class?  If you're using
>> NonlinearImplicitSystem, doesn't that allocate a matrix whether you
>> and PETSc eventually use it or not?
>
> Yes NonlinearImplicitSystem
>
> Crap... I just went and dug down into ImplicitSystem... and you're
> right!  There is a matrix that's getting created and sized!  Hmmm....
> we might have to turn that off somehow...

If it's getting repeatedly created and destroyed, that might explain
the big swings in mem. usage.  But if it just gets created once and
then tossed out at the end of the run I'm not sure it would cause
this.

OTOH, Derek, this may be why your "back of the envelope" problem size
calculation caused the machines to swap ;-)

-- 
John

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Benjamin K. <ben...@na...> - 2008-06-24 19:29:13

> OTOH, Derek, this may be why your "back of the envelope" problem size
> calculation caused the machines to swap ;-)

FWIW, I have no swap on my compute nodes.  (No disk for that matter too.)
I'd rather have a memory allocation request that does not fit in RAM kill
the process than swap!!

-ben

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 19:32:56

On Jun 24, 2008, at 1:29 PM, Benjamin Kirk wrote:

>>
> FWIW, I have no swap on my compute nodes.  (No disk for that matter  
> too.)
> I'd rather have a memory allocation request that does not fit in RAM  
> kill
> the process than swap!!

I completely agree with this.  I've even gone to having no (or very  
little) swap on my desktop for the same reason.  If I'm ever swapping  
I've done some horribly wrong... and it just needs to be killed.  With  
more than 2 or 4 gigs of RAM there should never be a _good_ reason to  
swap.

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Derek G. <fri...@gm...> - 2008-06-24 21:32:39

On Jun 24, 2008, at 1:19 PM, Roy Stogner wrote:

> I think I noticed this problem and tried to avoid it in
> petsc_diff_solver.C; you might cut and paste some of that code into
> Ben's solver to see if using my "swap vectors, System::update"
> localization works any better.

Ok - I've tried to implement this, but I must be off somewhere.  What  
I've done is add a _system refence inside of NonlinearSolver (the same  
way there is one in DiffSolver) and then I'm using code like this  
inside of __libmesh_petsc_snes_residual:

     PetscVector<Number> X_global(x), R(r);
     PetscVector<Number>& X_local =  
*dynamic_cast<PetscVector<Number>*>(sys.solution.get());

     X_global.swap(X_local);
     sys.update();
     X_global.swap(X_local);

     solver->residual (X_global, R);

This seems to work ok in 2D... but in 3D it is generating NAN's for  
some reason.  Someone please point out my logic troubles...

Derek

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-24 22:14:53

On Tue, 24 Jun 2008, Derek Gaston wrote:

> On Jun 24, 2008, at 1:19 PM, Roy Stogner wrote:
>
>> I think I noticed this problem and tried to avoid it in
>> petsc_diff_solver.C; you might cut and paste some of that code into
>> Ben's solver to see if using my "swap vectors, System::update"
>> localization works any better.
>
> Ok - I've tried to implement this, but I must be off somewhere.  What I've 
> done is add a _system refence inside of NonlinearSolver (the same way there 
> is one in DiffSolver) and then I'm using code like this inside of 
> __libmesh_petsc_snes_residual:
>
>   PetscVector<Number> X_global(x), R(r);
>   PetscVector<Number>& X_local = 
> *dynamic_cast<PetscVector<Number>*>(sys.solution.get());
>
>   X_global.swap(X_local);
>   sys.update();
>   X_global.swap(X_local);
>
>   solver->residual (X_global, R);
>
> This seems to work ok in 2D... but in 3D it is generating NAN's for some 
> reason.  Someone please point out my logic troubles...

Very strange.  I can't see why the dimensionality would make a
difference (particularly on a conforming mesh), but I must point out
that my own brief tests of PetscDiffSolver were all on 2D problems;
once I found out that my NewtonSolver was performing better, I lost
interest in trying to tweak SNES options for 3D.

Here's what I'm trying to figure out, though:

The DiffSystem::assembly() call in PetscDiffSolver uses
current_local_solution to plug into the weighted residual equations,
thus ensuring that DoFs which are owned by another processor have
correct values.  How does that work here?  When you swap the solution
out again, the current_local_solution doesn't come with it - you're
still passing to residual() a vector that only has locally owned DoFs,
right?

Also: Ben, is current_local_solution a serial vector?  It certainly
looks that way to me, but I've always been a little confused by the
solution/current_local_solution divide, so maybe I'm missing some sort
of PETSc magic under the hood.
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Benjamin K. <ben...@na...> - 2008-06-25 20:55:07

> Also: Ben, is current_local_solution a serial vector?  It certainly
> looks that way to me, but I've always been a little confused by the
> solution/current_local_solution divide, so maybe I'm missing some sort
> of PETSc magic under the hood.

In the beginning...

I understood a little about parallel computing and less about PETSc.

"solution" is a parallel vector, and "current_local_solution" is a
serialized version of a *subset* of "solution".  Thus it is full-sized, but
only a fraction of the entries are correctly populated from remote
processors.

Now, let "global" and "local" be, uh, global and local vectors,
respectively. Calling

global.localize(local);

Is kinda like an allreduce - at the end each processor has a complete
picture of the global vector stored in local.

global.localize(local, index_list);  // this is what is done in
                                     // System::update()

Copies only the subset of "global" to "local" as specified by "index_list"

I'm not sure how PETSc is going about implementing these sends, but I bet in
the former case the underlying memory allocations when you get to 100s of
processors is what you are seeing.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Where I want this to go is something like the following:

DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices);

Where "vec" is a vector which stores n_local entries, and also has "ghost
storage" for array_of_remote_indices.size() entries.

You then go about your business, computing with "vec" and performing
operations which update the local entries, and when you need to get
synchronized copies of the remote values you call something like

vec.localize();

This would effectively eliminate current_local_solution -- everything would
be done in-place from solution.

-Ben

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-25 21:09:13

On Wed, 25 Jun 2008, Benjamin Kirk wrote:

> DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices);
>
> Where "vec" is a vector which stores n_local entries, and also has "ghost
> storage" for array_of_remote_indices.size() entries.

We'd also need an "unsigned int first_local_index" argument for the
offset of n_local, right?  In any case that sounds like an excellent
idea.  We'll also want to have some higher level APIs where
array_of_local_indices is implied to come from the system's send_list,
of course, but if we get the lower level stuff right first that'll
make it easier to handle the exceptional cases like
System::project_vector.

> You then go about your business, computing with "vec" and performing
> operations which update the local entries, and when you need to get
> synchronized copies of the remote values you call something like
>
> vec.localize();
>
> This would effectively eliminate current_local_solution -- everything would
> be done in-place from solution.

Sounds good.

While you're explaining the fundamentals of our parallel vector
structures to me, could you explain the VecAssemblyBegin /
VecAssemblyEnd pair in PetscVector::close?  That's what does all the
communication required by VecSetValues, right?  What's the reason for
the dual API call; does the communication get started asynchronously
by Begin() and then End() blocks waiting for completion?
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Benjamin K. <ben...@na...> - 2008-06-25 21:21:54

>> DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices);
>> 
>> Where "vec" is a vector which stores n_local entries, and also has "ghost
>> storage" for array_of_remote_indices.size() entries.
> 
> We'd also need an "unsigned int first_local_index" argument for the
> offset of n_local, right?  In any case that sounds like an excellent
> idea. 

Is there ever a case where you need it?  Was thinking it can always be
computed from the partial sums of n_local over all the processors lower than
you. 

Similarly, n_global really would not be used except to assert that it is the
same as the sum of n_local.

...

> While you're explaining the fundamentals of our parallel vector
> structures to me, could you explain the VecAssemblyBegin /
> VecAssemblyEnd pair in PetscVector::close?  That's what does all the
> communication required by VecSetValues, right?  What's the reason for
> the dual API call; does the communication get started asynchronously
> by Begin() and then End() blocks waiting for completion?

That is absolutely the case.  I fretted for a while about implementing them
separately, but I feared there would be a million places you could get out
of sync.  For example, you call vec.close_start() at the end of your matrix
assembly routine, then vec.close_finish() before the linear solver?  What
computation would you want to slip between those two?  I figured we'd be
setting ourselves up for a most frequently-asked question, and it doesn't
seem to have hurt anything to date...

-Ben

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Roy S. <ro...@st...> - 2008-06-25 21:36:08

On Wed, 25 Jun 2008, Benjamin Kirk wrote:

>>> DistributedVector<Real> vec(n_global, n_local, array_of_remote_indices);
>>>
>>> Where "vec" is a vector which stores n_local entries, and also has "ghost
>>> storage" for array_of_remote_indices.size() entries.
>>
>> We'd also need an "unsigned int first_local_index" argument for the
>> offset of n_local, right?  In any case that sounds like an excellent
>> idea.
>
> Is there ever a case where you need it?  Was thinking it can always be
> computed from the partial sums of n_local over all the processors lower than
> you.

Hmm... good point.  But since we already do that operation in the
DofMap, we could probably avoid doing it a second time (except in
debug mode, where we'd double-check) by making vector constructors
more explicit.  It's not a big communication, but it is all-to-all.

> Similarly, n_global really would not be used except to assert that it is the
> same as the sum of n_local.

In that case let's leave it out too; forcing n_global to be supplied but
then computing local offsets inside the constructor pretty much forces
redundant communication, I think.

>> While you're explaining the fundamentals of our parallel vector
>> structures to me, could you explain the VecAssemblyBegin /
>> VecAssemblyEnd pair in PetscVector::close?  That's what does all the
>> communication required by VecSetValues, right?  What's the reason for
>> the dual API call; does the communication get started asynchronously
>> by Begin() and then End() blocks waiting for completion?
>
> That is absolutely the case.  I fretted for a while about implementing them
> separately, but I feared there would be a million places you could get out
> of sync.  For example, you call vec.close_start() at the end of your matrix
> assembly routine, then vec.close_finish() before the linear solver?  What
> computation would you want to slip between those two?  I figured we'd be
> setting ourselves up for a most frequently-asked question, and it doesn't
> seem to have hurt anything to date...

I just went through the code looking for places where we could
profitably use a two-part close, and the best thing I could come up
with is that it might potentially be a more efficient way to close
multiple vectors/matrices at once.  But that's not worth futzing with
the API for; if someone ever does find some computation to usefully
slip in between assembly and solve, we'll add close_start() and
close_finish() member functions then.

If the PETSc people went to all the trouble of adding asynchronous
communication to their assembly code, were they smart enough to start
that communication as soon as possible?  Because there is one bit of
computation that could definitely be started before the start of the
assembly has been communicated: the rest of the assembly.
---
Roy

Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh

From: Vijay M <vi...@gm...> - 2008-06-25 23:42:47

> I just went through the code looking for places where we could
> profitably use a two-part close, and the best thing I could come up
> with is that it might potentially be a more efficient way to close
> multiple vectors/matrices at once.  But that's not worth futzing with
> the API for; if someone ever does find some computation to usefully
> slip in between assembly and solve, we'll add close_start() and
> close_finish() member functions then.

I think you could provide a hook, using a function pointer, to be called
between the assembly_begin and assembly_end calls. And making this as an
optional argument would also ensure backward compatibility. Just a thought !

> -----Original Message-----
> From: lib...@li... [mailto:libmesh-devel-
> bo...@li...] On Behalf Of Roy Stogner
> Sent: Wednesday, June 25, 2008 3:36 PM
> To: Benjamin Kirk
> Cc: lib...@li...
> Subject: Re: [Libmesh-devel] Matrix Free Memory Scaling with ParallelMesh
> 
> 
> On Wed, 25 Jun 2008, Benjamin Kirk wrote:
> 
> >>> DistributedVector<Real> vec(n_global, n_local,
> array_of_remote_indices);
> >>>
> >>> Where "vec" is a vector which stores n_local entries, and also has
> "ghost
> >>> storage" for array_of_remote_indices.size() entries.
> >>
> >> We'd also need an "unsigned int first_local_index" argument for the
> >> offset of n_local, right?  In any case that sounds like an excellent
> >> idea.
> >
> > Is there ever a case where you need it?  Was thinking it can always be
> > computed from the partial sums of n_local over all the processors lower
> than
> > you.
> 
> Hmm... good point.  But since we already do that operation in the
> DofMap, we could probably avoid doing it a second time (except in
> debug mode, where we'd double-check) by making vector constructors
> more explicit.  It's not a big communication, but it is all-to-all.
> 
> > Similarly, n_global really would not be used except to assert that it is
> the
> > same as the sum of n_local.
> 
> In that case let's leave it out too; forcing n_global to be supplied but
> then computing local offsets inside the constructor pretty much forces
> redundant communication, I think.
> 
> >> While you're explaining the fundamentals of our parallel vector
> >> structures to me, could you explain the VecAssemblyBegin /
> >> VecAssemblyEnd pair in PetscVector::close?  That's what does all the
> >> communication required by VecSetValues, right?  What's the reason for
> >> the dual API call; does the communication get started asynchronously
> >> by Begin() and then End() blocks waiting for completion?
> >
> > That is absolutely the case.  I fretted for a while about implementing
> them
> > separately, but I feared there would be a million places you could get
> out
> > of sync.  For example, you call vec.close_start() at the end of your
> matrix
> > assembly routine, then vec.close_finish() before the linear solver?
> What
> > computation would you want to slip between those two?  I figured we'd be
> > setting ourselves up for a most frequently-asked question, and it
> doesn't
> > seem to have hurt anything to date...
> 
> I just went through the code looking for places where we could
> profitably use a two-part close, and the best thing I could come up
> with is that it might potentially be a more efficient way to close
> multiple vectors/matrices at once.  But that's not worth futzing with
> the API for; if someone ever does find some computation to usefully
> slip in between assembly and solve, we'll add close_start() and
> close_finish() member functions then.
> 
> If the PETSc people went to all the trouble of adding asynchronous
> communication to their assembly code, were they smart enough to start
> that communication as soon as possible?  Because there is one bit of
> computation that could definitely be started before the start of the
> assembly has been communicated: the rest of the assembly.
> ---
> Roy
> 
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> Libmesh-devel mailing list
> Lib...@li...
> https://lists.sourceforge.net/lists/listinfo/libmesh-devel

1 2 > >> (Page 1 of 2)