Anyone who wants to help with this is certainly welcome! I have
responded directly to the points below. All input is appreciated...
Remember, there are no stupid ideas, just stupid people :)
(I have deliberately taken my time putting this together... I would
really appreciate people letting me know what they think.)
Ahmed Elsheikh wrote:
>Hi all,
>
>I am currently taking a course in Parallel algorithms and I am
>considering working on some parts of libmesh as a part of the course
>project.
>
>As of the email of Bill Barth, Is there a plan for that?
>
>
Yes, there is a loose plan. I have been thinking about it for a long
time...
>Initially, I'll list my understanding of the current status of the code.
>1 Initially the unstructured mesh is read by the master processor using
>Mesh::read function and then all the mesh nodes coordinates are packed
>into a vector implemented as std::vector<Real>. This vector is
>broadcasted to all the processors. Similarly an elements vector that
>contains element types and connectivity information is broadcasted to
>all processors. Now as all the processors have a copy of the whole mesh
>information.
>
>
Yes, and any boundary condition information is broadcast. Since only
processor 0 reads the mesh the amount of disk I/O is independent of the
number of processors. As a practical matter, for clusters and other
architectures without highperformance parallel I/O this approach is
probably just as good as messing with a parallel data format. So, in
the future processor 0 will still probably be the only one that reads
the mesh from disk. The only difference is that it may broadcast it in
chunks to handle the case when the entire mesh is too large to fit on
any one processor.
>2 After that a preparation step is done which includes nodes
>renumbering, finding neighboring information and finally partitioning
>the mesh using the library METIS. The results from METIS is assigned to
>each element using a function called elem>set_processor_id() which sets
>a private element data of the element class to the owner processor
>number.
>
>
Yes. For the meshes we (I?) usually use Metis is really fast. There
are a number of other partitioners that may be used, though.
>3 The classical steps in the finite elements solution include building
>the stiffness matrix of each element and then mapping the local element
>stiffness matrix to the global stiffness matrix are done on each
>processor using a local element iterator which uses the processor id as
>a predicate.
>
>
Yes  after the DOFs are associated with the nodes &/or elements of the
mesh. The DOF numbering is dependent on the partitioning (as required
by the parallel matrix & vector decomposition).
The local_elem_iterators are quite useful for hiding a lot of
complexities in dealing with partitioned meshes. They can also be used
to parallelize certain algorithms. For example, if you want to find the
minimum h for the entire mesh each processor can look at its local
elements and then do communication to find the min over each subdomain.
>4 Either LASPack or PETSc library which hides most of the parallization
>details of the iterative solver are called
>
>
Only PETSc linear solvers may be used in parallel. Laspack assumes
serial data. There is another package out there called Aztec that also
provides parallel iterative solvers, which could be wrapped in a couple
of classes to provide another option for parallel iterative solvers.
The DistributedVector<> class provides some parallel vector
functionality, and I use it for explicit methods in parallel without
PETSc. It could also be used in a matrixfree approach without PETSc,
but the iterative solver algorithms would need to be coded up.
>5 The resulting solution is Broadcasted all to all the processors such
>that all the processors have the solution results of the whole mesh
>which is already stored on each processor.
>
>
Almost  when the DOFs are distributed, each processor constructs a
"send_list". The send_list contains the indices of all the DOFs the
processor needs to compute *its* portion of the solution. This includes
all the DOFs that belong to that processor (which are stored locally)
and *some* DOFs from neighboring elements. In general, the nodal DOFs
that in any way connect to a local element and all the DOFs from face
neighboring elements that belong to other processors are in the
send_list. Clearly all the nodal DOFs are needed to compute the local
solution. All the DOFs on a face neighbor are needed to compute the
neighbor's gradient at the local, shared face. This is necessary, for
example, when computing the flux jump terms in the Kelley error indicator.
After each solve step only the subset of the solution vector specified
by the send_list is then broadcast to each processor. Theoretically,
this means there is less data transferred than if the whole solution
were broadcast to all the processors after each solve. This really only
effects highend scalability, however, since for reasonable numbers of
processors there is not much more overhead involved in an Allreduce
(which is what is required to get the global solution on every processor).
>6 If adaptive mesh refinement is used the results are analyzed using
>some error estimation techniques and a refinement or coarsening flag is
>assigned to each mesh element.
>
>
Yes. The refinement and coarsening flags are specified after analyzing
the error distribution in some way. There are currently several methods
for selecting elements for coarsening/refinement: fixed fraction of
elements, fixed fraction of the error, some statistical quantity, etc...
These flags are considered as input, sortof like "suggestions." By
convention, any element flagged for refinement *will* be refined.
However, other elements may also be refined to maintain a level1
compatible mesh (if desired) or to ensure smooth transitions (limiting
the maximum level mismatch at a node, if desired). Elements flagged for
coarsening will only be coarsened if 1.) all the other children want to
be coarsened as well and 2.) coarsening the element will still maintain
a level1 mesh (if desired), even with any specified refinement.
>7 In the current implementation, the error estimation is done on the
>mesh elements owned by each local processor and the results is
>broadcasted to all the processors.
>
>
Yes. For flagging purposes the result does not need to be processed
provided that any required statistics can be done in parallel.
>8 The mesh refinement and coarsening is done for the whole mesh on each
>processor which is another fault in the current implementation.
>
>
Yes. Arguably the biggest fault...
>I think the true parallization will have many challenges...
>
>First, dealing with nodes and edges on the boundary of each partition
>which needs to be presented on all the processors that share that
>boundary.
>
>Second, the error estimator needs some information across the boundary
>which is hard to get unless an overlapping boundaries between processors
>is implemented or an element by element messaging across the partition
>boundaries is done
>
>Third, during the mesh adaptation a nonconforming nodes may results on
>the partition boundaries.
>
>I am currently thinking about sort of an overlapping partitioning, where
>each processor stores the elements which belongs to its partition only,
>in addition to all neighboring elements on the boundary. These elements
>should still retain its processor id and will be called remote copy. No
>local calculation is done for remote copies but after each round of
>calculation each element that is a neighbor to a remote copy should send
>the calculated results to the processor that owns that neighbor.
>
>One of the issues I am thinking about now is the parent information when
>the children are distributed across the partitions boundaries!! Another
>important issue is the migration of mesh elements across the boundaries
>for the dynamic load balancing.
>
>Feed Back, current status, comments ... are all needed :)
>
>Regards,
>
>Ahmed
>
>
There are some challenges, but certainly doable. What you suggest is
very much in line with the previous discussions we have had... Let me
summarize my current thoughts. PLEASE, ALL FEEL FREE TO ADD TO THIS!
Parallel Mesh Layout:

 Each processor owns a portion of the mesh, as in the current scheme.
(For uniform element types and uniform approximation orders (no local
prefinement) then these subdomains are balanced when they are equally
sized. Otherwise, what it means to be loadbalanced is a little more
ambiguous... If the linear solver is the bottleneck then balanced means
equal DOFs per processor, if the matrix assembly is the bottleneck then
something like n_elem_dofs^2*n_quadrature_points is the proper weight
for each element, and this can be quite different for different element
types...)
 The partitioning is performed on the active elements only, as in the
current mesh.
 Each processor stores all the elements it owns, as well as all the
elements that are connected to those elements via faces *or* edges.
This is required so that, in the refinement procedure, any new nodes
created on edges or faces are properly located.
 Furthermore, when a child element is assigned to a processor, then
that element's parent (and its parent's parent... all the way to the
level0 element) is also stored on that processor (as a "remote copy").
The consequence of this is that, for data distribution purposes, it
would be possible to redistribute *only* the level0 elements and then
recreate their children via its refinement tree.
Comments:
By storing the local elements and any remote element connected to the
local element nearly all of the existing algorithms will transfer
cleanly. It is possible, however, that the total number of elements
stored on a processor will be several times the number of elements
actually owned by that processor. For anything but a random partition
this number is bounded, and is proportional to the number of processor
subdomains adjacent to a given subdomain. Since this is independent of
the *total* number of processor subdomains the approach is scalable.
Adaptivity:

By requiring that parent elements be present where ever children are
present means that most of the current refinement & coarsening stuff
will go through unchanged. The general approach will be as follows:
 An error indicator/estimator is computed for all the local elements.
 This data is analyzed in parallel
 Each processor flags its elements as appropriate, broadcasts those
flags to all the processors that contain copies of its elements.
 Each processor will modify the flags for its elements based on any
specified requirements (level1, etc...). It must then resubmit its
modified flags to the processors that have copies of its elements. This
process can be repeated until the flags do not change, (although I think
it can be shown twice is sufficient?).
 Each processor will refine/coarsen *all* the elements it has stored,
regardless of whether or not it owns then. This is necessary so that
new nodes on processor interfaces will be found and properly connected.
 The new, active elements are partitioned. This will define what
processors each level0 element needs to be copied to. It will be
copied with all its children to each of these processors.
DOF Distribution:

 Largely unchanged. You mentioned that the solution for each "remote
copy" needs to be obtained, but I think the current approach of
broadcasting only those values in the send_list (which is smaller) is
sufficient.
Output:

 For the time being (until parallel data formats are better defined)
processor 0 will still write all data to disk. The major change is that
the data will be transferred to processor 0 in chunks in case the entire
mesh is too big to fit on one processor.
Input:

 The basic problem here is how to take a serial file, read it by one
processor, and distribute pieces of it to each processor. Keep in mind
that the initial partitioning may be arbitrarily bad, so we need to
repartition and redistribute the distributed input mesh before doing
anything else. Since you can imagine the initial partitioning is
random, the repartitioning should not require any neighbor information.
Here are my thoughts:
 First, forget all the mesh input formats we support. They can still
be read serially in a suboptimal way, but it is hopelessly complicated
to try to perform clever parallel I/O for file formats that are serial
by design. So, I suggets we focus on XDA/XDR for optimal I/O. Like I
said, other formats can still be handled, but if it can't be read on one
processor you are in trouble... (which begs the quesiton, how did you
create it?)
 Processor 0 opens the XDA/R file. Its header defines the total number
of nodes, the total connectivity size, and the total number of
elements. This can be used to determine how many elements to put on
each processor.
Since XDA/R was built with this in mind, the rest is straightforward...
 A block of elements (defined by its connectivity) is read for the ith
processor and broadcast to it. This is repeated for each processor, but
only one block is read at a time, so the total size of the file is
irrelevant.
 Now, each processor knows the connectivity for its elements, which
defines the global nodes it actually needs. It does not have any nodes
yet, though...
 Processor 0 reads a block of nodes, sized appropriately so they fit.
This is a subset of the global nodes, which are then broadcast to *all*
the processors. Each processor compares these nodes to the ones it
actually needs and only keeps the relevant nodes (those that are
connected to its elements). This is repeated until all the nodes are
read. Again, the total number of nodes is irrelevant, so long as each
processor can hold the data for its subdomain.
 The process is repeated for the boundary conditions.
Now, each processor has some elements and their nodes, but the
distribution could be arbitrarly bad. Additionally, there is no
neighbor information available, and if the partitioning is quite bad
then storing all the neighbors may not be feasible, and finding them
would certainly require a lot of communicaion.
Fortunately, John developed a parallel sort that can be used in
conjunction with a spacefilling curve to define an initial partitioning
based on the information available at this point. The spacefilling
curve partitioning is then used to redistribute the elements in such a
way that the partitioning makes some sense.
After the redistribution, each processor contains all its elements, but
no neighbors. A bounding box (or sphere) for each processor subdomain
can be computed. The intersection of bounding boxes defines all the
processors with elements that *may* be neighbors. With this in mind,
each processor computes the neighbors for the elements it owns. This
will leave a number of elements without full neighbor connectivity.
These elements are identified and copied to all the processors with
intersecting bounding boxes. Those processors do the same with their
elements. If a received element is a neighbor of a local element we
keep it and define it as a "remote copy", otherwise we throw it away.
After processing the elements from the processors with interstecting
bounding boxes any element without a full set of neighbors is
necessarily on a physical domain boundary. Furthermore, the set of
adjacent processors is known exactly for each subdomain, which is
necessarily a subset of those with intersecting bounding boxes.
 That is what I have been thinking. 

A few practical (implementation) issues:
1.) Of course, all of this will require that MPI be available. However,
I think the library should still work serially without MPI. So, there
is a bit of a problem... We either #ifdef *everything*, or we provide
something like "mpiuni" (in PETSc) that implements the MPI functions we
use for the trivial case of one processor. I prefer the latter approach
since it should not be that hard, and code development/maintenance
should be much easier.
Alternatively, we should create a class (or classes) that handles the
message passing functions we need. This can just do the right thing for
one processor and then call MPI for more than one processor. I like
this approach since it hides all the nasty MPI syntax. Ideas?
2.) Once the set of neighboring subdomains is identified we probably
want to create an MPI communicator that contains this information.
MPI_COMM_WORLD should probably only be used when you expect to talk to
any other processor...
3.) As parallel mesh generation and other such stuff gets more mature
parallel mesh file formats are likely to become more common. We should
keep an eye on this and incorporate truly parallel I/O when feasible. I
already stated that on clusters without highperformance parallel file
systems there will likely be little benefit, but on the other hand
machines like the Origin 3000 with really fast I/O would see a major
benefit.
Ben
