From: Benjamin S. Kirk <benkirk@cf...>  20040310 05:13:11

Anyone who wants to help with this is certainly welcome! I have responded directly to the points below. All input is appreciated... Remember, there are no stupid ideas, just stupid people :) (I have deliberately taken my time putting this together... I would really appreciate people letting me know what they think.) Ahmed Elsheikh wrote: >Hi all, > >I am currently taking a course in Parallel algorithms and I am >considering working on some parts of libmesh as a part of the course >project. > >As of the email of Bill Barth, Is there a plan for that? > > Yes, there is a loose plan. I have been thinking about it for a long time... >Initially, I'll list my understanding of the current status of the code. >1 Initially the unstructured mesh is read by the master processor using >Mesh::read function and then all the mesh nodes coordinates are packed >into a vector implemented as std::vector<Real>. This vector is >broadcasted to all the processors. Similarly an elements vector that >contains element types and connectivity information is broadcasted to >all processors. Now as all the processors have a copy of the whole mesh >information. > > Yes, and any boundary condition information is broadcast. Since only processor 0 reads the mesh the amount of disk I/O is independent of the number of processors. As a practical matter, for clusters and other architectures without highperformance parallel I/O this approach is probably just as good as messing with a parallel data format. So, in the future processor 0 will still probably be the only one that reads the mesh from disk. The only difference is that it may broadcast it in chunks to handle the case when the entire mesh is too large to fit on any one processor. >2 After that a preparation step is done which includes nodes >renumbering, finding neighboring information and finally partitioning >the mesh using the library METIS. The results from METIS is assigned to >each element using a function called elem>set_processor_id() which sets >a private element data of the element class to the owner processor >number. > > Yes. For the meshes we (I?) usually use Metis is really fast. There are a number of other partitioners that may be used, though. >3 The classical steps in the finite elements solution include building >the stiffness matrix of each element and then mapping the local element >stiffness matrix to the global stiffness matrix are done on each >processor using a local element iterator which uses the processor id as >a predicate. > > Yes  after the DOFs are associated with the nodes &/or elements of the mesh. The DOF numbering is dependent on the partitioning (as required by the parallel matrix & vector decomposition). The local_elem_iterators are quite useful for hiding a lot of complexities in dealing with partitioned meshes. They can also be used to parallelize certain algorithms. For example, if you want to find the minimum h for the entire mesh each processor can look at its local elements and then do communication to find the min over each subdomain. >4 Either LASPack or PETSc library which hides most of the parallization >details of the iterative solver are called > > Only PETSc linear solvers may be used in parallel. Laspack assumes serial data. There is another package out there called Aztec that also provides parallel iterative solvers, which could be wrapped in a couple of classes to provide another option for parallel iterative solvers. The DistributedVector<> class provides some parallel vector functionality, and I use it for explicit methods in parallel without PETSc. It could also be used in a matrixfree approach without PETSc, but the iterative solver algorithms would need to be coded up. >5 The resulting solution is Broadcasted all to all the processors such >that all the processors have the solution results of the whole mesh >which is already stored on each processor. > > Almost  when the DOFs are distributed, each processor constructs a "send_list". The send_list contains the indices of all the DOFs the processor needs to compute *its* portion of the solution. This includes all the DOFs that belong to that processor (which are stored locally) and *some* DOFs from neighboring elements. In general, the nodal DOFs that in any way connect to a local element and all the DOFs from face neighboring elements that belong to other processors are in the send_list. Clearly all the nodal DOFs are needed to compute the local solution. All the DOFs on a face neighbor are needed to compute the neighbor's gradient at the local, shared face. This is necessary, for example, when computing the flux jump terms in the Kelley error indicator. After each solve step only the subset of the solution vector specified by the send_list is then broadcast to each processor. Theoretically, this means there is less data transferred than if the whole solution were broadcast to all the processors after each solve. This really only effects highend scalability, however, since for reasonable numbers of processors there is not much more overhead involved in an Allreduce (which is what is required to get the global solution on every processor). >6 If adaptive mesh refinement is used the results are analyzed using >some error estimation techniques and a refinement or coarsening flag is >assigned to each mesh element. > > Yes. The refinement and coarsening flags are specified after analyzing the error distribution in some way. There are currently several methods for selecting elements for coarsening/refinement: fixed fraction of elements, fixed fraction of the error, some statistical quantity, etc... These flags are considered as input, sortof like "suggestions." By convention, any element flagged for refinement *will* be refined. However, other elements may also be refined to maintain a level1 compatible mesh (if desired) or to ensure smooth transitions (limiting the maximum level mismatch at a node, if desired). Elements flagged for coarsening will only be coarsened if 1.) all the other children want to be coarsened as well and 2.) coarsening the element will still maintain a level1 mesh (if desired), even with any specified refinement. >7 In the current implementation, the error estimation is done on the >mesh elements owned by each local processor and the results is >broadcasted to all the processors. > > Yes. For flagging purposes the result does not need to be processed provided that any required statistics can be done in parallel. >8 The mesh refinement and coarsening is done for the whole mesh on each >processor which is another fault in the current implementation. > > Yes. Arguably the biggest fault... >I think the true parallization will have many challenges... > >First, dealing with nodes and edges on the boundary of each partition >which needs to be presented on all the processors that share that >boundary. > >Second, the error estimator needs some information across the boundary >which is hard to get unless an overlapping boundaries between processors >is implemented or an element by element messaging across the partition >boundaries is done > >Third, during the mesh adaptation a nonconforming nodes may results on >the partition boundaries. > >I am currently thinking about sort of an overlapping partitioning, where >each processor stores the elements which belongs to its partition only, >in addition to all neighboring elements on the boundary. These elements >should still retain its processor id and will be called remote copy. No >local calculation is done for remote copies but after each round of >calculation each element that is a neighbor to a remote copy should send >the calculated results to the processor that owns that neighbor. > >One of the issues I am thinking about now is the parent information when >the children are distributed across the partitions boundaries!! Another >important issue is the migration of mesh elements across the boundaries >for the dynamic load balancing. > >Feed Back, current status, comments ... are all needed :) > >Regards, > >Ahmed > > There are some challenges, but certainly doable. What you suggest is very much in line with the previous discussions we have had... Let me summarize my current thoughts. PLEASE, ALL FEEL FREE TO ADD TO THIS! Parallel Mesh Layout:   Each processor owns a portion of the mesh, as in the current scheme. (For uniform element types and uniform approximation orders (no local prefinement) then these subdomains are balanced when they are equally sized. Otherwise, what it means to be loadbalanced is a little more ambiguous... If the linear solver is the bottleneck then balanced means equal DOFs per processor, if the matrix assembly is the bottleneck then something like n_elem_dofs^2*n_quadrature_points is the proper weight for each element, and this can be quite different for different element types...)  The partitioning is performed on the active elements only, as in the current mesh.  Each processor stores all the elements it owns, as well as all the elements that are connected to those elements via faces *or* edges. This is required so that, in the refinement procedure, any new nodes created on edges or faces are properly located.  Furthermore, when a child element is assigned to a processor, then that element's parent (and its parent's parent... all the way to the level0 element) is also stored on that processor (as a "remote copy"). The consequence of this is that, for data distribution purposes, it would be possible to redistribute *only* the level0 elements and then recreate their children via its refinement tree. Comments: By storing the local elements and any remote element connected to the local element nearly all of the existing algorithms will transfer cleanly. It is possible, however, that the total number of elements stored on a processor will be several times the number of elements actually owned by that processor. For anything but a random partition this number is bounded, and is proportional to the number of processor subdomains adjacent to a given subdomain. Since this is independent of the *total* number of processor subdomains the approach is scalable. Adaptivity:  By requiring that parent elements be present where ever children are present means that most of the current refinement & coarsening stuff will go through unchanged. The general approach will be as follows:  An error indicator/estimator is computed for all the local elements.  This data is analyzed in parallel  Each processor flags its elements as appropriate, broadcasts those flags to all the processors that contain copies of its elements.  Each processor will modify the flags for its elements based on any specified requirements (level1, etc...). It must then resubmit its modified flags to the processors that have copies of its elements. This process can be repeated until the flags do not change, (although I think it can be shown twice is sufficient?).  Each processor will refine/coarsen *all* the elements it has stored, regardless of whether or not it owns then. This is necessary so that new nodes on processor interfaces will be found and properly connected.  The new, active elements are partitioned. This will define what processors each level0 element needs to be copied to. It will be copied with all its children to each of these processors. DOF Distribution:   Largely unchanged. You mentioned that the solution for each "remote copy" needs to be obtained, but I think the current approach of broadcasting only those values in the send_list (which is smaller) is sufficient. Output:   For the time being (until parallel data formats are better defined) processor 0 will still write all data to disk. The major change is that the data will be transferred to processor 0 in chunks in case the entire mesh is too big to fit on one processor. Input:   The basic problem here is how to take a serial file, read it by one processor, and distribute pieces of it to each processor. Keep in mind that the initial partitioning may be arbitrarily bad, so we need to repartition and redistribute the distributed input mesh before doing anything else. Since you can imagine the initial partitioning is random, the repartitioning should not require any neighbor information. Here are my thoughts:  First, forget all the mesh input formats we support. They can still be read serially in a suboptimal way, but it is hopelessly complicated to try to perform clever parallel I/O for file formats that are serial by design. So, I suggets we focus on XDA/XDR for optimal I/O. Like I said, other formats can still be handled, but if it can't be read on one processor you are in trouble... (which begs the quesiton, how did you create it?)  Processor 0 opens the XDA/R file. Its header defines the total number of nodes, the total connectivity size, and the total number of elements. This can be used to determine how many elements to put on each processor. Since XDA/R was built with this in mind, the rest is straightforward...  A block of elements (defined by its connectivity) is read for the ith processor and broadcast to it. This is repeated for each processor, but only one block is read at a time, so the total size of the file is irrelevant.  Now, each processor knows the connectivity for its elements, which defines the global nodes it actually needs. It does not have any nodes yet, though...  Processor 0 reads a block of nodes, sized appropriately so they fit. This is a subset of the global nodes, which are then broadcast to *all* the processors. Each processor compares these nodes to the ones it actually needs and only keeps the relevant nodes (those that are connected to its elements). This is repeated until all the nodes are read. Again, the total number of nodes is irrelevant, so long as each processor can hold the data for its subdomain.  The process is repeated for the boundary conditions. Now, each processor has some elements and their nodes, but the distribution could be arbitrarly bad. Additionally, there is no neighbor information available, and if the partitioning is quite bad then storing all the neighbors may not be feasible, and finding them would certainly require a lot of communicaion. Fortunately, John developed a parallel sort that can be used in conjunction with a spacefilling curve to define an initial partitioning based on the information available at this point. The spacefilling curve partitioning is then used to redistribute the elements in such a way that the partitioning makes some sense. After the redistribution, each processor contains all its elements, but no neighbors. A bounding box (or sphere) for each processor subdomain can be computed. The intersection of bounding boxes defines all the processors with elements that *may* be neighbors. With this in mind, each processor computes the neighbors for the elements it owns. This will leave a number of elements without full neighbor connectivity. These elements are identified and copied to all the processors with intersecting bounding boxes. Those processors do the same with their elements. If a received element is a neighbor of a local element we keep it and define it as a "remote copy", otherwise we throw it away. After processing the elements from the processors with interstecting bounding boxes any element without a full set of neighbors is necessarily on a physical domain boundary. Furthermore, the set of adjacent processors is known exactly for each subdomain, which is necessarily a subset of those with intersecting bounding boxes.  That is what I have been thinking.   A few practical (implementation) issues: 1.) Of course, all of this will require that MPI be available. However, I think the library should still work serially without MPI. So, there is a bit of a problem... We either #ifdef *everything*, or we provide something like "mpiuni" (in PETSc) that implements the MPI functions we use for the trivial case of one processor. I prefer the latter approach since it should not be that hard, and code development/maintenance should be much easier. Alternatively, we should create a class (or classes) that handles the message passing functions we need. This can just do the right thing for one processor and then call MPI for more than one processor. I like this approach since it hides all the nasty MPI syntax. Ideas? 2.) Once the set of neighboring subdomains is identified we probably want to create an MPI communicator that contains this information. MPI_COMM_WORLD should probably only be used when you expect to talk to any other processor... 3.) As parallel mesh generation and other such stuff gets more mature parallel mesh file formats are likely to become more common. We should keep an eye on this and incorporate truly parallel I/O when feasible. I already stated that on clusters without highperformance parallel file systems there will likely be little benefit, but on the other hand machines like the Origin 3000 with really fast I/O would see a major benefit. Ben 