From: David Knezevic <david.knezevic@ba...>  20070825 11:41:02

Hi, I'm planning to implement a method similar to the alternating direction implicit (ADI) method in libMesh, and I was hoping to get a bit of advice :) For concreteness, I'll refer to the heat equation in the (x,y) plane (the PDE I'm solving is more complicated, but all the important issues I need to resolve are captured by the heat equation). So the idea is that I pick a mesh point on the xaxis which gives us a 1D crosssection of the solution in the y direction. I then apply the split differential operator to this ydirection crosssection in order to update those values. I then repeat this for all the mesh points on the xaxis, and then do the same for points on the yaxis, and when all these 1D updates are complete, I've finished a timestep. Now that's easy enough, but what I'd really like to do is implement this in parallel. Clearly the algorithm parallelizes very efficiently, as you could do all the updates in a given direction concurrently on different processors. However, this isn't the normal model of parallel computing in libMesh, because instead of using many processors for one large matrix assembly and solve, I need to do a large number of serial solves concurrently. For the PDE I'm solving, I'm using a spectral method in one direction, and I want to couple it to a libMesh finite element code in the other direction using this ADIlike approach. I've already implemented my spectral method using PETSc, and I've got it working in parallel. The approach I'm using is that I store a large MPI dense matrix which holds the solution data. This gets partitioned so that each processor is assigned a certain number of rows of this parallel matrix, and each processor performs its solves locally (i.e. sequentially). This turned out to be easy to implement in PETSc, because you can just create sequential vectors and matrices on each processor, and do a singleprocessor solve by passing the MPI communicator PETSC_COMM_SELF to the KSP object, even when on a parallel architecture. So I've got this part of my system running, and it's working well on lonestar at the moment. As I said, though, I want to couple this spectral method to libMesh, and to get the whole thing working in parallel so I need to get libMesh to do multiple solves concurrently as well. I imagine this should be possible, and would probably just require that I define some new data types. At the moment, the PETSc data types in libMesh detect whether or not there is more than one processor, and if there is more than one processor it creates a parallel matrix or vector... So I could define, say, PetscSeqMatrix, which could inherit from PetscMatrix but would always be sequential? Similarly for the solving, I would guess I'd just need to create a sequential KSP object... All this could be housed in an ADISystem solver? How about matrix assembly? Would there be any snags in ensuring that was done sequentially? Anyway, I just wanted to write down a few thoughts on this... any feedback would be most appreciated. Cheers, Dave 
From: John Peterson <peterson@cf...>  20070827 16:50:42

David Knezevic writes: > Hi, > > So I could define, say, PetscSeqMatrix, which could inherit from > PetscMatrix but would always be sequential? Similarly for the > solving, I would guess I'd just need to create a sequential KSP > object... All this could be housed in an ADISystem solver? How about adding a new constructor which takes an argument (or adding optional arguments to existing constructors) to force sequential matrix construction even when running in parallel? Might be easier than inheriting. > How about matrix assembly? Would there be any snags in ensuring that > was done sequentially? I think it depends on how are you partitioning the 2D mesh. (There is still a logically2D mesh, even though you are just doing these 1D solves, correct?) If you were able to partition the 2D mesh into a sequence of 1D strips a few elements wide, I think the current_local_solution would have all the values necessary to do your sequential assemblies. John 
From: Roy Stogner <roystgnr@ic...>  20070827 17:45:41

On Mon, 27 Aug 2007, John Peterson wrote: > David Knezevic writes: > > > So I could define, say, PetscSeqMatrix, which could inherit from > > PetscMatrix but would always be sequential? Similarly for the > > solving, I would guess I'd just need to create a sequential KSP > > object... All this could be housed in an ADISystem solver? > > How about adding a new constructor which takes an argument (or adding > optional arguments to existing constructors) to force sequential matrix > construction even when running in parallel? Might be easier than > inheriting. I think John's got the right idea here; the PETSc APIs for the most part just take pointers to Mat structs, so you could use inheritance in our interface without having to copy and paste a ton of code. > I think it depends on how are you partitioning the 2D mesh. (There is still > a logically2D mesh, even though you are just doing these 1D solves, correct?) > If you were able to partition the 2D mesh into a sequence of 1D strips a > few elements wide, I think the current_local_solution would have all the > values necessary to do your sequential assemblies. The trouble is that you'd need to partition the 2D mesh into 1D strips horizontally to do the solves in that direction, then vertically to do the solves in that direction, etc. repeatedly. This would just mean writing your own partitioner, but I'd worry about performance: even if there's no projection step required, how much overhead do we have in libMesh's code for transferring values and their ownership from processor to processor? It's the sort of thing we've never worried about optimizing because in our applications it only happens once per timestep and it's negligible compared to the linear solves. That might not be the case if repartitioning/dof renumbering/etc was *in* a linear solve as part of each iteration. All the solutions I can think of off the top of my head are pretty ugly; I'll post again if an idea pops up. On the other hand, I suppose if you're just dealing with 2D problems, you may not care about wall clock time in a practical sense and you can report any performance figures as iteration counts.  Roy 
From: Derek Gaston <friedmud@gm...>  20070827 21:10:56

I like the idea of writing your own partitioner to dole out the strips. I did actually do some performance analysis of the partitioners in libMesh a couple of years ago... and found that they are way down in the noise for our normal implicit calculations. It could be that with a scheme such as this they become much more apparent, but I actually kind of doubt it. Running through the mesh and assigning a number to each element is a pretty dang fast thing to do... I would be interested to hear about timing studies done with such a partitioner though... Derek On 8/27/07, Roy Stogner <roystgnr@...> wrote: > On Mon, 27 Aug 2007, John Peterson wrote: > > > David Knezevic writes: > > > > > So I could define, say, PetscSeqMatrix, which could inherit from > > > PetscMatrix but would always be sequential? Similarly for the > > > solving, I would guess I'd just need to create a sequential KSP > > > object... All this could be housed in an ADISystem solver? > > > > How about adding a new constructor which takes an argument (or adding > > optional arguments to existing constructors) to force sequential matrix > > construction even when running in parallel? Might be easier than > > inheriting. > > I think John's got the right idea here; the PETSc APIs for the most > part just take pointers to Mat structs, so you could use inheritance > in our interface without having to copy and paste a ton of code. > > > I think it depends on how are you partitioning the 2D mesh. (There is still > > a logically2D mesh, even though you are just doing these 1D solves, correct?) > > If you were able to partition the 2D mesh into a sequence of 1D strips a > > few elements wide, I think the current_local_solution would have all the > > values necessary to do your sequential assemblies. > > The trouble is that you'd need to partition the 2D mesh into 1D strips > horizontally to do the solves in that direction, then vertically to do > the solves in that direction, etc. repeatedly. This would just mean > writing your own partitioner, but I'd worry about performance: even if > there's no projection step required, how much overhead do we have in > libMesh's code for transferring values and their ownership from > processor to processor? It's the sort of thing we've never worried > about optimizing because in our applications it only happens once per > timestep and it's negligible compared to the linear solves. That > might not be the case if repartitioning/dof renumbering/etc was *in* a > linear solve as part of each iteration. > > All the solutions I can think of off the top of my head are pretty > ugly; I'll post again if an idea pops up. On the other hand, I > suppose if you're just dealing with 2D problems, you may not care > about wall clock time in a practical sense and you can report any > performance figures as iteration counts. >  > Roy > >  > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Libmeshusers mailing list > Libmeshusers@... > https://lists.sourceforge.net/lists/listinfo/libmeshusers > 
From: David Knezevic <david.knezevic@ba...>  20070828 10:26:53

> The trouble is that you'd need to partition the 2D mesh into 1D strips > horizontally to do the solves in that direction, then vertically to do > the solves in that direction, etc. repeatedly. This would just mean > writing your own partitioner, but I'd worry about performance: even if > there's no projection step required, how much overhead do we have in > libMesh's code for transferring values and their ownership from > processor to processor? It's the sort of thing we've never worried > about optimizing because in our applications it only happens once per > timestep and it's negligible compared to the linear solves. That > might not be the case if repartitioning/dof renumbering/etc was *in* a > linear solve as part of each iteration. In my email to John, I outlined my plan for the partitioning. I wonder if you think it's appropriate? My idea is to store a dense PETSc matrix, external to any libMesh data structures, and use that to do the simple partitioning such that each processor gets an equal number of rows. I'll just load the solution data from this matrix into the libMesh solution vector as required. You're right about the need to cut the data both horizontally and vertically. At the end of all the horizontal solves, I'll transpose the dense PETSc matrix. This takes care of all the necessary communications, and I'm sure it's a nontrivial operation, but I doubt this transposing it will take a large proportion of the overall computation time. This remains to be seen though :)  Dave 
From: David Knezevic <david.knezevic@ba...>  20070828 10:20:16

> Ohh, I think I see now. The dense matrix is really just a simple > way of getting > and storing rows (or columns) of solution data. Exactly. And the thing is, I've been talking about taking 1D strips from a logically 2D mesh, but what I'm actually implementing is taking "2D" strips from a 4D mesh. That is, the PDE I'm solving is posed in a 4D domain which is the cartesian product of two 2D domains, and I'm splitting the PDE and applying the ADI method in order to solve it. A single 2D strips is again just a row or column of the dense matrix (it doesn't matter how many dimensions the strips represent, because in the end they're just nodal values of the solution which will be loaded onto a current_local_solution vector). > I'll be curious to see what happens if some of the 1D problems are > harder > (and thus take longer to finish) than others. In such a case, you > may be > waiting on 1 or 2 CPUs to finish their quota while the others sit > idle. Definitely, but luckily in my application all the subproblems are equally hard. Well, actually, all the subproblems in a given direction are equally hard, but I'm only doing one direction at a time, since otherwise the solves would interfere with one another. The nice thing is that all the solves in a given direction are completely independent. > A different approach might be to use a "clientserver" structure, > where, as each "client" processor finishes a 1D solve, it asks the > "server" process for the next row/column to work on, until all are > finished. Since you've already got a dense parallel matrix with the > values, the communication part is mostly taken care of. This would > probably be a bit more complexity than necessary though, especially if > all the solves take about the same amount of time. I think this clientserver structure would give a lot more flexibility. As a first cut, though, I think I'll implement it in the simplest way, which is to just to allow PETSc to partition the dense matrix of solution data so that each processor gets an equal number of rows (which is done by default when you create an MPIDenseMatrix), and then can happily do all the solves on it's local rows, one at a time. In order to do the solves in the other direction, I just transpose the dense matrix and again PETSc partitions it so that each processor gets an equal number of rows of the transposed matrix. But I think that if, later on, I was able to try problems large enough that it was beneficial to have multiple processors working on each subproblem, then the clientserver structure would give the flexibility to handle that situation. But yeah, for the moment, the partitioning that I'm doing is so trivial that I don't think I'll need to worry about implementing anything more complicated than what I described above. Cheers, Dave 