From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-24 12:53:55
|
Let me first describe the usage case I have before I get to the question, and what I think is a problem… I have two sets of physics, A & B, which are coupled through boundary data along their common boundary interface. For a numbef of reasons, physics A and B are implemented in standalone codes. In this case one or both happens to be a libMesh app. Now, I want to pass data between the two. This is the genesis of my "would you be interested in a point cloud interpolation?" question some months ago. To pass data between the two, I of course want to use libMesh. And I'd like to to this all in one app instead of through file I/O or anything like that. In pseudocode, what I envision is LibMeshInit(); PointCloudData shared_data; MPI_Comm_split(COMM_WORLD, SPLIT_COMM) PhysicsA physicsA(SPLIT_COMM,data); PhysicsB physicsB(SPLIT_COMM,data); while (time < tmax) physicsA.advance(dt); physicsB.advance(dt); done inside each advance method, each physics puts/gets data from the shared_data structure. And they each run on a subset of the total number of processors. But the problem: main() and physicsA (resp. physicsB) will coexist on some of the processors. And recall at least one of them also uses libMesh. As i see it now, all the extern data we have (especially the MPI communicator!) makes this impossible. The only way to properly do it I can think of is to move all that extern stufff into the LibMeshInit object, and have it proliferate through the object tree - in particular make it into the mesh, equation systems, etc… Am I missing something? This doesn't happen in Queso integration because on any one processor there is at most one libMesh app, right? This is a pretty valuable use case, and I'm inclined to fix it, but it could be a major change and I don't want to jump into it without thinking it through… -Ben |
From: Paul T. B. <ptb...@gm...> - 2012-10-24 13:25:08
|
On Wed, Oct 24, 2012 at 7:53 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > This doesn't happen in Queso integration because on any one processor > there is at most one libMesh app, right? > Haven't thought about the other parts of this, but I can immediately answer this part. Yes, in all the QUESO apps I've built, any one processor is running only 1 instance of libMesh. |
From: Roy S. <roy...@ic...> - 2012-10-24 13:38:39
|
On Wed, 24 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > But the problem: main() and physicsA (resp. physicsB) will coexist > on some of the processors. And recall at least one of them also > uses libMesh. As i see it now, all the extern data we have > (especially the MPI communicator!) makes this impossible. The only > way to properly do it I can think of is to move all that extern > stufff into the LibMeshInit object, and have it proliferate through > the object tree - in particular make it into the mesh, equation > systems, etc… This was one of the original design goals of the libMeshInit object, actually. But the only design goal I needed right away was something like getting RAII semantics working for a code that wanted to do proper cleanup after an exception throw, so I stopped once that was accomplished before trying to figure out how to hit any of the difficult goals. > Am I missing something? This doesn't happen in Queso integration > because on any one processor there is at most one libMesh app, > right? One app at one time, anyway. IIRC the only fix for QUESO integration was to make sure we could init libMesh::COMM_WORLD to be something other than MPI_COMM_WORLD. > This is a pretty valuable use case, and I'm inclined to fix it, but > it could be a major change and I don't want to jump into it without > thinking it through… Agreed. The big problems seem to be backwards compatibility and per-object data storage. 1) I assume we don't want to break the code of everyone using libMesh::processor_id(), etc? But if libMeshInit's constructor sets a global pointer to "this", then global functions could grab previously-global state through that pointer, and only multi-LibMeshInit-aware apps would need to switch to my_libmesh->processor_id() instead. 2) How many of our classes would need to start stashing a pointer-to-LibMeshInit so they'd be multi-LibMeshInit compatible? I suppose only the MPI/TBB aware classes would need this, and they're all heavyweight enough that an extra pointer wouldn't matter? --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-24 13:56:50
|
On Oct 24, 2012, at 6:38 AM, Roy Stogner <roy...@ic...> wrote: >> >> This is a pretty valuable use case, and I'm inclined to fix it, but >> it could be a major change and I don't want to jump into it without >> thinking it through… > > Agreed. > > The big problems seem to be backwards compatibility and per-object > data storage. > > 1) I assume we don't want to break the code of everyone using > libMesh::processor_id(), etc? But if libMeshInit's constructor sets a > global pointer to "this", then global functions could grab > previously-global state through that pointer, and only > multi-LibMeshInit-aware apps would need to switch to > my_libmesh->processor_id() instead. Right, I think maintaining backwards compatibility with the API will be no problem. > 2) How many of our classes would need to start stashing a > pointer-to-LibMeshInit so they'd be multi-LibMeshInit compatible? I > suppose only the MPI/TBB aware classes would need this, and they're > all heavyweight enough that an extra pointer wouldn't matter? This I'm actually afraid of - but to find out I'm going to start by removing the default communicator argument (temporarrity!) from the Parallel:: methods and see what happens. What really scares me is everywhere there is a parallel_verify() or whatever the communicator will need to be passed in, which means access to the LibMeshInit (or some lighter-weight object). The API would be hardly different - pass an optional LibMeshInit object to the Mesh, and then it takes care of the rest - but the internals I'm honestly a little worried about! -Ben |
From: Derek G. <fri...@gm...> - 2012-10-24 16:24:02
|
We would like this capability as well. Right now, what we do is just solve both physics over _all_ of the processors (one after another)... which isn't always ideal. It would be great to have more flexibility here - even if it comes at the cost of a bit more work anywhere that we are doing actual parallel calls... Derek On Wed, Oct 24, 2012 at 6:53 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > Let me first describe the usage case I have before I get to the question, > and what I think is a problem… > > I have two sets of physics, A & B, which are coupled through boundary data > along their common boundary interface. For a numbef of reasons, physics A > and B are implemented in standalone codes. In this case one or both > happens to be a libMesh app. > > Now, I want to pass data between the two. This is the genesis of my > "would you be interested in a point cloud interpolation?" question some > months ago. > > To pass data between the two, I of course want to use libMesh. And I'd > like to to this all in one app instead of through file I/O or anything like > that. > > In pseudocode, what I envision is > > > LibMeshInit(); > > PointCloudData shared_data; > > MPI_Comm_split(COMM_WORLD, SPLIT_COMM) > > PhysicsA physicsA(SPLIT_COMM,data); > PhysicsB physicsB(SPLIT_COMM,data); > > while (time < tmax) > physicsA.advance(dt); > physicsB.advance(dt); > done > > > inside each advance method, each physics puts/gets data from the > shared_data structure. And they each run on a subset of the total number > of processors. > > But the problem: main() and physicsA (resp. physicsB) will coexist on > some of the processors. And recall at least one of them also uses libMesh. > As i see it now, all the extern data we have (especially the MPI > communicator!) makes this impossible. The only way to properly do it I can > think of is to move all that extern stufff into the LibMeshInit object, and > have it proliferate through the object tree - in particular make it into > the mesh, equation systems, etc… > > Am I missing something? This doesn't happen in Queso integration because > on any one processor there is at most one libMesh app, right? > > This is a pretty valuable use case, and I'm inclined to fix it, but it > could be a major change and I don't want to jump into it without thinking > it through… > > -Ben > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > |
From: Roy S. <roy...@ic...> - 2012-10-24 16:41:26
|
On Wed, 24 Oct 2012, Derek Gaston wrote: > We would like this capability as well. > Right now, what we do is just solve both physics over _all_ of the > processors (one after another)... which isn't always ideal. It > would be great to have more flexibility here - even if it comes at > the cost of a bit more work anywhere that we are doing actual > parallel calls... The big question I have is about those parallel calls (and parallel NumericVector initializations, etc): Although in the short run we'll certainly want their libMesh context (or Communicator) argument to default to the global LibMeshInit* target or its Communicator_World, what do we want to do in the long run? If we leave that default argument in place, we'll be encouraging people to write difficult-to-extend code which breaks at runtime in confusing ways when you try to extend it. We once changed all the example assembly code from "works with the mesh that example hands it" to "works with whatever parallel AMR a user might throw at it" simply because having to help users extend code in such a way is horribly un-fun. If we take that default argument away, every single Parallel call in app code will have to be changed from something like "Parallel::max(a)" into something like "Parallel::max(a, mesh.get_libmesh().Communicator_World())", and that's a huge backwards compatibility breakage to hit users with. Maybe we could play some tricks with the global LibMeshInit* target to make extensions easier to debug? E.g. the first LibMeshInit object to be constructed sets that target from NULL to "this", but then any subsequent LibMeshInit objects set it back to NULL, so that if it ever mistakenly gets used in a multi-LibMeshInit application the result is an immediate and easily-diagnosed segfault rather than a subtle MPI error/deadlock. --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-24 17:05:20
|
On Oct 24, 2012, at 9:41 AM, Roy Stogner <roy...@ic...> wrote: > If we take that default argument away, every single Parallel call in > app code will have to be changed from something like > "Parallel::max(a)" into something like "Parallel::max(a, > mesh.get_libmesh().Communicator_World())", and that's a huge backwards > compatibility breakage to hit users with. Wait - no. I meant I was simply going to take the default argument away for development to find all the places *we* need to pass the proper communicator. I would not commit that change, only use it as a compile-time tool to find where we need to fix things (and therefore have a LibMeshInit object) passed... > Maybe we could play some tricks with the global LibMeshInit* target to > make extensions easier to debug? E.g. the first LibMeshInit object to > be constructed sets that target from NULL to "this", but then any > subsequent LibMeshInit objects set it back to NULL, so that if it ever > mistakenly gets used in a multi-LibMeshInit application the result is > an immediate and easily-diagnosed segfault rather than a subtle MPI > error/deadlock. Sure - I like that idea. -Ben |
From: Roy S. <roy...@ic...> - 2012-10-24 17:13:08
|
On Wed, 24 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > On Oct 24, 2012, at 9:41 AM, Roy Stogner <roy...@ic...> wrote: > >> If we take that default argument away, every single Parallel call in >> app code will have to be changed from something like >> "Parallel::max(a)" into something like "Parallel::max(a, >> mesh.get_libmesh().Communicator_World())", and that's a huge backwards >> compatibility breakage to hit users with. > > Wait - no. I meant I was simply going to take the default argument > away for development to find all the places *we* need to pass the > proper communicator. I would not commit that change, only use it as > a compile-time tool to find where we need to fix things (and > therefore have a LibMeshInit object) passed... Oh, I understood your meaning - I was just pointing out that *I* saw some compelling reasons for considering taking the default argument away permanently, eventually. >> Maybe we could play some tricks with the global LibMeshInit* target to >> make extensions easier to debug? E.g. the first LibMeshInit object to >> be constructed sets that target from NULL to "this", but then any >> subsequent LibMeshInit objects set it back to NULL, so that if it ever >> mistakenly gets used in a multi-LibMeshInit application the result is >> an immediate and easily-diagnosed segfault rather than a subtle MPI >> error/deadlock. > > Sure - I like that idea. In fact, forget the segfault - stick a libmesh_assert() around the argument before using it and we can actually guarantee easy debugging even on systems which don't immediately segfault upon a NULL dereference. (Yes, I did use such a system once. "Let's allocate an extra page of user memory around 0x0, to prevent NULL dereferences from immediately crashing programs" probably sounded like a great idea to some OS designer who had enough good intentions to pave a road...) --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-24 17:17:09
|
On Oct 24, 2012, at 10:12 AM, Roy Stogner <roy...@ic...> wrote: > Oh, I understood your meaning - I was just pointing out that *I* saw > some compelling reasons for considering taking the default argument > away permanently, eventually OK, I can see the benefits of that too (eventually!) Well in that case, a much less severe API change might be Parallel::max(a); becomes Parallel parallel(MY_COMMUNICATOR); … parallel.max(a); at least that one can be handed with only one line of new code and a query/replace! |
From: Roy S. <roy...@ic...> - 2012-10-24 17:32:29
|
On Wed, 24 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > Well in that case, a much less severe API change might be > > Parallel::max(a); > > becomes > > Parallel parallel(MY_COMMUNICATOR); > … > > parallel.max(a); > > at least that one can be handed with only one line of new code and a query/replace! What's particularly pathetic isn't that I didn't think of this myself, it's that I did think of it (quite some time ago), then forgot about the idea, then today suggested an obviously grossly inferior idea instead. This is such an aesthetic improvement over the mess of default Communicator arguments in parallel.h that I'd be tempted to change to something like your proposed API regardless of other benefits. Move the existing code from Parallel:: functions to Communicator member functions, leave the Parallel::X() functions as libmesh_deprecated() shims to libMesh::Communicator_World.X(), and old user code would remain API compatible until the deprecated stuff got removed. --- Roy |
From: Paul T. B. <ptb...@gm...> - 2012-10-24 17:38:20
|
On Wed, Oct 24, 2012 at 12:32 PM, Roy Stogner <roy...@ic...>wrote: > This is such an aesthetic improvement over the mess of default > Communicator arguments in parallel.h that I'd be tempted to change to > something like your proposed API regardless of other benefits. Move > the existing code from Parallel:: functions to Communicator member > functions, leave the Parallel::X() functions as libmesh_deprecated() > shims to libMesh::Communicator_World.X(**), and old user code would > +1, FWIW. |
From: Roy S. <roy...@ic...> - 2012-10-30 22:04:32
|
On Wed, 24 Oct 2012, Paul T. Bauman wrote: > On Wed, Oct 24, 2012 at 12:32 PM, Roy Stogner <roy...@ic...> wrote: > This is such an aesthetic improvement over the mess of default > Communicator arguments in parallel.h that I'd be tempted to change to > something like your proposed API regardless of other benefits. Move > the existing code from Parallel:: functions to Communicator member > functions, leave the Parallel::X() functions as libmesh_deprecated() > shims to libMesh::Communicator_World.X(), and old user code would > > +1, FWIW. I haven't added the libmesh_deprecated() claims (and won't until all library code is updated to use the new APIs, which will probably wait until we decide on some name like libMesh::CommWorld to use as a terser reference for libMesh::Parallel::Communicator_World), but the basic task of moving APIs to Communicator and making the old global functions into shims is done. It's a particularly annoying "svn diff" to apply, since svn 1.6 doesn't support "--show-copies-as-adds" and patch doesn't like symlinks, but here's how to apply from your LIBMESH_DIR: wget http://users.ices.utexas.edu/~roystgnr/parallel_refactor.patch svn copy include/parallel/parallel.h include/parallel/parallel_implementation.h patch -p0 < parallel_refactor.patch ln -sf ../parallel/parallel_implementation.h include/libmesh/ ln -sf ../parallel/parallel_communicator_specializations include/libmesh/ Extra testing would be appreciated. --- Roy |
From: Paul T. B. <ptb...@gm...> - 2012-10-31 02:37:11
|
On Tue, Oct 30, 2012 at 5:04 PM, Roy Stogner <roy...@ic...>wrote: > > Extra testing would be appreciated. > Applying the patch and running one of the grins examples as is with all mpi or mpi+threads worked fine for me. |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-31 15:51:24
|
I applied this on libmesh.automake, which imposes strict namespacing, and I had to add 'using namepace libMesh;' inside your anonymous namespace in parallel_implementation.h so that Parallel:: was found properly. Then testing with FIN-S showed no issues! -Ben |
From: Roy S. <roy...@ic...> - 2012-10-31 16:26:03
|
On Wed, 31 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > I applied this on libmesh.automake, which imposes strict namespacing, and I had to add > > 'using namepace libMesh;' > > inside your anonymous namespace in parallel_implementation.h so that Parallel:: was found properly. Bah - I forgot to test with namespace separation enabled. Thanks for catching that! > Then testing with FIN-S showed no issues! Thanks, to both you and Paul! Unless I hear screams from elsewhere I'll commit this afternoon. Any preferences or suggestions on the naming issue? I really don't want to replace (much less deprecate) the old Parallel::sum(x) type calls until we've got something nicer looking than Parallel::Communicator_World.sum(x) to replace them. CommWorld.sum(x)? Comm_World.sum(x)? Comm.sum(x)? MPIWorld.sum(x)? --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-31 16:51:26
|
On Oct 31, 2012, at 11:25 AM, Roy Stogner <roy...@ic...> wrote: > CommWorld.sum(x)? Comm_World.sum(x)? Comm.sum(x)? MPIWorld.sum(x)? CommWorld.sum(x); gets my vote. Comm is too nebulous - easily could conflict with any other communicator that could exist in a program. The others just aesthetically offend me, if nothing else! -Ben |
From: Derek G. <fri...@gm...> - 2012-10-31 16:57:45
|
Give me a bit - I'll test it in about an hour or so. Derek On Wed, Oct 31, 2012 at 10:25 AM, Roy Stogner <roy...@ic...>wrote: > > On Wed, 31 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > > > I applied this on libmesh.automake, which imposes strict namespacing, > and I had to add > > > > 'using namepace libMesh;' > > > > inside your anonymous namespace in parallel_implementation.h so that > Parallel:: was found properly. > > Bah - I forgot to test with namespace separation enabled. Thanks for > catching that! > > > Then testing with FIN-S showed no issues! > > Thanks, to both you and Paul! Unless I hear screams from elsewhere > I'll commit this afternoon. > > Any preferences or suggestions on the naming issue? I really don't > want to replace (much less deprecate) the old Parallel::sum(x) type > calls until we've got something nicer looking than > Parallel::Communicator_World.sum(x) to replace them. > > CommWorld.sum(x)? Comm_World.sum(x)? Comm.sum(x)? MPIWorld.sum(x)? > --- > Roy > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel > |
From: Roy S. <roy...@ic...> - 2012-10-31 16:59:45
|
On Wed, 31 Oct 2012, Derek Gaston wrote: > Give me a bit - I'll test it in about an hour or so. I'll wait on you, thanks. > CommWorld.sum(x)? Comm_World.sum(x)? Comm.sum(x)? MPIWorld.sum(x)? So far we've got Ben's, my, and Paul's votes for libMesh::CommWorld, so I'll add that as a reference unless there's an outpouring of others' support for something else. --- Roy |
From: John P. <jwp...@gm...> - 2012-10-31 17:02:49
|
On Wed, Oct 31, 2012 at 10:59 AM, Roy Stogner <roy...@ic...> wrote: > > On Wed, 31 Oct 2012, Derek Gaston wrote: > >> Give me a bit - I'll test it in about an hour or so. > > > I'll wait on you, thanks. > >> CommWorld.sum(x)? Comm_World.sum(x)? Comm.sum(x)? MPIWorld.sum(x)? > > > So far we've got Ben's, my, and Paul's votes for libMesh::CommWorld, > so I'll add that as a reference unless there's an outpouring of > others' support for something else. libMesh::CMWRLD j/k -- John |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-31 17:19:28
|
> On Wed, 31 Oct 2012, Derek Gaston wrote: > >> Give me a bit - I'll test it in about an hour or so. > > I'll wait on you, thanks. > >> CommWorld.sum(x)? Comm_World.sum(x)? Comm.sum(x)? MPIWorld.sum(x)? > > So far we've got Ben's, my, and Paul's votes for libMesh::CommWorld, > so I'll add that as a reference unless there's an outpouring of > others' support for something else. Thanks for doing this. I'm going to use it to implement the optimization for David's reduced basis eigenvector I/O. And also to redo the embarassing EquationSystems serialized I/O code path. In reading it last night I found some very, very poor algorithmic complexity I want to fix. -Ben |
From: Derek G. <fri...@gm...> - 2012-10-31 18:04:04
|
On Wed, Oct 31, 2012 at 11:20 AM, Kirk, Benjamin (JSC-EG311) < ben...@na...> wrote: > And also to redo the embarassing EquationSystems serialized I/O code path. > In reading it last night I found some very, very poor algorithmic > complexity I want to fix. > Awesome Ben! Thanks! Derek |
From: Roy S. <roy...@ic...> - 2012-10-31 17:30:53
|
On Wed, 31 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > redo the embarassing EquationSystems serialized I/O code path. In > reading it last night I found some very, very poor algorithmic > complexity I want to fix. If you're messing around in EquationSystems I/O, any chance you could excise our libHilbert near-dependence while you're at it? IIRC we worked out the "right way" to do mesh/solution numbering on the list quite a while ago, it's just a nasty enough problem that nobody's had time to implement it. It's just become more urgent now that we've got licensing issues as well as bugs to worry about. --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-10-31 19:02:23
|
On Oct 31, 2012, at 12:30 PM, Roy Stogner <roy...@ic...> wrote: > On Wed, 31 Oct 2012, Kirk, Benjamin (JSC-EG311) wrote: > >> redo the embarassing EquationSystems serialized I/O code path. In >> reading it last night I found some very, very poor algorithmic >> complexity I want to fix. > > If you're messing around in EquationSystems I/O, any chance you could > excise our libHilbert near-dependence while you're at it? IIRC we > worked out the "right way" to do mesh/solution numbering on the list > quite a while ago, it's just a nasty enough problem that nobody's had > time to implement it. It's just become more urgent now that we've got > licensing issues as well as bugs to worry about. I'll review the thread and see what I can do. -Ben |
From: Derek G. <fri...@gm...> - 2012-11-01 20:55:07
|
On Wed, Oct 31, 2012 at 10:59 AM, Roy Stogner <roy...@ic...>wrote: > > I'll wait on you, thanks. > Sorry I didn't get to this yesterday... you checked this in this morning, right? I've tested the current changes and it did cause one compile error in MOOSE... but it was in my code for dealing with ghosting extra elements while using ParallelMesh. I know that that code is bad now anyway (I should use the new PackedElem stuff and redo the whole thing) so I just removed that part of the code from MOOSE for now until I get time to fix it all up. Hopefully that will be soon. Other than that it looks like everything is ok. Derek |
From: Roy S. <roy...@ic...> - 2012-11-01 21:10:26
|
On Thu, 1 Nov 2012, Derek Gaston wrote: > I've tested the current changes and it did cause one compile error in MOOSE... but it was in my code for dealing with ghosting extra elements > while using ParallelMesh. I know that that code is bad now anyway (I should use the new PackedElem stuff and redo the whole thing) so I just > removed that part of the code from MOOSE for now until I get time to fix it all up. Hopefully that will be soon. > > Other than that it looks like everything is ok. Thanks! I'd like to see the compile error if it's not too much trouble, though. I was attempting to keep things entirely backwards compatible... --- Roy |