From: Tim K. <tim...@ce...> - 2009-03-09 15:23:52
|
Dear Roy, On Fri, 6 Mar 2009, Roy Stogner wrote: >> It was the "solution" vector, not "current_local_solution". I don't know >> why it crashed there, but that was reproducible. Perhaps I did something >> else wrong. (In earlier times, when I didn't quite understand the >> difference between these two vectors, > > Don't feel bad about that. For my own part, I don't care too much > about serial vector memory usage - I'm mostly interested in getting > ghosted vectors working because I think the solution / > current_local_solution distinction is the second most confusing code > in the library, Just to be curious: What is the most confusing code in the library then? >> I think I sometimes cross-assigned one of them to the other in a >> different system. Perhaps such an assignment still exists somewhere >> hidden in my code.) > > That could be a serious problem, yes. > > But wouldn't we be catching that with a libmesh_assert()? The > preconditions on operator= in petsc_vector.C seem pretty thorough. But only in debug mode, right? The problem with that is that compiling in debug mode requires to compile our institute's software package in debug mode as well since they use the same debug macros. I try to avoid that because it's a lot of work, so actually I never use debug-mode libmesh. (Feature-request: Use LIBMESH_DEBUG rather than DEBUG.) >> I'm still working on the track-down. The problem is that the application >> runs for more than 2 hours before it crashes, and I couldn't find a >> possibility to let it crash earlier. > > I hope your app has restart file capabilities! Each time I face a problem like the current one I think "Well, next time I'll implement it." (-: >> In particular, I noticed that the grid is successfully refined two or three >> times before the crash. But I won't capitulate. > > Thanks! The crash seems to happen somewhere in the element loop in DofMap::enforce_constraints_exactly(). (Of course, the cause may be at a different place.) More details to follow next time. >>> On the other hand, here's something even more odd: I just realized >>> that I'm not properly matching the ghosted vector init() function >>> signature. The compiler must be deciding to convert GHOSTED to true >>> and then using the default AUTOMATIC for the fifth argument? That >>> wouldn't explain any crashes, but if we're fast-initializing vectors >>> we should be clearing, that might explain your slightly-off residuals. >>> I'll fix it in SVN. >> >> Did you fix that? "svn log" doesn't show anything from you later than >> 2009-03-04. > > Whoops! Fixed it, got distracted by other work, forgot to commit the > fix. I'll put in that (and your patch, which still looks good) today. Funnily enough, my slightly-off residuals get modified with your patch, but still are slightly off, e.g.: without ghosted: 1.16561e-08 with ghosted, without your patch: 1.16563e-08 with ghosted, with your patch: 1.16558e-08 Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: John P. <jwp...@gm...> - 2009-03-09 15:28:58
|
On Mon, Mar 9, 2009 at 9:23 AM, Tim Kroeger <tim...@ce...> wrote: > (Feature-request: Use LIBMESH_DEBUG rather > than DEBUG.) This would probably be ok... since assert's happen when NDEBUG is *not* defined if I remember correctly, and don't care about DEBUG. Can anyone else comment? -- John |
From: Tim K. <tim...@ce...> - 2009-03-09 15:41:52
|
On Mon, 9 Mar 2009, John Peterson wrote: > On Mon, Mar 9, 2009 at 9:23 AM, Tim Kroeger > <tim...@ce...> wrote: >> (Feature-request: Use LIBMESH_DEBUG rather >> than DEBUG.) > > This would probably be ok... since assert's happen when NDEBUG is > *not* defined if I remember correctly, and don't care about DEBUG. > Can anyone else comment? You're right; I found it in libmesh_common.h, line 218. But what does that help me? Is there a configure option that results in neither defining DEBUG nor NDEBUG? Our did you mean that asserts are always enabled unless I manually define NDEBUG? In any case, I would like to hold my feature request up. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: John P. <jwp...@gm...> - 2009-03-09 15:50:10
|
On Mon, Mar 9, 2009 at 9:41 AM, Tim Kroeger <tim...@ce...> wrote: > On Mon, 9 Mar 2009, John Peterson wrote: > >> On Mon, Mar 9, 2009 at 9:23 AM, Tim Kroeger >> <tim...@ce...> wrote: >>> >>> (Feature-request: Use LIBMESH_DEBUG rather >>> than DEBUG.) >> >> This would probably be ok... since assert's happen when NDEBUG is >> *not* defined if I remember correctly, and don't care about DEBUG. >> Can anyone else comment? > > You're right; I found it in libmesh_common.h, line 218. Yes, with libmesh_assert, we wanted to mimic the behavior of stdlib's assert. > But what does that help me? Is there a configure option that results in > neither defining DEBUG nor NDEBUG? Our did you mean that asserts are always > enabled unless I manually define NDEBUG? In optimized mode you get -DNDEBUG, in debug mode you get -DDEBUG, and in devel-mode you get neither. I was just making the statement that we probably wouldn't affect other standard library functions by changing DEBUG to LIBMESH_DEBUG. We might break something by changing NDEBUG to LIBMESH_NDEBUG, I also realize this is not what you are proposing. > In any case, I would like to hold my feature request up. It sounds reasonable to me, unless -DDEBUG is required by other standard library functions, which I don't think it is... -- John |
From: Tim K. <tim...@ce...> - 2009-03-09 16:07:54
|
On Mon, 9 Mar 2009, John Peterson wrote: > In optimized mode you get -DNDEBUG, in debug mode you get -DDEBUG, and > in devel-mode you get neither. I didn't even know that a "devel-mode" exists. How do I invoke that? I can't find it, neither in the output of "./configure --help" nor in the installation instructions on the web. Actually, the other two modes are not described there either. If these modes are really not documented, I would suggest to do so. If they are, I would suggest to make that documentation more easily findable. Oops, the website says something about this, but it only mentions "opt", "dbg", and "pro". Is the other one "devel" or "dvl" or "dev"? > I was just making the statement that we probably wouldn't affect other > standard library functions by changing DEBUG to LIBMESH_DEBUG. We > might break something by changing NDEBUG to LIBMESH_NDEBUG, I also > realize this is not what you are proposing. I would vote for decoupling libMesh's debug symbols completely from other symbols. If somebody likes to have DEBUG or NDEBUG defined for use in the standard library functions, they can define these symbols manually, can't they? That would be the way to go without libMesh, anyway, wouldn't it? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: John P. <jwp...@gm...> - 2009-03-09 16:29:07
|
On Mon, Mar 9, 2009 at 10:07 AM, Tim Kroeger <tim...@ce...> wrote: > On Mon, 9 Mar 2009, John Peterson wrote: > >> In optimized mode you get -DNDEBUG, in debug mode you get -DDEBUG, and >> in devel-mode you get neither. > > I didn't even know that a "devel-mode" exists. How do I invoke that? I > can't find it, neither in the output of "./configure --help" nor in the > installation instructions on the web. Actually, the other two modes are not > described there either. If these modes are really not documented, I would > suggest to do so. If they are, I would suggest to make that documentation > more easily findable. > > Oops, the website says something about this, but it only mentions "opt", > "dbg", and "pro". Is the other one "devel" or "dvl" or "dev"? METHOD=devel make I agree, it's not well-documented. The best place to find out about the different modes is probably Make.common.in, not exactly the standard location for documentation ;-) > >> I was just making the statement that we probably wouldn't affect other >> standard library functions by changing DEBUG to LIBMESH_DEBUG. We >> might break something by changing NDEBUG to LIBMESH_NDEBUG, I also >> realize this is not what you are proposing. > > I would vote for decoupling libMesh's debug symbols completely from other > symbols. If somebody likes to have DEBUG or NDEBUG defined for use in the > standard library functions, they can define these symbols manually, can't > they? That would be the way to go without libMesh, anyway, wouldn't it? Yes, but I'd also like to help out app-writers who call naked asserts in their own codes and use libmesh's makefiles to compile. As long as we continue to set the standard NDEBUG, they don't have to worry about it. -- John |
From: Roy S. <roy...@ic...> - 2009-03-09 16:28:52
|
On Mon, 9 Mar 2009, Tim Kroeger wrote: > Just to be curious: What is the most confusing code in the library then? Matter of opinion, but in mine it's the FE<BLAH> base class structure. The standard Curiously Recurring Template Pattern is confusing enough, our version of it isn't quite the standard, and even though the weirdness is all there for optimization's sake, we'd have probably been much better optimized by just doing FE inits with fewer un-inlineable function calls in the middle of loops. Of course, this is also a personal grudge. My first contribution to libMesh was a new FE type - if the FE classes were more straight forward I would have abandoned my Matlab and deal.II-based (no offense to the latter fantastic library) codes and switched to libMesh earlier and faster. >> But wouldn't we be catching that with a libmesh_assert()? The >> preconditions on operator= in petsc_vector.C seem pretty thorough. > > But only in debug mode, right? Right. > The problem with that is that compiling in debug mode requires to > compile our institute's software package in debug mode as well since > they use the same debug macros. I try to avoid that because > it's a lot of work, so actually I never use debug-mode libmesh. Yikes! I'm having to work on a non-libMesh code without any debug mode right now, and the lack of proper precondition/postcondition testing (or even the less-than-proper test coverage in libMesh) is really painful. libMesh makes it easy enough to keep simultaneous debug and optimized builds around; IMHO every software package should do the same. > (Feature-request: Use LIBMESH_DEBUG rather than DEBUG.) Interesting idea. Maybe in the next version, when people have had some time to switch from assert() to libmesh_assert(); we wouldn't want to turn off user code debugging inadvertently. > The crash seems to happen somewhere in the element loop in > DofMap::enforce_constraints_exactly(). (Of course, the cause may be at a > different place.) More details to follow next time. Thanks! > Funnily enough, my slightly-off residuals get modified with your patch, but > still are slightly off, e.g.: > > without ghosted: 1.16561e-08 > with ghosted, without your patch: 1.16563e-08 > with ghosted, with your patch: 1.16558e-08 Without my patch the problem could be that your first solution iterate might have been effectively non-zero depending on what was in the allocated memory... but I've got no idea what the difference from "without ghosted" could be. --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-10 14:12:49
|
On Mon, 9 Mar 2009, Roy Stogner wrote: > On Mon, 9 Mar 2009, Tim Kroeger wrote: >>> But wouldn't we be catching that with a libmesh_assert()? The >>> preconditions on operator= in petsc_vector.C seem pretty thorough. >> >> But only in debug mode, right? > > Right. Talking about asserts, what do you think about the attached patch? (I'm now using METHOD=devel and try to catch my crash using asserts, and I suspect now that it's in the ghosted part of PetscVector somewhere.) > libMesh makes it easy enough to keep simultaneous > debug and optimized builds around; IMHO every software package should > do the same. Well, our software package tries to make that easy as well. It's just so huge a package that it's still difficult to work with several versions. Anyway, METHOD=devel works well with the non-debug version of our software. Actuall, it causes my application to crash at a completely different point, even before any NumericVectors are used. Would you please have a look at the attached test.cpp; it creates a grid, partially refines 5 times and then completely coarsens it again. To actually delete the elements that are not required, it calls Mesh::contract(), which crashes for me with METHOD=devel. (It runs well with METHOD=opt, though.) Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-10 15:13:02
|
On Tue, 10 Mar 2009, Tim Kroeger wrote: > Talking about asserts, what do you think about the attached patch? A very good idea - I'll commit it now. > (I'm now using METHOD=devel and try to catch my crash using asserts, and I > suspect now that it's in the ghosted part of PetscVector somewhere.) > >> libMesh makes it easy enough to keep simultaneous >> debug and optimized builds around; IMHO every software package should >> do the same. > > Well, our software package tries to make that easy as well. It's just so > huge a package that it's still difficult to work with several versions. Understandable. Again, I can't recommend distcc and "make -j 50" highly enough. (as long as you're using a single compiler, anyway - distcc is a little tricky to get working with multiple installed versions of the same compiler) > Anyway, METHOD=devel works well with the non-debug version of our software. Sorry I never mentioned that. I always find myself wanting debugging stuff turned all the way on or all the way off, but John swears by METHOD=devel, and I suppose it can be pretty useful. > Actuall, it causes my application to crash at a completely different point, > even before any NumericVectors are used. Would you please have a look at the > attached test.cpp; it creates a grid, partially refines 5 times and then > completely coarsens it again. To actually delete the elements that are not > required, it calls Mesh::contract(), which crashes for me with METHOD=devel. > (It runs well with METHOD=opt, though.) Thanks! This is crashing for me, too, and it definitely shouldn't be. Looks like something going wrong with find_neighbors and subactive elements, which is 99% likely to be my fault - I only have a little time to check it out today, but I'll give it a shot tonight and tomorrow as well if I can't find the bug right away. --- Roy |
From: Roy S. <roy...@ic...> - 2009-03-10 15:58:39
|
On Tue, 10 Mar 2009, Roy Stogner wrote: > if I can't find the bug right away. Found it. In UnstructuredMesh::contract() we loop over elements in no particular order and delete the subactive ones, but in devel or debug mode we test elem->level() to make sure it's not 0. elem->level() on a non-level-0 element accesses elem->parent()->dim()... but if the subactive element we're currently deleting is the child of one we've already deleted, that segfaults. It's fixed now - instead of asserting that elem->level() is non-zero we just assuage my paranoia by making sure elem->parent() is non-NULL. None of my apps do two coarsenings in a row without a contract() in between, so I would never have caught this on my own. Thanks! --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-11 14:13:05
|
On Tue, 10 Mar 2009, Roy Stogner wrote: > On Tue, 10 Mar 2009, Roy Stogner wrote: > >> if I can't find the bug right away. > > Found it. Great. Now, the application crashes at the "original" crash point. It happens in System::project_vector() for the case of a ghosted vector. This calls DofMap::enforce_constraints_exactly(), which in line 680 of dof_map_constraints.C calls NumericVector::operator(). It hits the libmesh_assert() near the end of PetscVector::map_global_to_local_index(), i.e. the index supplied to operator() is neither a local nor a ghost index. There are 7 systems in the application (all on the same grid). Two ExplicitSystem's and two LinearImplicitSystem's are projected successfully, but the first TransientLinearImplicitSystem triggers the crash, when projecting its _transient_old_local_solution. That system contains two variables, both of the same FE-type. To reproduce it more easily, I wrote the grid to an .xdr file directly before the call to MeshRefinement::refine_and_coarsen_elements(), plus a file containing the refinement flags of all active elements (stored in the native order, i.e. the order that the iterator steps through the elements). Then, I wrote a test program that reads in the grid, initializes a TransientLinearImplicitSystem with two variables on that mesh, reads the refinement flags, and calls MeshRefinement::refine_and_coarsen_elements() and EquationSystems::reinit(). I ran this program on the same number of processors as the main application (that is 8) -- but that program does *not* crash. I think some very odd things must be going on here. Do you have any idea how to track this down further? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-11 17:16:35
|
On Wed, 11 Mar 2009, Tim Kroeger wrote: > It happens in System::project_vector() for the case of a ghosted vector. > This calls DofMap::enforce_constraints_exactly(), which in line 680 of > dof_map_constraints.C calls NumericVector::operator(). It hits the > libmesh_assert() near the end of PetscVector::map_global_to_local_index(), > i.e. the index supplied to operator() is neither a local nor a ghost index. Hmm... The index is pulled from a constraint-equation-expanded local_dof_indices. The in-element local indices should certainly be local or ghost indices, but is it possible that we're not inserting constraint equation dependencies into DofMap::_send_list? That would explain why the problem is so hard to replicate - it would be common, even when doing a lot of adaptation, for all nonlocal constraint equation dependencies to end up getting in the send_list anyway when the DofMap finds them in immediately neighboring elements. > There are 7 systems in the application (all on the same grid). Two > ExplicitSystem's and two LinearImplicitSystem's are projected successfully, > but the first TransientLinearImplicitSystem triggers the crash, when > projecting its _transient_old_local_solution. That system contains two > variables, both of the same FE-type. Could you list the variables and FE types in each of the systems? One problem with my theory above (other than that I'll have to pore through all of dof_map*.C to check it...) is that I'd have expected the crash to occur earlier. > To reproduce it more easily, I wrote the grid to an .xdr file directly before > the call to MeshRefinement::refine_and_coarsen_elements(), plus a file > containing the refinement flags of all active elements (stored in the native > order, i.e. the order that the iterator steps through the elements). Then, I > wrote a test program that reads in the grid, initializes a > TransientLinearImplicitSystem with two variables on that mesh, reads the > refinement flags, and calls MeshRefinement::refine_and_coarsen_elements() and > EquationSystems::reinit(). I ran this program on the same number of > processors as the main application (that is 8) -- but that program does *not* > crash. I think some very odd things must be going on here. Do you have any > idea how to track this down further? Well, there's the hard way: Compile the crashing test case in devel mode and run with -start_in_debugger (this will be easier if you can replicate the problem on fewer than 8 CPUs). Then pore through the data structures at the crash by hand to find out what DofObject the bad index is coming from. But let's try the either-easy-or-futile way first: I'll write a (possibly redundant) patch to make sure we're properly getting constraint dependency dofs into the send list, and you can try running with that. We can at least verify or rule out my first guess before we need more information to come up with a second guess. --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-11 17:53:42
|
On Wed, 11 Mar 2009, Roy Stogner wrote: > On Wed, 11 Mar 2009, Tim Kroeger wrote: > >> It happens in System::project_vector() for the case of a ghosted vector. >> This calls DofMap::enforce_constraints_exactly(), which in line 680 of >> dof_map_constraints.C calls NumericVector::operator(). It hits the >> libmesh_assert() near the end of PetscVector::map_global_to_local_index(), >> i.e. the index supplied to operator() is neither a local nor a ghost index. > > Hmm... The index is pulled from a constraint-equation-expanded > local_dof_indices. The in-element local indices should certainly be > local or ghost indices, but is it possible that we're not inserting > constraint equation dependencies into DofMap::_send_list? That would > explain why the problem is so hard to replicate - it would be common, > even when doing a lot of adaptation, for all nonlocal constraint > equation dependencies to end up getting in the send_list anyway when > the DofMap finds them in immediately neighboring elements. Still, I wonder why it cannot be replicated by writing the grid to a file and re-reading it in. Will that change the distribution of the dofs to the processors? >> There are 7 systems in the application (all on the same grid). Two >> ExplicitSystem's and two LinearImplicitSystem's are projected successfully, >> but the first TransientLinearImplicitSystem triggers the crash, when >> projecting its _transient_old_local_solution. That system contains two >> variables, both of the same FE-type. > > Could you list the variables and FE types in each of the systems? One > problem with my theory above (other than that I'll have to pore > through all of dof_map*.C to check it...) is that I'd have expected > the crash to occur earlier. All variables are first order Lagrange. Hence, the systems differ only in the system type and the number of variables: No. Type #Vars ----------------------------------------- 0 ExplicitSystem 5 1 LinearImplicitSystem 1 2 LinearImplicitSystem 1 3 ExplicitSystem 1 4 TransientLinearImplicitSystem 2 5 TransientLinearImplicitSystem 2 6 TransientLinearImplicitSystem 1 As said before, the crash is triggered by system 4. I would suspect that it only occurs on transient system -- although this surprises me since the other systems contain ghosted vectors as well, don't they? Another possibility is that it is triggered by vectors that contain more than one variable -- but then again, it should at least be triggered by system 0 as well. >> Do you have any idea how to track this down further? > > Well, there's the hard way: Compile the crashing test case in devel > mode You mean debug mode, right? > and run with -start_in_debugger (this will be easier if you can > replicate the problem on fewer than 8 CPUs). Then pore through the > data structures at the crash by hand to find out what DofObject the > bad index is coming from. > > But let's try the either-easy-or-futile way first: I'll write a > (possibly redundant) patch to make sure we're properly getting > constraint dependency dofs into the send list, and you can try running > with that. We can at least verify or rule out my first guess before > we need more information to come up with a second guess. I vote for this easy way and wait for your patch. (-: Nevertheless, I'll try out whether the crash occurs on 2 CPUs as well. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-03-12 16:28:01
|
On Wed, 11 Mar 2009, Tim Kroeger wrote: > On Wed, 11 Mar 2009, Roy Stogner wrote: > >> But let's try the either-easy-or-futile way first: I'll write a >> (possibly redundant) patch to make sure we're properly getting >> constraint dependency dofs into the send list, and you can try running >> with that. I should mention that I will be on vacation next week. If you manage to write your patch today, I will be able to test it before I leave; otherwise we will have to postpone that until I come back. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-26 15:55:03
|
On Thu, 26 Mar 2009, Tim Kroeger wrote: > You might want to check whether you can reproduce the non-reproducibility. > To do this, please download www.mevis.de/~tim/a.tar.gz and unpack it (small > this time). Then run the attached test.cpp on 8 CPUs, which creates a simple > grid, assembles a matrix as given by the files, writes the complete matrix to > another file (in PETSc binary format), and then solves. On solving the > system, it will crash, but that's not important in this case. (The grid is so > coarse that important geometric structures are not "seen", so that the system > becomes singular.) Run that program several times and rename the resulting > matrix files. Using "diff" you will find that they differ. You can use the > attached test2.cpp to get an ASCII representation of the matrices. I needed "#include <stdlib.h>" to get test2 to work; g++ is getting more and more nitpicky about standards compliance. I can reproduce the non-reproducibility... but not to any level that I'd worry about. The global matrices differ, but every difference was literally in the 16th or 17th decimal significant figure! I'd assume that PETSc assembles matrices in the order they're received, the order in which they're received depends on which processor happened to go slightly faster than which other processor, and so the differences we're seeing might all stem from the tragedy that, in floating point math, you can have (A+B)+C != A+(B+C). I can't say I'm happy about not having perfectly deterministic apps, but I'm not sure what to do about it. Maybe someone familiar with PETSc MatSetValues can chime in with "We're deterministic! Roy doesn't know what he's talking about" or "We *can be deterministic*, if you just configure or run with --some-obscure-option"; otherwise I'm stuck. Alternatively, maybe I *haven't* reproduced what you're seeing. IIRC you were talking about residual differences in the 6th place, not the 16th, so maybe there's something more at work there. --- Roy |
From: Jed B. <je...@59...> - 2009-03-26 17:15:12
|
On Thu 2009-03-26 10:54, Roy Stogner wrote: > I can't say I'm happy about not having perfectly deterministic apps, > but I'm not sure what to do about it. Maybe someone familiar with > PETSc MatSetValues can chime in with "We're deterministic! Roy > doesn't know what he's talking about" or "We *can be deterministic*, > if you just configure or run with --some-obscure-option"; otherwise > I'm stuck. No, there isn't a magic option to make this deterministic. There is a chance you could make it so by hacking the profiling interface for MPI_Waitany. For example, set a global before MatAssemblyEnd and intercept MPI_Waitany to receive in rank order when this global is set. The "correct" way to do this is to sort entries in ascending order prior to summing. In matrix assembly, that would likely kill performance and certainly injects a lot of complexity. It's much less complicated for VecAssembly, see the discussion on this thread http://lists.mcs.anl.gov/pipermail/petsc-users/2009-January/003875.html and the proposed patch http://lists.mcs.anl.gov/pipermail/petsc-dev/2009-January/001186.html The patch has not hit petsc-dev, likely due to it being a bit of work to incorporate and the need seems to be rare. Note that if you want to replace MatAssemblyEnd, you can implement you own PetscErrorCode MatAssemblyEnd_MPIAIJ_Deterministic(Mat,MatAssemblyType); and use it with your matrix by calling MatShellSetOperation(A,MATOP_ASSEMBLY_END,(void(*)(void))&MatAssemblyEnd_MPIAIJ_Deterministic); A lighter weight, though less performant, option would be to make matrix assembly sequential with something like for (int r=0; r<size; r++) { if (r == rank) { // matrix assembly code using MatSetValues(P,...); } MatAssemblyBegin(P,MAT_FLUSH_ASSEMBLY); MatAssemblyEnd(P,MAT_FLUSH_ASSEMBLY); } MatAssemblyBegin(P,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(P,MAT_FINAL_ASSEMBLY); I hope this helps. Jed |
From: John P. <jwp...@gm...> - 2009-03-11 17:58:46
|
On Wed, Mar 11, 2009 at 12:52 PM, Tim Kroeger <tim...@ce...> wrote: > On Wed, 11 Mar 2009, Roy Stogner wrote: > >> On Wed, 11 Mar 2009, Tim Kroeger wrote: >> >>> It happens in System::project_vector() for the case of a ghosted vector. >>> This calls DofMap::enforce_constraints_exactly(), which in line 680 of >>> dof_map_constraints.C calls NumericVector::operator(). It hits the >>> libmesh_assert() near the end of PetscVector::map_global_to_local_index(), >>> i.e. the index supplied to operator() is neither a local nor a ghost index. >> >> Hmm... The index is pulled from a constraint-equation-expanded >> local_dof_indices. The in-element local indices should certainly be >> local or ghost indices, but is it possible that we're not inserting >> constraint equation dependencies into DofMap::_send_list? That would >> explain why the problem is so hard to replicate - it would be common, >> even when doing a lot of adaptation, for all nonlocal constraint >> equation dependencies to end up getting in the send_list anyway when >> the DofMap finds them in immediately neighboring elements. > > Still, I wonder why it cannot be replicated by writing the grid to a > file and re-reading it in. Will that change the distribution of the > dofs to the processors? > >>> There are 7 systems in the application (all on the same grid). Two >>> ExplicitSystem's and two LinearImplicitSystem's are projected successfully, >>> but the first TransientLinearImplicitSystem triggers the crash, when >>> projecting its _transient_old_local_solution. That system contains two >>> variables, both of the same FE-type. >> >> Could you list the variables and FE types in each of the systems? One >> problem with my theory above (other than that I'll have to pore >> through all of dof_map*.C to check it...) is that I'd have expected >> the crash to occur earlier. > > All variables are first order Lagrange. Hence, the systems differ > only in the system type and the number of variables: > > No. Type #Vars > ----------------------------------------- > 0 ExplicitSystem 5 > 1 LinearImplicitSystem 1 > 2 LinearImplicitSystem 1 > 3 ExplicitSystem 1 > 4 TransientLinearImplicitSystem 2 > 5 TransientLinearImplicitSystem 2 > 6 TransientLinearImplicitSystem 1 Hmm... the biggest difference with the TransientLinearImplicitSystem would probably be the "old_local_solution" and "older_local_solution" vectors... I'm guessing you guys have already investigated those though? -- John |
From: Tim K. <tim...@ce...> - 2009-03-12 07:29:05
|
On Wed, 11 Mar 2009, John Peterson wrote: > On Wed, Mar 11, 2009 at 12:52 PM, Tim Kroeger >>>> but the first TransientLinearImplicitSystem triggers the crash, when >>>> projecting its _transient_old_local_solution. >>>> variables, both of the same FE-type. >> >> All variables are first order Lagrange. Hence, the systems differ >> only in the system type and the number of variables: >> >> No. Type #Vars >> ----------------------------------------- >> 0 ExplicitSystem 5 >> 1 LinearImplicitSystem 1 >> 2 LinearImplicitSystem 1 >> 3 ExplicitSystem 1 >> 4 TransientLinearImplicitSystem 2 >> 5 TransientLinearImplicitSystem 2 >> 6 TransientLinearImplicitSystem 1 > > Hmm... the biggest difference with the TransientLinearImplicitSystem > would probably be the "old_local_solution" and "older_local_solution" > vectors... I'm guessing you guys have already investigated those > though? As said above, old_local_solution is the vector that crashes first. But I don't know what's so different with this vector from current_local_solution. Wait... there *is* a difference! System::current_local_solution is not projected at all. Rather, only System::solution is projected (employing a temporary parallel vector) and after that localized to System::current_local_solution. Hence, Roy's suspicion that dofs required via constraints are not explicitly included in the send list would completely explain the problem. By the way, why doesn't System::project_vector() in case of a parallel vector use a ghosted (rather than a serial) as the temporary? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-12 16:30:50
|
On Thu, 12 Mar 2009, Tim Kroeger wrote: > By the way, why doesn't System::project_vector() in case of a parallel vector > use a ghosted (rather than a serial) as the temporary? Because we're still getting ghosted vectors working in the first place. Make a small change, test the small change, repeat until they add up to a big change, but most importantly: don't add another related small change until after the previous one passed all its tests. --- Roy |
From: Roy S. <roy...@ic...> - 2009-03-12 16:48:10
|
On Wed, 11 Mar 2009, Tim Kroeger wrote: > Still, I wonder why it cannot be replicated by writing the grid to a file and > re-reading it in. Will that change the distribution of the dofs to the > processors? Often, yes. IIRC our partitioning results (at least with METIS/ParMETIS, maybe not with space filling curves) depend on mesh history, not just the current mesh state. So yeah, that would explain why the problem isn't being replicated by a .xdr restart. We've got some parallel Mesh I/O now, and perhaps loading a different format preserves partitioning by default? I'd have to ask Ben or Derek. > All variables are first order Lagrange. Hence, the systems differ only in > the system type and the number of variables: > > As said before, the crash is triggered by system 4. I would suspect that it > only occurs on transient system That has to be the case. > -- although this surprises me since the other systems contain > ghosted vectors as well, don't they? They do, but they don't project those vectors. solution gets projected, current_local_solution just gets re-localized from the result. >> Well, there's the hard way: Compile the crashing test case in devel >> mode > > You mean debug mode, right? I think devel mode would be enough - all you need is "-g" to make gdb tolerable. The big change in debug mode over devel isn't debugging symbols in the binary, it's all the GNU STL safety testing that gets turned on. >> But let's try the either-easy-or-futile way first: I'll write a >> (possibly redundant) patch to make sure we're properly getting >> constraint dependency dofs into the send list, and you can try running >> with that. We can at least verify or rule out my first guess before >> we need more information to come up with a second guess. > > I vote for this easy way and wait for your patch. (-: The patch just got committed to SVN. Unless Ben can point out some subtle send_list machinations I've missed, the fact that it's even theoretically possible to be missing a constraint dependency is a bug. Let's just hope that it's the *only* bug and that we're not missing something else as well. > Nevertheless, I'll try out whether the crash occurs on 2 CPUs as well. Let me know. If this patch doesn't fix it, I'm out of easy ideas. --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-13 11:31:48
|
Dear Roy, The bad news first: Your patch did not fix my crash. However, it again led to slightly differing residuals (the fourth version now). On Thu, 12 Mar 2009, Roy Stogner wrote: > On Wed, 11 Mar 2009, Tim Kroeger wrote: > >> Still, I wonder why it cannot be replicated by writing the grid to a file >> and re-reading it in. Will that change the distribution of the dofs to the >> processors? > > Often, yes. IIRC our partitioning results (at least with > METIS/ParMETIS, maybe not with space filling curves) depend on mesh > history, not just the current mesh state. Okay, good to know. Well, I will try to track all refinement steps in order to make the problem faster reproducible. A lot of refinement steps happen before the EquationSystems object is created, but I guess I will have to track those as well. After that, only ~3 refinement steps occur before the crash. >> Nevertheless, I'll try out whether the crash occurs on 2 CPUs as well. > > Let me know. On 2 CPUs, it doesn't crash. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-26 17:40:45
|
On Thu, 26 Mar 2009, Tim Kroeger wrote: > Oops... at which line does your g++ complain if you don't include stdlib.h? exit() was undefined. > This non-reproducibility complicates the check whether the ghosted > vectors change the code behaviour a lot, because it is no longer > clear what "change the behaviour" means. That is pretty annoying from a verification standpoint. But I'm not sure how easy it is to get around. Google tells me that the very MPI standard itself doesn't guarantee a deterministic MPI_SUM. Plus, Jed Brown's email just came in, and the thought of overriding MatAssembly scares me. Although, the sequentialized MatAssembly hack he suggested was pretty interesting. I'd be tempted to add that behavior as a very-non-default option to FEMSystem... except, for it to really be useful you'd need a deterministic solve() too, and I'm sure the same distributed FP ordering problems come up there. > Perhaps I should just run both the ghosted and the non-ghosted version of my > application two times each and look whether the final results between > ghosted/non-ghosted differ consderably more than within these groups. Not > really convincing, but the best thing I can think of. (Unfortunately, I > deleted the results that I had from the runs some days ago since I thought of > a bug...) That's better than nothing. You could also try comparing the unassembled element residuals and matrices, written out to files like you did in this test problem setup. They won't be comparable after even one timestep (because a non-deterministic assembly leads to a differing solve() result leads to different initial conditions at the next timestep), but you could save a restart file from either version of the code, then load and test it in both versions. --- Roy |
From: Jed B. <je...@59...> - 2009-03-26 17:54:34
|
On Thu 2009-03-26 12:40, Roy Stogner wrote: > I'd be tempted to add that behavior as a very-non-default option to > FEMSystem... except, for it to really be useful you'd need a > deterministic solve() too, and I'm sure the same distributed FP > ordering problems come up there. Nah, just MatView and confirm that they're identical. Also, you should be able to guarantee a deterministic solve with -ksp_type preonly -pc_type redundant which duplicates the entire matrix on every process and solves redundantly on each process using LU (obviously it's not scalable). Jed |
From: Roy S. <roy...@ic...> - 2009-03-26 18:03:06
|
On Thu, 26 Mar 2009, Jed Brown wrote: > On Thu 2009-03-26 12:40, Roy Stogner wrote: >> I'd be tempted to add that behavior as a very-non-default option to >> FEMSystem... except, for it to really be useful you'd need a >> deterministic solve() too, and I'm sure the same distributed FP >> ordering problems come up there. > > Nah, just MatView and confirm that they're identical. If only I was confident that none of our bugs were too subtle to show up on the first timestep. Unfortunately the last ghosted vector bug we fixed only showed up after half a dozen more more refinement steps, and IIRC our outstanding ParallelMesh bug takes at least three refinement and coarsening steps to trigger. > Also, you should be able to guarantee a deterministic solve with > > -ksp_type preonly -pc_type redundant > > which duplicates the entire matrix on every process and solves > redundantly on each process using LU (obviously it's not scalable). Right, but for regression testing we don't care as much about scalability as repeatability. This looks like it will be a big help, thanks! --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-13 15:47:33
Attachments:
test.cpp
|
On Fri, 13 Mar 2009, Tim Kroeger wrote: > Okay, good to know. Well, I will try to track all refinement steps in > order to make the problem faster reproducible. I did this now, and it reproduces the crash. Let me tell you briefly about the refinement/coarsening steps of my application: They occur at five different places in the code, which for some reason I would like to call SV1, SV2, PO1, PO2, and ES. Some of them only refine, some only coarsen, some do both. Some of them are performed several times. Here is a list: SV1 6 times only refine SV2 6 times only coarsen after the last: Mesh::contract() PO1 7 times only refine after the last: Creation of EquationSystems PO2 1 time only refine ES 3 times possibly both 3rd time triggers the crash I attached a test program; it reads a file that you can download from my homepage (www.mevis.de/~tim/ref-flags.gz; unzipping makes it become ~5MB). The program initializes the same base grid as my application and then reads in all the refinement flags from the file. The points at which the mesh is contracted and the EquationSystem object is created are hardcoded. Could you please check whether this enables you the reproduce the crash? You need to run the program on 8 processors (with ghosted enabled of course). I used METHOD=devel, but I guess it will crash in the other modes as well. By the way, I observed another very strange thing: If I change the values of {x,y,z}{min,max} of the start grid (as in the comments of the test program), it crashes already on the first refinement step and at a completely different point, that is in elem.h, line 1744. (That's the assert in Elem::compute_key() with four arguments.) That does not make any sense at all to me. Anyway, complete confusion is a good state to start vacations with, isn't it? (-: Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |