From: Roy S. <roy...@ic...> - 2009-03-13 22:33:05
|
On Fri, 13 Mar 2009, Tim Kroeger wrote: > Could you please check whether this enables you the reproduce the crash? You > need to run the program on 8 processors (with ghosted enabled of course). I > used METHOD=devel, but I guess it will crash in the other modes as well. This is some very impressive testing work, thank you very much. I can reproduce the crash now. I should have time to look into it, probably Monday at the soonest, a week from Monday at the latest. --- Roy |
From: Roy S. <roy...@ic...> - 2009-03-17 00:01:13
|
On Fri, 13 Mar 2009, Tim Kroeger wrote: > Could you please check whether this enables you the reproduce the crash? You > need to run the program on 8 processors (with ghosted enabled of course). I > used METHOD=devel, but I guess it will crash in the other modes as well. Well, it crashes in the other modes, but in opt mode doing crash forensics is effectively impossible and in dbg mode the crash takes forever to reach. Curses to whomever is responsible for _GLIBC_DEBUG changing the algorithmic complexity of std::sort, and thanks again to John for reminding us about devel mode. gcc is lousy at debugging devel mode binaries, but it's much better than nothing. Anyway, the crash turned out to have a couple bugs behind it: 1. When I wrote enforce_constraints_exactly years ago, I apparently understood that each processor only had to set its local constrained dof values, and I wrote a comment to that effect... but apparently I never wrote *code* to that effect! So we were potentially trying to set constraint values we didn't own, which could depend on dof values we didn't know. An extra half-line of code fixed that. 2. When I wrote add_constraints_to_send_list days ago, I assumed it was being called after process_recursive_constraints(); I forgot that the latter had to be called late, after user constraints have had a chance to be applied. So we didn't always have all the necessary data when setting constraint values we did own. To fix that, I'm now doing these things (as well as sorting the send list) by hand in System::init_data() and EquationSystems::reinit() to ensure they happen in the right order. I really don't like the lack of modularity that implies, but I couldn't figure out how to do things in DofMap without nonintuitive API behavior or a redundant sort_send_list() operation. This looks like it may have been a pretty nasty bug. Under some conditions it caused ghosted vectors to crash, but I'd expect that under slightly rarer conditions it would cause serial vectors to calculate incorrect constrained dof values. This wouldn't effect my FEMSystem code (which only calls enforce_constraints_exactly on parallel vectors), so I'd have never noticed, but it could kill the accuracy of anyone whose code combined TransientSystem, parallel AMR, and bad luck!! Anyway, I've checked the fixes into SVN; now might be a good time for those of us on the bleeding edge to update. > By the way, I observed another very strange thing: If I change the values of > {x,y,z}{min,max} of the start grid (as in the comments of the test program), > it crashes already on the first refinement step and at a completely different > point, that is in elem.h, line 1744. (That's the assert in > Elem::compute_key() with four arguments.) That does not make any sense at > all to me. The error doesn't seem to make sense, but then neither does your function call. ;-) You got confused about parameter order, and passed in xmin=xmax=0.0 and zmin=zmax=1.0. libMesh gets completely confused when two distinct nodes overlap - you had *every* node overlapping many others. ;-) > Anyway, complete confusion is a good state to start vacations with, isn't it? Well, I hope "it's probably fixed now" is a good way to come back. --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-23 12:55:52
|
Dear Roy, On Mon, 16 Mar 2009, Roy Stogner wrote: > Anyway, I've checked the fixes into SVN; now might be a good time for > those of us on the bleeding edge to update. Great work! Thank you very much! My application does not crash any more. Well, at least it didn't crash at the `usual' point, and it's still running now, and up to now the results are more or less equal to those with non-ghosted vectors. If this remains to be true until the application finishes, I think that we can consider the ghosted vectors as finished and enable them on default; wouldn't you agree? Of course, my residuals are again slightly off (fifth version now). I think I should try out what happens when I switch back to non-ghosted vectors now. I will do that as soon as the current run has finished (which will not be before tomorrow). >> By the way, I observed another very strange thing: If I change the values >> of {x,y,z}{min,max} of the start grid (as in the comments of the test >> program), it crashes already on the first refinement step and at a >> completely different point, that is in elem.h, line 1744. (That's the >> assert in Elem::compute_key() with four arguments.) That does not make any >> sense at all to me. > > The error doesn't seem to make sense, but then neither does your > function call. ;-) You got confused about parameter order, and > passed in xmin=xmax=0.0 and zmin=zmax=1.0. Oh, I should have known this since I ran into that pitfall at least once before. I somehow find the required parameter order counterintuitive, but I can't state any reason why. >> Anyway, complete confusion is a good state to start vacations with, isn't >> it? > > Well, I hope "it's probably fixed now" is a good way to come back. Well, it's actually one of the best ways to come back, which does, however, not mean that it's a good way. I mean, "come back from a vacation" and "good way" kind of contradict each other, don't they? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-23 14:08:48
|
On Mon, 23 Mar 2009, Tim Kroeger wrote: > My application does not crash any more. Well, at least it didn't crash at > the `usual' point, and it's still running now, and up to now the results are > more or less equal to those with non-ghosted vectors. If this remains to be > true until the application finishes, I think that we can consider the ghosted > vectors as finished and enable them on default; wouldn't you agree? I agree with "consider as finished", and I'll be turning them on permanently in my own codes, but I'm going to wait before turning them on by default everywhere, just to get in a little more testing time. > Of course, my residuals are again slightly off (fifth version now). I think I > should try out what happens when I switch back to non-ghosted vectors now. I > will do that as soon as the current run has finished (which will not be > before tomorrow). Thanks. This is probably the last big test - although the residuals should probably differ from before the last SVN update, if the residuals between ghosted and non-ghosted vectors differ from each other, I'd like to track down why. --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-25 13:24:54
|
Dear Roy, On Mon, 23 Mar 2009, Roy Stogner wrote: > On Mon, 23 Mar 2009, Tim Kroeger wrote: > >> Of course, my residuals are again slightly off (fifth version now). I think >> I should try out what happens when I switch back to non-ghosted vectors >> now. I will do that as soon as the current run has finished (which will >> not be before tomorrow). > > Thanks. This is probably the last big test - although the residuals > should probably differ from before the last SVN update, if the > residuals between ghosted and non-ghosted vectors differ from each > other, I'd like to track down why. Unfortunately, the residuals still don't coincide. What suprises me even more is that they even differ between identical runs, i.e. ghost dofs are disabled both times. In other words, my application gets results that are not reproducible, and this has got nothing to do with ghost dofs. I will try to track this down. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-03-26 09:24:23
|
Dear Roy, On Wed, 25 Mar 2009, Tim Kroeger wrote: > Unfortunately, the residuals still don't coincide. What suprises me > even more is that they even differ between identical runs, i.e. ghost > dofs are disabled both times. In other words, my application gets > results that are not reproducible, and this has got nothing to do with > ghost dofs. I will try to track this down. I tracked it down in the sense that I can reproduce it fast, but I have still no idea why it happens. The grid is now uniform and small. In the matrix assembly function, I wrote the dof_indices, Ke, and Fe to files (one file for each processor) directly before they are added to the system matrix/rhs. You might want to check whether you can reproduce the non-reproducibility. To do this, please download www.mevis.de/~tim/a.tar.gz and unpack it (small this time). Then run the attached test.cpp on 8 CPUs, which creates a simple grid, assembles a matrix as given by the files, writes the complete matrix to another file (in PETSc binary format), and then solves. On solving the system, it will crash, but that's not important in this case. (The grid is so coarse that important geometric structures are not "seen", so that the system becomes singular.) Run that program several times and rename the resulting matrix files. Using "diff" you will find that they differ. You can use the attached test2.cpp to get an ASCII representation of the matrices. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-03-26 16:20:17
|
Dear Roy, On Thu, 26 Mar 2009, Roy Stogner wrote: > I needed "#include <stdlib.h>" to get test2 to work; g++ is getting > more and more nitpicky about standards compliance. Oops... at which line does your g++ complain if you don't include stdlib.h? > I can reproduce the non-reproducibility... but not to any level that > I'd worry about. > Alternatively, maybe I *haven't* reproduced what you're seeing. IIRC > you were talking about residual differences in the 6th place, not the > 16th, so maybe there's something more at work there. Well, as far as I understood this right now, the observed residual differences originate from these small matrix differences. (Of course, my matrix is much larger, and a larger number of coefficients differ between different runs.) This non-reproducibility complicates the check whether the ghosted vectors change the code behaviour a lot, because it is no longer clear what "change the behaviour" means. Perhaps I should just run both the ghosted and the non-ghosted version of my application two times each and look whether the final results between ghosted/non-ghosted differ consderably more than within these groups. Not really convincing, but the best thing I can think of. (Unfortunately, I deleted the results that I had from the runs some days ago since I thought of a bug...) Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-04-06 07:29:37
|
Dear Roy, On Thu, 26 Mar 2009, Tim Kroeger wrote: > Perhaps I should just run both the ghosted and the non-ghosted version > of my application two times each and look whether the final results > between ghosted/non-ghosted differ consderably more than within these > groups. I did this now: Two runs with ghosted and two runs without ghosted. None of the four final results are exactly identical, but the differences are small enough to be of no importance. Also, the differences between the two groups are of the same size as those within the groups. I would vote for making ghosted vectors the default now. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-04-07 15:22:42
|
On Mon, 6 Apr 2009, Tim Kroeger wrote: > I would vote for making ghosted vectors the default now. I'm tempted to agree. (Which probably means everyone else is way ahead of me; they just talked me into making --enable-second a default option, years after I wrote it.) My last reservation: is there any performance penalty? The ghosted vectors should be more scalable than serial vectors on N processors, but they've got overhead that may cost CPU time on 2-4 processors. When you were regression testing those residuals, did you happen to take any timing data? --- Roy |
From: Derek G. <fri...@gm...> - 2009-04-07 15:43:38
|
On Apr 7, 2009, at 9:22 AM, Roy Stogner wrote: > > On Mon, 6 Apr 2009, Tim Kroeger wrote: > >> I would vote for making ghosted vectors the default now. > > I'm tempted to agree. (Which probably means everyone else is way > ahead of me; they just talked me into making --enable-second a default > option, years after I wrote it.) Personally... I don't see much point in making it the default... unless we're going to do away with current_local_solution. How big of a deal would it be to remove that at this point? Derek |
From: Roy S. <roy...@ic...> - 2009-04-07 16:01:40
|
On Tue, 7 Apr 2009, Derek Gaston wrote: > On Apr 7, 2009, at 9:22 AM, Roy Stogner wrote: > > On Mon, 6 Apr 2009, Tim Kroeger wrote: > > I would vote for making ghosted vectors the default now. > > I'm tempted to agree. (Which probably means everyone else is way > ahead of me; they just talked me into making --enable-second a default > option, years after I wrote it.) > > Personally... I don't see much point in making it the default... Better memory usage by default, better scalability by default, plus better test coverage. We only caught that nasty parallel AMR bug because, while it would have manifested as slightly corrupted solutions with serial local vectors, it tripped a libmesh_assert() with ghosted vectors. > unless we're going to do away with current_local_solution. We are - the question is just "when". > How big of a deal would it be to remove that at this point? It wouldn't be too hard, if we were just talking about PETSc. But unless we want to break our other interfaces, we'll need the equivalent of ghosted vectors from LASPACK and Trilinos (and DistributedVector? anyone using that for explicit problems?) too. --- Roy |
From: Derek G. <fri...@gm...> - 2009-04-07 16:25:25
|
On Apr 7, 2009, at 10:01 AM, Roy Stogner wrote: > It wouldn't be too hard, if we were just talking about PETSc. But > unless we want to break our other interfaces, we'll need the > equivalent of ghosted vectors from LASPACK and Trilinos (and > DistributedVector? anyone using that for explicit problems?) too. ah - well.... seeing as how I don't have time to help with any of that.... I don't think I get much say ;-) As far as making it the default... I'm still just a bit wary of it introducing bugs. But if that's the only way we're going to find them... then let's do it. Derek |
From: Roy S. <roy...@ic...> - 2009-04-07 16:53:15
|
On Tue, 7 Apr 2009, Derek Gaston wrote: > On Apr 7, 2009, at 10:01 AM, Roy Stogner wrote: > >> It wouldn't be too hard, if we were just talking about PETSc. But >> unless we want to break our other interfaces, we'll need the >> equivalent of ghosted vectors from LASPACK and Trilinos (and >> DistributedVector? anyone using that for explicit problems?) too. > > ah - well.... seeing as how I don't have time to help with any of that.... I > don't think I get much say ;-) That's too bad. For LASPACK the problem is trivial: for a serial linear algebra package, "serial", "parallel", and "ghosted" vectors are the same, and we'd just need to do a little testing. But you're probably the libMesh developer most familiar with Trilinos. > As far as making it the default... I'm still just a bit wary of it > introducing bugs. But if that's the only way we're going to find > them... then let's do it. I wouldn't want to put out 0.7.0 with default ghosted vectors yet, but I'm certainly confident enough to make them default in SVN. Tim's amazingly good at finding and isolating bugs, and if his codes are coming up clean now then I'm happy. --- Roy |
From: Tim K. <tim...@ce...> - 2009-04-08 06:47:16
|
Dear Roy, On Tue, 7 Apr 2009, Roy Stogner wrote: > My last reservation: is there any performance penalty? The ghosted > vectors should be more scalable than serial vectors on N processors, > but they've got overhead that may cost CPU time on 2-4 processors. > When you were regression testing those residuals, did you happen to > take any timing data? Yes, unfortunately, I did, but I didn't look at them up to now. The reason is that, when I started my application in the morning, it was in no case finished before I knocked off work, but it was in all cases finished on the next morning. So my intuition was that it was equally fast. But I forgot that there is a wide range within that. Here's the result (hh:mm:ss): no-ghosted-1 : 11:34:10 no-ghosted-2 : 11:35:54 ghosted-1 : 17:25:28 ghosted-2 : 16:33:23 That was with 8 CPUs each. Quite disappointing. On the other hand, using ghosted vectors would allow my application to use more CPUs without actually using more computing resources (because each node of the cluster has 8 CPUs, but without ghosted I couldn't use more than 3 for memory reasons). What do you think, should I perform another two computations with 24 CPUs (on three nodes) and see how fast that is? (Remark: You certainly notice that 8 is not divisible by 3, so that 8 CPUs with 3 per node doesn't make sense. What I did was 2*3+1*2 CPUs, which is of course not efficient, but the point is that I originally started with 2*4 CPUs, but that turned out to sometimes (depending on the input configuration) run out of memory, so I required more nodes, but didn't want to change the total number of CPUs because that might have slightly changed the results.) Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-04-09 06:20:55
|
Dear Roy, On Wed, 8 Apr 2009, Tim Kroeger wrote: > What do you think, should I perform another two computations with 24 > CPUs (on three nodes) and see how fast that is? Of course, "What do you think, should I" means "I will" in this case, and the result is: Assertion `it!=_global_to_local_map.end()' failed. [16] /home/tkroeger/archives/libMesh/libmesh/include/numerics/petsc_vector.h, line 956, compiled Mar 31 2009 at 08:17:12 There seems to be another bug. What do you think, should I track this down once more? (BTW: I'll be out of office from Good Friday until Easter Monday.) Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-04-09 12:32:16
|
On Thu, 9 Apr 2009, Tim Kroeger wrote: > Assertion `it!=_global_to_local_map.end()' failed. > [16] /home/tkroeger/archives/libMesh/libmesh/include/numerics/petsc_vector.h, > line 956, compiled Mar 31 2009 at 08:17:12 > > There seems to be another bug. > > What do you think, should I track this down once more? Ugh, definitely. Although as an immediate start: do you have tracefiles turned on? It would be interesting to see the whole stack trace from where that assertion failed, and that might not even require re-running the code. --- Roy |
From: Tim K. <tim...@ce...> - 2009-04-09 14:22:51
|
Dear Roy, On Thu, 9 Apr 2009, Roy Stogner wrote: > On Thu, 9 Apr 2009, Tim Kroeger wrote: > >> Assertion `it!=_global_to_local_map.end()' failed. >> [16] >> /home/tkroeger/archives/libMesh/libmesh/include/numerics/petsc_vector.h, >> line 956, compiled Mar 31 2009 at 08:17:12 >> >> There seems to be another bug. >> >> What do you think, should I track this down once more? > > Ugh, definitely. Although as an immediate start: do you have > tracefiles turned on? No, I didn't. Actually, up to now, I didn't know this option exists. I have restarted it now with that option enabled, and the application will now also once again write all the refinement flags to a file, so that the bug can hopefully be reproduced more easily. BTW: Why is --enable-tracefiles not enabled by default? It shouldn't bother anybody since it doesn't do anything as long as the application doesn't crash, does it? And if it *does* crash, this option is very useful. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-04-09 14:30:31
|
On Thu, 9 Apr 2009, Tim Kroeger wrote: > BTW: Why is --enable-tracefiles not enabled by default? It shouldn't bother > anybody since it doesn't do anything as long as the application doesn't > crash, does it? And if it *does* crash, this option is very useful. Hmm... I think I was worried about compatibility. Check out the nastiness in print_trace.C; the current implementation requires gcc and glibc-compatible non-standard features. I guess we've now got those safely wrapped in specific autoconf tests for name demangling and backtracing, though. Any other votes for/against enabling tracefiles by default? --- Roy |
From: Tim K. <tim...@ce...> - 2009-04-14 07:52:10
Attachments:
test.cpp
traceout_16_1432.txt
|
Dear Roy, On Thu, 9 Apr 2009, Tim Kroeger wrote: > On Thu, 9 Apr 2009, Roy Stogner wrote: > >> Ugh, definitely. Although as an immediate start: do you have >> tracefiles turned on? > > No, I didn't. Actually, up to now, I didn't know this option exists. > I have restarted it now with that option enabled, and the application > will now also once again write all the refinement flags to a file, so > that the bug can hopefully be reproduced more easily. The tracefile is attached now. I don't think it will help a lot, though. Also, I attached a test program that reproduces the crash. It's very similar to the one I sent you a couple of weeks ago. Again, you can find the file it reads the refinement flags from at www.mevis.de/~tim/ref-flags.gz (190KB, unzips to 19MB). You need to run the program on 24 CPUs. Unfortunately, it takes quite a while before it crashes since the first 50 refinement/coarsening steps are performed successfully; the crash occurs at the 51st step. Let me know whether you can reproduce the crash. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |