From: Roy S. <roy...@ic...> - 2009-01-31 00:49:43
|
Ben had the brilliant idea of doing boundary data output by creating a boundary mesh, solving a local system to project functions of interior data onto that mesh, then outputting the results. Unfortunately, with certain partitionings, this can put *all* your boundary dofs on a single processor. This processor wants to create matrices and vectors. It now creates an MPI matrix (thanks to a bugfix of Ben's and one of my own; bad things happen if you try to do parallel operations on sequential matrices or mix them)... but how does it know to create an MPI vector? It sees n_local == n_global, it knows that sometimes we like creating sequential vectors on parallel runs... I think we need another argument to init. Probably a boolean with a sane default value so as not to break API compatibility, but I'm not sure on the details yet. Ben, you can see the current state of our merged code in ~roystgnr/fins.devel *on the ICES filesystem* - but I'm posting to libmesh-devel because anyone's opinions would be welcome. --- Roy |
From: Tim K. <tim...@ce...> - 2009-01-31 17:01:24
|
Dear Roy, On Fri, 30 Jan 2009, Roy Stogner wrote: > I think we need another argument to init. Probably a boolean with a > sane default value so as not to break API compatibility, but I'm not > sure on the details yet. What about an enum type for that? It could take the values AUTOMATIC (which would be the default and mimic the current implementation), SERIAL, PARALLEL, and GHOSTED. Also, the vector should then store this information (preferably in the base class) and make it readable. This would then also replace my is_ghosted() method. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-01-31 20:24:03
|
On Sat, 31 Jan 2009, Tim Kroeger wrote: > On Fri, 30 Jan 2009, Roy Stogner wrote: > >> I think we need another argument to init. Probably a boolean with a >> sane default value so as not to break API compatibility, but I'm not >> sure on the details yet. > > What about an enum type for that? It could take the values AUTOMATIC (which > would be the default and mimic the current implementation), SERIAL, PARALLEL, > and GHOSTED. Also, the vector should then store this information (preferably > in the base class) and make it readable. This would then also replace my > is_ghosted() method. This sounds like a big improvement over my boolean idea. Unless anyone else has input, I'll implement this tomorrow. --- Roy |
From: Roy S. <roy...@ic...> - 2009-02-03 18:28:53
Attachments:
roypatch-feb3
|
On Sat, 31 Jan 2009, Roy Stogner wrote: > On Sat, 31 Jan 2009, Tim Kroeger wrote: > >> On Fri, 30 Jan 2009, Roy Stogner wrote: >> >>> I think we need another argument to init. Probably a boolean with a >>> sane default value so as not to break API compatibility, but I'm not >>> sure on the details yet. >> >> What about an enum type for that? It could take the values AUTOMATIC >> (which would be the default and mimic the current implementation), SERIAL, >> PARALLEL, and GHOSTED. Also, the vector should then store this information >> (preferably in the base class) and make it readable. This would then also >> replace my is_ghosted() method. > > This sounds like a big improvement over my boolean idea. Unless > anyone else has input, I'll implement this tomorrow. This is all in svn now. Everything still seems to be working, but I've been making a lot of commits to core parts of the code lately; people should feel free to yell at me if they think they've found a regression. I'm also open to suggestions as to how to make debug-mode gcc shut up about a hidden overloaded NumericVector::init() version. How does a function with arguments "int, int, vector, bool, enum" hide a function with arguments "NumericVector, bool"? Even with implicit Vector/vector constructors from int and with default function arguments for the bool and enum, there should be no ambiguity here... The part of the code that really *is* a regression is attached, as yet-another patch version. As far as I can tell, the most prominent (but hopefully increasingly lonely) current bug is that PetscVector::localize() doesn't seem to properly update a GHOSTED current_local_solution vector from a PARALLEL solution vector; it seems to just leave current_local_solution with zeros. Not sure how soon I'll get time to work on this again... Tim, could you take a look? --- Roy |
From: Tim K. <tim...@ce...> - 2009-02-05 16:21:00
|
On Tue, 3 Feb 2009, Roy Stogner wrote: > The part of the code that really *is* a regression is attached, as > yet-another patch version. As far as I can tell, the most prominent > (but hopefully increasingly lonely) current bug is that > PetscVector::localize() doesn't seem to properly update a GHOSTED > current_local_solution vector from a PARALLEL solution vector; it > seems to just leave current_local_solution with zeros. Not sure how > soon I'll get time to work on this again... Tim, could you take a > look? Now that I read this mail again, I'm actually quite sure that the bug you were observing was exactly due to my bug in close() (item (2) of last mail). Sorry for keeping you in the dark... Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-02-04 23:38:52
Attachments:
roypatch-feb4
|
On Tue, 3 Feb 2009, Roy Stogner wrote: > The part of the code that really *is* a regression is attached, as > yet-another patch version. As far as I can tell, the most prominent > (but hopefully increasingly lonely) current bug is that > PetscVector::localize() doesn't seem to properly update a GHOSTED > current_local_solution vector from a PARALLEL solution vector; it > seems to just leave current_local_solution with zeros. Found two bugs in new libMesh code; one in the way the new DofMap was building a smaller send_list; the other in the localize() function. An svn diff with both bugs fixed is attached. There's still something wrong with the localize() called from System::update(), but now it's not a bug so much as my failure to understand how PETSc is supposed to work: in a simple failing test case (the first update() in ex14 run on 2 procs), the index sets appear to be built correctly, but the scatter doesn't update current_local_solution the way I expect it to. The bug's either in localize() or in operator()(), both of which are fairly small simple functions but which are based on PETSc APIs I'm not familiar enough with. --- Roy |
From: Tim K. <tim...@ce...> - 2009-02-05 16:17:29
Attachments:
tims-feb5-change-of-roys-feb4-patch
|
Dear Roy, Thank you for your work and your patch. I have looked at it and the following comments: (1) In the ghosted version of PetscVector::init(), you have placed an assert that checks whether the mesh is disjoint. Why is this necessary? (2) In PetscVector::close(), I currently wonder why I used SCATTER_REVERSE in the call to VecGhostUpdateBegin/End(). This seems to be wrong, it should be SCATTER_FORWARD instead. (3) It so happened that I looked at PetscVector<T>::operator=(const std::vector<T>& v) and noticed that in the ghosted case, it might be appropiate to set _is_closed to false. This is because for ghosted vectors, close() will also update the ghost values automatically. What do you think? I attached a patch that fixes (2), but again, you might find it more convenient to perform that change manually. (The patch doesn't do anything else.) On Wed, 4 Feb 2009, Roy Stogner wrote: > There's still something wrong with the localize() called from > System::update(), but now it's not a bug so much as my failure to > understand how PETSc is supposed to work: in a simple failing test > case (the first update() in ex14 run on 2 procs), the index sets > appear to be built correctly, but the scatter doesn't update > current_local_solution the way I expect it to. The bug's either in > localize() or in operator()(), both of which are fairly small simple > functions but which are based on PETSc APIs I'm not familiar enough > with. Might it be that this was just due to my item (2) above? Actually, for ghosted vectors, I think localize() isn't required to be called at all, is it? I mean, close() will update the ghost values anyway, and there is nothing more to be updated. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-02-05 16:43:57
|
On Thu, 5 Feb 2009, Tim Kroeger wrote: > (1) In the ghosted version of PetscVector::init(), you have placed an assert > that checks whether the mesh is disjoint. Why is this necessary? No - it's an assert that checks whether someone's trying to initialize a disjoint ghosted vector, which should never happen unless the mesh is disjoint. If for some reason someone really wants to run a fully coupled system on a disjoint mesh, then they take out the assert. If someone's running on a connected mesh and that assert trips, then it's caught a problem with the code. > (2) In PetscVector::close(), I currently wonder why I used SCATTER_REVERSE in > the call to VecGhostUpdateBegin/End(). This seems to be wrong, it should be > SCATTER_FORWARD instead. Thanks; I'll fix that now. > (3) It so happened that I looked at PetscVector<T>::operator=(const > std::vector<T>& v) and noticed that in the ghosted case, it might be > appropiate to set _is_closed to false. This is because for ghosted vectors, > close() will also update the ghost values automatically. What do you think? Hmmm... we don't want to change _is_closed (requiring the user to close() after operator= would be an API change, which I'd like to avoid as much as possible), but it looks like we need to do *something* within operator= itself - as is it won't be setting the ghosted values at all. > On Wed, 4 Feb 2009, Roy Stogner wrote: > >> There's still something wrong with the localize() called from >> System::update(), but now it's not a bug so much as my failure to >> understand how PETSc is supposed to work: in a simple failing test >> case (the first update() in ex14 run on 2 procs), the index sets >> appear to be built correctly, but the scatter doesn't update >> current_local_solution the way I expect it to. The bug's either in >> localize() or in operator()(), both of which are fairly small simple >> functions but which are based on PETSc APIs I'm not familiar enough >> with. > > Might it be that this was just due to my item (2) above? I don't think so - the failure was either *within* the localize function (whose results I tested before the function even exited) or within the operator() function (which I used to perform the tests). But let me check... Nope, ex14 gives the same problem even after your close() fix. > Actually, for ghosted vectors, I think localize() isn't required to > be called at all, is it? Baby steps, baby steps. Right now we've turned current_local_solution and kin from serial vectors into ghosted vectors, but for that to work properly we need correct behavior from solution->localize(current_local_solution, send_list) Eventually, we'll get rid of current_local_solution entirely and just use ghosted vectors everywhere we use parallel vectors now... but that's too big a change to make in one jump without testing, and it's a change I don't want to make at all until our EpetraVector interface (and preferably our DistributedVector, but that can wait) supports it. --- Roy |
From: Tim K. <tim...@ce...> - 2009-02-06 07:47:30
|
Dear Roy, On Thu, 5 Feb 2009, Roy Stogner wrote: > On Thu, 5 Feb 2009, Tim Kroeger wrote: > >> (3) It so happened that I looked at PetscVector<T>::operator=(const >> std::vector<T>& v) and noticed that in the ghosted case, it might be >> appropiate to set _is_closed to false. This is because for ghosted >> vectors, close() will also update the ghost values automatically. What do >> you think? > > Hmmm... we don't want to change _is_closed (requiring the user to > close() after operator= would be an API change, which I'd like to > avoid as much as possible), but it looks like we need to do > *something* within operator= itself - as is it won't be setting the > ghosted values at all. Well, what about calling this->close() inside operator=()? Or, if that seems too hard for you, copy the ghost-specific part of close() to operator=(). >> Might it be that this was just due to my item (2) above? > > Nope, ex14 gives the same problem even after your close() fix. Can you send me a minimal example for that problem? I'm not sure whether I can help you, but I could at least try. >> Actually, for ghosted vectors, I think localize() isn't required to >> be called at all, is it? > > Baby steps, baby steps. Sure, but what I mean is, wouldn't it make sense to do something like the following: template <typename T> void PetscVector<T>::localize (NumericVector<T>& v_local_in) const { if(this->type()==GHOSTED) { v_local_in = *this; v_local_in->close(); } else { // current code to be placed here. } } Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-02-06 19:11:37
Attachments:
ghost_activation_patch
|
On Fri, 6 Feb 2009, Tim Kroeger wrote: > On Thu, 5 Feb 2009, Roy Stogner wrote: > >> Hmmm... we don't want to change _is_closed (requiring the user to >> close() after operator= would be an API change, which I'd like to >> avoid as much as possible), but it looks like we need to do >> *something* within operator= itself - as is it won't be setting the >> ghosted values at all. > > Well, what about calling this->close() inside operator=()? That'll work. >> Nope, ex14 gives the same problem even after your close() fix. > > Can you send me a minimal example for that problem? I'm not sure whether I > can help you, but I could at least try. Actually, it looks like you've already helped me: close() inside operator= didn't do it, but I didn't realize you meant we should do a close inside localize() as well. I wish the PETSc docs made it more clear that that was necessary (Scatter just doesn't set any ghost nodes without a GhostUpdate too, huh?), but whatever fix works is fine by me. This particular fix worked for the problem at hand, but ex14 still isn't working past the first solve - I suspect it may be a flaw in the new projection case; I'll try to track it down this weekend. To make these patches shorter, I've separated out all the new code that either clearly works (smaller send_lists), is only for debugging (weaker asserts) or doesn't have to be used by default (more ghosted vector implementation code and fixes) and committed it to svn. The remaining patch (that just switches to ghosted initialization for formerly serial vectors) is attached. --- Roy |
From: Roy S. <roy...@ic...> - 2009-02-06 20:27:31
|
On Fri, 6 Feb 2009, Roy Stogner wrote: > This particular fix worked for the problem at hand, but ex14 still > isn't working past the first solve - I suspect it may be a flaw in the > new projection case; I'll try to track it down this weekend. ex14 is fixed now; I hadn't completely fixed the bug in ghosted close(), and that was ruining ghost dof values after multiple syncs. I suspect that ghosted current_solutions are now working for steady problems. Transient stuff like ex10 is still subtlely broken, though; there's got to still be a bug in projection of ghosted vectors. --- Roy |
From: Tim K. <tim...@ce...> - 2009-02-18 14:43:13
|
Dear Roy, What's the current state about the ghosted vectors? Are you waiting for some input from me, or did you just not have time in the meanwhile? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-02-18 14:52:15
|
On Wed, 18 Feb 2009, Tim Kroeger wrote: > What's the current state about the ghosted vectors? They appear to be working for stationary problems, but there's definitely a bug which corrupts transient results - possibly in the System::project_vector code, possibly in the enforce_constraints_exactly it calls, probably in both since the ghost-specific code is very similar. > Are you waiting for some input from me, or did you just not have > time in the meanwhile? I'm swamped right now. The next step is to reduce ex10 to a simpler case (maybe solving du/dt = 0 in nD with some non-zero ICs?) where I can step through the whole projection in gdb, but it's likely to be weeks before I can get to that. --- Roy |
From: Roy S. <roy...@ic...> - 2009-02-24 22:56:44
|
On Wed, 18 Feb 2009, Roy Stogner wrote: > I'm swamped right now. The next step is to reduce ex10 to a simpler > case (maybe solving du/dt = 0 in nD with some non-zero ICs?) where I > can step through the whole projection in gdb, but it's likely to be > weeks before I can get to that. I apparently underestimated the productivity benefits of trying to avoid twiddling my thumbs while the supercomputer I'm using undergoes maintenance. The problem in ex10 (and in my reduced version of it) was this line: *system.old_local_solution = *system.current_local_solution; which correctly copied local but not ghosted degree of freedom values. Explicitly correcting ghost dofs at the end of operator=() fixed it: // Make sure ghost dofs are copied if (this->type() == GHOSTED) this->close(); And ex10 seems to be working correctly now as well, as do the rest of the example codes. I'll keep trying to trigger bugs with my application codes, and I'll do some performance benchmarking, before I consider turning on ghosted vectors in SVN. But additional user testing would be nice too. I think I'll make --enable-ghosted-vectors a configure option to make the code easier to play with while we're still not confident about it. Besides, this can't be the right fix, can it? Why should any parallel communication be required to copy one consistent ghosted vector to another? It seems as if VecCopy is simply failing to copy the whole vec, and I don't like not understanding why. --- Roy |
From: Roy S. <roy...@ic...> - 2009-02-25 00:15:05
|
On Tue, 24 Feb 2009, Roy Stogner wrote: > I think I'll make --enable-ghosted-vectors a configure option to > make the code easier to play with while we're still not confident > about it. Quick update: It's ./configure --enable-ghosted It's not turned on by default or by --enable-everything --- Roy |
From: Tim K. <tim...@ce...> - 2009-02-25 10:19:09
|
On Tue, 24 Feb 2009, Roy Stogner wrote: > On Wed, 18 Feb 2009, Roy Stogner wrote: > >> I'm swamped right now. The next step is to reduce ex10 to a simpler >> case (maybe solving du/dt = 0 in nD with some non-zero ICs?) where I >> can step through the whole projection in gdb, but it's likely to be >> weeks before I can get to that. > > I apparently underestimated the productivity benefits of trying to > avoid twiddling my thumbs while the supercomputer I'm using undergoes > maintenance. Sounds great, although that sentence seems to have a wrong parity of negations. Anyway, great that you found the time to work on ghosted vectors. I'm currently trying to test the ghosted vectors on my application, but the queue on the cluster is full, so I have to wait. > Besides, this can't be the right fix, can it? Why should any parallel > communication be required to copy one consistent ghosted vector to > another? It seems as if VecCopy is simply failing to copy the whole > vec, and I don't like not understanding why. Good question. Perhaps one should apply VecCopy to the local form (obtained via VecGhostGetLocalForm())? Also, I wonder whether other methods, such as VecScale(), might possibly suffer from similar problems? I'll ask in the petsc-users list and let you know what they say. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-02-25 10:39:47
|
On Wed 2009-02-25 11:18, Tim Kroeger wrote: > On Tue, 24 Feb 2009, Roy Stogner wrote: > > Besides, this can't be the right fix, can it? Why should any parallel > > communication be required to copy one consistent ghosted vector to > > another? It seems as if VecCopy is simply failing to copy the whole > > vec, and I don't like not understanding why. > > Good question. Perhaps one should apply VecCopy to the local form > (obtained via VecGhostGetLocalForm())? Also, I wonder whether other > methods, such as VecScale(), might possibly suffer from similar > problems? Roy, are you familiar with the distinction between the local form and the global form? One chunk of memory is allocated for the owned portion of the global degrees of freedom plus the ghosts. The local form is a sequential vector (it's doesn't know anything about other processes, operations on it do not need to be called collectively) which includes all the ghosted values. The global form is a parallel vector (operations need to be called collectively since it uses a parallel communicator) that only uses the part of the vector corresponding to owned degrees of freedom. This means that operations on the parallel vector behave just like an unghosted parallel vector, they don't touch the ghost degrees of freedom. Operations on the local form will of course involve the ghosted values, it doesn't even know the difference between owned and ghosted dofs. The VecGhost* functions are the only ones that understand this relationship. I hope that helps. Jed |
From: Tim K. <tim...@ce...> - 2009-02-25 16:10:37
|
Dear Roy, On Wed, 25 Feb 2009, Tim Kroeger wrote: > I'm currently trying to test the ghosted vectors on my application, > but the queue on the cluster is full, so I have to wait. Update: My application crashes with the ghosted vectors, but I'm not yet sure whether it's a bug in the ghosted code or a bug in my application that just got triggered by the ghost dofs. I'll try to track this down, but that will be next week. >> Besides, this can't be the right fix, can it? Why should any parallel >> communication be required to copy one consistent ghosted vector to >> another? It seems as if VecCopy is simply failing to copy the whole >> vec, and I don't like not understanding why. > > Good question. Perhaps one should apply VecCopy to the local form > (obtained via VecGhostGetLocalForm())? Also, I wonder whether other > methods, such as VecScale(), might possibly suffer from similar > problems? > > I'll ask in the petsc-users list and let you know what they say. The PETSc people say that the local form has to be used (or VecGhostUpdate*(), as we used in PetscVector::close()). See also Jed's reply. I'll go through the PetscVector code once more and change this in all methods that are expected to work without parallel communication. This will most probably also happen next week. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-03-02 13:43:04
Attachments:
patch
|
Dear Roy, On Wed, 25 Feb 2009, Tim Kroeger wrote: > On Wed, 25 Feb 2009, Tim Kroeger wrote: > >>> Besides, this can't be the right fix, can it? Why should any parallel >>> communication be required to copy one consistent ghosted vector to >>> another? It seems as if VecCopy is simply failing to copy the whole >>> vec, and I don't like not understanding why. >> >> Good question. Perhaps one should apply VecCopy to the local form >> (obtained via VecGhostGetLocalForm())? Also, I wonder whether other >> methods, such as VecScale(), might possibly suffer from similar >> problems? >> >> I'll ask in the petsc-users list and let you know what they say. > > The PETSc people say that the local form has to be used (or > VecGhostUpdate*(), as we used in PetscVector::close()). See also > Jed's reply. > > I'll go through the PetscVector code once more and change this in all > methods that are expected to work without parallel communication. > This will most probably also happen next week. I've done this now, see attachment. I explicitly used the local form for the ghosted case in the following methods: PetscVector::zero(void) PetscVector::add(const T) PetscVector::add(const T, const NumericVector&) PetscVector::abs(void) PetscVector::operator=(const T) PetscVector::operator=(const PetscVector&) PetscVector::pointwiseMult(const NumericVector&, const NumericVector&) I wonder whether we should do this in PetscVector::operator=(const std::vector<T>&). At least in its "case 1", it should be possible to do without parallel communication using the local form, but in "case 2", it's of course impossible. I left this method unchanged, since the version using this->close() should certainly work as well. All the other functions seem not to require the local form to me, because they either (a) require closing the vector afterwards, or (b) perform parallel communication, or (c) do not modify the vector, or (d) do not directly use PETSc methods. It would be nice if you would double-check my changes before committing. One never knows. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-03-02 15:02:44
|
Dear Roy, On Wed, 25 Feb 2009, Tim Kroeger wrote: > On Wed, 25 Feb 2009, Tim Kroeger wrote: > >> I'm currently trying to test the ghosted vectors on my application, >> but the queue on the cluster is full, so I have to wait. > > Update: My application crashes with the ghosted vectors, but I'm not > yet sure whether it's a bug in the ghosted code or a bug in my > application that just got triggered by the ghost dofs. I'll try to > track this down, but that will be next week. Update: It doesn't crash any more. With the patch that I sent you in the previous mail, it seems to work. Well, at least the first 8 time steps of my application do the same as they did without the ghost dofs. And the memory requirements are less than 50% of before (on 8 processors). However, (a) the results do not *exactly* coincide with the previous ones; that it, I get e.g. a residual norm of 1.16562e-08 instead of 1.16561e-08 after 608 linear iterations (although using the same number of processors), and (b) I had to add a.close() between a=b and a.scale(), where a and b are *System::solution of two different systems. Both facts surprise me. In particular, (b) surprises me for two reasons, that is (b1) PetscVector::operator=(const PetscVector&) should not require closing the vector afterwards, and (b2) I wonder whether (and, if yes, why) you are using ghosted vectors for System::solution; I thought you were using them only for System::current_local_solution. What do you think? Does this look like more errors or not? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-03 04:26:08
Attachments:
vectest.C
|
On Mon, 2 Mar 2009, Tim Kroeger wrote: > Update: It doesn't crash any more. With the patch that I sent you in the > previous mail, it seems to work. That's interesting. Did you ever track down the source of the problem? The patch you provided looks like it could have fixed some inaccuracy bugs with ghosted vectors, but I can't see what code path in the old code might have triggered an actual crash. The patch looks good on first read and run through. Could use a little refactoring, but so could our whole PetscVector interface to begin with. I'll commit it some time this week if I don't find any subtle problems first. > Well, at least the first 8 time steps of my application do the same > as they did without the ghost dofs. And the memory requirements are > less than 50% of before (on 8 processors). Really? That's very good to hear. Is that with SerialMesh? > However, (a) the results do not *exactly* coincide with the previous ones; > that it, I get e.g. a residual norm of 1.16562e-08 instead of 1.16561e-08 > after 608 linear iterations (although using the same number of processors), Hmm... maybe we still have some inaccuracy bugs in there? There aren't any other confounding changes? If you configure using --disable-ghosted you get back the old results? > and (b) I had to add a.close() between a=b and a.scale(), where a and b are > *System::solution of two different systems. The systems are identical? Same mesh, same variable types added in the same order? > Both facts surprise me. In particular, (b) surprises me for two reasons, > that is (b1) PetscVector::operator=(const PetscVector&) should not require > closing the vector afterwards, I wouldn't think so either, but I'm not sure how best to divine PETSc standards there. The VecCopy man page doesn't mention needing VecAssembly afterwards, but then again neither do the VecSet* pages; there's just the brief discussion in the user's manual. > and (b2) I wonder whether (and, if yes, why) you are using ghosted > vectors for System::solution; I thought you were using them only for > System::current_local_solution. No, I'm not. I want to use ghosted vectors for System::solution (and get rid of current_local_solution entirely) in the long run, but scan for PARALLEL in system.C; there shouldn't be any ghosting there yet... unless our massive overloading of init() is backfiring somehow. > What do you think? Does this look like more errors or not? I don't know what to think yet... but I actually can't replicate problem (b) myself with the attached vectest.C - I assume it bombs out for you with the usual PETSc "object in wrong state" error? --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-05 14:59:58
|
Dear Roy, On Mon, 2 Mar 2009, Roy Stogner wrote: > On Mon, 2 Mar 2009, Tim Kroeger wrote: > >> Update: It doesn't crash any more. With the patch that I sent you in the >> previous mail, it seems to work. > > That's interesting. Did you ever track down the source of the > problem? The patch you provided looks like it could have fixed some > inaccuracy bugs with ghosted vectors, but I can't see what code path > in the old code might have triggered an actual crash. Well, actually the crash was due to the missing a.close() between a=b and a.scale(). After adding that, but before my latest patch, it didn't crash any more, but it produced totally wrong results. That led me to the idea that PetscVector::scale() (and hence other methods) should use the local form. >> Well, at least the first 8 time steps of my application do the same >> as they did without the ghost dofs. And the memory requirements are >> less than 50% of before (on 8 processors). > > Really? That's very good to hear. Yes. Well, the point is that my application uses a large number of systems, some of which have a lot of additional vectors. I expected it to be like this; that's why I pushed the ghosted vectors. (-: > Is that with SerialMesh? Yes. However, I how have another problem: After the first ~20 time steps of my application, it starts to refine the grid and -- crashes. I have no idea why, but I will try to track that down. >> However, (a) the results do not *exactly* coincide with the previous ones; >> that it, I get e.g. a residual norm of 1.16562e-08 instead of 1.16561e-08 >> after 608 linear iterations (although using the same number of processors), > > Hmm... maybe we still have some inaccuracy bugs in there? There > aren't any other confounding changes? If you configure using > --disable-ghosted you get back the old results? I haven't tested this yet. I will try to find the cause of the crash first. (Re-configuring takes a lot of time, you know.) >> and (b) I had to add a.close() between a=b and a.scale(), where a and b are >> *System::solution of two different systems. > > The systems are identical? Same mesh, same variable types added in > the same order? The mesh is the same, and also the variables. One is an ExplicitSystem, the other a LinearImplicitSystem, but that shouldn't be important, should it? >> Both facts surprise me. In particular, (b) surprises me for two reasons, >> that is (b1) PetscVector::operator=(const PetscVector&) should not require >> closing the vector afterwards, > > I wouldn't think so either, but I'm not sure how best to divine PETSc > standards there. The VecCopy man page doesn't mention needing > VecAssembly afterwards, but then again neither do the VecSet* pages; > there's just the brief discussion in the user's manual. > > I don't know what to think yet... but I actually can't replicate > problem (b) myself with the attached vectest.C - I can't either. Very strange. > I assume it bombs out > for you with the usual PETSc "object in wrong state" error? Yes, that's what it did. I'll keep you informed. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-05 16:36:59
Attachments:
vectest2.C
|
On Thu, 5 Mar 2009, Tim Kroeger wrote: > Well, actually the crash was due to the missing a.close() between a=b > and a.scale(). After adding that, but before my latest patch, it didn't > crash any more, but it produced totally wrong results. That led me to the > idea that PetscVector::scale() (and hence other methods) should use the local > form. This is surprising. What were you calling scale() on? As of now the only ghosted vectors are *supposed* to be current_local_solution and kin. > However, I how have another problem: After the first ~20 time steps of my > application, it starts to refine the grid and -- crashes. I have no idea > why, but I will try to track that down. Thanks. The ghosted vectors seem to do fine with our AMR/C examples and with my apps, and without a test case demonstrating the bug I don't know where I'd begin to fix it. >>> However, (a) the results do not *exactly* coincide with the previous ones; >>> that it, I get e.g. a residual norm of 1.16562e-08 instead of 1.16561e-08 >>> after 608 linear iterations (although using the same number of >>> processors), >> >> Hmm... maybe we still have some inaccuracy bugs in there? There >> aren't any other confounding changes? If you configure using >> --disable-ghosted you get back the old results? > > I haven't tested this yet. I will try to find the cause of the crash first. Right, first things first. > (Re-configuring takes a lot of time, you know.) Actually, I do all these big tests on a cluster with distcc set up. "make -j 50" is *fantastic*. ;-) >>> and (b) I had to add a.close() between a=b and a.scale(), where a and b >>> are *System::solution of two different systems. >> >> The systems are identical? Same mesh, same variable types added in >> the same order? > > The mesh is the same, and also the variables. One is an ExplicitSystem, the > other a LinearImplicitSystem, but that shouldn't be important, should it? No, they should have the same vector index ordering. >> I don't know what to think yet... but I actually can't replicate >> problem (b) myself with the attached vectest.C - > > I can't either. Very strange. Odd. How about with this vectest2.C? I can't even get problem (b) to trigger with ghosted vectors! On the other hand, here's something even more odd: I just realized that I'm not properly matching the ghosted vector init() function signature. The compiler must be deciding to convert GHOSTED to true and then using the default AUTOMATIC for the fifth argument? That wouldn't explain any crashes, but if we're fast-initializing vectors we should be clearing, that might explain your slightly-off residuals. I'll fix it in SVN. --- Roy |
From: Tim K. <tim...@ce...> - 2009-03-06 14:47:47
|
Dear Roy, On Thu, 5 Mar 2009, Roy Stogner wrote: > On Thu, 5 Mar 2009, Tim Kroeger wrote: > >> Well, actually the crash was due to the missing a.close() between a=b >> and a.scale(). After adding that, but before my latest patch, it didn't >> crash any more, but it produced totally wrong results. That led me to the >> idea that PetscVector::scale() (and hence other methods) should use the >> local form. > > This is surprising. What were you calling scale() on? As of now the > only ghosted vectors are *supposed* to be current_local_solution and kin. It was the "solution" vector, not "current_local_solution". I don't know why it crashed there, but that was reproducible. Perhaps I did something else wrong. (In earlier times, when I didn't quite understand the difference between these two vectors, I think I sometimes cross-assigned one of them to the other in a different system. Perhaps such an assignment still exists somewhere hidden in my code.) >> However, I how have another problem: After the first ~20 time steps of my >> application, it starts to refine the grid and -- crashes. I have no idea >> why, but I will try to track that down. > > Thanks. The ghosted vectors seem to do fine with our AMR/C examples > and with my apps, and without a test case demonstrating the bug I > don't know where I'd begin to fix it. I'm still working on the track-down. The problem is that the application runs for more than 2 hours before it crashes, and I couldn't find a possibility to let it crash earlier. In particular, I noticed that the grid is successfully refined two or three times before the crash. But I won't capitulate. >> (Re-configuring takes a lot of time, you know.) > > Actually, I do all these big tests on a cluster with distcc set up. > "make -j 50" is *fantastic*. ;-) Good idea! I will keep that in mind for the next time. (distcc isn't installed, but the master node has several CPUs as well.) > Odd. How about with this vectest2.C? I can't even get problem (b) to > trigger with ghosted vectors! Nor can't I other than in my application. > On the other hand, here's something even more odd: I just realized > that I'm not properly matching the ghosted vector init() function > signature. The compiler must be deciding to convert GHOSTED to true > and then using the default AUTOMATIC for the fifth argument? That > wouldn't explain any crashes, but if we're fast-initializing vectors > we should be clearing, that might explain your slightly-off residuals. > I'll fix it in SVN. Did you fix that? "svn log" doesn't show anything from you later than 2009-03-04. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-03-06 16:20:59
|
On Fri, 6 Mar 2009, Tim Kroeger wrote: > On Thu, 5 Mar 2009, Roy Stogner wrote: > >> This is surprising. What were you calling scale() on? As of now the >> only ghosted vectors are *supposed* to be current_local_solution and kin. > > It was the "solution" vector, not "current_local_solution". I don't know why > it crashed there, but that was reproducible. Perhaps I did something else > wrong. (In earlier times, when I didn't quite understand the difference > between these two vectors, Don't feel bad about that. For my own part, I don't care too much about serial vector memory usage - I'm mostly interested in getting ghosted vectors working because I think the solution / current_local_solution distinction is the second most confusing code in the library, and is the most confusing code that most users have to deal with. The sooner we can completely replace that dichotomy with a single ghosted solution vector, the better. > I think I sometimes cross-assigned one of them to the other in a > different system. Perhaps such an assignment still exists somewhere > hidden in my code.) That could be a serious problem, yes. But wouldn't we be catching that with a libmesh_assert()? The preconditions on operator= in petsc_vector.C seem pretty thorough. > I'm still working on the track-down. The problem is that the application > runs for more than 2 hours before it crashes, and I couldn't find a > possibility to let it crash earlier. I hope your app has restart file capabilities! > In particular, I noticed that the grid > is successfully refined two or three times before the crash. But I won't > capitulate. Thanks! >> On the other hand, here's something even more odd: I just realized >> that I'm not properly matching the ghosted vector init() function >> signature. The compiler must be deciding to convert GHOSTED to true >> and then using the default AUTOMATIC for the fifth argument? That >> wouldn't explain any crashes, but if we're fast-initializing vectors >> we should be clearing, that might explain your slightly-off residuals. >> I'll fix it in SVN. > > Did you fix that? "svn log" doesn't show anything from you later than > 2009-03-04. Whoops! Fixed it, got distracted by other work, forgot to commit the fix. I'll put in that (and your patch, which still looks good) today. --- Roy |