Re: [Libmesh-devel] here(), error(), & friends

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Sat, 1 Mar 2008, Benjamin Kirk wrote:

> std::set_terminate() returns a pointer to the function we are replacing,
> right?  Is it possible to use that information?

You're right; I wasn't looking at an up to date webpage.  We could
possibly handle the "user sets a terminate function before we can" by
having our own terminate function call theirs first... except that if
their function kills the program, it never returns to ours so we can
MPI_Abort().  And since our proposed terminate() function would be
calling MPI_Abort() and killing the program, it'd be hard to fault a
user for writing a function that kills the program themselves.

> Also, there is MPI_Comm_{get,set}_errhandler.  I would think we may
> be able to use this to get the current terminate and call it from
> our own MPI error handler??

I'll have to look at this, but I don't think it applies.  The MPI
error handler gets called when major errors in MPI functions occur,
right?  That would be orthogonal to the situation when an uncaught
exception is thrown because an internal libMesh error occurred.

>> Do nothing.  My MPI library is pretty good about figuring out that
>> when one process dies, the rest can't network write to it anymore and
>> should exit.  I'll bet other MPI libraries are just as good.  This is
>> basically what happens when there's a segfault, after all.  The "do
>> nothing" plan also appeals to my sense of laziness, so it's what I'll
>> do (or not do?) unless anyone objects.
>
> I need to think about it some more I guess...  Seems like I've been burnt by
> processes not getting killed a lot, but maybe not too much recently.

Hmm... okay, you're right.  Two weeks of uptime on my workstation and
I've apparently accumulated zombie processes from a dozen different
MPI runs.  The process attached to the terminal may die cleanly but
leave all the spawned processes behind.

I recall this having been a problem (well, irritation) for years, but
I'd never thought about how to try fixing it.  It seems like it'll be
a pain in the neck.  To get a dying process to clean up its kin, we'd
need to set a terminate handler, signal handlers for segfaults and
kill signals, and there still would be nothing we'd be able to do
about getting a "kill -9" on only one process.

Perhaps you had it right after all with looking at the MPI error
handler.  All those spawned jobs should be blocked on one MPI call or
another, right?  *Those* are the processes that should be able to
figure out there's something wrong and to exit from it cleanly; forget
about the one that's already dying.  But we're already aborting on MPI
errors, so the MPI errors must never be generated to begin with.  I
don't suppose there's some MPI-standard way to help those processes
figure out that something's going wrong?
---
Roy