From: Roy S. <roy...@ic...> - 2008-03-01 20:03:56
|
On Sat, 1 Mar 2008, Benjamin Kirk wrote: > std::set_terminate() returns a pointer to the function we are replacing, > right? Is it possible to use that information? You're right; I wasn't looking at an up to date webpage. We could possibly handle the "user sets a terminate function before we can" by having our own terminate function call theirs first... except that if their function kills the program, it never returns to ours so we can MPI_Abort(). And since our proposed terminate() function would be calling MPI_Abort() and killing the program, it'd be hard to fault a user for writing a function that kills the program themselves. > Also, there is MPI_Comm_{get,set}_errhandler. I would think we may > be able to use this to get the current terminate and call it from > our own MPI error handler?? I'll have to look at this, but I don't think it applies. The MPI error handler gets called when major errors in MPI functions occur, right? That would be orthogonal to the situation when an uncaught exception is thrown because an internal libMesh error occurred. >> Do nothing. My MPI library is pretty good about figuring out that >> when one process dies, the rest can't network write to it anymore and >> should exit. I'll bet other MPI libraries are just as good. This is >> basically what happens when there's a segfault, after all. The "do >> nothing" plan also appeals to my sense of laziness, so it's what I'll >> do (or not do?) unless anyone objects. > > I need to think about it some more I guess... Seems like I've been burnt by > processes not getting killed a lot, but maybe not too much recently. Hmm... okay, you're right. Two weeks of uptime on my workstation and I've apparently accumulated zombie processes from a dozen different MPI runs. The process attached to the terminal may die cleanly but leave all the spawned processes behind. I recall this having been a problem (well, irritation) for years, but I'd never thought about how to try fixing it. It seems like it'll be a pain in the neck. To get a dying process to clean up its kin, we'd need to set a terminate handler, signal handlers for segfaults and kill signals, and there still would be nothing we'd be able to do about getting a "kill -9" on only one process. Perhaps you had it right after all with looking at the MPI error handler. All those spawned jobs should be blocked on one MPI call or another, right? *Those* are the processes that should be able to figure out there's something wrong and to exit from it cleanly; forget about the one that's already dying. But we're already aborting on MPI errors, so the MPI errors must never be generated to begin with. I don't suppose there's some MPI-standard way to help those processes figure out that something's going wrong? --- Roy |