From: Erik H. <eah...@gm...> - 2005-06-22 06:01:37
|
On 6/21/05, Julian Seward <ju...@va...> wrote: >=20 > I wrote a simple test program which simply consists of a > spin-wait loop, then a bproc_move from front end to a slave > node, and a second spin-wait loop which prints a progress > message every second or so. >=20 > The process is migrated correctly. However, after running > on the slave for a few (20 ish?) seconds, it dies, with > "Killed" printed. The amount of progress it makes before > this happens varies from attempt to attempt, although it > does not vary by much. >=20 > Another ten or twenty seconds after "Killed" appears, the > slave invariably reboots itself. When a slave holding a process dies, the process looks like it got a SIGKILL on the front end. There's no normal UNIX way to say something like the machine that process was on isn't with us anymore so that seemed like the next best thing to do. > strace isn't helpful; it merely tells me the process is killed > by SIGKILL, which is apparent anyway. >=20 > Why does this happen? How can I avoid it? Given that the > migration takes place OK, it feels like the master has asked > the slave to reset itself as a result of some kind of timeout > happening. When the slaves are idle they stay alive indefinitely > with no such reboots. I'm not sure why that's happening. Is there anything on the slave's console? The 20 seconds interval sounds like the normal ping timeout scenerio between the master and a slave. Is it possible that the slave daemon is getting starved somehow? Can you bpsh other things to the node while your program is running? =20 > [I also don't understand why I can see printfs from the > program after migration, given that "man 2 bproc_move" says > "All open files are closed during migration."] That's an inaccuracy, I suppose. If you don't specify any other I/O setup, it takes stdout + stderr and feeds them to what ever the process's original stdout was. It uses the socket that was used to pass the process data to do this. It's a crutch for little test programs like the one you wrote. The I/O forwarding done by bpsh, mpirun, etc. doesn't work this way. - Erik |