On 6/21/05, Julian Seward <ju...@va...> wrote:
>=20
> I wrote a simple test program which simply consists of a
> spin-wait loop, then a bproc_move from front end to a slave
> node, and a second spin-wait loop which prints a progress
> message every second or so.
>=20
> The process is migrated correctly. However, after running
> on the slave for a few (20 ish?) seconds, it dies, with
> "Killed" printed. The amount of progress it makes before
> this happens varies from attempt to attempt, although it
> does not vary by much.
>=20
> Another ten or twenty seconds after "Killed" appears, the
> slave invariably reboots itself.
When a slave holding a process dies, the process looks like it got a
SIGKILL on the front end. There's no normal UNIX way to say something
like the machine that process was on isn't with us anymore so that
seemed like the next best thing to do.
> strace isn't helpful; it merely tells me the process is killed
> by SIGKILL, which is apparent anyway.
>=20
> Why does this happen? How can I avoid it? Given that the
> migration takes place OK, it feels like the master has asked
> the slave to reset itself as a result of some kind of timeout
> happening. When the slaves are idle they stay alive indefinitely
> with no such reboots.
I'm not sure why that's happening. Is there anything on the slave's
console? The 20 seconds interval sounds like the normal ping timeout
scenerio between the master and a slave. Is it possible that the
slave daemon is getting starved somehow? Can you bpsh other things to
the node while your program is running?
=20
> [I also don't understand why I can see printfs from the
> program after migration, given that "man 2 bproc_move" says
> "All open files are closed during migration."]
That's an inaccuracy, I suppose. If you don't specify any other I/O
setup, it takes stdout + stderr and feeds them to what ever the
process's original stdout was. It uses the socket that was used to
pass the process data to do this. It's a crutch for little test
programs like the one you wrote. The I/O forwarding done by bpsh,
mpirun, etc. doesn't work this way.
- Erik
|