Re: [BProc] bpmaster and bpsh failing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Dec 13, 2001 at 09:22:40PM -0500, henken wrote:

..snip..

> Iteration: 205859 on node: 1
> bpsh: invalid pid -29611
> bpsh: invalid pid -29611
> bpsh: invalid pid -29611
> bpsh: Child process exit abnormally.
> Iteration: 207649 on node: 0
> bproc_nodelist: Input/output error

This is very weird.  Since bpsh is getting that bogus PID back, it
implies that the remote process got that PID assigned to it.  Which
says "problem with move" to me.  A very very weird problem though.
Like maybe to procs with the same pid are trying to move to the same
node at the same time.  Either that or data corruption or the slave
disfunctioning in some weird way.

> I also found this in /var/log/messages:
> Dec 13 16:05:39 master /usr/sbin/bpmaster: FATAL: assoc_find: invalid pid
> -29611
>
> Am I expecting too much of bproc?

Nope.  Absolutely not.  This kind of stress testing is how you weed
out the hard to reproduce and harder to find bugs.

> I know I am being ridiculously evil in running this test script, but
> we have seen situations similar to this when our users run jobs that
> fail at the onset and they do not exit cleanly.

The bpsh errors are probably just a symptom of other stuff going
wrong.  The bpmaster crash is probably the best place to start here.
It should produce a core dump at the point where it dies with that
message.  That core file would be useful to look at.

1) build bpmaster with debugging turned on (-g).
2) Run it w/ core dumps enabled
3) Run it w/ message trace enabled "-m filename"

If you can reproduce the crash with all of that, we should have a good
point to start from to try and figure out what happened.

I've been running your test script for a few hours here and haven't
had any trouble.

- Erik
-- 
Erik Arjan Hendriks          Printed On 100 Percent Recycled Electrons
er...@he...                   Contents may settle during shipment