From: Erik A. H. <er...@he...> - 2001-12-14 20:12:24
|
On Thu, Dec 13, 2001 at 09:22:40PM -0500, henken wrote: ..snip.. > Iteration: 205859 on node: 1 > bpsh: invalid pid -29611 > bpsh: invalid pid -29611 > bpsh: invalid pid -29611 > bpsh: Child process exit abnormally. > Iteration: 207649 on node: 0 > bproc_nodelist: Input/output error This is very weird. Since bpsh is getting that bogus PID back, it implies that the remote process got that PID assigned to it. Which says "problem with move" to me. A very very weird problem though. Like maybe to procs with the same pid are trying to move to the same node at the same time. Either that or data corruption or the slave disfunctioning in some weird way. > I also found this in /var/log/messages: > Dec 13 16:05:39 master /usr/sbin/bpmaster: FATAL: assoc_find: invalid pid > -29611 > > Am I expecting too much of bproc? Nope. Absolutely not. This kind of stress testing is how you weed out the hard to reproduce and harder to find bugs. > I know I am being ridiculously evil in running this test script, but > we have seen situations similar to this when our users run jobs that > fail at the onset and they do not exit cleanly. The bpsh errors are probably just a symptom of other stuff going wrong. The bpmaster crash is probably the best place to start here. It should produce a core dump at the point where it dies with that message. That core file would be useful to look at. 1) build bpmaster with debugging turned on (-g). 2) Run it w/ core dumps enabled 3) Run it w/ message trace enabled "-m filename" If you can reproduce the crash with all of that, we should have a good point to start from to try and figure out what happened. I've been running your test script for a few hours here and haven't had any trouble. - Erik -- Erik Arjan Hendriks Printed On 100 Percent Recycled Electrons er...@he... Contents may settle during shipment |