From: henken <he...@se...> - 2001-12-14 02:23:07
|
Hello -- I have upgraded to the ~stock kernel shipped onthe clustermatic CD, and have re-run my stress test. If they are not remembered, here they are: [root@master /tmp]# more /home/henken/cvs/jobs/noop.c #include <unistd.h> #include <stdio.h> int main() { } [root@master /tmp]# and the job script [root@master /tmp]# more /home/henken/cvs/jobs/noop.sh #!/bin/bash JOBID=${0##/*/} /bin/echo "JOBID:$JOBID" NODES=`/usr/local/clubmask-0.5a2/bin/getnodes $JOBID` /bin/echo "NODES:$NODES" for node in $NODES; do ( let count=0 while [ $count -le $1 ]; do /bin/echo "Iteration: $count on node: $node" bpsh $node /home/henken/cvs/jobs/bin/noop let count=count+1 done ) & done I have been running these with 2 SMP nodes, which each get 2 proccesses, for a total of 4 instances of the while loop at the same time. I have been running with a $1 ( the number of iterations ) around 1 million. I have seen other kernel related problems related to RH's patching of the kernel, but now with the stock kernel I am getting bpmaster, bpslave, and bpsh failures. Here are the captures from the message I could find related to the error: in the stdout capture from the job script: [SNIP] Iteration: 205859 on node: 1 bpsh: invalid pid -29611 bpsh: invalid pid -29611 bpsh: invalid pid -29611 bpsh: Child process exit abnormally. Iteration: 207649 on node: 0 bproc_nodelist: Input/output error I also found this in /var/log/messages: Dec 13 16:05:39 master /usr/sbin/bpmaster: FATAL: assoc_find: invalid pid -29611 Am I expecting too much of bproc? I know I am being ridiculously evil in running this test script, but we have seen situations similar to this when our users run jobs that fail at the onset and they do not exit cleanly. Thanks for any and all help, and I am fully able to send more info/test other options. Nic -- Nicholas Henke Undergraduate - SEAS '02 Liniac Project - University of Pennsylvania http://clubmask.sourceforge.net ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ work: 215-873-5149 cell/home: 215-681-2705 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There's nothing like good food, good beer, and a bad girl. |