#80 'ps --node' in parallel with tar via bash-ll causes nodedown


On a RH9 system with OpenSSI 1.0.0, running "ps --
node <nodenum>" in parallel while multiple processes
started via bash-ll are in execution causes <nodenum>
to be shutdown.

Strangely, doing a "ping <nodenum>" shows that
<nodenum> is still reachable via network.

However, "cluster" command notices that the additional
node is down. node 2 transitions to DOWN state and the
node down event is handled by all surviving nodes!

The following can reliably reproduce the problem:

On a 2-node OpenSSI cluster, assuming the node
numbers are 1 and 2, run the following on the init node
(node 1):

$ /tmp/trytar.sh &
$ strace -f -ff -o ps1 ps --node 2

Where /tmp/trytar.sh contains the following:
----- BEGIN: /tmp/trytar.sh -----
for i in 0 1 2 3 4 5 6 7 8 9
strace -f -ff -o tar$$ bash-ll -c "tar cf
tar$$.out /tmp/*.out" > /dev/null 2>&1 &
sleep 2;
----- END: /tmp/trytar.sh -----


    • assigned_to: nobody --> kvaneesh
    • status: open --> closed
  • Logged In: YES

    The check-in that Laura had done on the OPENSSI-RH-1-0-
    STABLE in kernel/cluster/ssi/vproc/procfs_subr.c
    corresponding to a fix for some other problem actually now
    fixes this problem as well.

    I checked-out the above file from OPENSSI-RH-1-0-STABLE
    branch, built a new kernel, and after booting on that, tried
    the above mentioned test. The test now succeeds.

    Closing this bug.

    • assigned_to: kvaneesh --> kishoreks