Re: Re[2]: [SSI-users] cluster trouble

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dear Kishore,

On Tue, 2004-08-31 at 14:57, SAMPATHKUMAR KISHORE KANIYAR wrote:
> Today, I tried something along the lines of what you have written.
> Yes, you are right! It causes the additional node to go down! With
> the script I have given below, you can reliably reproduce the
> bug everytime.
>   
> Run the following on the init node (or node with node number 1):
>   
> $ /tmp/trytar.sh &
> $ strace -f -ff -o ps1 ps --node <node number>
>   
> Where /tmp/trytar.sh contains the following:
> ----- BEGIN: /tmp/trytar.sh -----
> for i in 0 1 2 3 4 5 6 7 8 9
> do
>     strace -f -ff -o tar$$ bash-ll -c "tar cf tar$$.out /tmp/*.out" > /dev/null 2>&1 &
>     sleep 2;
> done
> ----- END: /tmp/trytar.sh -----
>   
> I checked and found that, while that node is down, if you do a
> "ping" of that node (in your case node2), it will actually respond
> to the "ping"!
>   
> However, "cluster" command notices that the additional node is
> down. node 2 transitions to DOWN state and the node down event is
> handled by all surviving nodes!

We have encountered similar behavior. Since we applied the patch from
John Byrne, our cluster stayed up for a week. To further investigate
what went wrong this time, we did quite some tests. However, we never
succeeded in producing a totally fool-proof way of crashing a node.

One way seems to be to execute a few top with very high update rates
from the init node on a second node using onnode, like this:

node1$ onnode -l 2 top d0

Usually, node 2 crashes after a few seconds. Sometimes, a little extra
load on node 2 helped it crash sooner and sometimes only one top was
sufficient. However, we could not reproduce the problem when starting
the top commands from node 2's virtual consoles.

> I will start working on this bug immediately. Will keep you posted.

What is your latest findings?

Thanks in advance,
Martin Jacobsson

-- 
Martin Jacobsson <m.j...@ew...>