Re: Re[2]: [SSI-users] cluster trouble
Brought to you by:
brucewalker,
rogertsang
From: Martin J. <m.j...@ew...> - 2004-09-03 16:35:47
|
Dear Kishore, On Tue, 2004-08-31 at 14:57, SAMPATHKUMAR KISHORE KANIYAR wrote: > Today, I tried something along the lines of what you have written. > Yes, you are right! It causes the additional node to go down! With > the script I have given below, you can reliably reproduce the > bug everytime. > > Run the following on the init node (or node with node number 1): > > $ /tmp/trytar.sh & > $ strace -f -ff -o ps1 ps --node <node number> > > Where /tmp/trytar.sh contains the following: > ----- BEGIN: /tmp/trytar.sh ----- > for i in 0 1 2 3 4 5 6 7 8 9 > do > strace -f -ff -o tar$$ bash-ll -c "tar cf tar$$.out /tmp/*.out" > /dev/null 2>&1 & > sleep 2; > done > ----- END: /tmp/trytar.sh ----- > > I checked and found that, while that node is down, if you do a > "ping" of that node (in your case node2), it will actually respond > to the "ping"! > > However, "cluster" command notices that the additional node is > down. node 2 transitions to DOWN state and the node down event is > handled by all surviving nodes! We have encountered similar behavior. Since we applied the patch from John Byrne, our cluster stayed up for a week. To further investigate what went wrong this time, we did quite some tests. However, we never succeeded in producing a totally fool-proof way of crashing a node. One way seems to be to execute a few top with very high update rates from the init node on a second node using onnode, like this: node1$ onnode -l 2 top d0 Usually, node 2 crashes after a few seconds. Sometimes, a little extra load on node 2 helped it crash sooner and sometimes only one top was sufficient. However, we could not reproduce the problem when starting the top commands from node 2's virtual consoles. > I will start working on this bug immediately. Will keep you posted. What is your latest findings? Thanks in advance, Martin Jacobsson -- Martin Jacobsson <m.j...@ew...> |