Re: [jgroups-users] Cluster coordinator problem
Brought to you by:
belaban
|
From: Bela B. <be...@ya...> - 2012-12-12 12:31:13
|
On 12/12/12 9:36 AM, Kresimir Simatovic wrote: > > It seems that culprit was on FD. I examined it through JMX and saw that it > was not running (monitorRunning attribute) so coordinator shutdown was not > detected. When manually started, new coordinator was elected and cluster > recovered. Hmm. The FD monitor is running when it has a valid target to ping. This means the monitor will stop if the member is the only one in the cluster. As soon as another member joins, the monitor should be started again. > I think flow was something like this: > 1) node which was responsible to send heartbeat to coordinator has FD > inactive - I've noticed when channel is connected to cluster that > FD.monitorRunning = false until first view change If this was the first member to start, then yes this is correct. > 2) coordinator was shut down but for some unknown reason its disconnect > message went unnoticed so it always won in election procedure Shut down gracefully ? Then the view should have changed the target ping destination in FD, so that member should have started pinging someone else. If there was a member crash, then FD should have been invoked and the member should have been suspected and eventually excluded. Do you have FD_SOCK in your stack ? Because if you do, it'll kick in before FD and will suspect and exclude a crashed member. Are you running on UDP ? If you do, why don't you switch to FD_ALL ? > 3) when new channel wanted to join, it got non-existing corrdinator, message > was dropped and join could never finish Hmm, this shouldn't happen. Can you provide me with exact steps to reproduce this ? -- Bela Ban, JGroups lead (http://www.jgroups.org) |