Re: [jgroups-users] Cluster coordinator problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 12/12/12 9:36 AM, Kresimir Simatovic wrote:
>
> It seems that culprit was on FD. I examined it through JMX and saw that it
> was not running (monitorRunning attribute) so coordinator shutdown was not
> detected. When manually started, new coordinator was elected and cluster
> recovered.

Hmm. The FD monitor is running when it has a valid target to ping. This 
means the monitor will stop if the member is the only one in the 
cluster. As soon as another member joins, the monitor should be started 
again.

> I think flow was something like this:
> 1) node which was responsible to send heartbeat to coordinator has FD
> inactive - I've noticed when channel is connected to cluster that
> FD.monitorRunning = false until first view change

If this was the first member to start, then yes this is correct.

> 2) coordinator was shut down but for some unknown reason its disconnect
> message went unnoticed so it always won in election procedure

Shut down gracefully ? Then the view should have changed the target ping 
destination in FD, so that member should have started pinging someone else.

If there was a member crash, then FD should have been invoked and the 
member should have been suspected and eventually excluded. Do you have 
FD_SOCK in your stack ? Because if you do, it'll kick in before FD and 
will suspect and exclude a crashed member.

Are you running on UDP ? If you do, why don't you switch to FD_ALL ?

> 3) when new channel wanted to join, it got non-existing corrdinator, message
> was dropped and join could never finish

Hmm, this shouldn't happen. Can you provide me with exact steps to 
reproduce this ?

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)