Re: [javagroups-users] JGroups 2.12.0 downcall to get IP Address returned an unexpected class inste
Brought to you by:
belaban
From: Kuns, E. <Edw...@As...> - 2011-04-30 07:53:08
|
> From: Bela Ban > How do you know it wasn't null ? The instanceof code in the if clause would return false if phys_addr was null... Whoops, you're right. I confused two different variables in my head, knowing one was not null and thinking this meant the other was not null. So what does it mean for a cluster member to have a null physical address? I've seen this a few times now. Does it indicate some other problem has already occurred? I've realized that the system that reproduce this a couple times seems to have time stability issues where the time jumps back and forth by a few seconds to several minutes, sometimes several times in a minute, then the time is stable for hours or days. Could this stress JGroups into having a cluster member with a null physical address? More seriously at this same location, I've also seen cases where (I'm assuming the time jumping is the root cause here) two members of a 3 member cluster saw member #1 go away (the previous coordinator), then shortly later saw it come back, but member #1 never saw any view change. What makes this surprising is that this resulted in #1 thinking it was coordinator while the other two machines agreed that a different server was the coordinator, thus three members in a cluster with inconsistent ideas on who the coordinator was. I am trying to reproduce this with JGroups in debug, which I don't have a the moment. With the logs that I do have (JGroups in WARN), here is basically what happens where MyMembershipListener is showing human-readable output of processing a org.jgroups.MembershipListener#viewAccepted(View): The number at the far left is the server number, 1 - 3. The next number is a rough time in seconds. (I don't know how well the servers are synced to exactly the same time.) Points where it's known that a server changed time backwards are noted. Obviously, it's not clear where time jumped forward but I'm sure it happened. At first, everyone sees server #1 leave the cluster 3 - 0.110 [MyMembershipListener] - Members left: [SERVER-01]; Previous Coordinator: SERVER-01, New Coordinator: SERVER-02 2 - 0.263 [MyMembershipListener] - Members left: [SERVER-01]; Previous Coordinator: SERVER-01, New Coordinator: SERVER-02 (me) 1 - 0.769 [Incoming-1,null|GMS] - SERVER-01-19799: not member of view [SERVER-02-41875|5] [SERVER-02-41875, SERVER-03-18967]; discarding it Over the next 15 seconds, various complaints about missing ACKs 2 - 2.263 [ViewHandler,MYGROUP,SERVER-02-41875|GMS] - SERVER-02-41875: failed to collect all ACKs (expected=2) for view [SERVER-02-41875|5] [SERVER-02-41875, SERVER-03-18967] after 2000ms, missing ACKs from [SERVER-02-41875, SERVER-03-18967] 1 - 7.096 [Incoming-5,null|GMS] - SERVER-01-19799: failed to collect all ACKs (expected=3) for view MergeView::[SERVER-03-18967|6] [SERVER-03-18967, SERVER-02-41875, SERVER-01-19799], subgroups=[[SERVER-02-41875|5] [SERVER-02-41875, SERVER-03-18967], [SERVER-01-19799|4] [SERVER-01-19799, SERVER-03-18967]] after 2000ms, missing ACKs from [SERVER-03-18967] 3 - 11.421 [OOB-6,null|NAKACK] - SERVER-03-18967: dropped message from SERVER-01-19799 (not in table [SERVER-02-41875, SERVER-03-18967]), view=[SERVER-02-41875|5] [SERVER-02-41875, SERVER-03-18967] 2 - 11.606 [OOB-3,null|NAKACK] - SERVER-02-41875: dropped message from SERVER-01-19799 (not in table [SERVER-02-41875, SERVER-03-18967]), view=[SERVER-02-41875|5] [SERVER-02-41875, SERVER-03-18967] Around this time, the time on server #3 jumped backwards at least 1 second (from 12 to 11) Around the time, the time on server #2 jumped backwards by a fraction of a second 3 - 13.108 [Incoming-3,null|NAKACK] - SERVER-03-18967: dropped message from SERVER-01-19799 (not in table [SERVER-02-41875, SERVER-03-18967]), view=[SERVER-02-41875|5] [SERVER-02-41875, SERVER-03-18967] Finally, server #2 and 3 see Server #1 join the cluster, but Server #1 never gets a new view Around this time, the time on server #3 jumped backwards at least 9 seconds (from 22 to 13) 3 - 24,139 [MyMembershipListener] - Members joined: [SERVER-01]; Previous Coordinator: SERVER-02, New Coordinator: SERVER-03 (me) 2 - 24,293 [MyMembershipListener] - Members joined: [SERVER-01]; Previous Coordinator: SERVER-02, New Coordinator: SERVER-03 While the time jumping back and forth on multiple members in a JGroups cluster is clearly something that needs to be fixed, when the time stopped jumping around, the cluster settled into a state where #2 and #3 agreed on who the coordinator was, but #1 *NEVER* saw a MembershipListener#viewAccepted -- it never saw the cluster membership change. Not when the other two decided it left, and not when the other two decided it rejoined. This left a cluster of three machines where one of the members disagreed on who the coordinator was. The configuration is essentially udp.xml from 2.12.0. Other than having JGroups logs in debug (which I hope to have the next time this occurs), what more information would help to understand what happened? Thanks Eddie -- Edward Kuns |