Re: [javagroups-users] Fwd: Leaving node cannot rejoin cluster it just left

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hey Nicolas,

Create a JIRA, attach everything you have and assign it to me!

On 11-12-07 1:48 PM, balrog of moria wrote:
> Hello again,
>
> Well I encountered the problem once more on our production environment 
> where we have 3 nodes A, B and C. It is indeed the FLUSH protocol 
> which seems somewhat broken due to the fact of some weird instance where:
>
> - A, B and C are running normally until C leaves the cluster (it 
> doesn't seem graceful but anyways...)
> - A sees using FD_SOCK that C has left (suspected) and being the 
> coordinator, attempts to start a FLUSH with A and B as participants;
> - For some odd reason, B doesn't respond to the START_FLUSH with a 
> FLUSH_COMPLETED, so A rejects the flush and sends a FLUSH_ABORT;
> - Looking at B's logs, I can see the FLUSH_ABORTs but the debug traces 
> in FLUSH don't print anything regarding if A's START_FLUSH was 
> received and if B is indeed a participant if so;
> - This continues for about 15 seconds until GMS just gives up and 
> "forces" the new view with only A and B;
> - GMS /also/ fails to collects the missing ACKs from B for some odd 
> reason but nevertheless RESUMEs the FLUSH;
> - During this time, C continues to be suspected by A's FD_SOCK despite 
> the fact that a new view WITHOUT C was installed on A.
> - MERGE2 kicks in 10 seconds later detecting 2 different views with 
> merge participants A and C yielding
> Discovery results:
> [B-36404]: coord=A-20274
> [A-20274]: coord=A-20274
> which doesn't really make any sense since C should not even be running!
> - Anyhow the merge fails repeatedly since A (merge leader) isn't able 
> to gather all data from A and C (which is normal since there is no 
> physical address for C, msg was dropped)
> - B continues to not respond to the FLUSH RESUME, STOP_FLUSH and 
> START_FLUSHes
>
> I have detailed logs of A, B and C with "debug" level to all protocols 
> but at this point I'm thinking of ommiting the FLUSH protocol all 
> together since it seems it isn't robust enough for production use, 
> even with small clusters. Before I rethink my model, I just wanted to 
> make sure that FLUSH ensures the following axioms:
>
> - Before installing a new view, all messages sent to v1 will be 
> delivered before v2 can be installed;
Yes
> - If A sends a message to v1, is delivered to B, but terminates before 
> it is delivered to C, will B ensure that C receives the message? Is 
> this provided by FLUSH or does NAKACK suffice?
No, NAKACK does not suffice. Lets fix FLUSH to make sure it does not happen.

Regards,
Vladimir
>
> If you're interested I can forward whatever logs I have and write up a 
> JIRA. Thanks again,
>
> Nicolas
>