Re: [javagroups-users] Fwd: Leaving node cannot rejoin cluster it just left
Brought to you by:
belaban
From: Vladimir B. <vbl...@re...> - 2011-12-07 18:33:41
|
Hey Nicolas, Create a JIRA, attach everything you have and assign it to me! On 11-12-07 1:48 PM, balrog of moria wrote: > Hello again, > > Well I encountered the problem once more on our production environment > where we have 3 nodes A, B and C. It is indeed the FLUSH protocol > which seems somewhat broken due to the fact of some weird instance where: > > - A, B and C are running normally until C leaves the cluster (it > doesn't seem graceful but anyways...) > - A sees using FD_SOCK that C has left (suspected) and being the > coordinator, attempts to start a FLUSH with A and B as participants; > - For some odd reason, B doesn't respond to the START_FLUSH with a > FLUSH_COMPLETED, so A rejects the flush and sends a FLUSH_ABORT; > - Looking at B's logs, I can see the FLUSH_ABORTs but the debug traces > in FLUSH don't print anything regarding if A's START_FLUSH was > received and if B is indeed a participant if so; > - This continues for about 15 seconds until GMS just gives up and > "forces" the new view with only A and B; > - GMS /also/ fails to collects the missing ACKs from B for some odd > reason but nevertheless RESUMEs the FLUSH; > - During this time, C continues to be suspected by A's FD_SOCK despite > the fact that a new view WITHOUT C was installed on A. > - MERGE2 kicks in 10 seconds later detecting 2 different views with > merge participants A and C yielding > Discovery results: > [B-36404]: coord=A-20274 > [A-20274]: coord=A-20274 > which doesn't really make any sense since C should not even be running! > - Anyhow the merge fails repeatedly since A (merge leader) isn't able > to gather all data from A and C (which is normal since there is no > physical address for C, msg was dropped) > - B continues to not respond to the FLUSH RESUME, STOP_FLUSH and > START_FLUSHes > > I have detailed logs of A, B and C with "debug" level to all protocols > but at this point I'm thinking of ommiting the FLUSH protocol all > together since it seems it isn't robust enough for production use, > even with small clusters. Before I rethink my model, I just wanted to > make sure that FLUSH ensures the following axioms: > > - Before installing a new view, all messages sent to v1 will be > delivered before v2 can be installed; Yes > - If A sends a message to v1, is delivered to B, but terminates before > it is delivered to C, will B ensure that C receives the message? Is > this provided by FLUSH or does NAKACK suffice? No, NAKACK does not suffice. Lets fix FLUSH to make sure it does not happen. Regards, Vladimir > > If you're interested I can forward whatever logs I have and write up a > JIRA. Thanks again, > > Nicolas > |