Re: [javagroups-users] Possible to merge fix for JGRP-136 to 2.2.8??
Brought to you by:
belaban
From: Lenny P. <len...@or...> - 2005-12-28 16:54:31
|
Hi Bela, After reading the VIEW_SYNC code, I realized that we're actually hitting a slightly different problem. Here is the scenario I'm seeing: A: New coordinator doesn't take over ------------------------------------ - Group is {A,B,C}, A is coordinator - A leaves gracefully by calling Channel.close() and attempts to multicast view V2 {B,C} - Both B and C don't receive V2 - A terminates, which means that retransmission of A's V2 in B and C stops (B and C never receive V2) - B's view is now V1 {A,B,C} - C's view is now V2 {A,B,C} - B does *not* become the new coordinator ! Problem: - A attempts to rejoin group (*by creating a new channel and protocol stack instance*), both B and C reply to the discovery request with coord=A NOTE: VM instance A never really terminates. A leaves the group by calling Channel.close() and then rejoins the group by creating a new channel and protocol stack instance. - Now the strange thing is that A then attempts to send JOIN request to old A address and loops forever as it never receives a response. I keep seeing: [WARN] ClientGmsImpl - -join(<old A address>) failed, retrying The odd thing is that I'm using a TCP and am not seeing any exception when connecting to the old A address. This is the same as if : - A new member D wants to join, both B and C reply to the discovery request with coord=A - If D sends a JOIN request to A, it will time out and receive no response --> In any case, D will continue looping forever My configuration is: TCP(sock_conn_timeout=200;start_port=7800): TCPPING(initial_hosts=localhost[7800];port_range=5;timeout=3000;num_initial_members=1;up_thread=true;down_thread=true): FD_SOCK(up_thread=false;down_thread=false): VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false): pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000): pbcast.STABLE(desired_avg_gossip=0;stability_delay=1000;down_thread=false;digest_timeout=120000;max_bytes=1250000): pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=true;print_local_addr=true;down_thread=true;up_thread=true): FC(max_credits=5000000;min_threshold=0.25;down_thread=false): It doesn't seem that VIEW_SYNC would help in this scenario. Please let me know if you need more clarification. I can attempt to send logs if needed. Thanks! --Lenny At 11:09 PM 12/27/2005, Bela Ban wrote: >You should be able to simply take VIEW_SYNC from 2.2.9 and place it below >GMS. That said, this bug is very rare, >and if you didn't encounter it in 2.2.8 testing, you should be okay > >Lenny Phan wrote: > >>Hi Bela, >> >>I've run into the "Reliable View Dissemination" bug in 2.2.8. I see that >>it has been fixed in 2.2.9 >>but we're just about to release our product and have done extensive QA of >>it using 2.2.8. >> >>Is it possible to patch 2.2.8 with the fix for JGRP-136? >> >> >>Thanks! >> >>--Lenny >> > >-- >Bela Ban >Lead JGroups / JBossCache >callto://belaban > > > >------------------------------------------------------- >This SF.net email is sponsored by: Splunk Inc. Do you grep through log files >for problems? Stop! Download the new AJAX search engine that makes >searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! >http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click >_______________________________________________ >javagroups-users mailing list >jav...@li... >https://lists.sourceforge.net/lists/listinfo/javagroups-users |