Re: [javagroups-users] Possible to merge fix for JGRP-136 to 2.2.8??

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Bela,

After reading the VIEW_SYNC code, I realized that we're actually hitting a 
slightly different problem.
Here is the scenario I'm seeing:

A: New coordinator doesn't take over
------------------------------------
- Group is {A,B,C}, A is coordinator
- A leaves gracefully by calling Channel.close() and attempts to multicast 
view V2 {B,C}
- Both B and C don't receive V2
- A terminates, which means that retransmission of A's V2 in B and C stops 
(B and C never receive V2)
- B's view is now V1 {A,B,C}
- C's view is now V2 {A,B,C}
- B does *not* become the new coordinator !

Problem:

- A attempts to rejoin group (*by creating a new channel and protocol stack 
instance*), both B and C reply to the discovery request with
   coord=A

NOTE:  VM instance A never really terminates.  A leaves the group by 
calling Channel.close() and then rejoins the group by
             creating a new channel and protocol stack instance.

- Now the strange thing is that A then attempts to send JOIN request to old 
A address and loops forever as it never receives a response.
    I keep seeing:

   [WARN] ClientGmsImpl - -join(<old A address>) failed, retrying

The odd thing is that I'm using a TCP and am not seeing any exception when 
connecting to the old A address.

This is the same as if :

- A new member D wants to join, both B and C reply to the discovery request 
with coord=A
- If D sends a JOIN request to A, it will time out and receive no response
   --> In any case, D will continue looping forever

My configuration is:

TCP(sock_conn_timeout=200;start_port=7800):
TCPPING(initial_hosts=localhost[7800];port_range=5;timeout=3000;num_initial_members=1;up_thread=true;down_thread=true):
FD_SOCK(up_thread=false;down_thread=false):
VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
pbcast.STABLE(desired_avg_gossip=0;stability_delay=1000;down_thread=false;digest_timeout=120000;max_bytes=1250000):
pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=true;print_local_addr=true;down_thread=true;up_thread=true):
FC(max_credits=5000000;min_threshold=0.25;down_thread=false):

It doesn't seem that VIEW_SYNC would help in this scenario.

Please let me know if you need more clarification.  I can attempt to send 
logs if needed.

Thanks!

--Lenny

At 11:09 PM 12/27/2005, Bela Ban wrote:
>You should be able to simply take VIEW_SYNC from 2.2.9 and place it below 
>GMS. That said, this bug is very rare,
>and if you didn't encounter it in 2.2.8 testing, you should be okay
>
>Lenny Phan wrote:
>
>>Hi Bela,
>>
>>I've run into the "Reliable View Dissemination" bug in 2.2.8.  I see that 
>>it has been fixed in 2.2.9
>>but we're just about to release our product and have done extensive QA of 
>>it using 2.2.8.
>>
>>Is it possible to patch 2.2.8 with the fix for JGRP-136?
>>
>>
>>Thanks!
>>
>>--Lenny
>>
>
>--
>Bela Ban
>Lead JGroups / JBossCache
>callto://belaban
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
>for problems?  Stop!  Download the new AJAX search engine that makes
>searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
>http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
>_______________________________________________
>javagroups-users mailing list
>jav...@li...
>https://lists.sourceforge.net/lists/listinfo/javagroups-users