[javagroups-users] JGroups 2.8.0: cannot rejoin cluster after coordinator has been shut down

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello,

we are using JGroups 2.8.0 with Tomcat 5.5.27 on x64 Linux. Unfortunately,
we have to use jdk1.5 and tomcat 5.5. Our Cluster has 9 members and is still
growing. Our nodes are hosted on several ESX hosts, therefore we can´t use
multicast.

Sometimes we have to shut down our Tomcats, that is where we get problems.
When we shutdown the actual coordinator (usualy we don´t check this before),
the other nodes doesn´t define a new coordinator. When we start the old
coordinator, he doesn´t become the new coordinator and doesn´t join the
existing cluster.

Our Config:

------------------------------------------------------------------------------------------------------
<!--
    TCP based stack, with flow control and message bundling. This is usually
used when IP
    multicasting cannot be used in a network, e.g. because it is disabled
(routers discard multicast).
    Note that TCP.bind_addr and TCPPING.initial_hosts should be set,
possibly via system properties, e.g.
    -Djgroups.bind_addr=192.168.5.2 and
-Djgroups.tcpping.initial_hosts=192.168.5.2[7800]
    author: Bela Ban
    version: $Id: tcp.xml,v 1.40 2009/12/18 09:28:30 belaban Exp $
-->
<config>
  <TCP
    bind_port="12345"
    bind_interface_str="eth0"
    loopback="true"
         recv_buf_size="${tcp.recv_buf_size:20M}"
         send_buf_size="${tcp.send_buf_size:640K}"
         discard_incompatible_packets="true"
         max_bundle_size="64K"
         max_bundle_timeout="30"
         enable_bundling="true"
         use_send_queues="true"
         sock_conn_timeout="300"
         timer.num_threads="4"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="10"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>
  />

  <TCPPING
    timeout="2000"

initial_hosts="swcm3350[12345],swcm3351[12345],swcm3352[12345],swcm3353[12345],swcm3354[12345],swcm3355[12345],swcm3356[12345],swcm3357[12345],swcm3358[12345]"
    port_range="1"
    num_initial_members="9"
  />
  <MERGE2
    min_interval="5000"
    max_interval="10000"
  />
  <FD_SOCK/>
  <FD
    timeout="3000"
    max_tries="3"
  />
  <VERIFY_SUSPECT
    timeout="1500"
  />
  <BARRIER/>
  <pbcast.NAKACK
    discard_delivered_msgs="true"
    retransmit_timeout="100,200,300,600,1200,2400,4800"
    use_mcast_xmit="false"
    gc_lag="0"
  />
  <UNICAST
     timeout="300,600,1200"
  />
  <pbcast.STABLE
    max_bytes="400K"
    stability_delay="1000"
    desired_avg_gossip="50000"
  />
  <pbcast.GMS
    print_local_addr="true"
    join_timeout="3000"
  />
  <FC
     max_credits="2M"
     min_threshold="0.10"
  />
  <FRAG2
     frag_size="60K"
  />
  <pbcast.STREAMING_STATE_TRANSFER/>
</config>

------------------------------------------------------------------------------------------------------
In this case, the node swcm3353 was the coordinator, that we have restarted. 

Thread-CPS-1001-2008924061-http-8080-1_id0      WARN   
[org.jgroups.protocols.pbcast.GMS] - join(swcm3353.ourdomain.com-42400) sent
to swcm3353.ourdomain.com-53388 timed out (after 3000 ms), retrying
Thread-CPS-1001-2008924061-http-8080-1_id0      DEBUG  
[org.jgroups.protocols.pbcast.GMS] - initial_mbrs are 
[own_addr=swcm3358.ourdomain.com-54056, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3358.ourdomain.com-54056,
physical_addrs=10.248.234.197:12345, 
own_addr=swcm3353.ourdomain.com-29887, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3353.ourdomain.com-29887,
physical_addrs=10.248.234.175:12345, 
own_addr=swcm3354.ourdomain.com-60551, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3354.ourdomain.com-60551,
physical_addrs=10.248.234.176:12345, 
own_addr=swcm3353.ourdomain.com-52778, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3353.ourdomain.com-52778,
physical_addrs=10.248.234.175:12345, 
own_addr=swcm3355.ourdomain.com-61006, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3355.ourdomain.com-61006,
physical_addrs=10.248.234.189:12345, 
own_addr=swcm3356.ourdomain.com-63935, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3356.ourdomain.com-63935,
physical_addrs=10.248.234.195:12345, 
own_addr=swcm3353.ourdomain.com-56155, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3353.ourdomain.com-56155,
physical_addrs=10.248.234.175:12345, 
own_addr=swcm3352.ourdomain.com-6160, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3352.ourdomain.com-6160,
physical_addrs=10.248.232.39:12345, 
own_addr=swcm3350.ourdomain.com-3234, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3350.ourdomain.com-3234,
physical_addrs=10.248.232.37:12345, 
own_addr=swcm3353.ourdomain.com-15186, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3353.ourdomain.com-15186,
physical_addrs=10.248.234.175:12345, 
own_addr=swcm3357.ourdomain.com-31306, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3357.ourdomain.com-31306,
physical_addrs=10.248.234.196:12345, 
own_addr=swcm3351.ourdomain.com-60129, view
id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false,
logical_name=swcm3351.ourdomain.com-60129,
physical_addrs=10.248.232.38:12345]
Thread-CPS-1001-2008924061-http-8080-1_id0      DEBUG  
[org.jgroups.protocols.pbcast.GMS] - election results:
{swcm3353.ourdomain.com-53388=8}
Thread-CPS-1001-2008924061-http-8080-1_id0      DEBUG  
[org.jgroups.protocols.pbcast.GMS] - sending
handleJoin(swcm3353.ourdomain.com-42400) to swcm3353.ourdomain.com-53388

------------------------------------------------------------------------------------------------------
Any idea for this case?

Thanks in advance
Bent

-- 
View this message in context: http://old.nabble.com/JGroups-2.8.0%3A-cannot-rejoin-cluster-after-coordinator-has-been-shut-down-tp31953491p31953491.html
Sent from the JGroups - General mailing list archive at Nabble.com.