[javagroups-users] JGroups 2.8.0: cannot rejoin cluster after coordinator has been shut down
Brought to you by:
belaban
From: BZimmermann <das...@gm...> - 2011-06-29 14:11:09
|
Hello, we are using JGroups 2.8.0 with Tomcat 5.5.27 on x64 Linux. Unfortunately, we have to use jdk1.5 and tomcat 5.5. Our Cluster has 9 members and is still growing. Our nodes are hosted on several ESX hosts, therefore we can´t use multicast. Sometimes we have to shut down our Tomcats, that is where we get problems. When we shutdown the actual coordinator (usualy we don´t check this before), the other nodes doesn´t define a new coordinator. When we start the old coordinator, he doesn´t become the new coordinator and doesn´t join the existing cluster. Our Config: ------------------------------------------------------------------------------------------------------ <!-- TCP based stack, with flow control and message bundling. This is usually used when IP multicasting cannot be used in a network, e.g. because it is disabled (routers discard multicast). Note that TCP.bind_addr and TCPPING.initial_hosts should be set, possibly via system properties, e.g. -Djgroups.bind_addr=192.168.5.2 and -Djgroups.tcpping.initial_hosts=192.168.5.2[7800] author: Bela Ban version: $Id: tcp.xml,v 1.40 2009/12/18 09:28:30 belaban Exp $ --> <config> <TCP bind_port="12345" bind_interface_str="eth0" loopback="true" recv_buf_size="${tcp.recv_buf_size:20M}" send_buf_size="${tcp.send_buf_size:640K}" discard_incompatible_packets="true" max_bundle_size="64K" max_bundle_timeout="30" enable_bundling="true" use_send_queues="true" sock_conn_timeout="300" timer.num_threads="4" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false" thread_pool.queue_max_size="100" thread_pool.rejection_policy="discard" oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100" oob_thread_pool.rejection_policy="discard"/> /> <TCPPING timeout="2000" initial_hosts="swcm3350[12345],swcm3351[12345],swcm3352[12345],swcm3353[12345],swcm3354[12345],swcm3355[12345],swcm3356[12345],swcm3357[12345],swcm3358[12345]" port_range="1" num_initial_members="9" /> <MERGE2 min_interval="5000" max_interval="10000" /> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER/> <pbcast.NAKACK discard_delivered_msgs="true" retransmit_timeout="100,200,300,600,1200,2400,4800" use_mcast_xmit="false" gc_lag="0" /> <UNICAST timeout="300,600,1200" /> <pbcast.STABLE max_bytes="400K" stability_delay="1000" desired_avg_gossip="50000" /> <pbcast.GMS print_local_addr="true" join_timeout="3000" /> <FC max_credits="2M" min_threshold="0.10" /> <FRAG2 frag_size="60K" /> <pbcast.STREAMING_STATE_TRANSFER/> </config> ------------------------------------------------------------------------------------------------------ In this case, the node swcm3353 was the coordinator, that we have restarted. Thread-CPS-1001-2008924061-http-8080-1_id0 WARN [org.jgroups.protocols.pbcast.GMS] - join(swcm3353.ourdomain.com-42400) sent to swcm3353.ourdomain.com-53388 timed out (after 3000 ms), retrying Thread-CPS-1001-2008924061-http-8080-1_id0 DEBUG [org.jgroups.protocols.pbcast.GMS] - initial_mbrs are [own_addr=swcm3358.ourdomain.com-54056, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3358.ourdomain.com-54056, physical_addrs=10.248.234.197:12345, own_addr=swcm3353.ourdomain.com-29887, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3353.ourdomain.com-29887, physical_addrs=10.248.234.175:12345, own_addr=swcm3354.ourdomain.com-60551, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3354.ourdomain.com-60551, physical_addrs=10.248.234.176:12345, own_addr=swcm3353.ourdomain.com-52778, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3353.ourdomain.com-52778, physical_addrs=10.248.234.175:12345, own_addr=swcm3355.ourdomain.com-61006, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3355.ourdomain.com-61006, physical_addrs=10.248.234.189:12345, own_addr=swcm3356.ourdomain.com-63935, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3356.ourdomain.com-63935, physical_addrs=10.248.234.195:12345, own_addr=swcm3353.ourdomain.com-56155, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3353.ourdomain.com-56155, physical_addrs=10.248.234.175:12345, own_addr=swcm3352.ourdomain.com-6160, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3352.ourdomain.com-6160, physical_addrs=10.248.232.39:12345, own_addr=swcm3350.ourdomain.com-3234, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3350.ourdomain.com-3234, physical_addrs=10.248.232.37:12345, own_addr=swcm3353.ourdomain.com-15186, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3353.ourdomain.com-15186, physical_addrs=10.248.234.175:12345, own_addr=swcm3357.ourdomain.com-31306, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3357.ourdomain.com-31306, physical_addrs=10.248.234.196:12345, own_addr=swcm3351.ourdomain.com-60129, view id=[swcm3353.ourdomain.com-53388|29], is_server=true, is_coord=false, logical_name=swcm3351.ourdomain.com-60129, physical_addrs=10.248.232.38:12345] Thread-CPS-1001-2008924061-http-8080-1_id0 DEBUG [org.jgroups.protocols.pbcast.GMS] - election results: {swcm3353.ourdomain.com-53388=8} Thread-CPS-1001-2008924061-http-8080-1_id0 DEBUG [org.jgroups.protocols.pbcast.GMS] - sending handleJoin(swcm3353.ourdomain.com-42400) to swcm3353.ourdomain.com-53388 ------------------------------------------------------------------------------------------------------ Any idea for this case? Thanks in advance Bent -- View this message in context: http://old.nabble.com/JGroups-2.8.0%3A-cannot-rejoin-cluster-after-coordinator-has-been-shut-down-tp31953491p31953491.html Sent from the JGroups - General mailing list archive at Nabble.com. |