[javagroups-users] Occasional join/merge timeouts
Brought to you by:
belaban
From: MVLeor <li...@ma...> - 2008-10-28 17:13:40
|
Hi JGroups, We're running a cluster with 18 members. At first it all seems OK. It seems that when a node joins or re-joins, it sometimes succeeds right away and sometimes keeps failing forever: (IPs have been changed) 28 Oct 2008 17:42:28,159-[DEBUG ] , [determineCoord ] - election results: {aa.bb.cc.31:7800=3} 28 Oct 2008 17:42:28,159-[DEBUG ] , [join ] - sending handleJoin(aa.bb.cc.26:7875) to aa.bb.cc.31:7800 28 Oct 2008 17:42:31,159-[WARN ] , [join ] - join(aa.bb.cc.26:7875) sent to aa.bb.cc.31:7800 timed out (after 3000 ms), retr 28 Oct 2008 17:42:31,160-[DEBUG ] , [join ] - initial_mbrs are [[own_addr=aa.bb.cc.38:7800, coord_addr=aa.bb.cc.31:7800, is_s 28 Oct 2008 17:42:31,160-[DEBUG ] , [determineCoord ] - election results: {aa.bb.cc.31:7800=5} 28 Oct 2008 17:42:31,161-[DEBUG ] , [join ] - sending handleJoin(aa.bb.cc.26:7875) to aa.bb.cc.31:7800 28 Oct 2008 17:42:31,159-[WARN ] , [join ] - join(aa.bb.cc.26:7875) sent to aa.bb.cc.31:7800 timed out (after 3000 ms), retr : : and so on repeatedly. Also, occasionally nodes seem to leave the cluster (the channel gets reset). We were wondering if the configuration is OK for this size of cluster, and of course if there are any other suggestions ? All nodes are configured as follows: <config> <TCP bind_addr="aa.bb.cc.23" start_port="7875" persistent_ports="true" loopback="true" recv_buf_size="20000000" send_buf_size="640000" discard_incompatible_packets="true" max_bundle_size="64000" max_bundle_timeout="30" use_incoming_packet_handler="true" enable_bundling="true" use_send_queues="false" sock_conn_timeout="300" skip_suspected_members="true" use_concurrent_stack="true" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="25" thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false" thread_pool.queue_max_size="100" thread_pool.rejection_policy="Run" oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100" oob_thread_pool.rejection_policy="Run"/> <MPING timeout="4000" receive_on_all_interfaces="true" mcast_addr="228.8.8.8" mcast_port="60666" ip_ttl="8" num_initial_members="2" num_ping_requests="1"/> <MERGE2 max_interval="10000" min_interval="5000"/> <FD_SOCK/> <FD timeout="15000" max_tries="5" shun="true"/> <VERIFY_SUSPECT timeout="1500"/> <pbcast.NAKACK use_mcast_xmit="false" gc_lag="50" retransmit_timeout="600,1200,2400,4800" discard_delivered_msgs="true"/> <UNICAST timeout="1200,2400,3600"/> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/> <VIEW_SYNC avg_send_interval="60000"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" shun="true" view_bundling="true" reject_join_from_existing_member="true"/> <FC max_credits="2000000" min_threshold="0.10"/> <FRAG2 frag_size="60000"/> <pbcast.STATE_TRANSFER/> <pbcast.FLUSH timeout="10000"/> </config> -- View this message in context: http://www.nabble.com/Occasional-join-merge-timeouts-tp20211671p20211671.html Sent from the JGroups - General mailing list archive at Nabble.com. |