Re: [javagroups-users] Occasional join/merge timeouts
Brought to you by:
belaban
From: Dima G. <dim...@ma...> - 2008-10-29 20:03:39
|
Hi, Can anyone comment on this ? We are running 2.6.5 GA in a production environment and this issue causes us to ... scream :-) There is also another issue I've noticed. In case when nodes are configured with rejectJoinFromExistingMemebers=true on GMS level, the join fails with exception on some nodes while the joining nodes are unaware of that and keep trying to join forever. I have logs files that show that such behavior can causes infinite loops that cause the processes to get out of memory very fast .... Thanks in advance. Regards, Dima Gutzeit. MVLeor wrote: > Hi JGroups, > > We're running a cluster with 18 members. At first it all seems OK. > It seems that when a node joins or re-joins, it sometimes succeeds right > away and sometimes keeps failing forever: > > > (IPs have been changed) > > 28 Oct 2008 17:42:28,159-[DEBUG ] , [determineCoord ] > - election results: {aa.bb.cc.31:7800=3} > 28 Oct 2008 17:42:28,159-[DEBUG ] , [join ] > - sending handleJoin(aa.bb.cc.26:7875) to aa.bb.cc.31:7800 > 28 Oct 2008 17:42:31,159-[WARN ] , [join ] > - join(aa.bb.cc.26:7875) sent to aa.bb.cc.31:7800 timed out (after 3000 ms), > retr > 28 Oct 2008 17:42:31,160-[DEBUG ] , [join ] > - initial_mbrs are [[own_addr=aa.bb.cc.38:7800, coord_addr=aa.bb.cc.31:7800, > is_s > 28 Oct 2008 17:42:31,160-[DEBUG ] , [determineCoord ] > - election results: {aa.bb.cc.31:7800=5} > 28 Oct 2008 17:42:31,161-[DEBUG ] , [join ] > - sending handleJoin(aa.bb.cc.26:7875) to aa.bb.cc.31:7800 > 28 Oct 2008 17:42:31,159-[WARN ] , [join ] > - join(aa.bb.cc.26:7875) sent to aa.bb.cc.31:7800 timed out (after 3000 ms), > retr > : > : > and so on repeatedly. > > > Also, occasionally nodes seem to leave the cluster (the channel gets reset). > We were wondering if the configuration is OK for this size of cluster, and > of course if there are any other suggestions ? > > > > > All nodes are configured as follows: > <config> > <TCP > bind_addr="aa.bb.cc.23" > start_port="7875" > persistent_ports="true" > loopback="true" > recv_buf_size="20000000" > send_buf_size="640000" > discard_incompatible_packets="true" > max_bundle_size="64000" > max_bundle_timeout="30" > use_incoming_packet_handler="true" > enable_bundling="true" > use_send_queues="false" > sock_conn_timeout="300" > skip_suspected_members="true" > > use_concurrent_stack="true" > > thread_pool.enabled="true" > thread_pool.min_threads="1" > thread_pool.max_threads="25" > thread_pool.keep_alive_time="5000" > thread_pool.queue_enabled="false" > thread_pool.queue_max_size="100" > thread_pool.rejection_policy="Run" > > oob_thread_pool.enabled="true" > oob_thread_pool.min_threads="1" > oob_thread_pool.max_threads="8" > oob_thread_pool.keep_alive_time="5000" > oob_thread_pool.queue_enabled="false" > oob_thread_pool.queue_max_size="100" > oob_thread_pool.rejection_policy="Run"/> > <MPING timeout="4000" receive_on_all_interfaces="true" > mcast_addr="228.8.8.8" mcast_port="60666" ip_ttl="8" num_initial_members="2" > num_ping_requests="1"/> > <MERGE2 max_interval="10000" min_interval="5000"/> > <FD_SOCK/> > <FD timeout="15000" max_tries="5" shun="true"/> > <VERIFY_SUSPECT timeout="1500"/> > <pbcast.NAKACK use_mcast_xmit="false" gc_lag="50" > retransmit_timeout="600,1200,2400,4800" discard_delivered_msgs="true"/> > <UNICAST timeout="1200,2400,3600"/> > <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" > max_bytes="400000"/> > <VIEW_SYNC avg_send_interval="60000"/> > <pbcast.GMS print_local_addr="true" join_timeout="3000" shun="true" > view_bundling="true" reject_join_from_existing_member="true"/> > <FC max_credits="2000000" min_threshold="0.10"/> > <FRAG2 frag_size="60000"/> > <pbcast.STATE_TRANSFER/> > <pbcast.FLUSH timeout="10000"/> > </config> > > > > |