[jgroups-dev] TIMED_WAITING threads on JGroup messages makes application wait indefinitely

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi 

Usually in discount/sale periods we are facing the threads which take care
of sending JGroups messages to be in  TIMED_WAITING state, more precisely
on: 

java.lang.Thread.State: TIMED_WAITING (parking) 
        at sun.misc.Unsafe.park(Native Method) 
        - parking to wait for  <0x00000005d2968450> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) 
        at org.jgroups.util.CreditMap.decrement(CreditMap.java:157) 
        at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:102) 

It's like the case when the messages are so many the nodes cannot handle
them and there are no enough credits to be send over between the nodes in
order new messages to be processed. 

During peak periods we are having more or less 17 - 20 AWS EC2 instances in
our cluster. One of the EC2 instance is dedicated for batch processing and
on few occasions we receive huge files which initiates a big load of
messages. We have around 10 nodes serving the user traffic and some more
nodes for administration purposes. All of these nodes are communication
between each other via JGroups in the cluster using TCP (at the time we
migrated to AWS there was a constraint on only using TCP and we are
exploring the ways to move to UDP now) and we are using version 3.4.1
JGroups. 

However having said that, with the current infrastructure, what should be
the proposed JGroups TCP configuration? We feel that it is good practice to
optimise our configuration. 

our configuration is as follows: 
<config xmlns=&quot;urn:org:jgroups&quot; 

xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;  

xsi:schemaLocation=&quot;urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd&quot;>

        <TCP loopback=&quot;true&quot; 
                recv_buf_size=&quot;${tcp.recv_buf_size:20M}&quot; 
                send_buf_size=&quot;${tcp.send_buf_size:640K}&quot; 
                discard_incompatible_packets=&quot;true&quot; 
                max_bundle_size=&quot;64K&quot; 
                max_bundle_timeout=&quot;30&quot; 
                enable_bundling=&quot;true&quot; 
                use_send_queues=&quot;true&quot; 
                sock_conn_timeout=&quot;300&quot; 
                timer_type=&quot;new&quot; 
                timer.min_threads=&quot;4&quot; 
                timer.max_threads=&quot;10&quot; 
                timer.keep_alive_time=&quot;3000&quot; 
                timer.queue_max_size=&quot;500&quot; 
                thread_pool.enabled=&quot;true&quot; 
                thread_pool.min_threads=&quot;10&quot; 
                thread_pool.max_threads=&quot;40&quot; 
                thread_pool.keep_alive_time=&quot;5000&quot; 
                thread_pool.queue_enabled=&quot;false&quot; 
                thread_pool.queue_max_size=&quot;10000&quot; 
                thread_pool.rejection_policy=&quot;discard&quot; 
                oob_thread_pool.enabled=&quot;true&quot; 
                oob_thread_pool.min_threads=&quot;5&quot; 
                oob_thread_pool.max_threads=&quot;20&quot; 
                oob_thread_pool.keep_alive_time=&quot;5000&quot; 
                oob_thread_pool.queue_enabled=&quot;false&quot; 
                oob_thread_pool.queue_max_size=&quot;10000&quot; 
                oob_thread_pool.rejection_policy=&quot;discard&quot; 
                bind_addr=&quot;${hybris.jgroups.bind_addr}&quot; 
                bind_port=&quot;${hybris.jgroups.bind_port}&quot; />

        <JDBC_PING connection_driver=&quot;${hybris.database.driver}&quot; 
                connection_password=&quot;${hybris.database.password}&quot; 
                connection_username=&quot;${hybris.database.user}&quot; 
                connection_url=&quot;${hybris.database.url}&quot; 
                initialize_sql=&quot;${hybris.jgroups.schema}&quot; 
        datasource_jndi_name=&quot;${hybris.datasource.jndi.name}&quot;/>

        <MERGE2 min_interval="10000" max_interval="30000" />
        <FD_SOCK />
        <FD timeout="3000" max_tries="3" />
        <VERIFY_SUSPECT timeout="1500" />
        <BARRIER />
        <pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500"
discard_delivered_msgs="true" />

        <UNICAST />
        <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="4M" />
        <pbcast.GMS print_local_addr="true" join_timeout="3000"
view_bundling="true" />

        <UFC max_credits="20M" min_threshold="0.6" />
        <MFC max_credits="20M" min_threshold="0.6"  />

        <FRAG2 frag_size="60K" />
        <pbcast.STATE_TRANSFER />

</config>

Based on this configuration, do you have any recommendation for us to modify
anything here to get better throughput? 

Thanks in advance 
Simeon 

--
Sent from: http://jgroups.1086181.n5.nabble.com/JGroups-Dev-f6604.html