Thread: [jgroups-dev] TIMED_WAITING threads on JGroup messages makes application wait indefinitely
Brought to you by:
belaban
|
From: Development i. <jav...@li...> - 2018-05-24 10:02:26
|
Hi Usually in discount/sale periods we are facing the threads which take care of sending JGroups messages to be in TIMED_WAITING state, more precisely on: java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000005d2968450> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) at org.jgroups.util.CreditMap.decrement(CreditMap.java:157) at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:102) It's like the case when the messages are so many the nodes cannot handle them and there are no enough credits to be send over between the nodes in order new messages to be processed. During peak periods we are having more or less 17 - 20 AWS EC2 instances in our cluster. One of the EC2 instance is dedicated for batch processing and on few occasions we receive huge files which initiates a big load of messages. We have around 10 nodes serving the user traffic and some more nodes for administration purposes. All of these nodes are communication between each other via JGroups in the cluster using TCP (at the time we migrated to AWS there was a constraint on only using TCP and we are exploring the ways to move to UDP now) and we are using version 3.4.1 JGroups. However having said that, with the current infrastructure, what should be the proposed JGroups TCP configuration? We feel that it is good practice to optimise our configuration. our configuration is as follows: <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd"> <TCP loopback="true" recv_buf_size="${tcp.recv_buf_size:20M}" send_buf_size="${tcp.send_buf_size:640K}" discard_incompatible_packets="true" max_bundle_size="64K" max_bundle_timeout="30" enable_bundling="true" use_send_queues="true" sock_conn_timeout="300" timer_type="new" timer.min_threads="4" timer.max_threads="10" timer.keep_alive_time="3000" timer.queue_max_size="500" thread_pool.enabled="true" thread_pool.min_threads="10" thread_pool.max_threads="40" thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false" thread_pool.queue_max_size="10000" thread_pool.rejection_policy="discard" oob_thread_pool.enabled="true" oob_thread_pool.min_threads="5" oob_thread_pool.max_threads="20" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="10000" oob_thread_pool.rejection_policy="discard" bind_addr="${hybris.jgroups.bind_addr}" bind_port="${hybris.jgroups.bind_port}" /> <JDBC_PING connection_driver="${hybris.database.driver}" connection_password="${hybris.database.password}" connection_username="${hybris.database.user}" connection_url="${hybris.database.url}" initialize_sql="${hybris.jgroups.schema}" datasource_jndi_name="${hybris.datasource.jndi.name}"/> <MERGE2 min_interval="10000" max_interval="30000" /> <FD_SOCK /> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" discard_delivered_msgs="true" /> <UNICAST /> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M" /> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" /> <UFC max_credits="20M" min_threshold="0.6" /> <MFC max_credits="20M" min_threshold="0.6" /> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> </config> Based on this configuration, do you have any recommendation for us to modify anything here to get better throughput? Thanks in advance Simeon -- Sent from: http://jgroups.1086181.n5.nabble.com/JGroups-Dev-f6604.html |
|
From: Development i. <jav...@li...> - 2018-05-24 12:35:44
|
Hi Usually in discount/sale periods we are facing the threads which take care of sending JGroups messages to be in TIMED_WAITING state, more precisely on: java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000005d2968450> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) at org.jgroups.util.CreditMap.decrement(CreditMap.java:157) at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:102) It's like the case when the messages are so many the nodes cannot handle them and there are no enough credits to be send over between the nodes in order new messages to be processed. During peak periods we are having more or less 17 - 20 AWS EC2 instances in our cluster. One of the EC2 instance is dedicated for batch processing and on few occasions we receive huge files which initiates a big load of messages. We have around 10 nodes serving the user traffic and some more nodes for administration purposes. All of these nodes are communication between each other via JGroups in the cluster using TCP (at the time we migrated to AWS there was a constraint on only using TCP and we are exploring the ways to move to UDP now) and we are using version 3.4.1 JGroups. However having said that, with the current infrastructure, what should be the proposed JGroups TCP configuration? We feel that it is good practice to optimise our configuration. our configuration is as follows: <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd"> <TCP loopback="true" recv_buf_size="${tcp.recv_buf_size:20M}" send_buf_size="${tcp.send_buf_size:640K}" discard_incompatible_packets="true" max_bundle_size="64K" max_bundle_timeout="30" enable_bundling="true" use_send_queues="true" sock_conn_timeout="300" timer_type="new" timer.min_threads="4" timer.max_threads="10" timer.keep_alive_time="3000" timer.queue_max_size="500" thread_pool.enabled="true" thread_pool.min_threads="10" thread_pool.max_threads="40" thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false" thread_pool.queue_max_size="10000" thread_pool.rejection_policy="discard" oob_thread_pool.enabled="true" oob_thread_pool.min_threads="5" oob_thread_pool.max_threads="20" oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="10000" oob_thread_pool.rejection_policy="discard" bind_addr="${hybris.jgroups.bind_addr}" bind_port="${hybris.jgroups.bind_port}" /> <JDBC_PING connection_driver="${hybris.database.driver}" connection_password="${hybris.database.password}" connection_username="${hybris.database.user}" connection_url="${hybris.database.url}" initialize_sql="${hybris.jgroups.schema}" datasource_jndi_name="${hybris.datasource.jndi.name}"/> <MERGE2 min_interval="10000" max_interval="30000" /> <FD_SOCK /> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" discard_delivered_msgs="true" /> <UNICAST /> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M" /> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" /> <UFC max_credits="20M" min_threshold="0.6" /> <MFC max_credits="20M" min_threshold="0.6" /> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> </config> Based on this configuration, do you have any recommendation for us to modify anything here to get better throughput? Thanks in advance Simeon -- Sent from: http://jgroups.1086181.n5.nabble.com/JGroups-Dev-f6604.html |
|
From: Development i. <jav...@li...> - 2018-05-28 11:19:00
|
I'm afraid I don't support such an old version (5 years old), see [1] for details. Running out of credits may have a number of reasons, e.g. application threads blocking on the receivers, excessive GC, exhausted thread pools (this can be check with probe.sh) etc. I highly recommend to upgrade to the latest stable 3.6.x or 4.0.x version. Then copy the tcp.xml shipped with that version and modify it to fit your env, e.g. replace TCPPING with JDBC_PING etc. I see that you for example still use UNICAST and NAKACK instead of UNICAST3 and NAKACK2 in your config... [1] https://developer.jboss.org/wiki/Support [2] https://sourceforge.net/projects/javagroups/files/JGroups/ On 24/05/18 11:46, Development issues wrote: > Hi > > Usually in discount/sale periods we are facing the threads which take care > of sending JGroups messages to be in TIMED_WAITING state, more precisely > on: > > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000005d2968450> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) > at org.jgroups.util.CreditMap.decrement(CreditMap.java:157) > at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:102) > > > It's like the case when the messages are so many the nodes cannot handle > them and there are no enough credits to be send over between the nodes in > order new messages to be processed. > > During peak periods we are having more or less 17 - 20 AWS EC2 instances in > our cluster. One of the EC2 instance is dedicated for batch processing and > on few occasions we receive huge files which initiates a big load of > messages. We have around 10 nodes serving the user traffic and some more > nodes for administration purposes. All of these nodes are communication > between each other via JGroups in the cluster using TCP (at the time we > migrated to AWS there was a constraint on only using TCP and we are > exploring the ways to move to UDP now) and we are using version 3.4.1 > JGroups. > > However having said that, with the current infrastructure, what should be > the proposed JGroups TCP configuration? We feel that it is good practice to > optimise our configuration. > > our configuration is as follows: > <config xmlns="urn:org:jgroups" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="urn:org:jgroups > http://www.jgroups.org/schema/JGroups-3.0.xsd"> > > > <TCP loopback="true" > recv_buf_size="${tcp.recv_buf_size:20M}" > send_buf_size="${tcp.send_buf_size:640K}" > discard_incompatible_packets="true" > max_bundle_size="64K" > max_bundle_timeout="30" > enable_bundling="true" > use_send_queues="true" > sock_conn_timeout="300" > timer_type="new" > timer.min_threads="4" > timer.max_threads="10" > timer.keep_alive_time="3000" > timer.queue_max_size="500" > thread_pool.enabled="true" > thread_pool.min_threads="10" > thread_pool.max_threads="40" > thread_pool.keep_alive_time="5000" > thread_pool.queue_enabled="false" > thread_pool.queue_max_size="10000" > thread_pool.rejection_policy="discard" > oob_thread_pool.enabled="true" > oob_thread_pool.min_threads="5" > oob_thread_pool.max_threads="20" > oob_thread_pool.keep_alive_time="5000" > oob_thread_pool.queue_enabled="false" > oob_thread_pool.queue_max_size="10000" > oob_thread_pool.rejection_policy="discard" > bind_addr="${hybris.jgroups.bind_addr}" > bind_port="${hybris.jgroups.bind_port}" /> > > <JDBC_PING connection_driver="${hybris.database.driver}" > connection_password="${hybris.database.password}" > connection_username="${hybris.database.user}" > connection_url="${hybris.database.url}" > initialize_sql="${hybris.jgroups.schema}" > datasource_jndi_name="${hybris.datasource.jndi.name}"/> > > <MERGE2 min_interval="10000" max_interval="30000" /> > <FD_SOCK /> > <FD timeout="3000" max_tries="3" /> > <VERIFY_SUSPECT timeout="1500" /> > <BARRIER /> > <pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" > discard_delivered_msgs="true" /> > > <UNICAST /> > <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" > max_bytes="4M" /> > <pbcast.GMS print_local_addr="true" join_timeout="3000" > view_bundling="true" /> > > <UFC max_credits="20M" min_threshold="0.6" /> > <MFC max_credits="20M" min_threshold="0.6" /> > > <FRAG2 frag_size="60K" /> > <pbcast.STATE_TRANSFER /> > > </config> > > Based on this configuration, do you have any recommendation for us to modify > anything here to get better throughput? > > Thanks in advance > Simeon > > > > > > > -- > Sent from: http://jgroups.1086181.n5.nabble.com/JGroups-Dev-f6604.html > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Javagroups-development mailing list > -- Bela Ban | http://www.jgroups.org |