Menu

jgroups 3.6.2 - many threads blocked (TransferQueueBundler.send()) BlockingQueue<Message> queue

Help
toddler
2015-04-15
2015-04-27
  • toddler

    toddler - 2015-04-15

    Hi: I am looking for some help or suggestions...

    We are using jgroups 3.6.2 in an applicaton that extends RpcDispatcher which mostly performs remote async procedure calls.

    We've experienced a issue where many/threads (54) become blocked in org.jgroups.protocols.TP$TransferQueueBundler.send(TP.java:2686)
    waiting on a lock.

    Questions:
    Are there any known issue and/or if there is a workaround/solution that can be used for this issue?
    From the thread-dump(s) it is NOT obvious who is holding this lock/resource?
    I'm confused why the thread-dump does know not show what is holding this lock, anyone know how I determine what is holding this lock?

    Generic top-part of the stack trace:

    2015-04-12 04:25:00
    "RMI TCP Connection(34)-135.238.116.149" daemon prio=10 tid=0x0000000007b27800 nid=0x4e18 waiting on condition [0x00007f6142fda000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x0000000458c238c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
    at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
    at org.jgroups.protocols.TP$TransferQueueBundler.send(TP.java:2686)
    at org.jgroups.protocols.TP.send(TP.java:1872)
    at org.jgroups.protocols.TP._send(TP.java:1518)
    at org.jgroups.protocols.TP.down(TP.java:1475)

    grep "LinkedBlockingQueue.put" EmsThreadDump.2015-04-12_04-40-00.104.log_td_0 | wc -l
    54
    grep "TransferQueueBundler.send" EmsThreadDump.2015-04-12_04-40-00.104.log_td_0 | wc -l
    54
    grep "458c238c8" EmsThreadDump.2015-04-12_04-40-00.104.log_td_0 | wc -l
    54

    Full thread example:
    "Incoming-4,5620SAM_Auxiliary_SAM_13_0_R2REL_27_D_5620sam,lteroyal-34858135.238.116.149[12800] Hostname: lteroyal" prio=10 tid=0x0000000005517800 nid=0xcbf4 waiting on condition [0x00007f6143970000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x0000000458c238c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
    at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
    at org.jgroups.protocols.TP$TransferQueueBundler.send(TP.java:2686)
    at org.jgroups.protocols.TP.send(TP.java:1872)
    at org.jgroups.protocols.TP._send(TP.java:1518)
    at org.jgroups.protocols.TP.down(TP.java:1475)
    at org.jgroups.protocols.Discovery.down(Discovery.java:435)
    at org.jgroups.protocols.TCPPING.down(TCPPING.java:103)
    at org.jgroups.protocols.MERGE3.down(MERGE3.java:257)
    at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:356)
    at org.jgroups.protocols.FD.down(FD.java:305)
    at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:92)
    at org.jgroups.protocols.BARRIER.down(BARRIER.java:132)
    at org.jgroups.protocols.pbcast.NAKACK2.down(NAKACK2.java:589)
    at org.jgroups.protocols.UNICAST3.down(UNICAST3.java:633)
    at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:347)
    at org.jgroups.protocols.pbcast.GMS.down(GMS.java:1116)
    at org.jgroups.protocols.FlowControl.down(FlowControl.java:353)
    at org.jgroups.protocols.FRAG2.down(FRAG2.java:136)
    at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:202)
    at org.jgroups.protocols.ENCRYPT.encryptAndSend(ENCRYPT.java:850)
    at org.jgroups.protocols.ENCRYPT.down(ENCRYPT.java:800)
    at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1038)
    at org.jgroups.JChannel.down(JChannel.java:791)
    at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:693)
    at org.jgroups.blocks.RequestCorrelator.sendUnicastRequest(RequestCorrelator.java:205)
    at org.jgroups.blocks.UnicastRequest.sendRequest(UnicastRequest.java:43)
    at org.jgroups.blocks.Request.execute(Request.java:83)
    at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:407)
    at org.jgroups.blocks.RpcDispatcher.callRemoteMethod(RpcDispatcher.java:258)
    at com.timetra.nms.server.multiserver.core.common.Controller.callAsynchMethodOnCluster(Controller.java:1057)
    at com.timetra.nms.server.multiserver.core.common.Controller.callAsynchMethodOnCluster(Controller.java:1046)
    at com.timetra.nms.server.multiserver.core.common.Controller.sendPongMessage(Controller.java:635)
    at com.timetra.nms.server.multiserver.core.common.Controller.ping(Controller.java:619)
    - locked <0x0000000517771f68> (a org.jgroups.View)
    at sun.reflect.GeneratedMethodAccessor3109.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:286)
    at com.timetra.nms.server.multiserver.core.common.Controller.handle(Controller.java:994)
    at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:472)
    at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:377)
    at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:250)
    at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:675)
    at org.jgroups.JChannel.up(JChannel.java:761)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1033)
    at org.jgroups.protocols.ENCRYPT.up(ENCRYPT.java:463)
    at org.jgroups.stack.Protocol.up(Protocol.java:420)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:447)
    at org.jgroups.stack.Protocol.up(Protocol.java:420)
    at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
    at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1087)
    at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886)
    at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:790)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426)
    at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:652)
    at org.jgroups.protocols.BARRIER.up(BARRIER.java:152)
    at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
    at org.jgroups.protocols.FD.up(FD.java:253)
    at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:297)
    at org.jgroups.protocols.MERGE3.up(MERGE3.java:288)
    at org.jgroups.protocols.Discovery.up(Discovery.java:291)
    at org.jgroups.protocols.TP.passMessageUp(TP.java:1577)
    at org.jgroups.protocols.TP$MyHandler.run(TP.java:1796)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

     
  • Bela Ban

    Bela Ban - 2015-04-16

    TransferQueueBundler is bounded and serviced by a thread. If it is full, then writes block until there's enough space to add a message to the TQB again. It can block for example if you use TCP and the TCP.write blocks because the window size is 0. This can happen if the receiver is blocked on something (e.g. app code) and cannot receive TCP messages.
    I see that you invoke an RPC in Controller.handle(). I hope that RPC is non-blocking, or else you might run into a distributed deadlock. I guess checking the stack traces on all nodes would give you more insight here.
    Is this reproduceable ?

     
  • toddler

    toddler - 2015-04-18

    Thanks for your response...I found a problem in our app code...

    [We tried a few times, but were not able to reproduce the issue..]...

    Turns out, buried deep in the logs was a StackOverflowError error because of our App Code. When a callRemoteMethod() is attempted, if we get a "SuspectedException" our code would keep retrying the same call (recursive call with no limit) to a different cluster member. In this scenario, two other cluster members were/or-should-have-been cluster members ...seems for both we kept getting the "SuspectedException" ...(so ended up with the StackOverflowError error). I've changed the code so we only perform retries a limited number of times, We need to do more testing to make sure it doesn't occur again...

    I suspect the StackOverflowError caused the JVM to be in a weird state ...and somehow resulted in the TransferQueueBundler to be full...

    What is "interesting" is that looking "only" at the stack-traces of the "available" thread-dumps, I do see the singlle "TransferQueueBundler" thread servicing the queue ....but it seems to be waiting on messages to be put ito queue (so it can send them) while other threads are waiting for room to be made available in the queue (stack-trace in previous post). It may also be I don't understand how this code is suppose to work...

    I observe the following TransferQueueBundler stack-traces...note they are 2 days apart (we are missing thread-dumps, this is all I have).

    2015-04-10 15:40:00
    "TransferQueueBundler,5620SAM_Auxiliary_SAM_13_0_R2REL_27_D_5620sam,lteroyal-34858135.238.116.149[12800] Hostname: lteroyal" prio=10 tid=0x00000000027c4800 nid=0x94c0 waiting on condition [0x00007f615dd4d000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x000000047597fe80> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2694)
    at java.lang.Thread.run(Thread.java:745)

    Both the "tid (threadID, condition/etc." in the above and below thread dumps are the same, so I assume its the same thread...
    [but the "parking to wait for" values are different for some reason?].

    2015-04-12 04:25:00
    "TransferQueueBundler,5620SAM_Auxiliary_SAM_13_0_R2REL_27_D_5620sam,lteroyal-34858135.238.116.149[12800] Hostname: lteroyal" prio=10 tid=0x00000000027c4800 nid=0x94c0 waiting on condition [0x00007f615dd4d000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x0000000455abda10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2694)
    at java.lang.Thread.run(Thread.java:745)

    Thanks again for you help...

     
  • Bela Ban

    Bela Ban - 2015-04-18

    I don't recommend reacting to SuspectExceptions. Those are just indications that some member might get excluded soon, but you only get confirmation through a view. So I'd rather handle view changes than suspicions.
    Re the stack trace: this is a mistery to me ! Looks like this is the same thread (nid and tid match), but the first time the thread is waiting on condition (LinkedBlockingQueue.notEmpty) <0x000000047597fe80> and the second time on <0x0000000455abda10>. This makes no sense to me, as disconnecting and reconnecting a channel would also create a new thread in TransferQueueBundler.start().

     
  • Bela Ban

    Bela Ban - 2015-04-20

    OK, so I asked the in-house experts and their responses are (summarized):

    JVM Thread Ids can get reused. Ref: https://gist.github.com/rednaxelafx/843622#java-level-thread-id
    The native thread Id usage depends on the OS.
    In Linux, a process can reuse a thread Id after the thread terminates. Ref: http://man7.org/linux/man-pages/man3/pthread_self.3.html#NOTES

    This means that the same tid/nid doesn't necessarily mean it is the same thread.

    Here's the Hotspot code (from OpenJDK 8) which prints the lock ID:

      static void print_locked_object_class_name(outputStream* st, Handle obj,
      const char* lock_state) {
      if (obj.not_null()) {
        st->print("\t- %s <" INTPTR_FORMAT "> ", lock_state, (address)obj());
        if (obj->klass() == SystemDictionary::Class_klass()) {
          Klass* target_klass = java_lang_Class::as_Klass(obj());
          st->print_cr("(a java.lang.Class for %s)",
    InstanceKlass::cast(target_klass)->external_name());
        } else {
          Klass* k = obj->klass();
          st->print_cr("(a %s)", k->external_name());
        }
      }
    }
    

    So it's just the heap address of a Java object, and not the return value
    of System.identityHashCode() or something like that. As a result,
    comparisons between lock IDs are only meaningful within the same thread
    dump. Between different thread dumps, garbage collection may change the
    addresses of the lock objects involved.

    This means that the different condition object ID may be caused by GC moving around things, as the condition id is the address of the condition object in memory.

    Conclusion: it's entirely possible that the 2 thread dumps show 2 different threads.

     
  • toddler

    toddler - 2015-04-27

    Thanks Bela for the follow up.

    I fully suspect it was a different thread in the 2 day later thread dump. We do, in some failure scenarios, close then re-open the channel - so I assume that occurred and thus is is a differnt thread [different "queue"].

    Todd

     

Log in to post a comment.