JGroups / Discussion / javagroups-users: RpcDispatcher locked indefinitely

Hi,

I’m trying to get an inter-server communication system working for a while now, without much success. Maybe someone can give me a hint what I’m doing wrong…

The communication system is meant to allow distributed method calls between different nodes in a cluster (J2EE application in JBoss 4) without coding any network specific logic or even being aware of the cluster at the application level. So I developed a framework that dynamically injects the network (JGroups) code into simple interface-like classes (let's call them "ports") via byte code injection. That ports are developed to perform some "local" action without caring about JGroups, at all. The injected code is then able to transparently foreward method calls to the corresponding ports of one or more remote cluster nodes (based upon certain criterias), that simply perform their "local" action and return the result over the network. This was done using the RpcDispatcher (and a MemberStateListener to keep track of the cluster nodes). In theory, everthing works great.

Unfortunaltely, a number of problems still keeps me down. First I tried opening a separate JChannel (with an RpcDispatcher each) for every port because I assumed that would yield best performance. Beside the fact that application startup was unacceptable slow due to the group discovery process necessary for each and every port, I was not able to get things working stable. With just two cluster nodes sharing about 15 channels the system kept working flawlessly for about half an hour most times, before crashing. The more cluster nodes I addad, the faster the whole system crashed. And I mean it – as soon as the first error occured in one of the nodes, the whole JGroups communication went down on all nodes, printing an error to the console (see first error message below). Frequently adding and removing cluster nodes also lead to crashes. Below a number of the most common errors I got:

07:10:25,616 ERROR [Configurator] an instance of TP$ProtocolAdapter could not be created. Please check that it implements interface Protocol and that is has a public empty constructor !
07:10:25,618 ERROR [JChannel] org.jgroups.ChannelException: failed to open channel

ERROR org.jgroups.logging.Log4JLogImpl.error(Log4JLogImpl.java:61) - uncaught exception in Thread[DiagnosticsHandler,null,5,JGroups] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:810)
at java.util.HashMap$KeyIterator.next(HashMap.java:845)
at org.jgroups.protocols.TP$DiagnosticsHandler.handleDiagnosticProbe(TP.java:1781)
at org.jgroups.protocols.TP$DiagnosticsHandler.run(TP.java:1758)
at java.lang.Thread.run(Thread.java:717)

ERROR [jgroups] uncaught exception in Thread[DiagnosticsHandler,192.168.20.16:7800,5,JGroups] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
at java.util.HashMap$KeyIterator.next(HashMap.java:828)
at org.jgroups.protocols.TP$DiagnosticsHandler.handleDiagnosticProbe(TP.java:1707)
at org.jgroups.protocols.TP$DiagnosticsHandler.run(TP.java:1684)
at java.lang.Thread.run(Thread.java:619)

16:55:45,812 ERROR [UDP] failed sending message to 192.168.20.16:7800 (60 bytes)
java.lang.Exception: dest=/192.168.20.16:7800 (63 bytes)
at org.jgroups.protocols.UDP._send(UDP.java:212)
at org.jgroups.protocols.UDP.sendToSingleMember(UDP.java:181)
at org.jgroups.protocols.TP.doSend(TP.java:1105)
at org.jgroups.protocols.TP.send(TP.java:1088)
at org.jgroups.protocols.TP.down(TP.java:907)
at org.jgroups.protocols.TP$ProtocolAdapter.down(TP.java:1810)
at org.jgroups.protocols.Discovery.down(Discovery.java:363)
at org.jgroups.protocols.MERGE2.down(MERGE2.java:169)
at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:333)
at org.jgroups.protocols.FD$Monitor.run(FD.java:539)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.SocketException: No buffer space available (maximum connections reached?): Datagram send failed
at java.net.PlainDatagramSocketImpl.send(Native Method)
at java.net.DatagramSocket.send(DatagramSocket.java:612)
at org.jgroups.protocols.UDP._send(UDP.java:208)
... 18 more

To be honest, I don't get the point about any of that error messages ;-) Due to my lack of understanding I tried lots of different stack configurations (including standard stacks coming with the distributions) and different JGroups version (2.7, 2.8b and current 2.8 from CVS) but that all didn't make much of a difference. I tried shared and non-shared transports, no real difference.

Now I tried getting rid of the different channels and implemented an RPC multiplexer providing virtual channels on top of a single JChannel and one RpcDispatcher. That made a quite big difference. JGroups doesn't print errors to the console anymore and everything is a lot more stable than before. Currently it seems that only one problem remains: at some point, one of the cluster nodes transmits a series of files to another node (1 target address, GroupRequest.GET_FIRST, 15s timeout, single RPC call per file) and JGroups simply never returns from one of the successfully processed RPC calls (that not even has a return value), being stuck in ...

caller's side:

"ICP1" daemon prio=10 tid=0x000000002ce25000 nid=0x125c waiting on condition [0x00000000377df000..0x00000000377df810]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

    - parking to wait for  &lt;0x0000000024b45ee0&gt; (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)
    at org.jgroups.blocks.GroupRequest.collectResponses(GroupRequest.java:512)
    at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:266)
    at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:231)
    at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:576)
    at org.jgroups.blocks.RpcDispatcher.callRemoteMethod(RpcDispatcher.java:323)
    ...

callee's side:

"Incoming-1,CH,192.168.20.16:7800" prio=6 tid=0x00000000323e8000 nid=0x20c waiting on condition [0x000000003884f000..0x000000003884fae0]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

    - parking to wait for  &lt;0x00000000257f8a40&gt; (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)

This behaviour is perfectly reproducable, it always happens when transmitting the same file. If I alter the protocol stack by adding or removing a protocol to/from the stack, the application is locked at aother point (when another file is transmitted). If I replace the file that's "causing" the problem with a smaller one, one or two additional files can be transmitted before JGroups stops responding. This somehow looks like a buffer is running out of space or something and modifying the stack just has a small impact on the number of bytes that can be processed. Does anyone have any idea what that all means or what I could try? I also had this problem with the previous implementation using multiple channels, btw.

The only idea I currently have is, that JGroups internally somehow depends on the exact number of bytes of overhead that class MethodCall adds to each message. I'm using a custom subclass of MethodCall that adds some logic and two additional fields. That's the only "modification" made to JGroups and I really didn't expect that to cause any trouble.

Any help would be highly appreciated.

Regards,
momo

First off, instead of writing your own multiplexer, you could have used a shared transport. However, of course, a custom multiplexer is always OK, too...

Next, the stack dump of the callee is truncated, can you post the entire trace ? It looks like your message is stuck trying to get handled, but won't because there is currently no thread available to process it. Try increasing your thread pool size and/or make your RPC OOB. For example
thread_pool.enabled="true"
thread_pool.min_threads="10"
thread_pool.max_threads="80"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="1000"
thread_pool.rejection_policy="discard"

The min_threads and max_threads values were changed. You might also look into disabling the queue.

Third: if this is reproduceable, and the above 2 points don't fix the issue, can you submit a small test program which reproduces it ? I could then easily look at the hang and fix it. Might be as simple as a config change...

Forth: I don't think adding 2 fields to MethodCall cause the bug

Fifth: I prefer the mailing list (dev or user) to this forum, because quoting is much easier...
Cheers,

RpcDispatcher locked indefinitely

Forums

Help

RpcDispatcher locked indefinitely

RpcDispatcher locked indefinitely

Forums

Help

RpcDispatcher locked indefinitely document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

RpcDispatcher locked indefinitely