I’m trying to get an inter-server communication system working for a while now, without much success. Maybe someone can give me a hint what I’m doing wrong…
The communication system is meant to allow distributed method calls between different nodes in a cluster (J2EE application in JBoss 4) without coding any network specific logic or even being aware of the cluster at the application level. So I developed a framework that dynamically injects the network (JGroups) code into simple interface-like classes (let's call them "ports") via byte code injection. That ports are developed to perform some "local" action without caring about JGroups, at all. The injected code is then able to transparently foreward method calls to the corresponding ports of one or more remote cluster nodes (based upon certain criterias), that simply perform their "local" action and return the result over the network. This was done using the RpcDispatcher (and a MemberStateListener to keep track of the cluster nodes). In theory, everthing works great.
Unfortunaltely, a number of problems still keeps me down. First I tried opening a separate JChannel (with an RpcDispatcher each) for every port because I assumed that would yield best performance. Beside the fact that application startup was unacceptable slow due to the group discovery process necessary for each and every port, I was not able to get things working stable. With just two cluster nodes sharing about 15 channels the system kept working flawlessly for about half an hour most times, before crashing. The more cluster nodes I addad, the faster the whole system crashed. And I mean it – as soon as the first error occured in one of the nodes, the whole JGroups communication went down on all nodes, printing an error to the console (see first error message below). Frequently adding and removing cluster nodes also lead to crashes. Below a number of the most common errors I got:
07:10:25,616 ERROR [Configurator] an instance of TP$ProtocolAdapter could not be created. Please check that it implements interface Protocol and that is has a public empty constructor !
07:10:25,618 ERROR [JChannel] org.jgroups.ChannelException: failed to open channel
ERROR org.jgroups.logging.Log4JLogImpl.error(Log4JLogImpl.java:61) - uncaught exception in Thread[DiagnosticsHandler,null,5,JGroups] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:810)
at java.util.HashMap$KeyIterator.next(HashMap.java:845)
at org.jgroups.protocols.TP$DiagnosticsHandler.handleDiagnosticProbe(TP.java:1781)
at org.jgroups.protocols.TP$DiagnosticsHandler.run(TP.java:1758)
at java.lang.Thread.run(Thread.java:717)
ERROR [jgroups] uncaught exception in Thread[DiagnosticsHandler,192.168.20.16:7800,5,JGroups] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
at java.util.HashMap$KeyIterator.next(HashMap.java:828)
at org.jgroups.protocols.TP$DiagnosticsHandler.handleDiagnosticProbe(TP.java:1707)
at org.jgroups.protocols.TP$DiagnosticsHandler.run(TP.java:1684)
at java.lang.Thread.run(Thread.java:619)
16:55:45,812 ERROR [UDP] failed sending message to 192.168.20.16:7800 (60 bytes)
java.lang.Exception: dest=/192.168.20.16:7800 (63 bytes)
at org.jgroups.protocols.UDP._send(UDP.java:212)
at org.jgroups.protocols.UDP.sendToSingleMember(UDP.java:181)
at org.jgroups.protocols.TP.doSend(TP.java:1105)
at org.jgroups.protocols.TP.send(TP.java:1088)
at org.jgroups.protocols.TP.down(TP.java:907)
at org.jgroups.protocols.TP$ProtocolAdapter.down(TP.java:1810)
at org.jgroups.protocols.Discovery.down(Discovery.java:363)
at org.jgroups.protocols.MERGE2.down(MERGE2.java:169)
at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:333)
at org.jgroups.protocols.FD$Monitor.run(FD.java:539)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.SocketException: No buffer space available (maximum connections reached?): Datagram send failed
at java.net.PlainDatagramSocketImpl.send(Native Method)
at java.net.DatagramSocket.send(DatagramSocket.java:612)
at org.jgroups.protocols.UDP._send(UDP.java:208)
... 18 more
To be honest, I don't get the point about any of that error messages ;-) Due to my lack of understanding I tried lots of different stack configurations (including standard stacks coming with the distributions) and different JGroups version (2.7, 2.8b and current 2.8 from CVS) but that all didn't make much of a difference. I tried shared and non-shared transports, no real difference.
Now I tried getting rid of the different channels and implemented an RPC multiplexer providing virtual channels on top of a single JChannel and one RpcDispatcher. That made a quite big difference. JGroups doesn't print errors to the console anymore and everything is a lot more stable than before. Currently it seems that only one problem remains: at some point, one of the cluster nodes transmits a series of files to another node (1 target address, GroupRequest.GET_FIRST, 15s timeout, single RPC call per file) and JGroups simply never returns from one of the successfully processed RPC calls (that not even has a return value), being stuck in ...
caller's side:
"ICP1" daemon prio=10 tid=0x000000002ce25000 nid=0x125c waiting on condition [0x00000000377df000..0x00000000377df810]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
-parkingtowaitfor<0x0000000024b45ee0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)atjava.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)atjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)atorg.jgroups.blocks.GroupRequest.collectResponses(GroupRequest.java:512)atorg.jgroups.blocks.GroupRequest.execute(GroupRequest.java:266)atorg.jgroups.blocks.GroupRequest.execute(GroupRequest.java:231)atorg.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:576)atorg.jgroups.blocks.RpcDispatcher.callRemoteMethod(RpcDispatcher.java:323)...
callee's side:
"Incoming-1,CH,192.168.20.16:7800" prio=6 tid=0x00000000323e8000 nid=0x20c waiting on condition [0x000000003884f000..0x000000003884fae0]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
-parkingtowaitfor<0x00000000257f8a40> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)atjava.util.concurrent.locks.LockSupport.park(LockSupport.java:158)atjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)atjava.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)atjava.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
This behaviour is perfectly reproducable, it always happens when transmitting the same file. If I alter the protocol stack by adding or removing a protocol to/from the stack, the application is locked at aother point (when another file is transmitted). If I replace the file that's "causing" the problem with a smaller one, one or two additional files can be transmitted before JGroups stops responding. This somehow looks like a buffer is running out of space or something and modifying the stack just has a small impact on the number of bytes that can be processed. Does anyone have any idea what that all means or what I could try? I also had this problem with the previous implementation using multiple channels, btw.
The only idea I currently have is, that JGroups internally somehow depends on the exact number of bytes of overhead that class MethodCall adds to each message. I'm using a custom subclass of MethodCall that adds some logic and two additional fields. That's the only "modification" made to JGroups and I really didn't expect that to cause any trouble.
Any help would be highly appreciated.
Regards,
momo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Meanwhile, I've been able to reduce the test to one simple file and also found a way to get JGroups blocking on other machines, as well. Now, how should I submit the test project? The forum doesn't allow attachments and the e-mail I send to the mailing list has been rejected with "You are not allowed to post to this mailing list, and your message has been automatically rejected."
Cheers,
momo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First off, instead of writing your own multiplexer, you could have used a shared transport. However, of course, a custom multiplexer is always OK, too...
Next, the stack dump of the callee is truncated, can you post the entire trace ? It looks like your message is stuck trying to get handled, but won't because there is currently no thread available to process it. Try increasing your thread pool size and/or make your RPC OOB. For example
thread_pool.enabled="true"
thread_pool.min_threads="10"
thread_pool.max_threads="80"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="1000"
thread_pool.rejection_policy="discard"
The min_threads and max_threads values were changed. You might also look into disabling the queue.
Third: if this is reproduceable, and the above 2 points don't fix the issue, can you submit a small test program which reproduces it ? I could then easily look at the hang and fix it. Might be as simple as a config change...
Forth: I don't think adding 2 fields to MethodCall cause the bug
Fifth: I prefer the mailing list (dev or user) to this forum, because quoting is much easier...
Cheers,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thank you for your reply! I'm on vacation for a few days so sorry for my late answer. I'll be back in office next week to try what you suggested and check my settings. I'll let you know what happens.
Cheers
momo
p.s. The mailing list would have been my first choice but there was something like a read only or for developers only hint or such. Maybe I just missed something...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
tracking down the described problem worked out to be very difficult and I had a lot of other projects so sorry for my late reply. I tried what you proposed but that didn't make any difference. So I've been trying to create a simple test case showing the problem but have never been able to reproduce the error - until today. I just realized that the machines I have been developing and running the test on never ever run into the problem. But adding one of our test "servers" as a test node showed the blocking I described in my original post within seconds. I will still need some days to simplify the test and remove code not having any impact (currently the test still uses the whole communication system I developed for our product) but will provide the test as soon as possible, hoping you can reproduce the error in your environment.
Meanwhile, are you aware of any problem related to Windows XP 64 bit (the machine "causing" the error runs that OS) or any system settings or hardware setup that might possibly cause such trouble? The other machines I used in my testing - that never caused any trouble - run Windows XP 32 bit, Windows Server 2003 32 bit, Windows 7 RC 32 bit or Windows 7 RC 64 bit.
Cheers,
momo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I recall a user had issues with Vista 64 bit, but IIRC that was in conjunction with other (32 bit) boxes in the same cluster.
For OSes that I don't have, the only way to reproduce this is on images (e.g. EC2 or VMWare) with that OS installed. So if you have such an image, and a reproduceable use case, then I could take a look...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I’m trying to get an inter-server communication system working for a while now, without much success. Maybe someone can give me a hint what I’m doing wrong…
The communication system is meant to allow distributed method calls between different nodes in a cluster (J2EE application in JBoss 4) without coding any network specific logic or even being aware of the cluster at the application level. So I developed a framework that dynamically injects the network (JGroups) code into simple interface-like classes (let's call them "ports") via byte code injection. That ports are developed to perform some "local" action without caring about JGroups, at all. The injected code is then able to transparently foreward method calls to the corresponding ports of one or more remote cluster nodes (based upon certain criterias), that simply perform their "local" action and return the result over the network. This was done using the RpcDispatcher (and a MemberStateListener to keep track of the cluster nodes). In theory, everthing works great.
Unfortunaltely, a number of problems still keeps me down. First I tried opening a separate JChannel (with an RpcDispatcher each) for every port because I assumed that would yield best performance. Beside the fact that application startup was unacceptable slow due to the group discovery process necessary for each and every port, I was not able to get things working stable. With just two cluster nodes sharing about 15 channels the system kept working flawlessly for about half an hour most times, before crashing. The more cluster nodes I addad, the faster the whole system crashed. And I mean it – as soon as the first error occured in one of the nodes, the whole JGroups communication went down on all nodes, printing an error to the console (see first error message below). Frequently adding and removing cluster nodes also lead to crashes. Below a number of the most common errors I got:
07:10:25,616 ERROR [Configurator] an instance of TP$ProtocolAdapter could not be created. Please check that it implements interface Protocol and that is has a public empty constructor !
07:10:25,618 ERROR [JChannel] org.jgroups.ChannelException: failed to open channel
ERROR org.jgroups.logging.Log4JLogImpl.error(Log4JLogImpl.java:61) - uncaught exception in Thread[DiagnosticsHandler,null,5,JGroups] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:810)
at java.util.HashMap$KeyIterator.next(HashMap.java:845)
at org.jgroups.protocols.TP$DiagnosticsHandler.handleDiagnosticProbe(TP.java:1781)
at org.jgroups.protocols.TP$DiagnosticsHandler.run(TP.java:1758)
at java.lang.Thread.run(Thread.java:717)
ERROR [jgroups] uncaught exception in Thread[DiagnosticsHandler,192.168.20.16:7800,5,JGroups] (thread group=org.jgroups.util.Util$1[name=JGroups,maxpri=10] )
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
at java.util.HashMap$KeyIterator.next(HashMap.java:828)
at org.jgroups.protocols.TP$DiagnosticsHandler.handleDiagnosticProbe(TP.java:1707)
at org.jgroups.protocols.TP$DiagnosticsHandler.run(TP.java:1684)
at java.lang.Thread.run(Thread.java:619)
16:55:45,812 ERROR [UDP] failed sending message to 192.168.20.16:7800 (60 bytes)
java.lang.Exception: dest=/192.168.20.16:7800 (63 bytes)
at org.jgroups.protocols.UDP._send(UDP.java:212)
at org.jgroups.protocols.UDP.sendToSingleMember(UDP.java:181)
at org.jgroups.protocols.TP.doSend(TP.java:1105)
at org.jgroups.protocols.TP.send(TP.java:1088)
at org.jgroups.protocols.TP.down(TP.java:907)
at org.jgroups.protocols.TP$ProtocolAdapter.down(TP.java:1810)
at org.jgroups.protocols.Discovery.down(Discovery.java:363)
at org.jgroups.protocols.MERGE2.down(MERGE2.java:169)
at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:333)
at org.jgroups.protocols.FD$Monitor.run(FD.java:539)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.SocketException: No buffer space available (maximum connections reached?): Datagram send failed
at java.net.PlainDatagramSocketImpl.send(Native Method)
at java.net.DatagramSocket.send(DatagramSocket.java:612)
at org.jgroups.protocols.UDP._send(UDP.java:208)
... 18 more
To be honest, I don't get the point about any of that error messages ;-) Due to my lack of understanding I tried lots of different stack configurations (including standard stacks coming with the distributions) and different JGroups version (2.7, 2.8b and current 2.8 from CVS) but that all didn't make much of a difference. I tried shared and non-shared transports, no real difference.
Now I tried getting rid of the different channels and implemented an RPC multiplexer providing virtual channels on top of a single JChannel and one RpcDispatcher. That made a quite big difference. JGroups doesn't print errors to the console anymore and everything is a lot more stable than before. Currently it seems that only one problem remains: at some point, one of the cluster nodes transmits a series of files to another node (1 target address, GroupRequest.GET_FIRST, 15s timeout, single RPC call per file) and JGroups simply never returns from one of the successfully processed RPC calls (that not even has a return value), being stuck in ...
caller's side:
"ICP1" daemon prio=10 tid=0x000000002ce25000 nid=0x125c waiting on condition [0x00000000377df000..0x00000000377df810]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
callee's side:
"Incoming-1,CH,192.168.20.16:7800" prio=6 tid=0x00000000323e8000 nid=0x20c waiting on condition [0x000000003884f000..0x000000003884fae0]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
This behaviour is perfectly reproducable, it always happens when transmitting the same file. If I alter the protocol stack by adding or removing a protocol to/from the stack, the application is locked at aother point (when another file is transmitted). If I replace the file that's "causing" the problem with a smaller one, one or two additional files can be transmitted before JGroups stops responding. This somehow looks like a buffer is running out of space or something and modifying the stack just has a small impact on the number of bytes that can be processed. Does anyone have any idea what that all means or what I could try? I also had this problem with the previous implementation using multiple channels, btw.
The only idea I currently have is, that JGroups internally somehow depends on the exact number of bytes of overhead that class MethodCall adds to each message. I'm using a custom subclass of MethodCall that adds some logic and two additional fields. That's the only "modification" made to JGroups and I really didn't expect that to cause any trouble.
Any help would be highly appreciated.
Regards,
momo
Meanwhile, I've been able to reduce the test to one simple file and also found a way to get JGroups blocking on other machines, as well. Now, how should I submit the test project? The forum doesn't allow attachments and the e-mail I send to the mailing list has been rejected with "You are not allowed to post to this mailing list, and your message has been automatically rejected."
Cheers,
momo
Please submit it to JIRA and attach your program and config to the case. JIRA's at https://jira.jboss.org/jira/browse/JGRP.
Regarding the mailing list: you need to subscribe, this is a closed group and only subscribers can post to it. This is better to prevent spam
First off, instead of writing your own multiplexer, you could have used a shared transport. However, of course, a custom multiplexer is always OK, too...
Next, the stack dump of the callee is truncated, can you post the entire trace ? It looks like your message is stuck trying to get handled, but won't because there is currently no thread available to process it. Try increasing your thread pool size and/or make your RPC OOB. For example
thread_pool.enabled="true"
thread_pool.min_threads="10"
thread_pool.max_threads="80"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="1000"
thread_pool.rejection_policy="discard"
The min_threads and max_threads values were changed. You might also look into disabling the queue.
Third: if this is reproduceable, and the above 2 points don't fix the issue, can you submit a small test program which reproduces it ? I could then easily look at the hang and fix it. Might be as simple as a config change...
Forth: I don't think adding 2 fields to MethodCall cause the bug
Fifth: I prefer the mailing list (dev or user) to this forum, because quoting is much easier...
Cheers,
Hi Bela,
thank you for your reply! I'm on vacation for a few days so sorry for my late answer. I'll be back in office next week to try what you suggested and check my settings. I'll let you know what happens.
Cheers
momo
p.s. The mailing list would have been my first choice but there was something like a read only or for developers only hint or such. Maybe I just missed something...
Hi Bela,
tracking down the described problem worked out to be very difficult and I had a lot of other projects so sorry for my late reply. I tried what you proposed but that didn't make any difference. So I've been trying to create a simple test case showing the problem but have never been able to reproduce the error - until today. I just realized that the machines I have been developing and running the test on never ever run into the problem. But adding one of our test "servers" as a test node showed the blocking I described in my original post within seconds. I will still need some days to simplify the test and remove code not having any impact (currently the test still uses the whole communication system I developed for our product) but will provide the test as soon as possible, hoping you can reproduce the error in your environment.
Meanwhile, are you aware of any problem related to Windows XP 64 bit (the machine "causing" the error runs that OS) or any system settings or hardware setup that might possibly cause such trouble? The other machines I used in my testing - that never caused any trouble - run Windows XP 32 bit, Windows Server 2003 32 bit, Windows 7 RC 32 bit or Windows 7 RC 64 bit.
Cheers,
momo
I recall a user had issues with Vista 64 bit, but IIRC that was in conjunction with other (32 bit) boxes in the same cluster.
For OSes that I don't have, the only way to reproduce this is on images (e.g. EC2 or VMWare) with that OS installed. So if you have such an image, and a reproduceable use case, then I could take a look...