We are running into a problem with Jgroups. When we run for a couple of days, the application runs out of java heap memory. When we look at the heap dumps, we can see large objects of type "array of org.jgroups.Message" holding >40 MB each. All these objects are being held by org.jgroups.util.RetransmitTable. What could be causing this problem? We are on 3.6.11.Final. Does anyone know how to have this problem fixed?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
RetransmitTable is not used in 3.6.11; I suspect you're using an
outdated configuration (NAKACK instead of NAKACK2).
If you want someone to look into this, you need to post
- Configuration
- Stack trace / thread dump
- How to reproduce if possible
On 13/12/16 05:55, sageroger wrote:
We are running into a problem with Jgroups. When we run for a couple of
days, the application runs out of java heap memory. When we look at the
heap dumps, we can see large objects of type "array of
org.jgroups.Message" holding >40 MB each. All these objects are being
held by org.jgroups.util.RetransmitTable. What could be causing this
problem? We are on 3.6.11.Final. Does anyone know how to have this
problem fixed?
We do not have a stack trace. But in heap analysis we can see the data structures holding memory. This appears to occur in heave traffic situations.
Another question we have is: we do not need to have FIFO semantics in our case. Can we disable FIFO (which will probably eliminate the need for retransmits)? If so how can we do that?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm.. the reason we are using an old config file is we just upgraded from an ancient version to the latest version. But thought the new version works fine with the old config. Where can we find a good termplate for the new config file? Do you have any comments on my previous question about disabling FIFO? Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm.. the reason we are using an old config file is we just upgraded
from an ancient version to the latest version. But thought the new
version works fine with the old config.
No, it doesn't. With the config you posted, a 3.6.11 system will not
even start! So either you don't use the config you posted, or you
don't use 3.6.11...
Where can we find a good template for the new config file?
You could send your messages as OOB messages then, which are unordered
but lossless.
I suggest to find out the root cause first though, as this might also
affect OOB messages.
On 15/12/16 20:10, sageroger wrote:
Thanks for the conf link. By not needing FIFO I meant we need
retransmission of lost messages but no ordering guarantees.
We are running into a problem with Jgroups. When we run for a couple of days, the application runs out of java heap memory. When we look at the heap dumps, we can see large objects of type "array of org.jgroups.Message" holding >40 MB each. All these objects are being held by org.jgroups.util.RetransmitTable. What could be causing this problem? We are on 3.6.11.Final. Does anyone know how to have this problem fixed?
RetransmitTable is not used in 3.6.11; I suspect you're using an
outdated configuration (NAKACK instead of NAKACK2).
If you want someone to look into this, you need to post
- Configuration
- Stack trace / thread dump
- How to reproduce if possible
On 13/12/16 05:55, sageroger wrote:
--
Bela Ban, JGroups lead (http://www.jgroups.org)
Bela,
Thanks for the reponse.
Here is the configuration:
<config clusterName="MyTest">
<UDP mcast_addr="228.8.8.8" mcast_port="8888" ip_ttl="64" ip_mcast="true" mcast_send_buf_size="150000" mcast_recv_buf_size="80000" ucast_send_buf_size="150000" ucast_recv_buf_size="80000" loopback="false"/>
<PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
<MERGE2 min_interval="10000" max_interval="20000"/>
<FD shun="true" up_thread="true" down_thread="true"/>
<VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
<pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" up_thread="false" down_thread="false"/>
<pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/>
<UNICAST timeout="600,1200,2400" down_thread="false"/>
<FRAG frag_size="8192" down_thread="false" up_thread="false"/>
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
<pbcast.STATE_TRANSFER up_thread="false" down_thread="false"/>
</config>
We do not have a stack trace. But in heap analysis we can see the data structures holding memory. This appears to occur in heave traffic situations.
Another question we have is: we do not need to have FIFO semantics in our case. Can we disable FIFO (which will probably eliminate the need for retransmits)? If so how can we do that?
The config you posted is most definitely NOT a 3.6.x config, as attributes like shun or up_thread were eliminated decades ago!
Hmm.. the reason we are using an old config file is we just upgraded from an ancient version to the latest version. But thought the new version works fine with the old config. Where can we find a good termplate for the new config file? Do you have any comments on my previous question about disabling FIFO? Thanks!
On 14/12/16 21:37, sageroger wrote:
No, it doesn't. With the config you posted, a 3.6.11 system will not
even start! So either you don't use the config you posted, or you
don't use 3.6.11...
Look in the ./conf dir of the src code for examples:
https://github.com/belaban/JGroups/tree/3.6/conf
What do you mean by disabling FIFO? No retransmission of lost messages?
No ordering guarantees?
--
Bela Ban, JGroups lead (http://www.jgroups.org)
Thanks for the conf link. By not needing FIFO I meant we need retransmission of lost messages but no ordering guarantees.
You could send your messages as OOB messages then, which are unordered
but lossless.
I suggest to find out the root cause first though, as this might also
affect OOB messages.
On 15/12/16 20:10, sageroger wrote:
--
Bela Ban, JGroups lead (http://www.jgroups.org)