I'm using JGroups 4.0.15.Final to implement a cluster application. Everything works fine when I startup the two nodes, they got connected and start working. But when I kill (SIGKILL) one node and restart it, the nodes wont join together anymore. I've tried to find the problem for hours but I just can't find it. Sometimes the nodes got connected again after several connection attempts but that can take from few minutes to over an hour. I think it has something to do with the reincarnation problem but I'm not sure.
This is my current configuration:
public class JGroupsConfiguration {
final Protocol[] PROTOCOL_STACK = {
new TCP()
.setValue("bind_port", EngineConfiguration.get().clusterSrcPort().get())
.setValue("max_bundle_size", 64000)
.setValue("sock_conn_timeout", 300)
.setValue("thread_pool_min_threads", 0)
.setValue("thread_pool_max_threads", 20)
.setValue("thread_pool_keep_alive_time", 3000),
new TCPPING()
.setValue("async_discovery", false)
.setValue("port_range", 0)
.setValue("send_cache_on_join", true)
.setValue("return_entire_cache", true),
new MERGE3()
.setValue("min_interval", 10000)
.setValue("max_interval", 30000),
new FD()
.setValue("timeout", 3000)
.setValue("max_tries", 3),
new VERIFY_SUSPECT()
.setValue("timeout", 1500),
new BARRIER(),
new NAKACK2()
.setValue("use_mcast_xmit", false)
.setValue("discard_delivered_msgs", true),
new UNICAST3()
// Disables retransmitting of queued messages to reconnected nodes.
.setValue("max_retransmit_time", 0),
new STABLE()
.setValue("desired_avg_gossip", 50000)
.setValue("max_bytes", 4000000),
new GMS()
.setValue("print_local_addr", true)
.setValue("join_timeout", 2000),
new FRAG3()
.setValue("frag_size", 60000),
new STATE()
};
}
I've attached the logs of the two nodes (NODE_1.log, NODE_2.log) with enabled trace on org.jgroups.protocols. On the NODE_2 the address of the NODE_1 is included in the TCPPING.initial_hosts.
Note: Their is one specialty in my application because I'm using a custom socket factory for the TCP protocol. This factory creates encrypted sockets with some special requirements. But the connections seem to be fine and the socket works. The only thing I noticed is that after NODE_2 was verified as dead their are still connection attempts on its socket by the TCP protocol (see NODE_1.log line 142).
I hope anyone can help me with this because I'm out of ideas.
I still try to find a solution for this problem. I'm currently analyzing the TRACE logs for the TCP protocol but still can't point out the reason why the two nodes need several minutes to merge again.
When I restart the NODE_2 while both (NODE_1 = 1.1.1.1, NODE_2 = 1.1.1.2) are connected and everything is working fine. On both sides I see that a connection is established:
NODE_1:
After this its like both are in a loop to handle out something. While NODE_2 is trying to join (sending JOIN(NODE_2) to NODE_1) the NODE_1 doesn't seem to notice it and I don't see the GMS protocol involved in the LOG of NODE_1. I also see a lot of messages asking for the first sequence number over and over again (UNICAST3 NODE_2 --> SEND_FIRST_SEQNO(NODE_1 (flags=1)) | (Log4J2LogImpl.java:56) ). This is going on for a while (the time varies from few seconds to sometimes hours) until the join / merge suddenly works:
NODE_1:
After this everything is working fine again until one node is restarted or something. I'm really clueless here and need some help. Does anyone has any idea why the NODE_1 isn't starting the join / merge earlier? Thanks for any help.
it turned out that the problem was the UNICAST3 protocol. The protocol tries to re-send every message to a reconnected cluster member before they are actually allowed to be merged as member into the cluster. This was the reason for all the SEQNO messages between the nodes. After we removed the protocol from the stack it works fine because in our application we don't need this feature.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I doubt this is the root cause; killing a node and restarting it has worked for decade(s)... If you come up with a reproducer (including code, configuration and instructions to reproduce), I'll take a look.
Note that if you remove UNICAST3, you will destroy ordering guaranteesfor unicast messages.
Hi,
I'm using JGroups 4.0.15.Final to implement a cluster application. Everything works fine when I startup the two nodes, they got connected and start working. But when I kill (SIGKILL) one node and restart it, the nodes wont join together anymore. I've tried to find the problem for hours but I just can't find it. Sometimes the nodes got connected again after several connection attempts but that can take from few minutes to over an hour. I think it has something to do with the reincarnation problem but I'm not sure.
This is my current configuration:
I've attached the logs of the two nodes (NODE_1.log, NODE_2.log) with enabled trace on org.jgroups.protocols. On the NODE_2 the address of the NODE_1 is included in the TCPPING.initial_hosts.
Note: Their is one specialty in my application because I'm using a custom socket factory for the TCP protocol. This factory creates encrypted sockets with some special requirements. But the connections seem to be fine and the socket works. The only thing I noticed is that after NODE_2 was verified as dead their are still connection attempts on its socket by the TCP protocol (see NODE_1.log line 142).
I hope anyone can help me with this because I'm out of ideas.
Last edit: Jürgen Westerkamp 2019-01-05
Last edit: Jürgen Westerkamp 2019-01-05
Hi,
I still try to find a solution for this problem. I'm currently analyzing the TRACE logs for the TCP protocol but still can't point out the reason why the two nodes need several minutes to merge again.
When I restart the NODE_2 while both (NODE_1 = 1.1.1.1, NODE_2 = 1.1.1.2) are connected and everything is working fine. On both sides I see that a connection is established:
NODE_1:
NODE_2:
After this its like both are in a loop to handle out something. While NODE_2 is trying to join (sending JOIN(NODE_2) to NODE_1) the NODE_1 doesn't seem to notice it and I don't see the GMS protocol involved in the LOG of NODE_1. I also see a lot of messages asking for the first sequence number over and over again (UNICAST3 NODE_2 --> SEND_FIRST_SEQNO(NODE_1 (flags=1)) | (Log4J2LogImpl.java:56) ). This is going on for a while (the time varies from few seconds to sometimes hours) until the join / merge suddenly works:
NODE_1:
NODE_2:
After this everything is working fine again until one node is restarted or something. I'm really clueless here and need some help. Does anyone has any idea why the NODE_1 isn't starting the join / merge earlier? Thanks for any help.
Note: I've also attached both log files.
Last edit: Jürgen Westerkamp 2019-01-18
Hi,
it turned out that the problem was the UNICAST3 protocol. The protocol tries to re-send every message to a reconnected cluster member before they are actually allowed to be merged as member into the cluster. This was the reason for all the SEQNO messages between the nodes. After we removed the protocol from the stack it works fine because in our application we don't need this feature.
I doubt this is the root cause; killing a node and restarting it has worked for decade(s)... If you come up with a reproducer (including code, configuration and instructions to reproduce), I'll take a look.
Note that if you remove UNICAST3, you will destroy ordering guaranteesfor unicast messages.
[1] http://www.jgroups.org/manual4/index.html#UNICAST3