Menu

GMS Join fails after node reconnects

Help
2019-01-05
2019-02-27
  • Jürgen Westerkamp

    Hi,

    I'm using JGroups 4.0.15.Final to implement a cluster application. Everything works fine when I startup the two nodes, they got connected and start working. But when I kill (SIGKILL) one node and restart it, the nodes wont join together anymore. I've tried to find the problem for hours but I just can't find it. Sometimes the nodes got connected again after several connection attempts but that can take from few minutes to over an hour. I think it has something to do with the reincarnation problem but I'm not sure.

    This is my current configuration:

    public class JGroupsConfiguration {
    
        final Protocol[] PROTOCOL_STACK = {
                new TCP()
                        .setValue("bind_port", EngineConfiguration.get().clusterSrcPort().get())
                        .setValue("max_bundle_size", 64000)
                        .setValue("sock_conn_timeout", 300)
                        .setValue("thread_pool_min_threads", 0)
                        .setValue("thread_pool_max_threads", 20)
                        .setValue("thread_pool_keep_alive_time", 3000),
                new TCPPING()
                        .setValue("async_discovery", false)
                        .setValue("port_range", 0)
                        .setValue("send_cache_on_join", true)
                        .setValue("return_entire_cache", true),
                new MERGE3()
                        .setValue("min_interval", 10000)
                        .setValue("max_interval", 30000),
                new FD()
                        .setValue("timeout", 3000)
                        .setValue("max_tries", 3),
                new VERIFY_SUSPECT()
                        .setValue("timeout", 1500),
                new BARRIER(),
                new NAKACK2()
                        .setValue("use_mcast_xmit", false)
                        .setValue("discard_delivered_msgs", true),
                new UNICAST3()
                        // Disables retransmitting of queued messages to reconnected nodes.
                        .setValue("max_retransmit_time", 0),
                new STABLE()
                        .setValue("desired_avg_gossip", 50000)
                        .setValue("max_bytes", 4000000),
                new GMS()
                        .setValue("print_local_addr", true)
                        .setValue("join_timeout", 2000),
                new FRAG3()
                        .setValue("frag_size", 60000),
                new STATE()
        };
    }
    

    I've attached the logs of the two nodes (NODE_1.log, NODE_2.log) with enabled trace on org.jgroups.protocols. On the NODE_2 the address of the NODE_1 is included in the TCPPING.initial_hosts.

    Note: Their is one specialty in my application because I'm using a custom socket factory for the TCP protocol. This factory creates encrypted sockets with some special requirements. But the connections seem to be fine and the socket works. The only thing I noticed is that after NODE_2 was verified as dead their are still connection attempts on its socket by the TCP protocol (see NODE_1.log line 142).

    I hope anyone can help me with this because I'm out of ideas.

     

    Last edit: Jürgen Westerkamp 2019-01-05
  • Jürgen Westerkamp

     

    Last edit: Jürgen Westerkamp 2019-01-05
  • Jürgen Westerkamp

    Hi,

    I still try to find a solution for this problem. I'm currently analyzing the TRACE logs for the TCP protocol but still can't point out the reason why the two nodes need several minutes to merge again.

    When I restart the NODE_2 while both (NODE_1 = 1.1.1.1, NODE_2 = 1.1.1.2) are connected and everything is working fine. On both sides I see that a connection is established:
    NODE_1:

    2019-01-17 09:51:55.089 [TQ-Bundl |                  ] TRACE TCP                      - 1.1.1.1:7800: failed connecting to 1.1.1.2:7800: java.net.SocketException: Failed to connect to '/1.1.1.2:7800' with Socket #1041583140. ConnectException: Connection refused (Connection refused)
    2019-01-17 09:51:55.089 [TQ-Bundl |                  ] TRACE TCP                      - 1.1.1.1:7800: removed connection to 1.1.1.2:7800 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.089 [TQ-Bundl |                  ] TRACE TCP                      - JGRP000029: NODE_1 (flags=1): failed sending message to 1.1.1.2:7800 (117 bytes): java.net.SocketException: Failed to connect to '/1.1.1.2:7800' with Socket #1041583140. ConnectException: Connection refused (Connection refused), headers: TCPPING: [GET_MBRS_REQ cluster=CLUSTER initial_discovery=false], TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    ----------------------------
    NODE_2 is comes back online
    ----------------------------
    2019-01-17 09:51:55.791 [TcpServe |                  ] TRACE TCP                      - 1.1.1.1:7800: accepted connection from 1.1.1.2:7800 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.798 [jgrp-29, |                  ] TRACE TCP                      - NODE_1 (flags=1): received [dst: 1.1.1.1:7800, src: NODE_2 (2 headers), size=50 bytes, flags=OOB|DONT_BUNDLE|INTERNAL], headers are TCPPING: [GET_MBRS_REQ cluster=CLUSTER initial_discovery=true], TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.798 [jgrp-29, |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to 1.1.1.2:7800, src=NODE_1 (flags=1), headers are MERGE3: INFO: view_id=[NODE_1 (flags=1)|2], logical_name=NODE_1, physical_addr=1.1.1.1:7800, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.798 [jgrp-29, |                  ] TRACE TCPPING                  - NODE_1 (flags=1): received GET_MBRS_REQ from NODE_2, sending response NODE_1 (flags=1), name=NODE_1, addr=1.1.1.1:7800, coord | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.799 [jgrp-29, |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to NODE_2, src=NODE_1 (flags=1), headers are TCPPING: [GET_MBRS_RSP cluster=null initial_discovery=false], TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.852 [jgrp-29, |                  ] TRACE TCP                      - NODE_1 (flags=1): received [dst: NODE_1 (flags=1), src: NODE_2 (3 headers), size=0 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[JOIN_REQ]: mbr=NODE_2, UNICAST3: DATA, seqno=1, first, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.853 [jgrp-29, |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) <-- DATA(NODE_2: #1, conn_id=0, first) | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:56.207 [jgrp-29, |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) --> ACK(NODE_2: #6) | (Log4J2LogImpl.java:56)  
    

    NODE_2:

    2019-01-17 09:51:48.998 [cc-clust |                  ] DEBUG Configurator             - set property TCP.bind_addr to default value /fe80:0:0:0:250:56ff:feb4:3785%eth2 | (Log4J2LogImpl.java:66) 
    2019-01-17 09:51:49.000 [cc-clust |                  ] DEBUG Configurator             - set property TCP.diagnostics_addr to default value /ff0e:0:0:0:0:0:75:75 | (Log4J2LogImpl.java:66) 
    2019-01-17 09:51:49.004 [cc-clust |                  ] DEBUG TCP                      - thread pool min/max/keep-alive: 0/20/3000 use_fork_join=false, internal pool: 0/4/30000 (2 cores available) | (Log4J2LogImpl.java:71) 
    2019-01-17 09:51:49.011 [cc-clust |                  ] TRACE NAKACK2                  - null: set max_xmit_req_size from 0 to 511600 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:49.012 [cc-clust |                  ] TRACE UNICAST3                 - null: set max_xmit_req_size from 0 to 511600 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.535 [cc-clust |                  ] TRACE STABLE                   - NODE_2: stable task started | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.553 [cc-clust |                  ] TRACE TCP                      - joined /ff0e:0:0:0:0:0:75:75:7500 on eth2 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.554 [cc-clust |                  ] TRACE TCP                      - joined /ff0e:0:0:0:0:0:75:75:7500 on eth1 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.555 [cc-clust |                  ] TRACE TCP                      - joined /ff0e:0:0:0:0:0:75:75:7500 on eth0 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.555 [cc-clust |                  ] TRACE TCP                      - joined /ff0e:0:0:0:0:0:75:75:7500 on lo | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.581 [cc-clust |                  ] TRACE TCPPING                  - NODE_2: sending discovery request to 1.1.1.1:7800 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.582 [cc-clust |                  ] TRACE TCP                      - NODE_2: sending msg to 1.1.1.1:7800, src=NODE_2, headers are TCPPING: [GET_MBRS_REQ cluster=CLUSTER initial_discovery=true], TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.591 [TQ-Bundl |                  ] TRACE TCP                      - 1.1.1.2:7800: connecting to 1.1.1.1:7800 | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.823 [jgrp-1,s |                  ] TRACE TCP                      - NODE_2: received [dst: NODE_2, src: 0c8f19dc-b0da-7fbd-3dae-993890717623 (flags=1) (2 headers), size=52 bytes, flags=OOB|DONT_BUNDLE|INTERNAL], headers are TCPPING: [GET_MBRS_RSP cluster=null initial_discovery=false], TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.826 [jgrp-2,s |                  ] TRACE TCP                      - NODE_2: received [dst: 1.1.1.2:7800, src: 0c8f19dc-b0da-7fbd-3dae-993890717623 (flags=1) (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: INFO: view_id=[0c8f19dc-b0da-7fbd-3dae-993890717623 (flags=1)|2], logical_name=NODE_1, physical_addr=1.1.1.1:7800, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.830 [jgrp-1,s |                  ] TRACE TCPPING                  - NODE_2: received GET_MBRS_RSP from NODE_1 (flags=1): NODE_1 (flags=1), name=NODE_1, addr=1.1.1.1:7800, coord | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.832 [cc-clust |                  ] TRACE GMS                      - NODE_2: discovery took 262 ms, members: 1 rsps (1 coords) [done] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.835 [cc-clust |                  ] DEBUG GMS                      - NODE_2: sending JOIN(NODE_2) to NODE_1 (flags=1) | (Log4J2LogImpl.java:71) 
    2019-01-17 09:51:55.844 [cc-clust |                  ] TRACE UNICAST3                 - NODE_2: created sender window for NODE_1 (flags=1) (conn-id=0) | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:55.848 [cc-clust |                  ] TRACE UNICAST3                 - NODE_2 --> DATA(NODE_1 (flags=1): #1, conn_id=0, first) | (Log4J2LogImpl.java:46) 
    2019-01-17 09:51:55.848 [cc-clust |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_1 (flags=1), src=NODE_2, headers are GMS: GmsHeader[JOIN_REQ]: mbr=NODE_2, UNICAST3: DATA, seqno=1, first, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:56.210 [jgrp-1,s |                  ] TRACE TCP                      - NODE_2: received [dst: NODE_2, src: NODE_1 (flags=1) (2 headers), size=0 bytes, flags=INTERNAL], headers are UNICAST3: ACK, seqno=6, ts=6, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 09:51:56.211 [jgrp-1,s |                  ] TRACE UNICAST3                 - NODE_2 <-- ACK(NODE_1 (flags=1): #6, conn-id=0, ts=6) | (Log4J2LogImpl.java:56) 
    

    After this its like both are in a loop to handle out something. While NODE_2 is trying to join (sending JOIN(NODE_2) to NODE_1) the NODE_1 doesn't seem to notice it and I don't see the GMS protocol involved in the LOG of NODE_1. I also see a lot of messages asking for the first sequence number over and over again (UNICAST3 NODE_2 --> SEND_FIRST_SEQNO(NODE_1 (flags=1)) | (Log4J2LogImpl.java:56) ). This is going on for a while (the time varies from few seconds to sometimes hours) until the join / merge suddenly works:
    NODE_1:

    2019-01-17 10:11:36.043 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to NODE_2, src=NODE_1 (flags=1), headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::21, UNICAST3: DATA, seqno=19750, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.544 [jgrp-125 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) --> XMIT(NODE_2: #19750) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.544 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to NODE_2, src=NODE_1 (flags=1), headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::21, UNICAST3: DATA, seqno=19750, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.547 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): received [dst: NODE_1 (flags=1), src: NODE_2 (2 headers), size=0 bytes, flags=OOB], headers are UNICAST3: SEND_FIRST_SEQNO, seqno=0, ts=581, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.547 [jgrp-125 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) <-- SEND_FIRST_SEQNO(NODE_2) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.547 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to NODE_2, src=NODE_1 (flags=1), headers are UNICAST3: DATA, seqno=19725, first, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.716 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): received [dst: NODE_1 (flags=1), src: NODE_2 (2 headers), size=0 bytes, flags=INTERNAL], headers are UNICAST3: ACK, seqno=19725, ts=582, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.716 [jgrp-125 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) <-- ACK(NODE_2: #19725, conn-id=0, ts=582) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.296 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): received [dst: NODE_1 (flags=1), src: NODE_2 (2 headers), size=0 bytes, flags=INTERNAL], headers are MERGE3: VIEW_REQ: , logical_name=null, physical_addr=null, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.296 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to NODE_2, src=NODE_1 (flags=1), headers are MERGE3: VIEW_RSP: , logical_name=null, physical_addr=null, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.300 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): received [dst: NODE_1 (flags=1), src: NODE_2 (3 headers), size=23 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=NODE_2::22, UNICAST3: DATA, seqno=32, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.300 [jgrp-125 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) <-- DATA(NODE_2: #32, conn_id=0) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.301 [jgrp-125 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1): delivering NODE_2#32 | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.301 [jgrp-125 |                  ] DEBUG STABLE                   - suspending message garbage collection | (Log4J2LogImpl.java:66) 
    2019-01-17 10:11:37.302 [jgrp-125 |                  ] DEBUG STABLE                   - NODE_1 (flags=1): resume task started, max_suspend_time=220000 | (Log4J2LogImpl.java:71) 
    2019-01-17 10:11:37.302 [jgrp-125 |                  ] TRACE GMS                      - NODE_1 (flags=1): got merge request from NODE_2, merge_id=NODE_2::22, mbrs=[NODE_1 (flags=1)] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.303 [jgrp-125 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) --> DATA(NODE_2: #19751, conn_id=0) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.303 [jgrp-125 |                  ] TRACE TCP                      - NODE_1 (flags=1): sending msg to NODE_2, src=NODE_1 (flags=1), headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::22, UNICAST3: DATA, seqno=19751, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.307 [jgrp-127 |                  ] TRACE TCP                      - NODE_1 (flags=1): received message batch of 1 messages from NODE_2 | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.310 [jgrp-127 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1) <-- DATA(NODE_2: #33 - #33) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.310 [jgrp-127 |                  ] TRACE UNICAST3                 - NODE_1 (flags=1): delivering #33 - #33 (1 messages) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.313 [jgrp-127 |                  ] TRACE GMS                      - NODE_1 (flags=1): mcasting view MergeView::[NODE_1 (flags=1)|25] (2) [NODE_1 (flags=1), NODE_2], 2 subgroups: [NODE_1 (flags=1)|3] (2) [NODE_1 (flags=1), NODE_2], [NODE_2|24] (1) [NODE_2] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.314 [jgrp-127 |                  ] DEBUG NAKACK2                  - 
    [NODE_1 (flags=1) mergeDigest()]
    existing digest:  NODE_1 (flags=1): [3 (3)], NODE_2: [0 (0)]
    new digest:       NODE_1 (flags=1): [3 (3)], NODE_2: [21 (21)]
    resulting digest: NODE_1 (flags=1): [3 (3)], NODE_2: [21 (21)] | (Log4J2LogImpl.java:66)
    2019-01-17 10:11:37.314 [jgrp-127 |                  ] DEBUG GMS                      - NODE_1 (flags=1): installing view MergeView::[NODE_1 (flags=1)|25] (2) [NODE_1 (flags=1), NODE_2], 2 subgroups: [NODE_1 (flags=1)|3] (2) [NODE_1 (flags=1), NODE_2], [NODE_2|24] (1) [NODE_2] | (Log4J2LogImpl.java:71) 
    

    NODE_2:

    2019-01-17 10:11:36.547 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2 --> SEND_FIRST_SEQNO(NODE_1 (flags=1)) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.547 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_1 (flags=1), src=NODE_2, headers are UNICAST3: SEND_FIRST_SEQNO, seqno=0, ts=581, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.550 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: received message batch of 1 messages from NODE_1 (flags=1) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.550 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2: created receiver window for NODE_1 (flags=1) at seqno=#19725 for conn-id=0 | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.553 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2 <-- DATA(NODE_1 (flags=1): #19725 - #19725) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.553 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2: delivering #19725 - #19725 (1 messages) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:36.713 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2 --> ACK(NODE_1 (flags=1): #19725) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:36.713 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_1 (flags=1), src=NODE_2, headers are UNICAST3: ACK, seqno=19725, ts=582, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.296 [jgrp-174 |                  ] TRACE MERGE3                   - NODE_2: merge participants are [NODE_2, NODE_1 (flags=1)] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.296 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_1 (flags=1), src=NODE_2, headers are MERGE3: VIEW_REQ: , logical_name=null, physical_addr=null, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.299 [jgrp-177 |                  ] TRACE TCP                      - NODE_2: received [dst: NODE_2, src: NODE_1 (flags=1) (2 headers), size=67 bytes, flags=INTERNAL], headers are MERGE3: VIEW_RSP: , logical_name=null, physical_addr=null, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.299 [jgrp-174 |                  ] DEBUG GMS                      - NODE_2: I will be the merge leader. Starting the merge task. Views: {NODE_1 (flags=1)=[NODE_1 (flags=1)|3] (2) [NODE_1 (flags=1), NODE_2], NODE_2=MergeView::[NODE_2|24] (1) [NODE_2], 1 subgroups: [NODE_2|23] (1) [NODE_2]} | (Log4J2LogImpl.java:71) 
    2019-01-17 10:11:37.300 [MergeTas |                  ] DEBUG STABLE                   - suspending message garbage collection | (Log4J2LogImpl.java:66) 
    2019-01-17 10:11:37.300 [MergeTas |                  ] DEBUG STABLE                   - NODE_2: resume task started, max_suspend_time=220000 | (Log4J2LogImpl.java:71) 
    2019-01-17 10:11:37.300 [MergeTas |                  ] DEBUG GMS                      - NODE_2: merge task NODE_2::22 started with 2 participants | (Log4J2LogImpl.java:71) 
    2019-01-17 10:11:37.300 [MergeTas |                  ] TRACE GMS                      - NODE_2: sending MERGE_REQ to [NODE_1 (flags=1), NODE_2] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.300 [MergeTas |                  ] TRACE UNICAST3                 - NODE_2 --> DATA(NODE_1 (flags=1): #32, conn_id=0) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.300 [MergeTas |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_1 (flags=1), src=NODE_2, headers are GMS: GmsHeader[MERGE_REQ]: merge_id=NODE_2::22, UNICAST3: DATA, seqno=32, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.301 [MergeTas |                  ] TRACE UNICAST3                 - NODE_2 --> DATA(NODE_2: #64, conn_id=1) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.301 [MergeTas |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_2, src=NODE_2, headers are GMS: GmsHeader[MERGE_REQ]: merge_id=NODE_2::22, UNICAST3: DATA, seqno=64, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.301 [MergeTas |                  ] TRACE TCP                      - NODE_2: looping back message [dst: NODE_2, src: NODE_2 (3 headers), size=22 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=NODE_2::22, UNICAST3: DATA, seqno=64, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.301 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: received [dst: NODE_2, src: NODE_2 (3 headers), size=22 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=NODE_2::22, UNICAST3: DATA, seqno=64, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.302 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2 <-- DATA(NODE_2: #64, conn_id=1) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.302 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2: delivering NODE_2#64 | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.302 [jgrp-174 |                  ] TRACE GMS                      - NODE_2: got merge request from NODE_2, merge_id=NODE_2::22, mbrs=[NODE_2] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.303 [jgrp-174 |                  ] TRACE UNICAST3                 - NODE_2 --> DATA(NODE_2: #65, conn_id=1) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.303 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_2, src=NODE_2, headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::22, UNICAST3: DATA, seqno=65, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.303 [jgrp-174 |                  ] TRACE TCP                      - NODE_2: looping back message [dst: NODE_2, src: NODE_2 (3 headers), size=51 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::22, UNICAST3: DATA, seqno=65, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.304 [jgrp-179 |                  ] TRACE UNICAST3                 - NODE_2: delivering #65 - #65 (1 messages) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.304 [jgrp-179 |                  ] TRACE GMS                      - NODE_2: got merge response from NODE_2, merge_id=NODE_2::22, merge data is sender=NODE_2, view=[NODE_2|24] (1) [NODE_2], digest=NODE_2: [21 (21)] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.304 [jgrp-177 |                  ] TRACE TCP                      - NODE_2: received [dst: NODE_2, src: NODE_2 (3 headers), size=51 bytes, flags=OOB|INTERNAL, transient_flags=OOB_DELIVERED], headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::22, UNICAST3: DATA, seqno=65, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.304 [jgrp-177 |                  ] TRACE UNICAST3                 - NODE_2 <-- DATA(NODE_2: #65, conn_id=1) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.305 [jgrp-177 |                  ] TRACE TCP                      - NODE_2: received [dst: NODE_2, src: NODE_1 (flags=1) (3 headers), size=53 bytes, flags=OOB|INTERNAL], headers are GMS: GmsHeader[MERGE_RSP]merge_id=NODE_2::22, UNICAST3: DATA, seqno=19751, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.305 [jgrp-177 |                  ] TRACE UNICAST3                 - NODE_2 <-- DATA(NODE_1 (flags=1): #19751, conn_id=0) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.305 [jgrp-177 |                  ] TRACE UNICAST3                 - NODE_2: delivering NODE_1 (flags=1)#19751 | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.306 [jgrp-177 |                  ] TRACE GMS                      - NODE_2: got merge response from NODE_1 (flags=1), merge_id=NODE_2::22, merge data is sender=NODE_1 (flags=1), view=[NODE_1 (flags=1)|3] (1) [NODE_1 (flags=1)], digest=NODE_1 (flags=1): [3 (3)] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.306 [MergeTas |                  ] TRACE GMS                      - NODE_2: collected 2 merge response(s) in 6 ms | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.307 [MergeTas |                  ] TRACE GMS                      - NODE_2: consolidated view=MergeView::[NODE_1 (flags=1)|25] (2) [NODE_1 (flags=1), NODE_2], 2 subgroups: [NODE_1 (flags=1)|3] (2) [NODE_1 (flags=1), NODE_2], MergeView::[NODE_2|24] (1) [NODE_2], 1 subgroups: [NODE_2|23] (1) [NODE_2]
    consolidated digest=NODE_1 (flags=1): [3 (3)], NODE_2: [21 (21)] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.307 [MergeTas |                  ] DEBUG GMS                      - NODE_2: installing merge view [NODE_1 (flags=1)|25] (2 members) in 2 coords | (Log4J2LogImpl.java:71) 
    2019-01-17 10:11:37.307 [MergeTas |                  ] TRACE UNICAST3                 - NODE_2 --> DATA(NODE_1 (flags=1): #33, conn_id=0) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.307 [MergeTas |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_1 (flags=1), src=NODE_2, headers are GMS: GmsHeader[INSTALL_MERGE_VIEW], UNICAST3: DATA, seqno=33, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.307 [MergeTas |                  ] TRACE UNICAST3                 - NODE_2 --> DATA(NODE_2: #66, conn_id=1) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.307 [MergeTas |                  ] TRACE TCP                      - NODE_2: sending msg to NODE_2, src=NODE_2, headers are GMS: GmsHeader[INSTALL_MERGE_VIEW], UNICAST3: DATA, seqno=66, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.308 [MergeTas |                  ] TRACE TCP                      - NODE_2: looping back message [dst: NODE_2, src: NODE_2 (3 headers), size=108 bytes], headers are GMS: GmsHeader[INSTALL_MERGE_VIEW], UNICAST3: DATA, seqno=66, conn_id=1, TP: [cluster_name=CLUSTER] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.308 [MergeTas |                  ] DEBUG GMS                      - NODE_2: merge NODE_2::22 took 9 ms | (Log4J2LogImpl.java:71) 
    2019-01-17 10:11:37.309 [jgrp-179 |                  ] TRACE TCP                      - NODE_2: received message batch of 1 messages from NODE_2 | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.309 [jgrp-179 |                  ] TRACE UNICAST3                 - NODE_2 <-- DATA(NODE_2: #66 - #66) | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.309 [jgrp-179 |                  ] TRACE UNICAST3                 - NODE_2: delivering #66 - #66 (1 messages) | (Log4J2LogImpl.java:46) 
    2019-01-17 10:11:37.310 [jgrp-179 |                  ] TRACE GMS                      - NODE_2: mcasting view MergeView::[NODE_1 (flags=1)|25] (2) [NODE_1 (flags=1), NODE_2], 2 subgroups: [NODE_1 (flags=1)|3] (2) [NODE_1 (flags=1), NODE_2], [NODE_2|24] (1) [NODE_2] | (Log4J2LogImpl.java:56) 
    2019-01-17 10:11:37.310 [jgrp-179 |                  ] DEBUG NAKACK2                  - 
    [NODE_2 mergeDigest()]
    existing digest:  NODE_2: [21 (21)]
    new digest:       NODE_1 (flags=1): [3 (3)], NODE_2: [21 (21)]
    resulting digest: NODE_1 (flags=1): [3 (3)], NODE_2: [21 (21)] | (Log4J2LogImpl.java:66) 
    2019-01-17 10:11:37.310 [jgrp-179 |                  ] DEBUG GMS                      - NODE_2: installing view MergeView::[NODE_1 (flags=1)|25] (2) [NODE_1 (flags=1), NODE_2], 2 subgroups: [NODE_1 (flags=1)|3] (2) [NODE_1 (flags=1), NODE_2], [NODE_2|24] (1) [NODE_2] | (Log4J2LogImpl.java:71) 
    

    After this everything is working fine again until one node is restarted or something. I'm really clueless here and need some help. Does anyone has any idea why the NODE_1 isn't starting the join / merge earlier? Thanks for any help.

    Note: I've also attached both log files.

     

    Last edit: Jürgen Westerkamp 2019-01-18
  • Jürgen Westerkamp

    Hi,

    it turned out that the problem was the UNICAST3 protocol. The protocol tries to re-send every message to a reconnected cluster member before they are actually allowed to be merged as member into the cluster. This was the reason for all the SEQNO messages between the nodes. After we removed the protocol from the stack it works fine because in our application we don't need this feature.

     
  • Bela Ban

    Bela Ban - 2019-02-27

    I doubt this is the root cause; killing a node and restarting it has worked for decade(s)... If you come up with a reproducer (including code, configuration and instructions to reproduce), I'll take a look.
    Note that if you remove UNICAST3, you will destroy ordering guaranteesfor unicast messages.

    [1] http://www.jgroups.org/manual4/index.html#UNICAST3

     

Log in to post a comment.