Menu

#2014 Rebooted controller not detected in TCP

4.7.2
fixed
None
defect
dtm
-
major
2016-09-15
2016-09-08
Jonas Arndt
No

OS environment:

Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
4.4.7 kernel
Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations)

In 20% of the cases a "reboot -f" on controller2 is not detected and acted on. What is in the mds.log is .....

Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured
Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured
Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured
Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>

Still, there is nothing in the syslog indicating that controller2 has left the cluster. This is for TCP.
When the node comes back on line (without opensaf being started) controller 1 notice finally and fail over apps.

When the reboot is not detected the tcp keep alives stops and goes into retransmits instead. I have attached 2 tshark sessions captured from controller1, capturing traffic between controller1 and controller2. The failed reboot detect is captured in "ctrl2_failed_detection.trc" and for a working detection there is a file "ctrl2_working.trc" I have also attached all logs in /var/log/opensaf and the syslog (all from controller one).

It appears to me that we are hitting something similar like "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect"

// Jonas

2 Attachments

Related

Tickets: #2014
Wiki: ChangeLog-4.7.2
Wiki: ChangeLog-5.0.1

Discussion

  • A V Mahesh (AVM)

    • status: unassigned --> assigned
    • assigned_to: A V Mahesh (AVM)
    • Component: unknown --> dtm
    • Part: lib --> -
    • Priority: critical --> major
     
  • A V Mahesh (AVM)

    Can you please provide your Cluster environment ( OS / VM /container ) details

     
  • A V Mahesh (AVM)

    It appears to me that we are hitting something similar like >>"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive->>timer-delaying-disconnect"

    Have you economized below configuration in /etc/opensaf/dtmd.conf ?

    The above case disconnection is via keepalive timer (idle time=40 sec, 4 probes, probe time=10 sec).

    ==============================================================

    /# so_keepalive: Enable sending of keep-alive messages on connection-oriented
    /# sockets. Expects an integer boolean flag
    /# Note that without this set none of the tcp options will matter
    DTM_SKEEPALIVE=1

    /#
    /# tcp_keepalive_time: The time (in seconds) the connection needs to remain
    /# idle before TCP starts sending keepalive probes
    /# Optional
    DTM_TCP_KEEPIDLE_TIME=2

    ==============================================================

     
  • Anders Widell

    Anders Widell - 2016-09-09

    tcp_retries2 is a global configuration parameter that affects the whole node. We shouldn't assume that OpenSAF is the only user of TCP on the node, so we should not rely on changing this parameter. TCP_USER_TIMEOUT can be set per socket, so if it works it would be the preferred solution.

     
  • Jonas Arndt

    Jonas Arndt - 2016-09-09

    Agreed about the global nature of tcp_retried2. These parameters can be set on a socket level as well, right? Once we have the right parameter we should apply it on the sockets and not in /etc/sysctl.conf

     
  • A V Mahesh (AVM)

    I agree, the below article on Improving HA Failures with TCP Timeouts given more details on TCP_USER_TIMEOUT soclt option http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html ( please see TCP sockets option TCP_USER_TIMEOUT commit message) .

     
  • A V Mahesh (AVM)

    • Attachments has changed:

    Diff:

    --- old
    +++ new
    @@ -1 +1,2 @@
     logs.tgz (84.1 kB; application/x-compressed-tar)
    +tcp_user_timeout_2014.patch (5.5 kB; application/octet-stream)
    
     
  • A V Mahesh (AVM)

    Even I have Linux Kernel > 2.6.37 (3.0.13-0.27-default ) some how my system <netinet/tcp.h> or <linux/tcp.h> doesn't have
    #define TCP_USER_TIMEOUT TCP socket options.

    So can some one please test the attached tcp_user_timeout_2014.patch and let know the result/observations.

    Try to tune & test the DTM_TCP_USER_TIMEOUT=1500 to higher and lower value in /etc/opensaf/dtmd.conf

     
  • Anders Widell

    Anders Widell - 2016-09-12

    The constant TCP_USER_TIMEOUT is not part of LSB, so we will anyhow need to add the following in our code:

    #ifndef TCP_USER_TIMEOUT
    #define TCP_USER_TIMEOUT 18
    #endif
    

    We should bump the minimum required Linux version to 3.18 after introducing this fix, since the TCP_USER_TIMEOUT feature didn't work properly in earlier Linux versions according to the article Improving HA Failures with TCP Timeouts you referred to.

     
  • Jonas Arndt

    Jonas Arndt - 2016-09-12

    Tested the patch and ended up with split brain after 4th reboot. Both controllers think they are active while they can ping each other perfectly fine. I will try to reproduce and collect logs

     
  • A V Mahesh (AVM)

    Tested the patch and ended up with split brain after 4th reboot. Both controllers think they are active while they can ping each other perfectly fine. I will try to reproduce and collect logs

    Can you please elaborate in which sequence of test you are ending up with split brain :

    1) is all node in cluster detected Rebooted controller ( Lost contact with 'SC-2' ) with in
    1.5 seconds now ?
    2) with out this patch your are not able see Lost contact with 'SC-2' on any node with in 1.5
    sec , what is current behavior ?
    2) is split brain case coming after the Rebooted controller rejoined (reboot -f) ?
    3) is split brain case coming after reboot -f issue on controller with out going for reboot ?

     
  • A V Mahesh (AVM)

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,3 +1,10 @@
    +OS environment:
    +
    +    Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
    +    4.4.7 kernel
    +    Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations)
    +
    +
     In 20% of the cases a "reboot -f" on  controller2 is not detected and acted on. What is in the mds.log is .....
    
     Sep  7  6:44:23.918566 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
    
     
  • Jonas Arndt

    Jonas Arndt - 2016-09-13

    I actually need to do more tests. From the patch's point of view I think it is looking good. The split brain seems to be related to that OVS is bringing up the port with a new MAC address every time. I have run some tests on eth0 (without OVS) and not been able to reproduce the split brain. Note that with TIPC as a transport the split brain also never happens even with OVS. I will run some more tests today and get back with some conclusion.

    The split brain is coming after "reboot -f" on controller2 when it tries to join the cluster after coming up after the reboot. After that the two controllers run next to each other both active and there is no reboot.

    The detection of reboot seems to always be there now, so the patch definitely fixed that.

     
  • Anders Widell

    Anders Widell - 2016-09-13

    Maybe your split-brain problems could be related to the ticket [#2030] that I just filed on DTM?

     

    Related

    Tickets: #2030

  • Jonas Arndt

    Jonas Arndt - 2016-09-13

    Anders, it is possible. I am seeing the same entry in my system when I get the split-brain.

    After I fixed the MAC in OVS the problem went away though.

     
  • A V Mahesh (AVM)

    • status: assigned --> review
    • Milestone: 4.7.2 --> 5.0.1
     
  • A V Mahesh (AVM)

    split-brain is different issue and we have ticket #2030 to debug the split-brain case ,
    so I published the patch of this ticket.

     
  • Jonas Arndt

    Jonas Arndt - 2016-09-14

    Mahesh,

    Can we get this back-ported to 4.7.x as well?

    Cheers,

    // Jonas

     
    • A V Mahesh (AVM)

      Hi Jonas,

      Ok , I just pushed , please test once on 4.7 :

      ============================================================

      branch: opensaf-4.7.x
      parent: 8043:4a8a00097561
      user: A V Mahesh mahesh.valla@oracle.com
      date: Thu Sep 15 10:50:31 2016 +0530
      summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

      ============================================================

      -AVM

      On 9/15/2016 12:08 AM, Jonas Arndt wrote:

      Mahesh,

      Can we get this back-ported to 4.7.x as well?

      Cheers,

      // Jonas


      [tickets:#2014] https://sourceforge.net/p/opensaf/tickets/2014/
      Rebooted controller not detected in TCP

      Status: review
      Milestone: 5.0.1
      Created: Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
      Last Updated: Wed Sep 14, 2016 04:51 AM UTC
      Owner: A V Mahesh (AVM)
      Attachments:

      OS environment:

      Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
      4.4.7 kernel
      Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations)

      In 20% of the cases a "reboot -f" on controller2 is not detected and
      acted on. What is in the mds.log is .....

      Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV:
      Adest=<0x00000000,1>
      Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV:
      Anchor=<0x0002020f,1790>
      Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or
      Error occured
      Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured
      on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
      Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV:
      Adest=<0x00000000,1>
      Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV:
      Anchor=<0x0002020f,1790>
      Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or
      Error occured
      Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured
      on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
      Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV:
      Adest=<0x00000000,1>
      Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV:
      Anchor=<0x0002020f,1790>
      Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or
      Error occured
      Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured
      on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
      Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV:
      Adest=<0x00000000,1>
      Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV:
      Anchor=<0x0002020f,1790>

      Still, there is nothing in the syslog indicating that controller2 has
      left the cluster. This is for TCP.
      When the node comes back on line (without opensaf being started)
      controller 1 notice finally and fail over apps.

      When the reboot is not detected the tcp keep alives stops and goes
      into retransmits instead. I have attached 2 tshark sessions captured
      from controller1, capturing traffic between controller1 and
      controller2. The failed reboot detect is captured in
      "ctrl2_failed_detection.trc" and for a working detection there is a
      file "ctrl2_working.trc" I have also attached all logs in
      /var/log/opensaf and the syslog (all from controller one).

      It appears to me that we are hitting something similar like
      "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect"

      // Jonas


      Sent from sourceforge.net because
      opensaf-tickets@lists.sourceforge.net is subscribed to
      https://sourceforge.net/p/opensaf/tickets/

      To unsubscribe from further messages, a project admin can change
      settings at https://sourceforge.net/p/opensaf/admin/tickets/options.
      Or, if this is a mailing list, you can unsubscribe from the mailing list.



      Opensaf-tickets mailing list
      Opensaf-tickets@lists.sourceforge.net
      https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

       

      Related

      Tickets: #2014

  • A V Mahesh (AVM)

    • status: review --> fixed
    • Milestone: 5.0.1 --> 4.7.2
     
  • A V Mahesh (AVM)

    changeset: 8066:afddc603adcb
    branch: opensaf-4.7.x
    parent: 8043:4a8a00097561
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Sep 15 10:50:31 2016 +0530
    summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

    changeset: 8067:efeaffca9483
    branch: opensaf-5.0.x
    parent: 8049:28129451fd38
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Sep 15 10:52:03 2016 +0530
    summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

    changeset: 8068:87a09d9164d3
    branch: opensaf-5.1.x
    parent: 8065:019e617955ef
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Sep 15 10:52:32 2016 +0530
    summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

    changeset: 8069:b30d5e33e50c
    tag: tip
    parent: 8064:99410ba8cc21
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Sep 15 10:52:49 2016 +0530
    summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

     

    Related

    Tickets: #2014


Log in to post a comment.