OpenSAF / Tickets / #2014 Rebooted controller not detected in TCP

A V Mahesh (AVM) - 2016-09-09

status: unassigned --> assigned

assigned_to: A V Mahesh (AVM)

Component: unknown --> dtm

Part: lib --> -

Priority: critical --> major
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-09

Can you please provide your Cluster environment ( OS / VM /container ) details

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-09

It appears to me that we are hitting something similar like >>"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive->>timer-delaying-disconnect"

Have you economized below configuration in /etc/opensaf/dtmd.conf ?

The above case disconnection is via keepalive timer (idle time=40 sec, 4 probes, probe time=10 sec).

==============================================================

/# so_keepalive: Enable sending of keep-alive messages on connection-oriented
/# sockets. Expects an integer boolean flag
/# Note that without this set none of the tcp options will matter
DTM_SKEEPALIVE=1

/#
/# tcp_keepalive_time: The time (in seconds) the connection needs to remain
/# idle before TCP starts sending keepalive probes
/# Optional
DTM_TCP_KEEPIDLE_TIME=2

==============================================================

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2016-09-09

Maybe we can use TCP_USER_TIMEOUT, as suggested here:

http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2016-09-09

tcp_retries2 is a global configuration parameter that affects the whole node. We shouldn't assume that OpenSAF is the only user of TCP on the node, so we should not rely on changing this parameter. TCP_USER_TIMEOUT can be set per socket, so if it works it would be the preferred solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonas Arndt - 2016-09-09

Agreed about the global nature of tcp_retried2. These parameters can be set on a socket level as well, right? Once we have the right parameter we should apply it on the sockets and not in /etc/sysctl.conf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-12

I agree, the below article on Improving HA Failures with TCP Timeouts given more details on TCP_USER_TIMEOUT soclt option http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html ( please see TCP sockets option TCP_USER_TIMEOUT commit message) .

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-12

Attachments has changed:

Diff:

--- old +++ new @@ -1 +1,2 @@ logs.tgz (84.1 kB; application/x-compressed-tar) +tcp_user_timeout_2014.patch (5.5 kB; application/octet-stream)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-12

Even I have Linux Kernel > 2.6.37 (3.0.13-0.27-default ) some how my system <netinet/tcp.h> or <linux/tcp.h> doesn't have
#define TCP_USER_TIMEOUT TCP socket options.

So can some one please test the attached tcp_user_timeout_2014.patch and let know the result/observations.

Try to tune & test the DTM_TCP_USER_TIMEOUT=1500 to higher and lower value in /etc/opensaf/dtmd.conf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2016-09-12

The constant TCP_USER_TIMEOUT is not part of LSB, so we will anyhow need to add the following in our code:

#ifndef TCP_USER_TIMEOUT #define TCP_USER_TIMEOUT 18 #endif

We should bump the minimum required Linux version to 3.18 after introducing this fix, since the TCP_USER_TIMEOUT feature didn't work properly in earlier Linux versions according to the article Improving HA Failures with TCP Timeouts you referred to.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonas Arndt - 2016-09-12

Tested the patch and ended up with split brain after 4th reboot. Both controllers think they are active while they can ping each other perfectly fine. I will try to reproduce and collect logs

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-13

Tested the patch and ended up with split brain after 4th reboot. Both controllers think they are active while they can ping each other perfectly fine. I will try to reproduce and collect logs

Can you please elaborate in which sequence of test you are ending up with split brain :

1) is all node in cluster detected Rebooted controller ( Lost contact with 'SC-2' ) with in
1.5 seconds now ?
2) with out this patch your are not able see Lost contact with 'SC-2' on any node with in 1.5
sec , what is current behavior ?
2) is split brain case coming after the Rebooted controller rejoined (reboot -f) ?
3) is split brain case coming after reboot -f issue on controller with out going for reboot ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -1,3 +1,10 @@
+OS environment:
+
+    Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
+    4.4.7 kernel
+    Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations)
+
+
 In 20% of the cases a "reboot -f" on  controller2 is not detected and acted on. What is in the mds.log is .....

 Sep  7  6:44:23.918566 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>

Jonas Arndt - 2016-09-13

I actually need to do more tests. From the patch's point of view I think it is looking good. The split brain seems to be related to that OVS is bringing up the port with a new MAC address every time. I have run some tests on eth0 (without OVS) and not been able to reproduce the split brain. Note that with TIPC as a transport the split brain also never happens even with OVS. I will run some more tests today and get back with some conclusion.

The split brain is coming after "reboot -f" on controller2 when it tries to join the cluster after coming up after the reboot. After that the two controllers run next to each other both active and there is no reboot.

The detection of reboot seems to always be there now, so the patch definitely fixed that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2016-09-13

Maybe your split-brain problems could be related to the ticket [#2030] that I just filed on DTM?

Related

Tickets: ~~#2030~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonas Arndt - 2016-09-13

Anders, it is possible. I am seeing the same entry in my system when I get the split-brain.

After I fixed the MAC in OVS the problem went away though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-14

status: assigned --> review

Milestone: 4.7.2 --> 5.0.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-14

split-brain is different issue and we have ticket #2030 to debug the split-brain case ,
so I published the patch of this ticket.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonas Arndt - 2016-09-14

Mahesh,

Can we get this back-ported to 4.7.x as well?

Cheers,

// Jonas

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- A V Mahesh (AVM) - 2016-09-15
  
  Hi Jonas,
  
  Ok , I just pushed , please test once on 4.7 :
  
  ============================================================
  
  branch: opensaf-4.7.x
  parent: 8043:4a8a00097561
  user: A V Mahesh mahesh.valla@oracle.com
  date: Thu Sep 15 10:50:31 2016 +0530
  summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
  
  ============================================================
  
  -AVM
  
  On 9/15/2016 12:08 AM, Jonas Arndt wrote:
  
  Mahesh,
  
  Can we get this back-ported to 4.7.x as well?
  
  Cheers,
  
  // Jonas
  
  [tickets:#2014] https://sourceforge.net/p/opensaf/tickets/2014/
  Rebooted controller not detected in TCP
  
  Status: review
  Milestone: 5.0.1
  Created: Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
  Last Updated: Wed Sep 14, 2016 04:51 AM UTC
  Owner: A V Mahesh (AVM)
  Attachments:
  
  logs.tgz
  https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz
  (84.1 kB; application/x-compressed-tar)
  
  tcp_user_timeout_2014.patch
  https://sourceforge.net/p/opensaf/tickets/2014/attachment/tcp_user_timeout_2014.patch
  (5.5 kB; application/octet-stream)
  
  OS environment:
  
  Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
  4.4.7 kernel
  Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations)
  
  In 20% of the cases a "reboot -f" on controller2 is not detected and
  acted on. What is in the mds.log is .....
  
  Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV:
  Adest=<0x00000000,1>
  Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV:
  Anchor=<0x0002020f,1790>
  Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or
  Error occured
  Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured
  on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
  Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV:
  Adest=<0x00000000,1>
  Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV:
  Anchor=<0x0002020f,1790>
  Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or
  Error occured
  Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured
  on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
  Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV:
  Adest=<0x00000000,1>
  Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV:
  Anchor=<0x0002020f,1790>
  Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or
  Error occured
  Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured
  on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
  Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV:
  Adest=<0x00000000,1>
  Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV:
  Anchor=<0x0002020f,1790>
  
  Still, there is nothing in the syslog indicating that controller2 has
  left the cluster. This is for TCP.
  When the node comes back on line (without opensaf being started)
  controller 1 notice finally and fail over apps.
  
  When the reboot is not detected the tcp keep alives stops and goes
  into retransmits instead. I have attached 2 tshark sessions captured
  from controller1, capturing traffic between controller1 and
  controller2. The failed reboot detect is captured in
  "ctrl2_failed_detection.trc" and for a working detection there is a
  file "ctrl2_working.trc" I have also attached all logs in
  /var/log/opensaf and the syslog (all from controller one).
  
  It appears to me that we are hitting something similar like
  "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect"
  
  // Jonas
  
  Sent from sourceforge.net because
  opensaf-tickets@lists.sourceforge.net is subscribed to
  https://sourceforge.net/p/opensaf/tickets/
  
  To unsubscribe from further messages, a project admin can change
  settings at https://sourceforge.net/p/opensaf/admin/tickets/options.
  Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  Opensaf-tickets mailing list
  Opensaf-tickets@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
  
  Related
  
  Tickets: ~~#2014~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-15

status: review --> fixed

Milestone: 5.0.1 --> 4.7.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-09-15

changeset: 8066:afddc603adcb
branch: opensaf-4.7.x
parent: 8043:4a8a00097561
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:50:31 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

changeset: 8067:efeaffca9483
branch: opensaf-5.0.x
parent: 8049:28129451fd38
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:52:03 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

changeset: 8068:87a09d9164d3
branch: opensaf-5.1.x
parent: 8065:019e617955ef
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:52:32 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

changeset: 8069:b30d5e33e50c
tag: tip
parent: 8064:99410ba8cc21
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:52:49 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

Related

Tickets: ~~#2014~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rebooted controller not detected in TCP

Milestone

Searches

Help

#2014 Rebooted controller not detected in TCP

Related

Discussion

Related

Related

Related