OS environment:
Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
4.4.7 kernel
Network eth0, bonded, OVS (I have tried all of them and the problem is there in all configurations)
In 20% of the cases a "reboot -f" on controller2 is not detected and acted on. What is in the mds.log is .....
Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured
Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured
Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured
Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x00000000,1>
Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790>
Still, there is nothing in the syslog indicating that controller2 has left the cluster. This is for TCP.
When the node comes back on line (without opensaf being started) controller 1 notice finally and fail over apps.
When the reboot is not detected the tcp keep alives stops and goes into retransmits instead. I have attached 2 tshark sessions captured from controller1, capturing traffic between controller1 and controller2. The failed reboot detect is captured in "ctrl2_failed_detection.trc" and for a working detection there is a file "ctrl2_working.trc" I have also attached all logs in /var/log/opensaf and the syslog (all from controller one).
It appears to me that we are hitting something similar like "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect"
// Jonas
Can you please provide your Cluster environment ( OS / VM /container ) details
Have you economized below configuration in /etc/opensaf/dtmd.conf ?
The above case disconnection is via keepalive timer (idle time=40 sec, 4 probes, probe time=10 sec).
==============================================================
/# so_keepalive: Enable sending of keep-alive messages on connection-oriented
/# sockets. Expects an integer boolean flag
/# Note that without this set none of the tcp options will matter
DTM_SKEEPALIVE=1
/#
/# tcp_keepalive_time: The time (in seconds) the connection needs to remain
/# idle before TCP starts sending keepalive probes
/# Optional
DTM_TCP_KEEPIDLE_TIME=2
==============================================================
Maybe we can use TCP_USER_TIMEOUT, as suggested here:
http://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux
tcp_retries2 is a global configuration parameter that affects the whole node. We shouldn't assume that OpenSAF is the only user of TCP on the node, so we should not rely on changing this parameter. TCP_USER_TIMEOUT can be set per socket, so if it works it would be the preferred solution.
Agreed about the global nature of tcp_retried2. These parameters can be set on a socket level as well, right? Once we have the right parameter we should apply it on the sockets and not in /etc/sysctl.conf
I agree, the below article on
Improving HA Failures with TCP Timeoutsgiven more details on TCP_USER_TIMEOUT soclt optionhttp://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html( please see TCP sockets option TCP_USER_TIMEOUT commit message) .Diff:
Even I have Linux Kernel > 2.6.37 (3.0.13-0.27-default ) some how my system
<netinet/tcp.h> or <linux/tcp.h>doesn't have#define TCP_USER_TIMEOUTTCP socket options.So can some one please test the attached
tcp_user_timeout_2014.patchand let know the result/observations.Try to tune & test the DTM_TCP_USER_TIMEOUT=1500 to higher and lower value in /etc/opensaf/dtmd.conf
The constant TCP_USER_TIMEOUT is not part of LSB, so we will anyhow need to add the following in our code:
We should bump the minimum required Linux version to 3.18 after introducing this fix, since the TCP_USER_TIMEOUT feature didn't work properly in earlier Linux versions according to the article Improving HA Failures with TCP Timeouts you referred to.
Tested the patch and ended up with split brain after 4th reboot. Both controllers think they are active while they can ping each other perfectly fine. I will try to reproduce and collect logs
Can you please elaborate in which sequence of test you are ending up with split brain :
1) is all node in cluster detected Rebooted controller ( Lost contact with 'SC-2' ) with in
1.5 seconds now ?
2) with out this patch your are not able see
Lost contact with 'SC-2'on any node with in 1.5sec , what is current behavior ?
2) is split brain case coming after the Rebooted controller rejoined (reboot -f) ?
3) is split brain case coming after
reboot -fissue on controller with out going for reboot ?Diff:
I actually need to do more tests. From the patch's point of view I think it is looking good. The split brain seems to be related to that OVS is bringing up the port with a new MAC address every time. I have run some tests on eth0 (without OVS) and not been able to reproduce the split brain. Note that with TIPC as a transport the split brain also never happens even with OVS. I will run some more tests today and get back with some conclusion.
The split brain is coming after "reboot -f" on controller2 when it tries to join the cluster after coming up after the reboot. After that the two controllers run next to each other both active and there is no reboot.
The detection of reboot seems to always be there now, so the patch definitely fixed that.
Maybe your split-brain problems could be related to the ticket [#2030] that I just filed on DTM?
Related
Tickets:
#2030Anders, it is possible. I am seeing the same entry in my system when I get the split-brain.
After I fixed the MAC in OVS the problem went away though.
split-brain is different issue and we have ticket #2030 to debug the split-brain case ,
so I published the patch of this ticket.
Mahesh,
Can we get this back-ported to 4.7.x as well?
Cheers,
// Jonas
Hi Jonas,
Ok , I just pushed , please test once on 4.7 :
============================================================
branch: opensaf-4.7.x
parent: 8043:4a8a00097561
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:50:31 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
============================================================
-AVM
On 9/15/2016 12:08 AM, Jonas Arndt wrote:
Related
Tickets:
#2014changeset: 8066:afddc603adcb
branch: opensaf-4.7.x
parent: 8043:4a8a00097561
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:50:31 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
changeset: 8067:efeaffca9483
branch: opensaf-5.0.x
parent: 8049:28129451fd38
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:52:03 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
changeset: 8068:87a09d9164d3
branch: opensaf-5.1.x
parent: 8065:019e617955ef
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:52:32 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
changeset: 8069:b30d5e33e50c
tag: tip
parent: 8064:99410ba8cc21
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Sep 15 10:52:49 2016 +0530
summary: dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]
Related
Tickets:
#2014