Menu

#3280 dtm: loss of TCP connection requires node reboot

future
unassigned
None
enhancement
dtm
d
major
False
2022-01-23
2021-08-27
No

Some time we see loss of TCP connection among payloads or among controller and payloads in the cluster.
Example: If we have 2 controllers and 10 payloads(starting from PL-3 to PL-10), we see TCP connection loss at PL-4 among PL-5. The connection of PL-4 with other payloads remains established.
We also see connection loss at PL-7 with SC-2, the connection of PL-7 with other nodes remains established. This result in PL-7 reboot when controller failover happens i.e. SC-1 fails and SC-2 takes Act role. PL-7 thinks that there was a single controller in the cluster and it reboots.

This could be reproduced by adding iptables rule to drop the packets.

So, the expected behavior is dtmd on PL-4/PL-5 can retry the connection for few times before declaring the node is down.
The only drawback with this approach is that it will delay the application failover time or even controller failover time.

Any suggestion on it ??

Related

Tickets: #3280

Discussion

  • Mohan  Kanakam

    Mohan Kanakam - 2021-09-06

    Hi Gary/Minh,
    Any thoughts on this?
    Thanks

     
    • Alex Jones

      Alex Jones - 2021-09-08

      Hi Mohan,

      There is some code in there already which will continuously send out the
      initial ping to all nodes, so that if a connection is lost it will be
      regained. The code is disabled by default. I think there is some
      configuration which can be set to set the interval at which you want the
      pings to be sent.

      Is this what you are looking for?

      Alex

      On Mon, Sep 6, 2021 at 1:50 PM Mohan Kanakam mohan-hasoln@users.sourceforge.net wrote:

      Hi Gary/Minh,
      Any thoughts on this?
      Thanks


      Status: unassigned
      Milestone: 5.21.10
      Created: Fri Aug 27, 2021 11:33 AM UTC by Mohan Kanakam
      Last Updated: Fri Aug 27, 2021 11:33 AM UTC
      Owner: Mohan Kanakam

      Some time we see loss of TCP connection among payloads or among controller
      and payloads in the cluster.
      Example: If we have 2 controllers and 10 payloads(starting from PL-3 to
      PL-10), we see TCP connection loss at PL-4 among PL-5. The connection of
      PL-4 with other payloads remains established.
      We also see connection loss at PL-7 with SC-2, the connection of PL-7 with
      other nodes remains established. This result in PL-7 reboot when controller
      failover happens i.e. SC-1 fails and SC-2 takes Act role. PL-7 thinks that
      there was a single controller in the cluster and it reboots.

      This could be reproduced by adding iptables rule to drop the packets.

      So, the expected behavior is dtmd on PL-4/PL-5 can retry the connection
      for few times before declaring the node is down.
      The only drawback with this approach is that it will delay the application
      failover time or even controller failover time.

      Any suggestion on it ??

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/opensaf/tickets/3280/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #3280

      • Mohan  Kanakam

        Mohan Kanakam - 2021-09-09

        Hi Alex,
        Thanks for your help. I did find the option to enable the broadcast after initial discovery.
        Below is my testing observation. Please kindly comment.
        Case1:(SC-1,SC-2)
        1)Initially both nodes are connected as Active and Standby.
        2) I drop the packets on the 6700 port by adding iptable rule, then 2 nodes become active individually.
        3) I deleted the iptable rule, then both nodes are trying to connect each other and both got rebooted because they detected as spilt brain.

        Case2:(SC-1,PL-3)
        1)Initially both nodes are connected as Active controller and payload.
        2) I drop the packets on the 6700 port by adding iptable rule, then payload got rebooted and active remains up.

        I also observed that all the databases are being cleaned at controllers. So how reconnect will help. For example, ntfs will delete ntf agent information if node hosting ntf agent leaves the cluster. Now if it joins ntf agent will not register, so there is a mismatch in the cluster. The same stands true for EDS. Incase payload is down then controller delete all the information of application, even if it reconnects the amf application will not re register.

        So, I am not sure in which case, fix of 2522 will help.

         
      • Mohan  Kanakam

        Mohan Kanakam - 2021-09-16

        Hi Alex,
        Any update on this?
        Thanks.

         
      • Mohan  Kanakam

        Mohan Kanakam - 2021-10-07

        Hi Alex,
        Any update on this, please share.
        Thanks.

         
  • Minh Hon Chau

    Minh Hon Chau - 2021-09-06

    Hi Mohan,
    It's expected to reboot a node if it separates from the others in order to maintain the consistency. The loss can be in just a second but things can keep going on at payloads, i.e amf component assignments,.. which later will be out of sync with others.

     
    • Mohan  Kanakam

      Mohan Kanakam - 2021-09-09

      Hi Minh,
      Thanks for your comment.
      Because of network glitch, sometimes the nodes are up but they detect down of each other.
      We would like to see some resolution around it without reboot.
      Thanks.

       
  • Gary Lee

    Gary Lee - 2021-09-14
    • Milestone: 5.21.09 --> 5.21.12
     
  • Gary Lee

    Gary Lee - 2022-01-23
    • Milestone: 5.22.01 --> future
     

Log in to post a comment.