Menu

#1036 mds: IMMND restarts because of out of order messages

4.5.0
fixed
None
defect
mds
-
4.5
major
2014-10-02
2014-09-02
No

Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting

Recreation steps:

  1. The problem is reproduced when 100 swithovers are done

  2. Immediately when failover is done

Then outof order message is observed.

  1. Because of out-of order message, new-active IMMND went for re-start.

From there on :
Sep 2 05:17:04 SLES-SLOT-2 osafimmnd[10230]: WA Sync MESSAGE:81639 OUT OF ORDER my highest processed:81637
Sep 2 05:17:09 SLES-SLOT-2 osafimmnd[10254]: WA Sync MESSAGE:81893 OUT OF ORDER my highest processed:81891
Sep 2 05:17:16 SLES-SLOT-2 osafimmnd[10275]: WA Sync MESSAGE:82114 OUT OF ORDER my highest processed:82112
Sep 2 05:17:20 SLES-SLOT-2 osafimmnd[10295]: WA Sync MESSAGE:82335 OUT OF ORDER my highest processed:82333

  1. Because of constant IMMND restarts at the time of sync,CLM got TRY_AGAIN and node went for reboot
    Sep 2 05:17:16 SLES-SLOT-2 osafimmd[10478]: NO Node 2020f request sync sync-pid:10295 epoch:0
    Sep 2 05:17:17 SLES-SLOT-2 osafclmd[10521]: ER saImmOiInitialize_2 failed 6, exiting
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
1 Attachments

Related

Tickets: #1036
Tickets: #1072

Discussion

  • Neelakanta Reddy

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,3 @@
    -
     Sep  2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting
    
     Recreation steps:
    
     
  • Neelakanta Reddy

    Attaching the controller logs.

     
  • Anders Bjornerstedt

    As can be seen, there appear to be dropped messages.
    The dropped messages are sent from the IMMD to the IMMMNDs as MDS broadcast.

    So in some sense this can be seen as an overload symptom.
    The main question is if the new MDS broadcast using TIPC broadcast has lower
    throughput than the older multicast implementation?

    Relevant is also how the test is constructed.
    Are you pushing requests with as high frequency as accepted?
    The fevs mechanism has a flow control mechanism that kicks when there
    are too many non responded fevs messages sent. Perhaps the very fact that
    TIPC broadcast is faster allows the test to push more messages per unit
    time over fevs, but increase the risk for local tipc buffer overflow at
    some nodes ?

     
  • Anders Bjornerstedt

    That should have been "MDS broadcast using TIPC broadcast has higher
    throughput than the older implementaiton".

    And this is no test it is a sync client seing dropped messages.
    That IMMN restarts and will sync again.
    The CLM restart due to TRY_AGAIN on imm-handle-initialize is a bug in CLM,
    unless it has retried for 60 seconds (max allowed sync time).

     
  • A V Mahesh (AVM)

    • status: unassigned --> assigned
    • assigned_to: A V Mahesh (AVM)
     
  • A V Mahesh (AVM)

    • status: assigned --> review
     
  • Anders Bjornerstedt

    The main question for the current review of this ticket is if it is
    posible to reproduce the repeated IMMND restarts reported in the creation
    of this ticket.

    A question is also if that was a test that has been run on older releases without any problems.

    I am pretty sure that it has always been possible to overload the system by generating enough traffic.

    The IMMND has the flow control mechanism to reduse the risk of overloading the central bottleneck of the IMMD.
    But after the MDS-TIPC multicast enhancement, that bottleneck in the IMMD has been reduced a lot.
    The effect of that is a raised presure on fevs receivers, the IMMNDs receiving the fevs from the IMMD.

    If I remember correctly, there was also some optimization of the sync to
    increase the sync-batch size etc.
    Perhaps we went too far?
    A burst of large messages is more likley than normal to overload the tipc
    receiver, resulting in tipc discarding some messagesm, resulting in this
    MESSAGE OUT OF ORDER symptom.

    The problem with this ticket is that it has no single or absolute solution.
    There is no one problem to solve.

    The problem is also not the fact of getting out of order message.
    If this ticket points to a problem it would be that this message out of order now can happen too often and even cyclicaly.

    If that is still the case, then we perhaps need to partly rewind/undo some
    of the imm-sync optimizations that where done in 4.5.

    But the first thing to test (as part of the on-going review) should be to
    redo the test that generated this ticket.

    Is it still "too easy" to get repeated IMMND restarts in sync ?

     
    • A V Mahesh (AVM)

      May be after optimizing sync data Mbcsv Check pointing (#952) we can see some reduced overflow.

      https://sourceforge.net/p/opensaf/tickets/952/

       
      • Anders Bjornerstedt

        Yes if the sync tends to cause out of order fevs only at the SC- standby, then that would help.
        But if you see the out of order message at payloads also then it would not help.

        /AndersBj


        From: A V Mahesh (AVM) [mailto:avmahesh@users.sf.net]
        Sent: den 2 oktober 2014 15:27
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] Re: #1036 mds: IMMND restarts because of out of order messages

        May be after optimizing sync data Mbcsv Check pointing (#952) we can see some reduced overflow.

        https://sourceforge.net/p/opensaf/tickets/952/https://sourceforge.net/p/opensaf/tickets/952


        [tickets:#1036]http://sourceforge.net/p/opensaf/tickets/1036 mds: IMMND restarts because of out of order messages

        Status: fixed
        Milestone: 4.5.0
        Created: Tue Sep 02, 2014 01:06 PM UTC by Neelakanta Reddy
        Last Updated: Thu Oct 02, 2014 01:10 PM UTC
        Owner: A V Mahesh (AVM)

        Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting

        Recreation steps:

        1. The problem is reproduced when 100 swithovers are done

        2. Immediately when failover is done

        Then outof order message is observed.

        1. Because of out-of order message, new-active IMMND went for re-start.

        From there on :
        Sep 2 05:17:04 SLES-SLOT-2 osafimmnd[10230]: WA Sync MESSAGE:81639 OUT OF ORDER my highest processed:81637
        Sep 2 05:17:09 SLES-SLOT-2 osafimmnd[10254]: WA Sync MESSAGE:81893 OUT OF ORDER my highest processed:81891
        Sep 2 05:17:16 SLES-SLOT-2 osafimmnd[10275]: WA Sync MESSAGE:82114 OUT OF ORDER my highest processed:82112
        Sep 2 05:17:20 SLES-SLOT-2 osafimmnd[10295]: WA Sync MESSAGE:82335 OUT OF ORDER my highest processed:82333

        1. Because of constant IMMND restarts at the time of sync,CLM got TRY_AGAIN and node went for reboot
          Sep 2 05:17:16 SLES-SLOT-2 osafimmd[10478]: NO Node 2020f request sync sync-pid:10295 epoch:0
          Sep 2 05:17:17 SLES-SLOT-2 osafclmd[10521]: ER saImmOiInitialize_2 failed 6, exiting
          Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
          Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
          Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60

        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1036/https://sourceforge.net/p/opensaf/tickets/1036

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

         

        Related

        Tickets: #1036

      • A V Mahesh (AVM)

        We also can reduced some overflow by optimizing in FinilizeSync message,
        while debugging I also observed ,in some sinario FinilizeSync is containing Committed CCB data , I this which is not required to be sent/sync to peer IMMD & IMMND as part of FinilizeSync message. This .

         
  • Hans Feldt

    Hans Feldt - 2014-10-02

    One thinh that strikes me is that there are always 2 messages missing. To me that indicates a logical problem rather then overload. But I might be wrong...

     
  • Anders Bjornerstedt

    The fevs message order is at the IMM level so that would mean the logical
    problem was at the imm level.
    But it is also an intermittent problem.

    The logic for sending (broadcasting) fevs in IMMD and for receiving fevs
    at IMMND is pretty linear and simple.

    I think the reason for ther e being 2 messages lost each time here has to
    do with that sync is on-going. The sync (optimized in 4.5) generates much
    larger messages than in previous releases. Probably two such sync-batch
    messages (each one being one fevs message) is enough to "easily" overflow
    the TIPC receiving buffer. So tipc discards two messages each time this
    happens, in this scenario.

     
    • A V Mahesh (AVM)

      In this bug case , IMMND received OUT of order received because of
      MDS logic its self was not sent messages to few IMMND , that is fixed now.

       
      • Anders Bjornerstedt

        Ok!

        The log severity in this message:

        Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting

        should be ERror and not WArning. This since immnd exits in the next statement. I will write a minor ticket on imm for that.

        /AndersBj


        From: A V Mahesh (AVM) [mailto:avmahesh@users.sf.net]
        Sent: den 2 oktober 2014 15:13
        To: opensaf-tickets@lists.sourceforge.net
        Subject: [tickets] [opensaf:tickets] Re: #1036 mds: IMMND restarts because of out of order messages

        In this bug case , IMMND received OUT of order received because of
        MDS logic its self was not sent messages to few IMMND , that is fixed now.


        [tickets:#1036]http://sourceforge.net/p/opensaf/tickets/1036 mds: IMMND restarts because of out of order messages

        Status: fixed
        Milestone: 4.5.0
        Created: Tue Sep 02, 2014 01:06 PM UTC by Neelakanta Reddy
        Last Updated: Thu Oct 02, 2014 01:10 PM UTC
        Owner: A V Mahesh (AVM)

        Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting

        Recreation steps:

        1. The problem is reproduced when 100 swithovers are done

        2. Immediately when failover is done

        Then outof order message is observed.

        1. Because of out-of order message, new-active IMMND went for re-start.

        From there on :
        Sep 2 05:17:04 SLES-SLOT-2 osafimmnd[10230]: WA Sync MESSAGE:81639 OUT OF ORDER my highest processed:81637
        Sep 2 05:17:09 SLES-SLOT-2 osafimmnd[10254]: WA Sync MESSAGE:81893 OUT OF ORDER my highest processed:81891
        Sep 2 05:17:16 SLES-SLOT-2 osafimmnd[10275]: WA Sync MESSAGE:82114 OUT OF ORDER my highest processed:82112
        Sep 2 05:17:20 SLES-SLOT-2 osafimmnd[10295]: WA Sync MESSAGE:82335 OUT OF ORDER my highest processed:82333

        1. Because of constant IMMND restarts at the time of sync,CLM got TRY_AGAIN and node went for reboot
          Sep 2 05:17:16 SLES-SLOT-2 osafimmd[10478]: NO Node 2020f request sync sync-pid:10295 epoch:0
          Sep 2 05:17:17 SLES-SLOT-2 osafclmd[10521]: ER saImmOiInitialize_2 failed 6, exiting
          Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
          Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
          Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60

        Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets

        To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

         

        Related

        Tickets: #1036
        Tickets: tickets

  • A V Mahesh (AVM)

    • status: review --> fixed
     
  • A V Mahesh (AVM)

    changeset: 5970:1e1498aecf23
    branch: opensaf-4.5.x
    parent: 5968:8adcf25b25a4
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Oct 02 17:58:20 2014 +0530

    changeset: 5971:dc1a4d4b82c0
    tag: tip
    parent: 5969:ead18326c13b
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Oct 02 17:59:34 2014 +0530

     
  • Hans Feldt

    Hans Feldt - 2014-10-02

    We have bumped TIPC importance to HIGH (from default LOW) in our deployment. This gives much more buffering in kernel space by TIPC. So maybe the problem comes with default TIPC importance settings.

     

Log in to post a comment.

MongoDB Logo MongoDB