OpenSAF / Tickets / #1036 mds: IMMND restarts because of out of order messages

Neelakanta Reddy - 2014-09-02

Description has changed:

Diff:

--- old +++ new @@ -1,4 +1,3 @@ - Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting Recreation steps:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2014-09-02

Attaching the controller logs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-02

As can be seen, there appear to be dropped messages.
The dropped messages are sent from the IMMD to the IMMMNDs as MDS broadcast.

So in some sense this can be seen as an overload symptom.
The main question is if the new MDS broadcast using TIPC broadcast has lower
throughput than the older multicast implementation?

Relevant is also how the test is constructed.
Are you pushing requests with as high frequency as accepted?
The fevs mechanism has a flow control mechanism that kicks when there
are too many non responded fevs messages sent. Perhaps the very fact that
TIPC broadcast is faster allows the test to push more messages per unit
time over fevs, but increase the risk for local tipc buffer overflow at
some nodes ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-02

That should have been "MDS broadcast using TIPC broadcast has higher
throughput than the older implementaiton".

And this is no test it is a sync client seing dropped messages.
That IMMN restarts and will sync again.
The CLM restart due to TRY_AGAIN on imm-handle-initialize is a bug in CLM,
unless it has retried for 60 seconds (max allowed sync time).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-09-23

status: unassigned --> assigned

assigned_to: A V Mahesh (AVM)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-01

status: assigned --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-10-02

The main question for the current review of this ticket is if it is
posible to reproduce the repeated IMMND restarts reported in the creation
of this ticket.

A question is also if that was a test that has been run on older releases without any problems.

I am pretty sure that it has always been possible to overload the system by generating enough traffic.

The IMMND has the flow control mechanism to reduse the risk of overloading the central bottleneck of the IMMD.
But after the MDS-TIPC multicast enhancement, that bottleneck in the IMMD has been reduced a lot.
The effect of that is a raised presure on fevs receivers, the IMMNDs receiving the fevs from the IMMD.

If I remember correctly, there was also some optimization of the sync to
increase the sync-batch size etc.
Perhaps we went too far?
A burst of large messages is more likley than normal to overload the tipc
receiver, resulting in tipc discarding some messagesm, resulting in this
MESSAGE OUT OF ORDER symptom.

The problem with this ticket is that it has no single or absolute solution.
There is no one problem to solve.

The problem is also not the fact of getting out of order message.
If this ticket points to a problem it would be that this message out of order now can happen too often and even cyclicaly.

If that is still the case, then we perhaps need to partly rewind/undo some
of the imm-sync optimizations that where done in 4.5.

But the first thing to test (as part of the on-going review) should be to
redo the test that generated this ticket.

Is it still "too easy" to get repeated IMMND restarts in sync ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- A V Mahesh (AVM) - 2014-10-02
  
  May be after optimizing sync data Mbcsv Check pointing (#952) we can see some reduced overflow.
  
  https://sourceforge.net/p/opensaf/tickets/952/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2014-10-02
    
    Yes if the sync tends to cause out of order fevs only at the SC- standby, then that would help.
    But if you see the out of order message at payloads also then it would not help.
    
    /AndersBj
    
    From: A V Mahesh (AVM) [mailto:avmahesh@users.sf.net]
    Sent: den 2 oktober 2014 15:27
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1036 mds: IMMND restarts because of out of order messages
    
    May be after optimizing sync data Mbcsv Check pointing (#952) we can see some reduced overflow.
    
    https://sourceforge.net/p/opensaf/tickets/952/https://sourceforge.net/p/opensaf/tickets/952
    
    [tickets:#1036]http://sourceforge.net/p/opensaf/tickets/1036 mds: IMMND restarts because of out of order messages
    
    Status: fixed
    Milestone: 4.5.0
    Created: Tue Sep 02, 2014 01:06 PM UTC by Neelakanta Reddy
    Last Updated: Thu Oct 02, 2014 01:10 PM UTC
    Owner: A V Mahesh (AVM)
    
    Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting
    
    Recreation steps:
    
    The problem is reproduced when 100 swithovers are done
    
    Immediately when failover is done
    
    Then outof order message is observed.
    
    Because of out-of order message, new-active IMMND went for re-start.
    
    From there on :
    Sep 2 05:17:04 SLES-SLOT-2 osafimmnd[10230]: WA Sync MESSAGE:81639 OUT OF ORDER my highest processed:81637
    Sep 2 05:17:09 SLES-SLOT-2 osafimmnd[10254]: WA Sync MESSAGE:81893 OUT OF ORDER my highest processed:81891
    Sep 2 05:17:16 SLES-SLOT-2 osafimmnd[10275]: WA Sync MESSAGE:82114 OUT OF ORDER my highest processed:82112
    Sep 2 05:17:20 SLES-SLOT-2 osafimmnd[10295]: WA Sync MESSAGE:82335 OUT OF ORDER my highest processed:82333
    
    Because of constant IMMND restarts at the time of sync,CLM got TRY_AGAIN and node went for reboot
    Sep 2 05:17:16 SLES-SLOT-2 osafimmd[10478]: NO Node 2020f request sync sync-pid:10295 epoch:0
    Sep 2 05:17:17 SLES-SLOT-2 osafclmd[10521]: ER saImmOiInitialize_2 failed 6, exiting
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1036/https://sourceforge.net/p/opensaf/tickets/1036
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
    
    Related
    
    Tickets: ~~#1036~~
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - A V Mahesh (AVM) - 2014-10-02
    
    We also can reduced some overflow by optimizing in FinilizeSync message,
    while debugging I also observed ,in some sinario FinilizeSync is containing Committed CCB data , I this which is not required to be sent/sync to peer IMMD & IMMND as part of FinilizeSync message. This .
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hans Feldt - 2014-10-02

One thinh that strikes me is that there are always 2 messages missing. To me that indicates a logical problem rather then overload. But I might be wrong...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-10-02

The fevs message order is at the IMM level so that would mean the logical
problem was at the imm level.
But it is also an intermittent problem.

The logic for sending (broadcasting) fevs in IMMD and for receiving fevs
at IMMND is pretty linear and simple.

I think the reason for ther e being 2 messages lost each time here has to
do with that sync is on-going. The sync (optimized in 4.5) generates much
larger messages than in previous releases. Probably two such sync-batch
messages (each one being one fevs message) is enough to "easily" overflow
the TIPC receiving buffer. So tipc discards two messages each time this
happens, in this scenario.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- A V Mahesh (AVM) - 2014-10-02
  
  In this bug case , IMMND received OUT of order received because of
  MDS logic its self was not sent messages to few IMMND , that is fixed now.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2014-10-02
    
    Ok!
    
    The log severity in this message:
    
    Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting
    
    should be ERror and not WArning. This since immnd exits in the next statement. I will write a minor ticket on imm for that.
    
    /AndersBj
    
    From: A V Mahesh (AVM) [mailto:avmahesh@users.sf.net]
    Sent: den 2 oktober 2014 15:13
    To: opensaf-tickets@lists.sourceforge.net
    Subject: [tickets] [opensaf:tickets] Re: #1036 mds: IMMND restarts because of out of order messages
    
    In this bug case , IMMND received OUT of order received because of
    MDS logic its self was not sent messages to few IMMND , that is fixed now.
    
    [tickets:#1036]http://sourceforge.net/p/opensaf/tickets/1036 mds: IMMND restarts because of out of order messages
    
    Status: fixed
    Milestone: 4.5.0
    Created: Tue Sep 02, 2014 01:06 PM UTC by Neelakanta Reddy
    Last Updated: Thu Oct 02, 2014 01:10 PM UTC
    Owner: A V Mahesh (AVM)
    
    Sep 2 05:16:57 SLES-SLOT-2 osafimmnd[21492]: WA MESSAGE:81414 OUT OF ORDER my highest processed:81412, exiting
    
    Recreation steps:
    
    The problem is reproduced when 100 swithovers are done
    
    Immediately when failover is done
    
    Then outof order message is observed.
    
    Because of out-of order message, new-active IMMND went for re-start.
    
    From there on :
    Sep 2 05:17:04 SLES-SLOT-2 osafimmnd[10230]: WA Sync MESSAGE:81639 OUT OF ORDER my highest processed:81637
    Sep 2 05:17:09 SLES-SLOT-2 osafimmnd[10254]: WA Sync MESSAGE:81893 OUT OF ORDER my highest processed:81891
    Sep 2 05:17:16 SLES-SLOT-2 osafimmnd[10275]: WA Sync MESSAGE:82114 OUT OF ORDER my highest processed:82112
    Sep 2 05:17:20 SLES-SLOT-2 osafimmnd[10295]: WA Sync MESSAGE:82335 OUT OF ORDER my highest processed:82333
    
    Because of constant IMMND restarts at the time of sync,CLM got TRY_AGAIN and node went for reboot
    Sep 2 05:17:16 SLES-SLOT-2 osafimmd[10478]: NO Node 2020f request sync sync-pid:10295 epoch:0
    Sep 2 05:17:17 SLES-SLOT-2 osafclmd[10521]: ER saImmOiInitialize_2 failed 6, exiting
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
    Sep 2 05:17:17 SLES-SLOT-2 osafamfnd[10550]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
    
    Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
    
    To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
    
    Related
    
    Tickets: ~~#1036~~
    Tickets: tickets
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-02

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-02

changeset: 5970:1e1498aecf23
branch: opensaf-4.5.x
parent: 5968:8adcf25b25a4
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Oct 02 17:58:20 2014 +0530

changeset: 5971:dc1a4d4b82c0
tag: tip
parent: 5969:ead18326c13b
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Oct 02 17:59:34 2014 +0530

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hans Feldt - 2014-10-02

We have bumped TIPC importance to HIGH (from default LOW) in our deployment. This gives much more buffering in kernel space by TIPC. So maybe the problem comes with default TIPC importance settings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mds: IMMND restarts because of out of order messages

Milestone

Searches

Help

#1036 mds: IMMND restarts because of out of order messages

Related

Discussion

Related

Related