Menu

#1605 smfd: campaign not correctly terminated after failed SI-SWAP

5.0.2
fixed
None
defect
smf
d
4.6
minor
2016-10-10
2015-11-19
No

Scenario:
1)smfd order a SI-SWAP to continue the campaign execution on the other controller.
2)before swap is executed, an imm object "SmfRestartIndicator" is created to signal to the smf on the new controller the campaign restart was initiated by smf (spontaneus restarts will always fail the campaign in executing state).
3)When the new controller comes up smf will check the existence of the object. If present OK if not fail. If OK the "SmfRestartIndicator" object is removed.
In this case the new controller fail to start very early, before smf was started. Smf never have a chance to remove the object.
5)AMF order a switchback to the first controller.
6)Smf start up on the "old" controller once again. Since the "SmfRestartIndicator" is still there, smf think the restart was ordered by smf and try to continue campaign execution which fail (the wrong way e.g. core dump)

Todo: find a mechanism which make smf to detect the "SmfRestartIndicator" is the old one and treat this case as it does not exist. Make sure the new solution is backward compatible.

The campaign continues at:
file: SmfUpgradeCampaign.cc, method:SmfUpgradeCampaign::continueExec()

The restart indicator is handeled in:
file: SmfUpgradeCampaign.cc, method:SmfUpgradeCampaign::checkSmfRestartIndicator()

Related

Tickets: #1605
Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

  • Mathi Naickan

    Mathi Naickan - 2015-12-15

    I see two possible options for this

    (a) Change the DN name format and concatenate/suffix the node_name to the SmfRestartIndicator DN name.
    (b) Change the SMF's ImplementerName string by concatenating/suffixing the node_name. Before becoming implementer, if this object exists, call AccessorGet() for the SmfRestartIndicator PRTO object and read the 'SaImmAttrImplementerName' attribute.

    Thus it wil help determine if the object was originally created by smfd running on the same/self node(old controller) or the peer controller node.

     
    • Mathi Naickan

      Mathi Naickan - 2015-12-17

      Just as a note, I tried Option (b) and it seems to work. (Also, SMF doesn’t performs a classimplementerset() on this, but the implementerName is still available

      for reading because it’s a PRTO)

      The question would  then be -this change should be adopted in older releases too!?

      Thanks,

      Mathi.

      From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
      Sent: Tuesday, December 15, 2015 3:31 PM
      To: opensaf-tickets@lists.sourceforge.net
      Subject: [tickets] [opensaf:tickets] #1605 smfd: campaign not correctly terminated after failed SI-SWAP

      I see two possible options for this

      (a) Change the DN name format and concatenate/suffix the node_name to the SmfRestartIndicator DN name.
      (b) Change the SMF's ImplementerName string by concatenating/suffixing the node_name. Before becoming implementer, if this object exists, call AccessorGet() for the SmfRestartIndicator PRTO object and read the 'SaImmAttrImplementerName' attribute.

      Thus it wil help determine if the object was originally created by smfd running on the same/self node(old controller) or the peer controller node.

      _  

      HYPERLINK "http://sourceforge.net/p/opensaf/tickets/1605/"[tickets:#1605] smfd: campaign not correctly terminated after failed SI-SWAP

      Status: unassigned
      Milestone: 5.0.FC
      Created: Thu Nov 19, 2015 12:26 PM UTC by Ingvar Bergström
      Last Updated: Thu Nov 19, 2015 12:26 PM UTC
      Owner: nobody

      Scenario:
      1)smfd order a SI-SWAP to continue the campaign execution on the other controller.
      2)before swap is executed, an imm object "SmfRestartIndicator" is created to signal to the smf on the new controller the campaign restart was initiated by smf (spontaneus restarts will always fail the campaign in executing state).
      3)When the new controller comes up smf will check the existence of the object. If present OK if not fail. If OK the "SmfRestartIndicator" object is removed.
      In this case the new controller fail to start very early, before smf was started. Smf never have a chance to remove the object.
      5)AMF order a switchback to the first controller.
      6)Smf start up on the "old" controller once again. Since the "SmfRestartIndicator" is still there, smf think the restart was ordered by smf and try to continue campaign execution which fail (the wrong way e.g. core dump)

      Todo: find a mechanism which make smf to detect the "SmfRestartIndicator" is the old one and treat this case as it does not exist. Make sure the new solution is backward compatible.

      The campaign continues at:
      file: SmfUpgradeCampaign.cc, method:SmfUpgradeCampaign::continueExec()

      The restart indicator is handeled in:
      file: SmfUpgradeCampaign.cc, method:SmfUpgradeCampaign::checkSmfRestartIndicator()

      _  

      Sent from sourceforge.net because HYPERLINK "mailto:opensaf-tickets@lists.sourceforge.net"opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/

      To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

       

      Related

      Tickets: #1605
      Tickets: tickets

  • Ingvar Bergström

    Very important: Will it also work for upgrade i.e. the first swap when the "new" code shall handle a restart indicator created by the "old" code?

    Since it is regarded as a fault, all maintained versions shall be updated.

     
  • Mathi Naickan

    Mathi Naickan - 2016-01-06
    • status: unassigned --> accepted
    • assigned_to: Mathi Naickan
     
  • Mathi Naickan

    Mathi Naickan - 2016-01-08
    • Milestone: 5.0.FC --> 4.6.2
     
  • Mathi Naickan

    Mathi Naickan - 2016-01-13

    Option B of changing the implementername will not work!

    The attached patch changes the DN name of the restartindicator object.
    The scenario mentioned in this ticket can be tested only after the first si-swap succeed. i.e. after this fix (similair) can get into the code.

     
  • Mathi Naickan

    Mathi Naickan - 2016-01-13

    Also, i just realised that there is a headless cluster related scenario involved in the background of this ticket. That makes it very difficult to reproduce the issue by myself too.!

     
  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 4.6.2 --> 4.7.2
     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Alex Jones

    Alex Jones - 2016-10-05
    • assigned_to: Mathi Naickan --> Alex Jones
     
  • Alex Jones

    Alex Jones - 2016-10-06
    • status: accepted --> review
     
  • Alex Jones

    Alex Jones - 2016-10-10
    • status: review --> fixed
     
  • Alex Jones

    Alex Jones - 2016-10-10

    changeset: 8200:721e05de1401
    tag: tip
    parent: 8197:69f86b9bddb0
    user: Alex Jones ajones@genband.com
    date: Mon Oct 10 15:30:37 2016 -0400
    summary: smfd: handle failed middleware si-swap [#1605]

    changeset: 8199:a37e89393eea
    branch: opensaf-5.1.x
    parent: 8196:88f6b4d6e234
    user: Alex Jones ajones@genband.com
    date: Mon Oct 10 15:30:37 2016 -0400
    summary: smfd: handle failed middleware si-swap [#1605]

    changeset: 8198:6f89139a3134
    branch: opensaf-5.0.x
    parent: 8195:967e479b7c42
    user: Alex Jones ajones@genband.com
    date: Mon Oct 10 15:30:37 2016 -0400
    summary: smfd: handle failed middleware si-swap [#1605]

     

    Related

    Tickets: #1605


Log in to post a comment.