Menu

#1353 smf: step undoing is in progress forever until cluster reset

5.18.09
fixed
None
defect
smf
-
4.6 FC
major
False
2018-09-29
2015-04-28
No

Test description:
1. rolling middle-ware upgrade(4.5->4.6) campaign is ran
2. one of the upgrade node(PL-4) the new rpms(4.6) are kept empty and the node comes up without opensaf installation
3. the step rollback is taken approximately two hours to describe the campaign as EXECUTION_FAILED
4. attaching syslog of SC-1

Apr 24 18:36:55 SLES1 osafamfd[2289]: NO Node 'PL-4' left the cluster
Apr 24 18:36:55 SLES1 osafimmnd[2237]: NO Implementer connected: 47 (MsgQueueService132111) <2280, 2010f>
Apr 24 18:36:55 SLES1 osafimmnd[2237]: NO Implementer locally disconnected. Marking it as doomed 47 <2280, 2010f> (MsgQueueService132111)
Apr 24 18:36:55 SLES1 osafimmnd[2237]: NO Implementer disconnected 47 <2280, 2010f> (MsgQueueService132111)
Apr 24 18:36:58 SLES1 kernel: [ 172.812065] TIPC: Resetting link <1.1.1:eth0-1.1.4:eth0>, peer not responding
Apr 24 18:36:58 SLES1 kernel: [ 172.812071] TIPC: Lost link <1.1.1:eth0-1.1.4:eth0> on network plane A
Apr 24 18:36:58 SLES1 kernel: [ 172.812075] TIPC: Lost contact with <1.1.4>
Apr 24 18:37:15 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster
Apr 24 18:37:36 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster

-------------------


Apr 24 20:36:00 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster
Apr 24 20:36:22 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster
Apr 24 20:36:44 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster
Apr 24 20:37:06 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster
Apr 24 20:37:28 SLES1 osafsmfd[2318]: NO Failed to get node dest for clm node safNode=PL-4,safCluster=myClmCluster
Apr 24 20:37:28 SLES1 osafsmfd[2318]: NO no node destination found whitin time limit for node safAmfNode=PL-4,safAmfCluster=myAmfCluster
Apr 24 20:37:28 SLES1 osafsmfd[2318]: NO no node destination found for node safAmfNode=PL-4,safAmfCluster=myAmfCluster
Apr 24 20:37:28 SLES1 osafsmfd[2318]: ER Failed to online install old bundles
Apr 24 20:37:28 SLES1 osafsmfd[2318]: ER Step undoing failed
Apr 24 20:37:28 SLES1 osafsmfd[2318]: NO Step safSmfStep=0004 in procedure safSmfProc=OpenSAF-upgrade failed, step result 5
Apr 24 20:37:28 SLES1 osafsmfd[2318]: NO CAMP: Procedure safSmfProc=OpenSAF-upgrade returned FAILED

2 Attachments

Related

Tickets: #1353
Wiki: ChangeLog-5.18.09

Discussion

  • Anders Bjornerstedt

    • Milestone: future --> 4.6.1
     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.6.1 --> 4.6.2
     
  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 4.6.2 --> 4.7.2
     
  • Madhurika Koppula

    • summary: smf: two hours is spent on step undoing state --> smf: step undoing is in progress forever until cluster reset
    • Attachments has changed:

    Diff:

    --- old
    +++ new
    @@ -1 +1,2 @@
    +1353.tgz (475.2 kB; application/octet-stream)
     messages_step_undo (111.1 kB; application/octet-stream)
    
     
  • Madhurika Koppula

    Steps to reproduce:

    1) Execute middle-ware upgrade(5.0->5.1). Campaign is ran.
    2) On the node (SC-2) which is being upgraded, the new rpms(5.1) are kept empty and the node came up without opensaf installation.

    Observations:

    1) Node(SC-2) went for reboot for upgrade.
    2) As node SC-2 didnot join within 10 mins of time, step undoing is initiated. Rolling back of node reboot step is initiated by SMF.

    Below is the snippet:

    Sep 12 13:13:30 SLES-M-SLOT-1 osafamfd[2528]: NO Node 'SC-2' left the cluster
    Sep 12 13:13:33 SLES-M-SLOT-1 osaffmd[2467]: NO Node Down event for node id 2020f:
    Sep 12 13:13:33 SLES-M-SLOT-1 osaffmd[2467]: NO Current role: ACTIVE
    Sep 12 13:13:33 SLES-M-SLOT-1 osaffmd[2467]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60

    Sep 12 13:23:38 SLES-M-SLOT-1 osafsmfd[2583]: NO SmfUpgradeStep::nodeReboot: the following nodes has not been correctly rebooted
    Sep 12 13:23:38 SLES-M-SLOT-1 osafsmfd[2583]: NO Node safAmfNode=SC-2,safAmfCluster=myAmfCluster
    Sep 12 13:23:38 SLES-M-SLOT-1 osafsmfd[2583]: ER Fails to reboot node safAmfNode=SC-2,safAmfCluster=myAmfCluster
    Sep 12 13:23:38 SLES-M-SLOT-1 osafsmfd[2583]: ER Step execution failed, Try undoing the step

    Sep 12 13:23:38 SLES-M-SLOT-1 osafsmfd[2583]: NO SmfStepStateUndoing::execute start undoing step.
    Sep 12 13:23:38 SLES-M-SLOT-1 osafsmfd[2583]: NO STEP: Rolling back node reboot step safSmfStep=0002,safSmfProc=OpenSAF-upgrade,safSmfCampaign=UpgradeCampaign_7.0_7.1,safApp=safSmfService

    3) Step undoing is in progress forever until cluster reset.

    Attachments:

    1) Syslog, smf traces of both the controllers.

     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     
  • Thuan Tran

    Thuan Tran - 2018-09-25
    • status: unassigned --> accepted
    • assigned_to: Thuan
    • Blocker: --> False
     
  • Thuan Tran

    Thuan Tran - 2018-09-25
    • Milestone: future --> 5.18.09
     
  • Thuan Tran

    Thuan Tran - 2018-09-25
    • status: accepted --> review
     
  • Gary Lee

    Gary Lee - 2018-09-29

    commit e08db1131c686779e418fe1514deaecf666bf776
    Author: thuan.tran thuan.tran@dektech.com.au
    Date: Fri Sep 28 09:29:29 2018 +0000

    smf: campaign is executing forever until cluster reset [#1353]

    The function getNodeDestination() reset elapsedTime to zero cause
    the node reboot timeout at waitForNodeDestination() never reach.
    If scenario that node reboot cannot come back then campaign is stuck
    in executing forever until cluster reset.

     

    Related

    Tickets: #1353

  • Gary Lee

    Gary Lee - 2018-09-29
    • status: review --> fixed
     

Log in to post a comment.