Menu

#1334 Standby reboots delayed when failover is trigerred during standby synch is in progress

4.5.2
fixed
None
defect
amf
d
4.6FC
major
2015-05-05
2015-04-22
Srikanth R
No

Changeset : 6377

Issue : Out of sync (failed over) new active controller should go for immediate reboot

During failover, if the standby controller is OUT OF SYNC and could not get promoted to active, amfnd should reboot the node immediately. The node went for reboot after 180 seconds or so. In this scenario, cold sync could not be completed.Hence Out of sync.

Apr 22 21:02:45 CONTROLLER-2 osaffmd[5534]: NO Current role: STANDBY
Apr 22 21:02:45 CONTROLLER-2 osaffmd[5534]: Rebooting OpenSAF NodeId = 131343 EE Name = ,

Apr 22 21:02:45 CONTROLLER-2 osaffmd[5534]: NO Controller Failover: Setting role to ACTIVE
Apr 22 21:02:45 CONTROLLER-2 osafrded[5525]: NO RDE role set to ACTIVE

Apr 22 21:02:45 CONTROLLER-2 osafamfd[5610]: NO FAILOVER StandBy --> Active
Apr 22 21:02:45 CONTROLLER-2 osafamfd[5610]: ER FAILOVER StandBy --> Active FAILED, Standby OUT OF SYNC
Apr 22 21:02:45 CONTROLLER-2 osafamfd[5610]: ER avd_role_change role change failure

Apr 22 21:05:43 CONTROLLER-2 osafamfnd[5620]: ER AMF director unexpectedly crashed
Apr 22 21:05:43 CONTROLLER-2 osafamfnd[5620]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Apr 22 21:05:43 CONTROLLER-2 opensaf_reboot: Rebooting local node; timeout=60

In similar scenario, fmd process rebooted the node when it detected that the standby is not ready to take active role.

Apr 22 21:58:10 CONTROLLER-1 osaffmd[5516]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Failover occurred, but this node is not yet ready, OwnNodeId = 131343, SupervisionTime = 60

Related

Tickets: #1334
Tickets: #1842
Wiki: ChangeLog-4.5.2
Wiki: ChangeLog-4.6.1

Discussion

  • Srikanth R

    Srikanth R - 2015-04-23

    Attaching the syslog.

    To clarify the intention of the ticket:

    The scenario is to produce intentional OUT_OF_SYNC standby and observe how this standby node behaves when it's get promoted to active.

     
  • Nagendra Kumar

    Nagendra Kumar - 2015-04-23

    Analysis:
    In case of failover, fm reboots its own node if csi is not assigned to it(csi_assigned is false) by Amf. In this scenario, while standby controller is coming up, Act Amfd has send SUSI to upcoming node Amfnd and Amfnd has assigned the role to fmd.

    Apr 22 21:02:34 CONTROLLER-2 osafamfnd[5620]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
    Apr 22 21:02:34 CONTROLLER-2 osafamfnd[5620]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'

    But, Standby Amfd is yet to complete the cold sync and node failover happend.

    Amf can provide a reboot in this case if fms don't care to handle it.

    Suggestion ??

    Thanks
    -Nagu

     
  • Mathi Naickan

    Mathi Naickan - 2015-04-23

    Changed the slogan, because it is not a case of "out of sync" but a case of "synch in progress"

     
  • Mathi Naickan

    Mathi Naickan - 2015-04-23
    • summary: OUT_OF_SYNC (failed over) new active controller should go for immediate reboot --> Standby reboots delayed when failover is trigerred during standby synch is in progress
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-04-24
    • status: unassigned --> accepted
    • assigned_to: Nagendra Kumar
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-04-27
    • Milestone: 4.4.2 --> 4.5.2
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-04-27
    • status: accepted --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-04-27

    The patch floated makes sure that Standby Amfd completes its cold sync and then sends nid response. After Amfd responds to nid, nid starts Amfnd and then other services comes up. So, with the patch, if Standby Amfd is in the middle of cold sync and Act controller reboots, Fmd will reboot the node because it will not have Amf assignment till then.

     
  • Nagendra Kumar

    Nagendra Kumar - 2015-04-27
    • Component: unknown --> amf
    • Part: - --> d
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-05-05

    changeset: 6511:9b3ea213edb3
    branch: opensaf-4.4.x
    parent: 6507:a9f0e7afef52
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue May 05 17:06:34 2015 +0530
    summary: amfd: respond to nid only after initialization is completed [#1334]

    changeset: 6512:81e237b5b6cc
    branch: opensaf-4.5.x
    parent: 6508:aba458b4cb6a
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue May 05 17:07:04 2015 +0530
    summary: amfd: respond to nid only after initialization is completed [#1334]

    changeset: 6513:5366987cdf2d
    branch: opensaf-4.6.x
    parent: 6509:0b2a391068f9
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue May 05 17:07:13 2015 +0530
    summary: amfd: respond to nid only after initialization is completed [#1334]

    changeset: 6514:58a11403b3dc
    tag: tip
    parent: 6510:08382ad144ea
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue May 05 17:07:21 2015 +0530
    summary: amfd: respond to nid only after initialization is completed [#1334]

    [staging:9b3ea2]
    [staging:81e237]
    [staging:536698]
    [staging:58a114]

     

    Related

    Tickets: #1334

  • Nagendra Kumar

    Nagendra Kumar - 2015-05-05
    • status: review --> fixed
     

Log in to post a comment.