Changeset : 6377
Issue : Out of sync (failed over) new active controller should go for immediate reboot
During failover, if the standby controller is OUT OF SYNC and could not get promoted to active, amfnd should reboot the node immediately. The node went for reboot after 180 seconds or so. In this scenario, cold sync could not be completed.Hence Out of sync.
Apr 22 21:02:45 CONTROLLER-2 osaffmd[5534]: NO Current role: STANDBY
Apr 22 21:02:45 CONTROLLER-2 osaffmd[5534]: Rebooting OpenSAF NodeId = 131343 EE Name = ,
Apr 22 21:02:45 CONTROLLER-2 osaffmd[5534]: NO Controller Failover: Setting role to ACTIVE
Apr 22 21:02:45 CONTROLLER-2 osafrded[5525]: NO RDE role set to ACTIVE
Apr 22 21:02:45 CONTROLLER-2 osafamfd[5610]: NO FAILOVER StandBy --> Active
Apr 22 21:02:45 CONTROLLER-2 osafamfd[5610]: ER FAILOVER StandBy --> Active FAILED, Standby OUT OF SYNC
Apr 22 21:02:45 CONTROLLER-2 osafamfd[5610]: ER avd_role_change role change failure
Apr 22 21:05:43 CONTROLLER-2 osafamfnd[5620]: ER AMF director unexpectedly crashed
Apr 22 21:05:43 CONTROLLER-2 osafamfnd[5620]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Apr 22 21:05:43 CONTROLLER-2 opensaf_reboot: Rebooting local node; timeout=60
In similar scenario, fmd process rebooted the node when it detected that the standby is not ready to take active role.
Apr 22 21:58:10 CONTROLLER-1 osaffmd[5516]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Failover occurred, but this node is not yet ready, OwnNodeId = 131343, SupervisionTime = 60
Tickets: #1334
Tickets: #1842
Wiki: ChangeLog-4.5.2
Wiki: ChangeLog-4.6.1
Attaching the syslog.
To clarify the intention of the ticket:
The scenario is to produce intentional OUT_OF_SYNC standby and observe how this standby node behaves when it's get promoted to active.
Analysis:
In case of failover, fm reboots its own node if csi is not assigned to it(csi_assigned is false) by Amf. In this scenario, while standby controller is coming up, Act Amfd has send SUSI to upcoming node Amfnd and Amfnd has assigned the role to fmd.
Apr 22 21:02:34 CONTROLLER-2 osafamfnd[5620]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Apr 22 21:02:34 CONTROLLER-2 osafamfnd[5620]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
But, Standby Amfd is yet to complete the cold sync and node failover happend.
Amf can provide a reboot in this case if fms don't care to handle it.
Suggestion ??
Thanks
-Nagu
Changed the slogan, because it is not a case of "out of sync" but a case of "synch in progress"
The patch floated makes sure that Standby Amfd completes its cold sync and then sends nid response. After Amfd responds to nid, nid starts Amfnd and then other services comes up. So, with the patch, if Standby Amfd is in the middle of cold sync and Act controller reboots, Fmd will reboot the node because it will not have Amf assignment till then.
changeset: 6511:9b3ea213edb3
branch: opensaf-4.4.x
parent: 6507:a9f0e7afef52
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue May 05 17:06:34 2015 +0530
summary: amfd: respond to nid only after initialization is completed [#1334]
changeset: 6512:81e237b5b6cc
branch: opensaf-4.5.x
parent: 6508:aba458b4cb6a
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue May 05 17:07:04 2015 +0530
summary: amfd: respond to nid only after initialization is completed [#1334]
changeset: 6513:5366987cdf2d
branch: opensaf-4.6.x
parent: 6509:0b2a391068f9
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue May 05 17:07:13 2015 +0530
summary: amfd: respond to nid only after initialization is completed [#1334]
changeset: 6514:58a11403b3dc
tag: tip
parent: 6510:08382ad144ea
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue May 05 17:07:21 2015 +0530
summary: amfd: respond to nid only after initialization is completed [#1334]
[staging:9b3ea2]
[staging:81e237]
[staging:536698]
[staging:58a114]
Related
Tickets:
#1334