Changeset : 6490
Issue : Quiesced controller failed to promote back to Active ( amfd asserted : ImplementerClear failed 5 )
-> Opensafd is brought up on 5 nodes in the cluster.
-> This issue is observed while verifying #707.
-> Invoked middleware si-swap operation ( SC-1 is the active and SC-2 is the standby)
-> Rebooted the old standby controller SC-2 using reboot -f command.
-> The quiesced controller SC-1 failed to promote back to active.
Apr 30 12:07:44 CONTROLLER-1 osafamfd[2214]: NO safSi=SC-2N,safApp=OpenSAF Swap initiated
Apr 30 12:07:44 CONTROLLER-1 osafamfnd[2224]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 30 12:07:44 CONTROLLER-1 osafimmnd[2158]: NO Implementer locally disconnected. Marking it as doomed 312 <496, 2010f> (safSmfService)
Apr 30 12:07:44 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 307 <328, 2010f> (safMsgGrpService)
.....
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 319 (safCheckPointService) <0, 2020f>
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 320 (safLckService) <0, 2020f>
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 321 (safEvtService) <0, 2020f>
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 322 (safClmService) <0, 2020f>
Apr 30 12:07:47 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 323 (safSmfService) <0, 2020f>
Apr 30 12:07:47 CONTROLLER-1 osafamfnd[2224]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 30 12:07:47 CONTROLLER-1 osafamfnd[2224]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 30 12:07:47 CONTROLLER-1 osafntfimcnd[2876]: NO exiting on signal 15
Apr 30 12:07:47 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 313 <503, 2010f> (@OpenSafImmReplicatorA)
Apr 30 12:07:47 CONTROLLER-1 osafamfd[2214]: NO Controller switch over initiated
Apr 30 12:07:47 CONTROLLER-1 osafamfd[2214]: NO ROLE SWITCH Active --> Quiesced
Apr 30 12:07:47 CONTROLLER-1 osafrded[2129]: NO RDE role set to QUIESCED
Apr 30 12:07:47 CONTROLLER-1 osafimmnd[2158]: NO Implementer (applier) connected: 324 (@OpenSafImmReplicatorA) <608, 2010f>
Apr 30 12:07:47 CONTROLLER-1 osafntfimcnd[3025]: NO Started
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: NO Node Down event for node id 2020f:
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA Director Service in NOACTIVE state - fevs replies pending:1 fevs highest processed:10375
Apr 30 12:07:50 CONTROLLER-1 kernel: [ 464.308117] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth3>, peer not responding
Apr 30 12:07:50 CONTROLLER-1 kernel: [ 464.308126] TIPC: Lost link <1.1.1:eth0-1.1.2:eth3> on network plane A
Apr 30 12:07:50 CONTROLLER-1 kernel: [ 464.308132] TIPC: Lost contact with <1.1.2>
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: NO Current role: QUIESCED
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: WA IMMND DOWN on active controller f2 detected at standby immd!! f1. Possible failover
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN)
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO Skipping re-send of fevs message 10374 since it has recently been resent.
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO Skipping re-send of fevs message 10375 since it has recently been resent.
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA DISCARD DUPLICATE FEVS message:10374
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA Error code 2 returned for message type 57 - ignoring
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA DISCARD DUPLICATE FEVS message:10375
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA Error code 2 returned for message type 57 - ignoring
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Global discard node received for nodeId:2020f pid:2642
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 316 <0, 2020f(down)> (MsgQueueService131599)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 314 <0, 2020f(down)> (@safAmfService2020f)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 318 <0, 2020f(down)> (safLogService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 322 <0, 2020f(down)> (safClmService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 317 <0, 2020f(down)> (safMsgGrpService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 319 <0, 2020f(down)> (safCheckPointService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 320 <0, 2020f(down)> (safLckService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 321 <0, 2020f(down)> (safEvtService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 323 <0, 2020f(down)> (safSmfService)
Apr 30 12:07:50 CONTROLLER-1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: NO Controller Failover: Setting role to ACTIVE
Apr 30 12:07:50 CONTROLLER-1 osafrded[2129]: NO RDE role set to ACTIVE
Apr 30 12:07:50 CONTROLLER-1 osafntfd[2181]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osaflogd[2168]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osafclmd[2195]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO ellect_coord invoke from rda_callback ACTIVE
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO Coord re-elected, resides at 2010f
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO This IMMND re-elected coord redundantly, failover ?
Apr 30 12:07:50 CONTROLLER-1 osaflogd[2168]: WA read_logsv_configuration(). All attributes could not be read
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 325 (safLogService) <3, 2010f>
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 326 (safClmService) <5, 2010f>
Apr 30 12:07:58 CONTROLLER-1 osafamfd[2214]: ER FAILOVER Active --> Quiesced FAILED, ImplementerClear failed 5
Apr 30 12:07:58 CONTROLLER-1 osafamfd[2214]: role.cc:671: avd_mds_qsd_role_evh: Assertion '0' failed.
Apr 30 12:07:58 CONTROLLER-1 osafamfnd[2224]: ER AMF director unexpectedly crashed
Apr 30 12:07:58 CONTROLLER-1 osafamfnd[2224]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60
Apr 30 12:07:58 CONTROLLER-1 opensaf_reboot: Rebooting local node; timeout=60
Apr 30 12:07:58 CONTROLLER-1 osafimmnd[2158]: NO Implementer locally disconnected. Marking it as doomed 306 <10, 2010f> (safAmfService)
Apr 30 12:07:58 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 306 <10, 2010f> (safAmfService)
Below is the backtrace :
Traces of AMF and syslog are attached. IMM traces are of huge sizes. If required, appropriate logs at that time shall be shared.
Amfd timed out in immutil_saImmOiImplementerClear (10 sec) at SC-1:
Apr 30 12:07:48.356058 osafamfd [2214:role.cc:0651] >> avd_mds_qsd_role_evh
Apr 30 12:07:58.366509 osafamfd [2214:role.cc:0667] ER FAILOVER Active --> Quiesced FAILED, ImplementerClear failed 5
Need to check immsv logs for the lost request. Please upload immnd and immd traces during the above time period for SC-1.
Thanks
-Nagu
Last edit: Nagendra Kumar 2015-05-04
branch: opensaf-4.5.x
parent: 6590:d775d8fb7951
user: Nagendra Kumarnagendra.k@oracle.com
date: Wed May 27 16:14:56 2015 +0530
summary: amfd: try again if ImplementerClear times out [#1360]
changeset: 6594:447615c80905
branch: opensaf-4.6.x
parent: 6591:05d5ba64ae8a
user: Nagendra Kumarnagendra.k@oracle.com
date: Wed May 27 16:15:39 2015 +0530
summary: amfd: try again if ImplementerClear times out [#1360]
changeset: 6595:e163e5eebcb7
tag: tip
parent: 6592:17406d1e43d3
user: Nagendra Kumarnagendra.k@oracle.com
date: Wed May 27 16:15:47 2015 +0530
summary: amfd: try again if ImplementerClear times out [#1360]
[staging:49e69a]
[staging:447615]
[staging:e163e5]
Related
Tickets:
#1360