Menu

#1360 Quiesced controller failed to promote back to Active ( amfd asserted : ImplementerClear failed 5 )

4.5.2
fixed
None
defect
amf
d
4.6 GA
major
2015-05-27
2015-04-30
Srikanth R
No

Changeset : 6490

Issue : Quiesced controller failed to promote back to Active ( amfd asserted : ImplementerClear failed 5 )

-> Opensafd is brought up on 5 nodes in the cluster.

-> This issue is observed while verifying #707.

-> Invoked middleware si-swap operation ( SC-1 is the active and SC-2 is the standby)

-> Rebooted the old standby controller SC-2 using reboot -f command.

-> The quiesced controller SC-1 failed to promote back to active.

Apr 30 12:07:44 CONTROLLER-1 osafamfd[2214]: NO safSi=SC-2N,safApp=OpenSAF Swap initiated
Apr 30 12:07:44 CONTROLLER-1 osafamfnd[2224]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 30 12:07:44 CONTROLLER-1 osafimmnd[2158]: NO Implementer locally disconnected. Marking it as doomed 312 <496, 2010f> (safSmfService)
Apr 30 12:07:44 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 307 <328, 2010f> (safMsgGrpService)
.....
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 319 (safCheckPointService) <0, 2020f>
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 320 (safLckService) <0, 2020f>
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 321 (safEvtService) <0, 2020f>
Apr 30 12:07:46 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 322 (safClmService) <0, 2020f>
Apr 30 12:07:47 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 323 (safSmfService) <0, 2020f>
Apr 30 12:07:47 CONTROLLER-1 osafamfnd[2224]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 30 12:07:47 CONTROLLER-1 osafamfnd[2224]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 30 12:07:47 CONTROLLER-1 osafntfimcnd[2876]: NO exiting on signal 15
Apr 30 12:07:47 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 313 <503, 2010f> (@OpenSafImmReplicatorA)
Apr 30 12:07:47 CONTROLLER-1 osafamfd[2214]: NO Controller switch over initiated
Apr 30 12:07:47 CONTROLLER-1 osafamfd[2214]: NO ROLE SWITCH Active --> Quiesced
Apr 30 12:07:47 CONTROLLER-1 osafrded[2129]: NO RDE role set to QUIESCED
Apr 30 12:07:47 CONTROLLER-1 osafimmnd[2158]: NO Implementer (applier) connected: 324 (@OpenSafImmReplicatorA) <608, 2010f>
Apr 30 12:07:47 CONTROLLER-1 osafntfimcnd[3025]: NO Started
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: NO Node Down event for node id 2020f:
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA Director Service in NOACTIVE state - fevs replies pending:1 fevs highest processed:10375
Apr 30 12:07:50 CONTROLLER-1 kernel: [ 464.308117] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth3>, peer not responding
Apr 30 12:07:50 CONTROLLER-1 kernel: [ 464.308126] TIPC: Lost link <1.1.1:eth0-1.1.2:eth3> on network plane A
Apr 30 12:07:50 CONTROLLER-1 kernel: [ 464.308132] TIPC: Lost contact with <1.1.2>
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: NO Current role: QUIESCED
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: WA IMMND DOWN on active controller f2 detected at standby immd!! f1. Possible failover
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN)
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO Skipping re-send of fevs message 10374 since it has recently been resent.
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO Skipping re-send of fevs message 10375 since it has recently been resent.
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA DISCARD DUPLICATE FEVS message:10374
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA Error code 2 returned for message type 57 - ignoring
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA DISCARD DUPLICATE FEVS message:10375
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: WA Error code 2 returned for message type 57 - ignoring
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Global discard node received for nodeId:2020f pid:2642
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 316 <0, 2020f(down)> (MsgQueueService131599)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 314 <0, 2020f(down)> (@safAmfService2020f)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 318 <0, 2020f(down)> (safLogService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 322 <0, 2020f(down)> (safClmService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 317 <0, 2020f(down)> (safMsgGrpService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 319 <0, 2020f(down)> (safCheckPointService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 320 <0, 2020f(down)> (safLckService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 321 <0, 2020f(down)> (safEvtService)
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 323 <0, 2020f(down)> (safSmfService)
Apr 30 12:07:50 CONTROLLER-1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Apr 30 12:07:50 CONTROLLER-1 osaffmd[2138]: NO Controller Failover: Setting role to ACTIVE
Apr 30 12:07:50 CONTROLLER-1 osafrded[2129]: NO RDE role set to ACTIVE
Apr 30 12:07:50 CONTROLLER-1 osafntfd[2181]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osaflogd[2168]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osafclmd[2195]: NO ACTIVE request
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO ellect_coord invoke from rda_callback ACTIVE
Apr 30 12:07:50 CONTROLLER-1 osafimmd[2148]: NO Coord re-elected, resides at 2010f
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO This IMMND re-elected coord redundantly, failover ?
Apr 30 12:07:50 CONTROLLER-1 osaflogd[2168]: WA read_logsv_configuration(). All attributes could not be read
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 325 (safLogService) <3, 2010f>
Apr 30 12:07:50 CONTROLLER-1 osafimmnd[2158]: NO Implementer connected: 326 (safClmService) <5, 2010f>
Apr 30 12:07:58 CONTROLLER-1 osafamfd[2214]: ER FAILOVER Active --> Quiesced FAILED, ImplementerClear failed 5
Apr 30 12:07:58 CONTROLLER-1 osafamfd[2214]: role.cc:671: avd_mds_qsd_role_evh: Assertion '0' failed.
Apr 30 12:07:58 CONTROLLER-1 osafamfnd[2224]: ER AMF director unexpectedly crashed
Apr 30 12:07:58 CONTROLLER-1 osafamfnd[2224]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60
Apr 30 12:07:58 CONTROLLER-1 opensaf_reboot: Rebooting local node; timeout=60
Apr 30 12:07:58 CONTROLLER-1 osafimmnd[2158]: NO Implementer locally disconnected. Marking it as doomed 306 <10, 2010f> (safAmfService)
Apr 30 12:07:58 CONTROLLER-1 osafimmnd[2158]: NO Implementer disconnected 306 <10, 2010f> (safAmfService)

Below is the backtrace :

0 0x00007ff43242eb55 in raise () from /lib64/libc.so.6

1 0x00007ff432430131 in abort () from /lib64/libc.so.6

2 0x00007ff43421637a in __osafassert_fail () from /usr/lib64/libopensaf_core.so.0

3 0x00000000004428ba in avd_mds_qsd_role_evh(cl_cb_tag, avd_evt_tag) ()

4 0x00000000004333ce in process_event(cl_cb_tag, avd_evt_tag) () at main.cc:810

5 0x000000000040786a in main () at main.cc:710

Traces of AMF and syslog are attached. IMM traces are of huge sizes. If required, appropriate logs at that time shall be shared.

1 Attachments

Related

Tickets: #1360
Wiki: ChangeLog-4.5.2
Wiki: ChangeLog-4.6.1

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2015-05-04
    • status: unassigned --> assigned
    • assigned_to: Nagendra Kumar
    • Component: unknown --> amf
    • Milestone: future --> 4.5.2
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-05-04

    Amfd timed out in immutil_saImmOiImplementerClear (10 sec) at SC-1:

    Apr 30 12:07:48.356058 osafamfd [2214:role.cc:0651] >> avd_mds_qsd_role_evh
    Apr 30 12:07:58.366509 osafamfd [2214:role.cc:0667] ER FAILOVER Active --> Quiesced FAILED, ImplementerClear failed 5

    Need to check immsv logs for the lost request. Please upload immnd and immd traces during the above time period for SC-1.

    Thanks
    -Nagu

     

    Last edit: Nagendra Kumar 2015-05-04
  • Nagendra Kumar

    Nagendra Kumar - 2015-05-08
    • status: assigned --> accepted
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-05-12
    • status: accepted --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-05-27
    • status: review --> fixed
    • Part: - --> d
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-05-27

    branch: opensaf-4.5.x
    parent: 6590:d775d8fb7951
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Wed May 27 16:14:56 2015 +0530
    summary: amfd: try again if ImplementerClear times out [#1360]

    changeset: 6594:447615c80905
    branch: opensaf-4.6.x
    parent: 6591:05d5ba64ae8a
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Wed May 27 16:15:39 2015 +0530
    summary: amfd: try again if ImplementerClear times out [#1360]

    changeset: 6595:e163e5eebcb7
    tag: tip
    parent: 6592:17406d1e43d3
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Wed May 27 16:15:47 2015 +0530
    summary: amfd: try again if ImplementerClear times out [#1360]

    [staging:49e69a]
    [staging:447615]
    [staging:e163e5]

     

    Related

    Tickets: #1360


Log in to post a comment.