Menu

#2393 Immd got crashed on Active as immnd restarted on Active with cluster having single controller and payload

5.2.RC2
invalid
nobody
None
defect
imm
d
5.2.RC1
minor
2017-03-24
2017-03-23
Ritu Raj
No

Environment details

OS : Suse 64bit
Changeset : 8701 ( 5.2.RC1)
2 nodes setup(1 controller and 1 payload)

Summary

Immd got crashed on Active as immnd restarted on Active with cluster having single controller and payload

Steps followed & Observed behaviour

  1. Bring up cluster wtih 1 controller and 1 payload
  2. Kill immnd on active controller
  3. Observed, that immd got crashed on Active controller(SC-1) due to which Payload also got rebooted

** Issue obserbed when there is only one controller **

Syslog
SC-1:::

Mar 23 11:06:12 SO-SLOT-1 osafamfnd[2213]: NO 'safSu=SC-1,safSg=NoRed,safApp=OpenSAF' component restart probation timer started (timeout: 60000000000 ns)
Mar 23 11:06:12 SO-SLOT-1 osafamfnd[2213]: NO Restarting a component of 'safSu=SC-1,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
Mar 23 11:06:12 SO-SLOT-1 osafamfnd[2213]: NO 'safComp=IMMND,safSu=SC-1,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Mar 23 11:06:12 SO-SLOT-1 osafsmfd[2235]: WA DispatchOiCallback: saImmOiDispatch() Fail 'SA_AIS_ERR_BAD_HANDLE (9)'
Mar 23 11:06:12 SO-SLOT-1 osafntfimcnd[2181]: NO saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9)
Mar 23 11:06:12 SO-SLOT-1 osafimmd[2138]: WA IMMND coordinator at 2010f apparently crashed => electing new coord
Mar 23 11:06:12 SO-SLOT-1 osafimmd[2138]: ER Failed to find candidate for new IMMND coordinator (ScAbsenceAllowed:0 RulingEpoch:2
Mar 23 11:06:12 SO-SLOT-1 osafimmd[2138]: ER Active IMMD has to restart the IMMSv. All IMMNDs will restart
Mar 23 11:06:12 SO-SLOT-1 osafimmd[2138]: ER IMM RELOAD with NO persistent back end => ensure cluster restart by IMMD exit at both SCs, exiting
Mar 23 11:06:12 SO-SLOT-1 osafamfnd[2213]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 23 11:06:12 SO-SLOT-1 osafamfnd[2213]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 23 11:06:12 SO-SLOT-1 osafamfnd[2213]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 23 11:06:12 SO-SLOT-1 opensaf_reboot: Rebooting local node; timeout=60

PL-3:::
Mar 23 11:06:21 SO-SLOT-3 osafimmnd[2280]: ER IMMND forced to restart on order from IMMD, exiting
Mar 23 11:06:21 SO-SLOT-3 osafamfnd[2290]: NO 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' component restart probation timer started (timeout: 60000000000 ns)
Mar 23 11:06:21 SO-SLOT-3 osafamfnd[2290]: NO Restarting a component of 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
Mar 23 11:06:21 SO-SLOT-3 osafamfnd[2290]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Mar 23 11:06:21 SO-SLOT-3 osafimmnd[2755]: mkfifo already exists: /var/lib/opensaf/osafimmnd.fifo File exists
Mar 23 11:06:21 SO-SLOT-3 osafimmnd[2755]: Started
Mar 23 11:06:26 SO-SLOT-3 osafamfnd[2290]: WA AMF director unexpectedly crashed
Mar 23 11:06:26 SO-SLOT-3 osafamfnd[2290]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131855, SupervisionTime = 60

Traces:
From traces Active 'Failed to find candidate for new IMMND coordinator' and Active IMMD has to restart the IMMSv

Mar 23 11:06:12.535325 osafimmd [2138:src/imm/immd/immd_evt.c:2638] T5 Received IMMND service event
Mar 23 11:06:12.535349 osafimmd [2138:src/imm/immd/immd_evt.c:2741] T5 PROCESS MDS EVT: NCSMDS_DOWN, my PID:2138
Mar 23 11:06:12.535451 osafimmd [2138:src/imm/immd/immd_evt.c:2748] T5 NCSMDS_DOWN => local IMMND down
Mar 23 11:06:12.535463 osafimmd [2138:src/imm/immd/immd_evt.c:2763] T5 IMMND DOWN PROCESS detected by IMMD
Mar 23 11:06:12.535475 osafimmd [2138:src/imm/immd/immd_proc.c:0618] >> immd_process_immnd_down
Mar 23 11:06:12.535483 osafimmd [2138:src/imm/immd/immd_proc.c:0621] T5 immd_process_immnd_down pid:2149 on-active:1 cb->immnd_coord:2010f
Mar 23 11:06:12.535503 osafimmd [2138:src/imm/immd/immd_proc.c:0628] WA IMMND coordinator at 2010f apparently crashed => electing new coord
Mar 23 11:06:12.535516 osafimmd [2138:src/imm/immd/immd_proc.c:0204] >> immd_proc_elect_coord
Mar 23 11:06:12.535536 osafimmd [2138:src/imm/immd/immd_proc.c:0320] ER **Failed to find candidate for new IMMND coordinator** (ScAbsenceAllowed:0 RulingEpoch:2
Mar 23 11:06:12.535542 osafimmd [2138:src/imm/immd/immd_proc.c:0322] << immd_proc_elect_coord
Mar 23 11:06:12.535547 osafimmd [2138:src/imm/immd/immd_proc.c:0059] >> immd_proc_immd_reset
Mar 23 11:06:12.535560 osafimmd [2138:src/imm/immd/immd_proc.c:0062] ER **Active IMMD has to restart the IMMSv. All IMMNDs will restart**
Mar 23 11:06:12.535567 osafimmd [2138:src/imm/immd/immd_mbcsv.c:0044] >> immd_mbcsv_sync_update
Mar 23 11:06:12.535574 osafimmd [2138:src/mbc/mbcsv_api.c:0773] >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified
Mar 23 11:06:12.535582 osafimmd [2138:src/mbc/mbcsv_api.c:0803] TR svc_id:42, pwe_hdl:65549
Mar 23 11:06:12.535587 osafimmd [2138:src/mbc/mbcsv_api.c:0807] T1 No STANDBY peers found yet
Mar 23 11:06:12.535593 osafimmd [2138:src/mbc/mbcsv_api.c:0868] << mbcsv_process_snd_ckpt_request: retval: 1
Mar 23 11:06:12.535598 osafimmd [2138:src/imm/immd/immd_mbcsv.c:0062] << immd_mbcsv_sync_update
Mar 23 11:06:12.535604 osafimmd [2138:src/imm/immd/immd_mds.c:0762] >> immd_mds_bcast_send
Mar 23 11:06:12.535610 osafimmd [2138:src/imm/common/immsv_evt.c:5422] T8 Sending:  IMMND_EVT_D2ND_RESET to 0
Mar 23 11:06:12.535868 osafimmd [2138:src/imm/immd/immd_mds.c:0782] << immd_mds_bcast_send
Mar 23 11:06:12.535917 osafimmd [2138:src/imm/immd/immd_proc.c:0104] ER IMM RELOAD with NO persistent back end => ensure cluster restart by IMMD exit at both SCs, exiting

Note:

  1. Syslog of Active controler and Pyalod attached
  2. Immnd and immd traces attached
2 Attachments

Discussion

  • Anders Bjornerstedt

     
  • Anders Bjornerstedt

    Unless this ticket describes system that has been configured to allow a headless/SC-absence, then the above is expected behavior and this ticket is invalid.

    I see no mention of headless/sc-absence mentioned.

    The cluster has to reload because the IMMND at a payload can not take on the role of
    coordinator IMMND in a normal configuration.

     
  • Anders Bjornerstedt

    Note also that the IMMD does not "crash", it exits.

    Mar 23 11:06:12 SO-SLOT-1 osafimmd[2138]: ER IMM RELOAD with NO persistent back end => ensure cluster restart by IMMD exit at both SCs, exiting

     
  • Zoran Milinkovic

    • status: unassigned --> invalid
     
  • Zoran Milinkovic

    This is expected behavior when SC absence is not allowed.
    Even if there were more payloads, cluster reboot would be initiated due to absence of IMMNDs on controllers.

    SC absence is set to 0 (is not allowed). This info can be seen in message:
    Mar 23 11:06:12 SO-SLOT-1 osafimmd[2138]: ER Failed to find candidate for new IMMND coordinator (ScAbsenceAllowed:0 RulingEpoch:2

     

Log in to post a comment.

MongoDB Logo MongoDB