Menu

#1500 AMF: Fails to repair failed SU when failed component has attribute saAmfCompDisableRestart=1

4.6.2
fixed
None
defect
amf
nd
4.7.M0
minor
2016-01-21
2015-09-24
Quyen Dao
No

Steps to reproduce
Load the attached model
Change saAmfCompDisableRestart=1 on the component in SU1
Change to component CLC-CLI script to return 1 for cleanup
Trigger component termination failed by killing the component in SU1
* Repair SU1
-> Result: SU1 fails to repair.

command log
root@PL-3:~# immcfg -a saAmfCompDisableRestart=1 safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
root@PL-3:~# amf-state su all safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=ENABLED(1)
saAmfSUPresenceState=INSTANTIATED(3)
saAmfSUReadinessState=IN-SERVICE(2)
root@PL-3:~# pkill amf_demo
root@PL-3:~# amf-state su all safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=DISABLED(2)
saAmfSUPresenceState=TERMINATION-FAILED(7)
saAmfSUReadinessState=OUT-OF-SERVICE(1)
root@PL-3:~#
root@PL-3:~# amf-adm repaired safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
root@PL-3:~# amf-state su all safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=DISABLED(2)
saAmfSUPresenceState=TERMINATION-FAILED(7)
saAmfSUReadinessState=OUT-OF-SERVICE(1)
root@PL-3:~#

syslog
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO saAmfCompDisableRestart changed to 1 for 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:05 PL-3 osafimmnd[389]: NO Ccb 3 COMMITTED (immcfg_PL-3_641)
Sep 24 20:14:05 PL-3 amf_demo[585]: exiting (caught term signal)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO saAmfCompDisableRestart is true for 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO recovery action 'comp restart' escalated to 'comp failover'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO saAmfSUFailover is true for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Performing failover of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action escalated from 'componentRestart' to 'suFailover'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'suFailover'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Terminating components of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'(abruptly & unordered)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATING
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Cleanup of 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Reason:'Exec of script success, but script exits with non-zero status'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Exit code: 1
Sep 24 20:14:05 PL-3 osafamfnd[417]: WA 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
Sep 24 20:14:14 PL-3 osafamfnd[417]: ER ncsmds_api for 0 FAILED, dest=2030f00000249

Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Repair request for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATION_FAILED => UNINSTANTIATED
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State UNINSTANTIATED => INSTANTIATING
Sep 24 20:14:22 PL-3 amf_demo[734]: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATING => INSTANTIATED
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' STANDBY to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 amf_demo[734]: saAmfHealthcheckStart FAILED - 14
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO saAmfCompDisableRestart is true for 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO recovery action 'comp restart' escalated to 'comp failover'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO saAmfSUFailover is true for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Performing failover of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 2)
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action escalated from 'componentRestart' to 'suFailover'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'suFailover'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Terminating components of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'(abruptly & unordered)
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATING
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Cleanup of 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Reason:'Exec of script success, but script exits with non-zero status'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Exit code: 1
Sep 24 20:14:22 PL-3 osafamfnd[417]: WA 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED

1 Attachments

Related

Tickets: #1500
Wiki: ChangeLog-4.6.2

Discussion

  • Praveen

    Praveen - 2015-10-13

    I could not reproduce this issue on latest changeset. Attached are AMF traces after successful repair of SU.
    Please share AMF traces of the time when issue was observed.

     
  • Quyen Dao

    Quyen Dao - 2015-10-13

    I could reproduce the problem on the latest changeset (6994:092665e82e11). Please find attached the test log, syslog and trace.

     
  • Minh Hon Chau

    Minh Hon Chau - 2015-10-13
    • status: unassigned --> accepted
    • assigned_to: Minh Hon Chau
     
  • Minh Hon Chau

    Minh Hon Chau - 2015-10-20

    This issue can be reproduced in 4.5 branch, which does not have admin-restart implementation.
    I have tested 2 scenarios:
    1- Test case 1
    . Load 2N model which saAmfSGCompRestartMax=0, saAmfSGSuRestartMax=0
    . Change clc cleanup return 1
    . 1 kill amf_demo will escalate to su_failover
    . Change clc cleanup return 0
    . Unable to repair failed SU
    2- Test case 2
    . Load 2N model which saAmfSGCompRestartMax=0, saAmfSGSuRestartMax=0
    . Change clc cleanup return 0
    . 1 kill amf_demo will escalate to su_failover
    . Change clc cleanup return 0
    . Repair failed SU successfully

    In test case 1, the reason is to fail in reparing SU that avnd_comp_clc_terming_cleanfail_hdler() doesn't delete comp info (avnd_comp_curr_info_del), while in test case 2, avnd_comp_clc_terming_cleansucc_hdler does call avnd_comp_curr_info_del().

    If avnd_comp_curr_info_del is not called, when comp starts again, it will get the error
    Oct 21 09:43:41 PL-4 amf_demo[716]: saAmfHealthcheckStart FAILED - 14

     
  • Minh Hon Chau

    Minh Hon Chau - 2015-10-20
     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.2 --> 4.6.2
     
  • Minh Hon Chau

    Minh Hon Chau - 2015-12-15
    • status: accepted --> review
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-01-21
    • status: review --> fixed
     
  • Praveen

    Praveen - 2016-01-21

    https://sourceforge.net/p/opensaf/mailman/message/34695694/

    changeset: 7242:40764da009d0
    parent: 7239:79f73818ea89
    user: Minh Hon Chauminh.chau@dektech.com.au
    date: Thu Jan 21 10:52:44 2016 +0530
    summary: amfnd: delete comp_curr_info if comp fails into TERMINATION_FAILED [#1500]

    changeset: 7241:fe628aa13d14
    branch: opensaf-4.7.x
    parent: 7238:120d4ca82e39
    user: Minh Hon Chauminh.chau@dektech.com.au
    date: Thu Jan 21 10:52:14 2016 +0530
    summary: amfnd: delete comp_curr_info if comp fails into TERMINATION_FAILED [#1500]

    changeset: 7240:fb3727aebc20
    branch: opensaf-4.6.x
    parent: 7237:7e128bc2d76d
    user: Minh Hon Chauminh.chau@dektech.com.au
    date: Thu Jan 21 10:51:49 2016 +0530
    summary: amfnd: delete comp_curr_info if comp fails into TERMINATION_FAILED [#1500]

     

    Related

    Tickets: #1500


Log in to post a comment.