Steps to reproduce
Load the attached model
Change saAmfCompDisableRestart=1 on the component in SU1
Change to component CLC-CLI script to return 1 for cleanup
Trigger component termination failed by killing the component in SU1
* Repair SU1
-> Result: SU1 fails to repair.
command log
root@PL-3:~# immcfg -a saAmfCompDisableRestart=1 safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
root@PL-3:~# amf-state su all safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=ENABLED(1)
saAmfSUPresenceState=INSTANTIATED(3)
saAmfSUReadinessState=IN-SERVICE(2)
root@PL-3:~# pkill amf_demo
root@PL-3:~# amf-state su all safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=DISABLED(2)
saAmfSUPresenceState=TERMINATION-FAILED(7)
saAmfSUReadinessState=OUT-OF-SERVICE(1)
root@PL-3:~#
root@PL-3:~# amf-adm repaired safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
root@PL-3:~# amf-state su all safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=DISABLED(2)
saAmfSUPresenceState=TERMINATION-FAILED(7)
saAmfSUReadinessState=OUT-OF-SERVICE(1)
root@PL-3:~#
syslog
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO saAmfCompDisableRestart changed to 1 for 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:05 PL-3 osafimmnd[389]: NO Ccb 3 COMMITTED (immcfg_PL-3_641)
Sep 24 20:14:05 PL-3 amf_demo[585]: exiting (caught term signal)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO saAmfCompDisableRestart is true for 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO recovery action 'comp restart' escalated to 'comp failover'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO saAmfSUFailover is true for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Performing failover of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 1)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action escalated from 'componentRestart' to 'suFailover'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'suFailover'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Terminating components of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'(abruptly & unordered)
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATING
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Cleanup of 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Reason:'Exec of script success, but script exits with non-zero status'
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO Exit code: 1
Sep 24 20:14:05 PL-3 osafamfnd[417]: WA 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
Sep 24 20:14:05 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
Sep 24 20:14:14 PL-3 osafamfnd[417]: ER ncsmds_api for 0 FAILED, dest=2030f00000249
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Repair request for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATION_FAILED => UNINSTANTIATED
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State UNINSTANTIATED => INSTANTIATING
Sep 24 20:14:22 PL-3 amf_demo[734]: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATING => INSTANTIATED
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' STANDBY to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 amf_demo[734]: saAmfHealthcheckStart FAILED - 14
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO saAmfCompDisableRestart is true for 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO recovery action 'comp restart' escalated to 'comp failover'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO saAmfSUFailover is true for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Performing failover of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' (SU failover count: 2)
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' recovery action escalated from 'componentRestart' to 'suFailover'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'suFailover'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Terminating components of 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'(abruptly & unordered)
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATING
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Cleanup of 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Reason:'Exec of script success, but script exits with non-zero status'
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO Exit code: 1
Sep 24 20:14:22 PL-3 osafamfnd[417]: WA 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
Sep 24 20:14:22 PL-3 osafamfnd[417]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State TERMINATING => TERMINATION_FAILED
I could not reproduce this issue on latest changeset. Attached are AMF traces after successful repair of SU.
Please share AMF traces of the time when issue was observed.
I could reproduce the problem on the latest changeset (6994:092665e82e11). Please find attached the test log, syslog and trace.
This issue can be reproduced in 4.5 branch, which does not have admin-restart implementation.
I have tested 2 scenarios:
1- Test case 1
. Load 2N model which saAmfSGCompRestartMax=0, saAmfSGSuRestartMax=0
. Change clc cleanup return 1
. 1 kill amf_demo will escalate to su_failover
. Change clc cleanup return 0
. Unable to repair failed SU
2- Test case 2
. Load 2N model which saAmfSGCompRestartMax=0, saAmfSGSuRestartMax=0
. Change clc cleanup return 0
. 1 kill amf_demo will escalate to su_failover
. Change clc cleanup return 0
. Repair failed SU successfully
In test case 1, the reason is to fail in reparing SU that avnd_comp_clc_terming_cleanfail_hdler() doesn't delete comp info (avnd_comp_curr_info_del), while in test case 2, avnd_comp_clc_terming_cleansucc_hdler does call avnd_comp_curr_info_del().
If avnd_comp_curr_info_del is not called, when comp starts again, it will get the error
Oct 21 09:43:41 PL-4 amf_demo[716]: saAmfHealthcheckStart FAILED - 14
https://sourceforge.net/p/opensaf/mailman/message/34695694/
changeset: 7242:40764da009d0
parent: 7239:79f73818ea89
user: Minh Hon Chauminh.chau@dektech.com.au
date: Thu Jan 21 10:52:44 2016 +0530
summary: amfnd: delete comp_curr_info if comp fails into TERMINATION_FAILED [#1500]
changeset: 7241:fe628aa13d14
branch: opensaf-4.7.x
parent: 7238:120d4ca82e39
user: Minh Hon Chauminh.chau@dektech.com.au
date: Thu Jan 21 10:52:14 2016 +0530
summary: amfnd: delete comp_curr_info if comp fails into TERMINATION_FAILED [#1500]
changeset: 7240:fb3727aebc20
branch: opensaf-4.6.x
parent: 7237:7e128bc2d76d
user: Minh Hon Chauminh.chau@dektech.com.au
date: Thu Jan 21 10:51:49 2016 +0530
summary: amfnd: delete comp_curr_info if comp fails into TERMINATION_FAILED [#1500]
Related
Tickets:
#1500