AMF currently performs fail-over recovery action although a component is in termination-failed presence state. This can lead to severe inconsistencies for the application. The specification also clearly states how this should work in 4.8:
"If the component and any of its contained components (for a container component)
were assigned the active HA state for some component service instances when the
CLEANUP command was executed, and semantics of the redundancy model of its
enclosing service group guarantee that at a point in time only one component can be
in the active HA state for a given component service instance, the failure to terminate
that component prevents the Availability Management Framework from assigning to
another component the active HA state for these component service instances (and
by the same token prevents the assignment of the active HA state to other service
units for the service instances that contain the involved CSIs). In this case, the ser-
vice instances will stay unassigned until an administrative action is performed to ter-
minate the failed component."
Can be tested by running the AMF 2N sa-aware sample app and modifying the cleanup script to do "exit 1" which gives this effect when the active component is killed:
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'componentRestart'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Cleanup of 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Reason:'Exec of script success, but script exits with non-zero status'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Exit code: 1
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Component Failover trigerred for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1': Failed component: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATION_FAILED
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' QUIESCED to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigned 'safSi=AmfDemo,safApp=AmfDemo1' QUIESCED to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro amf_demo[11620]: CSI Set - HAState Active for all assigned CSIs
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigned 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Removing 'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Removed 'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
This will change the behaviour of the application and so it looks enhancement to me.
While testing #542 in one case #538 was observed when component moves to TERMINATION_FAILED due to failure returned by clean up scirpt. AMFD tries to failover the assginemnts by giving quiesced HA state to faulted SU.
The "behaviour" that will change anyway oes not work. That is what #573 was about. Instead of fixing this broken bad incorrect behaviour we should do what the specification says.
Some design thoughts:
Last edit: Nagendra Kumar 2013-10-22
I have some patches for supporting saAmfNodeFailfastOnTerminationFailure
including:
- amfnd reading the value at startup
- amfnd acting on it if any comp enter TERM-FAILED
- amfd support for handling changes of saAmfNodeFailfastOnTerminationFailure
- amfnd support for handling changes of saAmfNodeFailfastOnTerminationFailure
I think this is an important first step.
New patches distributed today, please review asap for a chance to be part of the pending releases!
changeset: 5080:391010d84726
branch: opensaf-4.3.x
parent: 5077:23ed04837d5d
user: Hans Feldt hans.feldt@ericsson.com
date: Tue Mar 25 15:02:45 2014 +0100
summary: avd: allow modification of node repair attributes [#538]
changeset: 5081:fda126666841
branch: opensaf-4.3.x
user: Hans Feldt hans.feldt@ericsson.com
date: Tue Mar 25 15:02:46 2014 +0100
summary: avd: reboot node when term-failed SU [#538]
changeset: 5082:f96a9b16283c
branch: opensaf-4.3.x
user: Hans Feldt hans.feldt@ericsson.com
date: Tue Mar 25 15:02:47 2014 +0100
summary: avd: auto clear comp cleanup failed alarm [#538]
changeset: 5083:56984476d492
branch: opensaf-4.4.x
parent: 5078:0697fcffea42
user: Hans Feldt osafdevel@gmail.com
date: Fri Feb 28 08:54:48 2014 +0100
summary: amfd: allow modification of node repair attributes [#538]
changeset: 5084:ba77a529c8f1
branch: opensaf-4.4.x
user: Hans Feldt osafdevel@gmail.com
date: Fri Feb 28 08:54:49 2014 +0100
summary: amfd: reboot node when term-failed SU [#538]
changeset: 5085:560e4191aa5d
branch: opensaf-4.4.x
user: Hans Feldt osafdevel@gmail.com
date: Fri Feb 28 08:54:51 2014 +0100
summary: amfd: auto clear comp cleanup failed alarm [#538]
changeset: 5086:4ca3719141cc
parent: 5079:50e35f8bbaa4
user: Hans Feldt osafdevel@gmail.com
date: Fri Feb 28 08:54:48 2014 +0100
summary: amfd: allow modification of node repair attributes [#538]
changeset: 5087:f728668c6556
user: Hans Feldt osafdevel@gmail.com
date: Fri Feb 28 08:54:49 2014 +0100
summary: amfd: reboot node when term-failed SU [#538]
changeset: 5088:e071b7342c87
tag: tip
user: Hans Feldt osafdevel@gmail.com
date: Fri Feb 28 08:54:51 2014 +0100
summary: amfd: auto clear comp cleanup failed alarm [#538]
Related
Tickets: #538
We need to conclude this TR. I published a patch for the remaining issue that needs to be finalized. I will rebase and repost it.