Menu

#538 AMF: fail-over assignments despite comps in TERM-FAILED state

future
assigned
Praveen
None
defect
amf
nd
4.2
major
2016-10-13
2013-08-09
Hans Feldt
No

AMF currently performs fail-over recovery action although a component is in termination-failed presence state. This can lead to severe inconsistencies for the application. The specification also clearly states how this should work in 4.8:

"If the component and any of its contained components (for a container component)
were assigned the active HA state for some component service instances when the
CLEANUP command was executed, and semantics of the redundancy model of its
enclosing service group guarantee that at a point in time only one component can be
in the active HA state for a given component service instance, the failure to terminate
that component prevents the Availability Management Framework from assigning to
another component the active HA state for these component service instances (and
by the same token prevents the assignment of the active HA state to other service
units for the service instances that contain the involved CSIs). In this case, the ser-
vice instances will stay unassigned until an administrative action is performed to ter-
minate the failed component."

Can be tested by running the AMF 2N sa-aware sample app and modifying the cleanup script to do "exit 1" which gives this effect when the active component is killed:

Aug 9 08:40:01 Vostro osafamfnd[11307]: NO 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' faulted due to 'avaDown' : Recovery is 'componentRestart'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Cleanup of 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' failed
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Reason:'Exec of script success, but script exits with non-zero status'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Exit code: 1
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Component Failover trigerred for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1': Failed component: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATION_FAILED
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' QUIESCED to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigned 'safSi=AmfDemo,safApp=AmfDemo1' QUIESCED to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro amf_demo[11620]: CSI Set - HAState Active for all assigned CSIs
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Assigned 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Removing 'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Aug 9 08:40:01 Vostro osafamfnd[11307]: NO Removed 'safSi=AmfDemo,safApp=AmfDemo1' from 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'

Related

Tickets: #538
Wiki: ChangeLog-4.3.3
Wiki: ChangeLog-4.4.1

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2013-08-12

    This will change the behaviour of the application and so it looks enhancement to me.

     
  • Praveen

    Praveen - 2013-08-21

    While testing #542 in one case #538 was observed when component moves to TERMINATION_FAILED due to failure returned by clean up scirpt. AMFD tries to failover the assginemnts by giving quiesced HA state to faulted SU.

     
  • Praveen

    Praveen - 2013-10-09
    • Type: defect --> enhancement
     
  • Hans Feldt

    Hans Feldt - 2013-10-09
    • Type: enhancement --> defect
     
  • Hans Feldt

    Hans Feldt - 2013-10-09

    The "behaviour" that will change anyway oes not work. That is what #573 was about. Instead of fixing this broken bad incorrect behaviour we should do what the specification says.

     
  • Nagendra Kumar

    Nagendra Kumar - 2013-10-22

    Some design thoughts:

    1. When SU goes to Term failure state, Amfnd should inform Amfd to take corrective actions. All other components in the SU except the component in term failed state, should be cleaned up.
    2. Amfd can send Quiesced or remove assignment to Amfnd.
    3. Amfnd shouldn’t process the assignment as SU is in term failed state and should store the SUSI message in the buffer.
    4. The state of Amfd and Amfnd for that particular Su will be suspended in the same state.
    5. When repair action is performed on the term failed SU(even when SG is unstable), then Amfd should allow the admin operations.
    6. Amfd should send the admin operation to Amfnd for instantiating the SU.
    7. Amfnd should instantiate the SU and then process the buffered SUSI command and respond back the SUSI command to Amfd.
    8. If SU was faulted in term failed state when assignment was undergoing, then Amfd shouldn’t send any next SUSI assignment to Amfnd when Amfnd informs Amfd for SU term failure. Rather, when repair admin command lands on Amfnd, it should enable SU and then assign the undergoing SUSI and respond back to Amfd.
    9. ‘The above points’ stops the SG when SU goes to term fail state and after admin command is performed, then it starts from the state when SG was stopped. It is like suspending the SG for that SU for any other operation than admin repair command.
    10. The faults can happen in other assigned/unassigned SUs or node hosted by those SUs, when SG state is suspended for term failed SU. The action taken on other SUs, will be as per SG FSM state.
    11. Faulted SU holds the same state when repaired. That means Su is supposed to be in Inservice for all SG FSM purpose.
     

    Last edit: Nagendra Kumar 2013-10-22
  • Hans Feldt

    Hans Feldt - 2014-01-28

    I have some patches for supporting saAmfNodeFailfastOnTerminationFailure
    including:
    - amfnd reading the value at startup
    - amfnd acting on it if any comp enter TERM-FAILED
    - amfd support for handling changes of saAmfNodeFailfastOnTerminationFailure
    - amfnd support for handling changes of saAmfNodeFailfastOnTerminationFailure

    I think this is an important first step.

     
  • Hans Feldt

    Hans Feldt - 2014-01-28
    • status: unassigned --> assigned
     
  • Hans Feldt

    Hans Feldt - 2014-01-28
    • assigned_to: Hans Feldt
     
  • Hans Feldt

    Hans Feldt - 2014-01-30
    • status: assigned --> review
     
  • Hans Feldt

    Hans Feldt - 2014-02-28

    New patches distributed today, please review asap for a chance to be part of the pending releases!

     
  • Hans Feldt

    Hans Feldt - 2014-03-25

    changeset: 5080:391010d84726
    branch: opensaf-4.3.x
    parent: 5077:23ed04837d5d
    user: Hans Feldt hans.feldt@ericsson.com
    date: Tue Mar 25 15:02:45 2014 +0100
    summary: avd: allow modification of node repair attributes [#538]

    changeset: 5081:fda126666841
    branch: opensaf-4.3.x
    user: Hans Feldt hans.feldt@ericsson.com
    date: Tue Mar 25 15:02:46 2014 +0100
    summary: avd: reboot node when term-failed SU [#538]

    changeset: 5082:f96a9b16283c
    branch: opensaf-4.3.x
    user: Hans Feldt hans.feldt@ericsson.com
    date: Tue Mar 25 15:02:47 2014 +0100
    summary: avd: auto clear comp cleanup failed alarm [#538]

    changeset: 5083:56984476d492
    branch: opensaf-4.4.x
    parent: 5078:0697fcffea42
    user: Hans Feldt osafdevel@gmail.com
    date: Fri Feb 28 08:54:48 2014 +0100
    summary: amfd: allow modification of node repair attributes [#538]

    changeset: 5084:ba77a529c8f1
    branch: opensaf-4.4.x
    user: Hans Feldt osafdevel@gmail.com
    date: Fri Feb 28 08:54:49 2014 +0100
    summary: amfd: reboot node when term-failed SU [#538]

    changeset: 5085:560e4191aa5d
    branch: opensaf-4.4.x
    user: Hans Feldt osafdevel@gmail.com
    date: Fri Feb 28 08:54:51 2014 +0100
    summary: amfd: auto clear comp cleanup failed alarm [#538]

    changeset: 5086:4ca3719141cc
    parent: 5079:50e35f8bbaa4
    user: Hans Feldt osafdevel@gmail.com
    date: Fri Feb 28 08:54:48 2014 +0100
    summary: amfd: allow modification of node repair attributes [#538]

    changeset: 5087:f728668c6556
    user: Hans Feldt osafdevel@gmail.com
    date: Fri Feb 28 08:54:49 2014 +0100
    summary: amfd: reboot node when term-failed SU [#538]

    changeset: 5088:e071b7342c87
    tag: tip
    user: Hans Feldt osafdevel@gmail.com
    date: Fri Feb 28 08:54:51 2014 +0100
    summary: amfd: auto clear comp cleanup failed alarm [#538]

     

    Related

    Tickets: #538

  • Hans Feldt

    Hans Feldt - 2014-03-25
    • status: review --> accepted
    • Milestone: future --> 4.3.3
     
  • Hans Feldt

    Hans Feldt - 2014-06-05

    We need to conclude this TR. I published a patch for the remaining issue that needs to be finalized. I will rebase and repost it.

     
  • Hans Feldt

    Hans Feldt - 2014-06-09
    • status: accepted --> review
     
  • Hans Feldt

    Hans Feldt - 2014-09-03
    • status: review --> unassigned
    • assigned_to: Hans Feldt --> nobody
     
  • Anders Widell

    Anders Widell - 2014-10-07
    • Milestone: 4.3.3 --> 4.4.2
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-03-24
    • Milestone: 4.4.2 --> future
     
  • Anders Bjornerstedt

    • Milestone: future --> 4.5.2
     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.2 --> 4.6.2
     
  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 4.6.2 --> 4.7.2
     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Praveen

    Praveen - 2016-10-13
    • status: unassigned --> assigned
    • assigned_to: Praveen
    • Part: - --> nd
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     

Log in to post a comment.