Menu

#1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

4.5.2
fixed
None
defect
amf
d
major
2015-07-24
2014-09-17
No

This ticket is in essence a continuation of ticket #1078

http://sourceforge.net/p/opensaf/tickets/1078/

In switchover, the new standby fails to attach as AMFD applier. It retries
this for a limited time (45 seconds os so), but finally gives up and AMFD standby
restarts.

In ticket 1078 the blockage was actually caused by a bug because the lingering
CCB was in that case not interfering with AMF data (data monitored by the
AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
for #1078.

But this ticket tracks the case of true interference. The very same symptom
can be acheived by creating a CCB that modifies an AMF object and then lingers.
An si-swap done in this setup will result in the new standby rebooting after
it gives up in retrying.

The new active AMFD is doing the very same thing, failing to set itself
as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
AMFD triggers the restart of that SC, which triggers a sync, which aborts the
CCB removing the blockage for the new active AMFD.

Note that this scenario is not totally unrealistic. An operator starts to
build a CCB. Forgets about it and then performs an si-swap. That will cause
an SC restart. Not good.

While a good NBI frontend should buffer the ccb and only send it to the system
when the operator does his/her high level apply, we can not rely on that.

I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
invoking the saImmOmCcbApply. Then invoked this on one node:

immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
inside immcfg itself and aborting the CCB before the scenario can complete.

Quickly after invoking the above I order an si-swap from another shell/node:

immadm -o 7 safSi=SC-2N,safApp=OpenSAF

The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
attach as long as there is an active, non-empty ccb, that contains operations
on AMF objects.

The first level of solution in my opinion is that both AMFDs should retry
forever (in a separate thread assumed to be the case already) to attach as
implementer/applier. A notification should be sent periodically
to inform the operator or whomever is listening that thre is a lingering
AMF related CCB that should be terminated (aborted or committed by the user).

Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
an admin-operation for this purpose. The active AMFD could invoke this admop
to trigger the immsv to clear all non-critical CCBs. It should do this if it
ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
for a while. Adding such an admin-operation to the immsv and implementing
its use in AMF should probably be seen as two enhacnements.

The really thorny issue is that there can be blocked critical CCBs.
These are CCBs where the immsv is waiting on the result of commit from PBE.
The probability is low that there is both a critical CCB stuck and that it
contains AMF object operations, but it can happen. Such a system is in ANY CASE
stuck in its CCB processing so the AMF should wait indefinitely here.
Currently the system should cluster restart after some time. Not good.
The immsv can not clear critical CCBs by itself. The only option is to
use the admin-op (already implemented) for emergency disablement of PBE.

To summarize: This defect ticket is only concerned with the problem of the AMF
rebooting its standby when this scenario occurs. This should be changed to
eternal wait with periodic notifications. The AMF service is functioning but
can not process configuration changes on its data while in this state.
That is not a fatal condition and so should not be esclated to SC restart.

The problem of how to clear the interfering CCB can be solved in many ways.
A short term alternative (a hack solution) is for the AMF to reboot a payload.
That would also trigger a sync clearing al non critical CCBs.

Related

Tickets: #1105
Wiki: ChangeLog-4.5.2
Wiki: ChangeLog-4.6.1

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2014-09-18

    Well, Amf as a HA provider can't wait eternal. Amf is doing some of imm operation in a separate thread, but that is also not a suitable solution for HA provider. As Amf has to deal with imm in each flow, Amf need not wait eternal.

    Even rebooting Standby SC is fine as it doesn;t harm HA.

    Hence and hereby, I don't find relevance of the issue in this ticket.

     
    • Anders Bjornerstedt

      Rebooting an SC always harms SA.

      This is by definition so since the cluster becomes one-safe (single point of failure in the remaining SC).

      I am of course not saying that AMF as an entity shall wait forever in providing service.
      All I am saying is that the AMF should keep trying to attach as OI/Applier forever.
      The OI/applier initialize must be done either in a separate thread, or as a recurrent
      realtime single try with the task parked after each retry (coroutine solution).
      This so that the "eternal" task is isolated.
      I assume this is the case already.

      In the meantime the AMF is fully functional with one excpetion. It can not process ccb modifications on
      the imm-objects owned by the AMF-OI.

      But I repeat that is not a fatal condition and should not be allowed to compromize HA,
      which an SC restart always does.

      /AndersBj


      From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
      Sent: den 18 september 2014 08:57
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier

      Well, Amf as a HA provider can't wait eternal. Amf is doing some of imm operation in a separate thread, but that is also not a suitable solution for HA provider. As Amf has to deal with imm in each flow, Amf need not wait eternal.

      Even rebooting Standby SC is fine as it doesn;t harm HA.

      Hence and hereby, I don't find relevance of the issue in this ticket.


      [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier

      Status: unassigned
      Milestone: 4.3.3
      Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
      Last Updated: Wed Sep 17, 2014 06:18 PM UTC
      Owner: nobody

      This ticket is in essence a continuation of ticket #1078

      http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078

      In switchover, the new standby fails to attach as AMFD applier. It retries
      this for a limited time (45 seconds os so), but finally gives up and AMFD standby
      restarts.

      In ticket 1078 the blockage was actually caused by a bug because the lingering
      CCB was in that case not interfering with AMF data (data monitored by the
      AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
      for #1078.

      But this ticket tracks the case of true interference. The very same symptom
      can be acheived by creating a CCB that modifies an AMF object and then lingers.
      An si-swap done in this setup will result in the new standby rebooting after
      it gives up in retrying.

      The new active AMFD is doing the very same thing, failing to set itself
      as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
      AMFD triggers the restart of that SC, which triggers a sync, which aborts the
      CCB removing the blockage for the new active AMFD.

      Note that this scenario is not totally unrealistic. An operator starts to
      build a CCB. Forgets about it and then performs an si-swap. That will cause
      an SC restart. Not good.

      While a good NBI frontend should buffer the ccb and only send it to the system
      when the operator does his/her high level apply, we can not rely on that.

      I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
      invoking the saImmOmCcbApply. Then invoked this on one node:

      immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

      The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
      inside immcfg itself and aborting the CCB before the scenario can complete.

      Quickly after invoking the above I order an si-swap from another shell/node:

      immadm -o 7 safSi=SC-2N,safApp=OpenSAF

      The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
      attach as long as there is an active, non-empty ccb, that contains operations
      on AMF objects.

      The first level of solution in my opinion is that both AMFDs should retry
      forever (in a separate thread assumed to be the case already) to attach as
      implementer/applier. A notification should be sent periodically
      to inform the operator or whomever is listening that thre is a lingering
      AMF related CCB that should be terminated (aborted or committed by the user).

      Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
      an admin-operation for this purpose. The active AMFD could invoke this admop
      to trigger the immsv to clear all non-critical CCBs. It should do this if it
      ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
      for a while. Adding such an admin-operation to the immsv and implementing
      its use in AMF should probably be seen as two enhacnements.

      The really thorny issue is that there can be blocked critical CCBs.
      These are CCBs where the immsv is waiting on the result of commit from PBE.
      The probability is low that there is both a critical CCB stuck and that it
      contains AMF object operations, but it can happen. Such a system is in ANY CASE
      stuck in its CCB processing so the AMF should wait indefinitely here.
      Currently the system should cluster restart after some time. Not good.
      The immsv can not clear critical CCBs by itself. The only option is to
      use the admin-op (already implemented) for emergency disablement of PBE.

      To summarize: This defect ticket is only concerned with the problem of the AMF
      rebooting its standby when this scenario occurs. This should be changed to
      eternal wait with periodic notifications. The AMF service is functioning but
      can not process configuration changes on its data while in this state.
      That is not a fatal condition and so should not be esclated to SC restart.

      The problem of how to clear the interfering CCB can be solved in many ways.
      A short term alternative (a hack solution) is for the AMF to reboot a payload.
      That would also trigger a sync clearing al non critical CCBs.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

       

      Related

      Tickets: #1105

      • Anders Bjornerstedt

        HA is a statistical property.

        It can only be truly evaluated by recording the availability history of a system.
        But one can predict if an operation will impact HA by analyzing the degree of increased
        vulnerability that the operation causes.

        Basically it is (at least) the MTBF of a single SC that becomes relevant for the duration of the SC restart.
        This instead of the MTB2F (mean time between double failure which is by definition smaller).
        But realistically the risk increases more than that because a failover is in itself
        a complex operation and thus an increased risk of complications => impact on HA statistics.
        Thus a "solution" requiring SC restart will reduce the MTB2F and so it should be avoided
        when possible. In this case it is defiinitely possible.

        /AndersBj


        From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
        Sent: den 18 september 2014 09:18
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier

        Rebooting an SC always harms SA.

        This is by definition so since the cluster becomes one-safe (single point of failure in the remaining SC).

        I am of course not saying that AMF as an entity shall wait forever in providing service.
        All I am saying is that the AMF should keep trying to attach as OI/Applier forever.
        The OI/applier initialize must be done either in a separate thread, or as a recurrent
        realtime single try with the task parked after each retry (coroutine solution).
        This so that the "eternal" task is isolated.
        I assume this is the case already.

        In the meantime the AMF is fully functional with one excpetion. It can not process ccb modifications on
        the imm-objects owned by the AMF-OI.

        But I repeat that is not a fatal condition and should not be allowed to compromize HA,
        which an SC restart always does.

        /AndersBj


        From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
        Sent: den 18 september 2014 08:57
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier

        Well, Amf as a HA provider can't wait eternal. Amf is doing some of imm operation in a separate thread, but that is also not a suitable solution for HA provider. As Amf has to deal with imm in each flow, Amf need not wait eternal.

        Even rebooting Standby SC is fine as it doesn;t harm HA.

        Hence and hereby, I don't find relevance of the issue in this ticket.


        [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier

        Status: unassigned
        Milestone: 4.3.3
        Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
        Last Updated: Wed Sep 17, 2014 06:18 PM UTC
        Owner: nobody

        This ticket is in essence a continuation of ticket #1078

        http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078http://sourceforge.net/p/opensaf/tickets/1078

        In switchover, the new standby fails to attach as AMFD applier. It retries
        this for a limited time (45 seconds os so), but finally gives up and AMFD standby
        restarts.

        In ticket 1078 the blockage was actually caused by a bug because the lingering
        CCB was in that case not interfering with AMF data (data monitored by the
        AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
        for #1078.

        But this ticket tracks the case of true interference. The very same symptom
        can be acheived by creating a CCB that modifies an AMF object and then lingers.
        An si-swap done in this setup will result in the new standby rebooting after
        it gives up in retrying.

        The new active AMFD is doing the very same thing, failing to set itself
        as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
        AMFD triggers the restart of that SC, which triggers a sync, which aborts the
        CCB removing the blockage for the new active AMFD.

        Note that this scenario is not totally unrealistic. An operator starts to
        build a CCB. Forgets about it and then performs an si-swap. That will cause
        an SC restart. Not good.

        While a good NBI frontend should buffer the ccb and only send it to the system
        when the operator does his/her high level apply, we can not rely on that.

        I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
        invoking the saImmOmCcbApply. Then invoked this on one node:

        immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

        The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
        inside immcfg itself and aborting the CCB before the scenario can complete.

        Quickly after invoking the above I order an si-swap from another shell/node:

        immadm -o 7 safSi=SC-2N,safApp=OpenSAF

        The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
        attach as long as there is an active, non-empty ccb, that contains operations
        on AMF objects.

        The first level of solution in my opinion is that both AMFDs should retry
        forever (in a separate thread assumed to be the case already) to attach as
        implementer/applier. A notification should be sent periodically
        to inform the operator or whomever is listening that thre is a lingering
        AMF related CCB that should be terminated (aborted or committed by the user).

        Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
        an admin-operation for this purpose. The active AMFD could invoke this admop
        to trigger the immsv to clear all non-critical CCBs. It should do this if it
        ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
        for a while. Adding such an admin-operation to the immsv and implementing
        its use in AMF should probably be seen as two enhacnements.

        The really thorny issue is that there can be blocked critical CCBs.
        These are CCBs where the immsv is waiting on the result of commit from PBE.
        The probability is low that there is both a critical CCB stuck and that it
        contains AMF object operations, but it can happen. Such a system is in ANY CASE
        stuck in its CCB processing so the AMF should wait indefinitely here.
        Currently the system should cluster restart after some time. Not good.
        The immsv can not clear critical CCBs by itself. The only option is to
        use the admin-op (already implemented) for emergency disablement of PBE.

        To summarize: This defect ticket is only concerned with the problem of the AMF
        rebooting its standby when this scenario occurs. This should be changed to
        eternal wait with periodic notifications. The AMF service is functioning but
        can not process configuration changes on its data while in this state.
        That is not a fatal condition and so should not be esclated to SC restart.

        The problem of how to clear the interfering CCB can be solved in many ways.
        A short term alternative (a hack solution) is for the AMF to reboot a payload.
        That would also trigger a sync clearing al non critical CCBs.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105https://sourceforge.net/p/opensaf/tickets/1105

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptionshttps://sourceforge.net/auth/subscriptions


        [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier

        Status: unassigned
        Milestone: 4.3.3
        Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
        Last Updated: Thu Sep 18, 2014 06:57 AM UTC
        Owner: nobody

        This ticket is in essence a continuation of ticket #1078

        http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078

        In switchover, the new standby fails to attach as AMFD applier. It retries
        this for a limited time (45 seconds os so), but finally gives up and AMFD standby
        restarts.

        In ticket 1078 the blockage was actually caused by a bug because the lingering
        CCB was in that case not interfering with AMF data (data monitored by the
        AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
        for #1078.

        But this ticket tracks the case of true interference. The very same symptom
        can be acheived by creating a CCB that modifies an AMF object and then lingers.
        An si-swap done in this setup will result in the new standby rebooting after
        it gives up in retrying.

        The new active AMFD is doing the very same thing, failing to set itself
        as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
        AMFD triggers the restart of that SC, which triggers a sync, which aborts the
        CCB removing the blockage for the new active AMFD.

        Note that this scenario is not totally unrealistic. An operator starts to
        build a CCB. Forgets about it and then performs an si-swap. That will cause
        an SC restart. Not good.

        While a good NBI frontend should buffer the ccb and only send it to the system
        when the operator does his/her high level apply, we can not rely on that.

        I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
        invoking the saImmOmCcbApply. Then invoked this on one node:

        immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

        The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
        inside immcfg itself and aborting the CCB before the scenario can complete.

        Quickly after invoking the above I order an si-swap from another shell/node:

        immadm -o 7 safSi=SC-2N,safApp=OpenSAF

        The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
        attach as long as there is an active, non-empty ccb, that contains operations
        on AMF objects.

        The first level of solution in my opinion is that both AMFDs should retry
        forever (in a separate thread assumed to be the case already) to attach as
        implementer/applier. A notification should be sent periodically
        to inform the operator or whomever is listening that thre is a lingering
        AMF related CCB that should be terminated (aborted or committed by the user).

        Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
        an admin-operation for this purpose. The active AMFD could invoke this admop
        to trigger the immsv to clear all non-critical CCBs. It should do this if it
        ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
        for a while. Adding such an admin-operation to the immsv and implementing
        its use in AMF should probably be seen as two enhacnements.

        The really thorny issue is that there can be blocked critical CCBs.
        These are CCBs where the immsv is waiting on the result of commit from PBE.
        The probability is low that there is both a critical CCB stuck and that it
        contains AMF object operations, but it can happen. Such a system is in ANY CASE
        stuck in its CCB processing so the AMF should wait indefinitely here.
        Currently the system should cluster restart after some time. Not good.
        The immsv can not clear critical CCBs by itself. The only option is to
        use the admin-op (already implemented) for emergency disablement of PBE.

        To summarize: This defect ticket is only concerned with the problem of the AMF
        rebooting its standby when this scenario occurs. This should be changed to
        eternal wait with periodic notifications. The AMF service is functioning but
        can not process configuration changes on its data while in this state.
        That is not a fatal condition and so should not be esclated to SC restart.

        The problem of how to clear the interfering CCB can be solved in many ways.
        A short term alternative (a hack solution) is for the AMF to reboot a payload.
        That would also trigger a sync clearing al non critical CCBs.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

         

        Related

        Tickets: #1105

  • Anders Bjornerstedt

    For the switchover case there is an alternative to "eternal wait" on
    setting OI/applier. This is for the active AMFD to reject a switchover
    if there is currently an active CCB modifying AMF data.

    The AMFD must know if this is the case since it is the OI for that data.

     
  • Anders Bjornerstedt

    For the failover case, the new active AMFD really must wait eternally
    on implementer-set, preferraby in combination with actions directed
    at resolving the issue, such as the proposed admin-op on imm
    (enhancement #1107).

    The "alternative" of a cluster restart is not an alternative.

     
  • Anders Widell

    Anders Widell - 2014-10-07
    • Milestone: 4.3.3 --> 4.4.2
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-03-24
    • Milestone: 4.4.2 --> future
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-06-17
    • status: unassigned --> accepted
    • assigned_to: Nagendra Kumar
    • Milestone: future --> 4.5.2
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-06-17

    Here is what I go along:
    1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is going on. And AMF will also set the "error string" appropriately.
    2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in progress and AMF will also set the "error string".

    Also #1108 and #1111 will be closed.

    Thanks,
    -Nagu

     

    Last edit: Nagendra Kumar 2015-06-26
    • Anders Bjornerstedt

      Hi

      Fix (1) fixes the problem reported in 1111 (111 is an enhancement).
      Fix (2) is only a partial fix for (2) that fixes #1105 for the si-swap case. Not sure about the failover case.

      Ticket #1108 is also an enhancement that will speed up the progress of any si-swap or failover that has problems
      setting OI (or applier).
      I see enhancement #1108 as still a valid enhancement even after we have this proposed fix for #1105.
      The fix proposed in #1108 is also trivial to implement. Just send the admin-op request asynchronously.
      No need to wait on a response.

      /AndersBj

      From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
      Sent: den 17 juni 2015 12:46
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier

      Here is what I go along:
      1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is going on. And AMF will also set the "error string" appropriately.
      2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in progress and AMF will also set the "error string".

      Also #1108 and #1111 will be closed.

      Thanks,
      -Nagu


      [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier

      Status: accepted
      Milestone: 4.5.2
      Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
      Last Updated: Wed Jun 17, 2015 09:48 AM UTC
      Owner: Nagendra Kumar

      This ticket is in essence a continuation of ticket #1078

      http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078

      In switchover, the new standby fails to attach as AMFD applier. It retries
      this for a limited time (45 seconds os so), but finally gives up and AMFD standby
      restarts.

      In ticket 1078 the blockage was actually caused by a bug because the lingering
      CCB was in that case not interfering with AMF data (data monitored by the
      AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
      for #1078.

      But this ticket tracks the case of true interference. The very same symptom
      can be acheived by creating a CCB that modifies an AMF object and then lingers.
      An si-swap done in this setup will result in the new standby rebooting after
      it gives up in retrying.

      The new active AMFD is doing the very same thing, failing to set itself
      as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
      AMFD triggers the restart of that SC, which triggers a sync, which aborts the
      CCB removing the blockage for the new active AMFD.

      Note that this scenario is not totally unrealistic. An operator starts to
      build a CCB. Forgets about it and then performs an si-swap. That will cause
      an SC restart. Not good.

      While a good NBI frontend should buffer the ccb and only send it to the system
      when the operator does his/her high level apply, we can not rely on that.

      I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
      invoking the saImmOmCcbApply. Then invoked this on one node:

      immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

      The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
      inside immcfg itself and aborting the CCB before the scenario can complete.

      Quickly after invoking the above I order an si-swap from another shell/node:

      immadm -o 7 safSi=SC-2N,safApp=OpenSAF

      The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
      attach as long as there is an active, non-empty ccb, that contains operations
      on AMF objects.

      The first level of solution in my opinion is that both AMFDs should retry
      forever (in a separate thread assumed to be the case already) to attach as
      implementer/applier. A notification should be sent periodically
      to inform the operator or whomever is listening that thre is a lingering
      AMF related CCB that should be terminated (aborted or committed by the user).

      Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
      an admin-operation for this purpose. The active AMFD could invoke this admop
      to trigger the immsv to clear all non-critical CCBs. It should do this if it
      ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
      for a while. Adding such an admin-operation to the immsv and implementing
      its use in AMF should probably be seen as two enhacnements.

      The really thorny issue is that there can be blocked critical CCBs.
      These are CCBs where the immsv is waiting on the result of commit from PBE.
      The probability is low that there is both a critical CCB stuck and that it
      contains AMF object operations, but it can happen. Such a system is in ANY CASE
      stuck in its CCB processing so the AMF should wait indefinitely here.
      Currently the system should cluster restart after some time. Not good.
      The immsv can not clear critical CCBs by itself. The only option is to
      use the admin-op (already implemented) for emergency disablement of PBE.

      To summarize: This defect ticket is only concerned with the problem of the AMF
      rebooting its standby when this scenario occurs. This should be changed to
      eternal wait with periodic notifications. The AMF service is functioning but
      can not process configuration changes on its data while in this state.
      That is not a fatal condition and so should not be esclated to SC restart.

      The problem of how to clear the interfering CCB can be solved in many ways.
      A short term alternative (a hack solution) is for the AMF to reboot a payload.
      That would also trigger a sync clearing al non critical CCBs.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

       

      Related

      Tickets: #1105

      • Anders Bjornerstedt

        On 06/17/2015 01:28 PM, Anders Bjornerstedt wrote:

        Hi

        Fix (1) fixes the problem reported in 1111 (111 is an enhancement).
        Fix (2) is only a partial fix for (2) that fixes #1105 for the si-swap
        case. Not sure about the failover case.

        I just reproduced problem #1105 for the fail-over case (not switchover).
        To do so only requires a CCB that lingers, say for 240 seconds before
        applying and that the PBE (IMMND coord)
        resides at SC standby before failover. If the PBE (IMMND coord) resides
        at active before failover then it has to
        re-attach at standby at failover and since the PBE invokes the admin-op
        for aborting non-critical CCBs when
        re-attaching, the AMF is in that case saved by the PBE. But that will be
        rouchly 50% of the failover cases.

        If the PBE does not need restart at failover, because it already resided
        at old-standby-new-active, then
        the AMFD old-standby-new-active is not saved by the PBE and will reboot
        resulting in CLUSTER RELOAD.

        So I claim that to really fix #1105, not just for the si-swap
        interference problem but also for the fail-over case,
        you really need the fix for #1108.
        There are of course alternatives to a fix of type #1108.
        But why not take that one when we have it instead of inventing yet
        another way, or delaying indefinitely
        becoming AMF-OI ?

        The solution of the AMFD invoking an admin-op on the IMM was earlier
        "rejected" with the motivation that
        such a solution was "proprietary". While "proprietary" is not the
        correct words for decribing a mechanism
        that is public and part of an open-source implementation, I giess the
        complaint was that the solution was
        OpenSAF specific. But I dont get what the problem would be with the
        internals of OpenSAF being OpensAF specific.

        /AndersBj

        Ticket #1108 is also an enhancement that will speed up the progress of
        any si-swap or failover that has problems
        setting OI (or applier).
        I see enhancement #1108 as still a valid enhancement even after we
        have this proposed fix for #1105.
        The fix proposed in #1108 is also trivial to implement. Just send the
        admin-op request asynchronously.
        No need to wait on a response.

        /AndersBj

        From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
        Sent: den 17 juni 2015 12:46
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked
        on becoming applier

        Here is what I go along:
        1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is
        going on. And AMF will also set the "error string" appropriately.
        2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in
        progress and AMF will also set the "error string".

        Also #1108 and #1111 will be closed.

        Thanks,
        -Nagu


        [tickets:#1105]
        http://sourceforge.net/p/opensaf/tickets/1105http://sourceforge.net/p/opensaf/tickets/1105
        AMFD: New standby crashes if blocked on becoming applier

        Status: accepted
        Milestone: 4.5.2
        Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
        Last Updated: Wed Jun 17, 2015 09:48 AM UTC
        Owner: Nagendra Kumar

        This ticket is in essence a continuation of ticket #1078

        http://sourceforge.net/p/opensaf/tickets/1078/
        http://sourceforge.net/p/opensaf/tickets/1078http://sourceforge.net/p/opensaf/tickets/1078

        In switchover, the new standby fails to attach as AMFD applier. It retries
        this for a limited time (45 seconds os so), but finally gives up and
        AMFD standby
        restarts.

        In ticket 1078 the blockage was actually caused by a bug because the
        lingering
        CCB was in that case not interfering with AMF data (data monitored by the
        AMFD-OI and the AMFD-applier). That "false" interference is fixed by
        the patch
        for #1078.

        But this ticket tracks the case of true interference. The very same
        symptom
        can be acheived by creating a CCB that modifies an AMF object and then
        lingers.
        An si-swap done in this setup will result in the new standby rebooting
        after
        it gives up in retrying.

        The new active AMFD is doing the very same thing, failing to set itself
        as OI 'saAmfService' becaue of the interfering CCB. But the crashed
        standby
        AMFD triggers the restart of that SC, which triggers a sync, which
        aborts the
        CCB removing the blockage for the new active AMFD.

        Note that this scenario is not totally unrealistic. An operator starts to
        build a CCB. Forgets about it and then performs an si-swap. That will
        cause
        an SC restart. Not good.

        While a good NBI frontend should buffer the ccb and only send it to
        the system
        when the operator does his/her high level apply, we can not rely on that.

        I reproduced this scenario by hacking immcfg so that it waits 60
        seconds before
        invoking the saImmOmCcbApply. Then invoked this on one node:

        immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
        safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

        The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
        inside immcfg itself and aborting the CCB before the scenario can
        complete.

        Quickly after invoking the above I order an si-swap from another
        shell/node:

        immadm -o 7 safSi=SC-2N,safApp=OpenSAF

        The basic problem here is that neither the AMFD-OI nor the
        AMFD-applier can
        attach as long as there is an active, non-empty ccb, that contains
        operations
        on AMF objects.

        The first level of solution in my opinion is that both AMFDs should retry
        forever (in a separate thread assumed to be the case already) to attach as
        implementer/applier. A notification should be sent periodically
        to inform the operator or whomever is listening that thre is a lingering
        AMF related CCB that should be terminated (aborted or committed by the
        user).

        Rebooting an SC is a very coarse way of clearing CCBs. The Immsv
        should provide
        an admin-operation for this purpose. The active AMFD could invoke this
        admop
        to trigger the immsv to clear all non-critical CCBs. It should do this
        if it
        ends up in the implementer-set TRY_AGAIN loop. Preferably after it has
        waited
        for a while. Adding such an admin-operation to the immsv and implementing
        its use in AMF should probably be seen as two enhacnements.

        The really thorny issue is that there can be blocked critical CCBs.
        These are CCBs where the immsv is waiting on the result of commit from
        PBE.
        The probability is low that there is both a critical CCB stuck and that it
        contains AMF object operations, but it can happen. Such a system is in
        ANY CASE
        stuck in its CCB processing so the AMF should wait indefinitely here.
        Currently the system should cluster restart after some time. Not good.
        The immsv can not clear critical CCBs by itself. The only option is to
        use the admin-op (already implemented) for emergency disablement of PBE.

        To summarize: This defect ticket is only concerned with the problem of
        the AMF
        rebooting its standby when this scenario occurs. This should be changed to
        eternal wait with periodic notifications. The AMF service is
        functioning but
        can not process configuration changes on its data while in this state.
        That is not a fatal condition and so should not be esclated to SC restart.

        The problem of how to clear the interfering CCB can be solved in many
        ways.
        A short term alternative (a hack solution) is for the AMF to reboot a
        payload.
        That would also trigger a sync clearing al non critical CCBs.


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/opensaf/tickets/1105/
        https://sourceforge.net/p/opensaf/tickets/1105https://sourceforge.net/p/opensaf/tickets/1105

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/
        https://sourceforge.net/auth/subscriptionshttps://sourceforge.net/auth/subscriptions


        [tickets:#1105] http://sourceforge.net/p/opensaf/tickets/1105 AMFD:
        New standby crashes if blocked on becoming applier

        Status: accepted
        Milestone: 4.5.2
        Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
        Last Updated: Wed Jun 17, 2015 10:45 AM UTC
        Owner: Nagendra Kumar

        This ticket is in essence a continuation of ticket #1078

        http://sourceforge.net/p/opensaf/tickets/1078/
        http://sourceforge.net/p/opensaf/tickets/1078

        In switchover, the new standby fails to attach as AMFD applier. It retries
        this for a limited time (45 seconds os so), but finally gives up and
        AMFD standby
        restarts.

        In ticket 1078 the blockage was actually caused by a bug because the
        lingering
        CCB was in that case not interfering with AMF data (data monitored by the
        AMFD-OI and the AMFD-applier). That "false" interference is fixed by
        the patch
        for #1078.

        But this ticket tracks the case of true interference. The very same
        symptom
        can be acheived by creating a CCB that modifies an AMF object and then
        lingers.
        An si-swap done in this setup will result in the new standby rebooting
        after
        it gives up in retrying.

        The new active AMFD is doing the very same thing, failing to set itself
        as OI 'saAmfService' becaue of the interfering CCB. But the crashed
        standby
        AMFD triggers the restart of that SC, which triggers a sync, which
        aborts the
        CCB removing the blockage for the new active AMFD.

        Note that this scenario is not totally unrealistic. An operator starts to
        build a CCB. Forgets about it and then performs an si-swap. That will
        cause
        an SC restart. Not good.

        While a good NBI frontend should buffer the ccb and only send it to
        the system
        when the operator does his/her high level apply, we can not rely on that.

        I reproduced this scenario by hacking immcfg so that it waits 60
        seconds before
        invoking the saImmOmCcbApply. Then invoked this on one node:

        immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

        The high immcfg timeout (-t 120) is needed to avoid the OM side timing
        out
        inside immcfg itself and aborting the CCB before the scenario can
        complete.

        Quickly after invoking the above I order an si-swap from another
        shell/node:

        immadm -o 7 safSi=SC-2N,safApp=OpenSAF

        The basic problem here is that neither the AMFD-OI nor the
        AMFD-applier can
        attach as long as there is an active, non-empty ccb, that contains
        operations
        on AMF objects.

        The first level of solution in my opinion is that both AMFDs should retry
        forever (in a separate thread assumed to be the case already) to attach as
        implementer/applier. A notification should be sent periodically
        to inform the operator or whomever is listening that thre is a lingering
        AMF related CCB that should be terminated (aborted or committed by the
        user).

        Rebooting an SC is a very coarse way of clearing CCBs. The Immsv
        should provide
        an admin-operation for this purpose. The active AMFD could invoke this
        admop
        to trigger the immsv to clear all non-critical CCBs. It should do this
        if it
        ends up in the implementer-set TRY_AGAIN loop. Preferably after it has
        waited
        for a while. Adding such an admin-operation to the immsv and implementing
        its use in AMF should probably be seen as two enhacnements.

        The really thorny issue is that there can be blocked critical CCBs.
        These are CCBs where the immsv is waiting on the result of commit from
        PBE.
        The probability is low that there is both a critical CCB stuck and
        that it
        contains AMF object operations, but it can happen. Such a system is in
        ANY CASE
        stuck in its CCB processing so the AMF should wait indefinitely here.
        Currently the system should cluster restart after some time. Not good.
        The immsv can not clear critical CCBs by itself. The only option is to
        use the admin-op (already implemented) for emergency disablement of PBE.

        To summarize: This defect ticket is only concerned with the problem of
        the AMF
        rebooting its standby when this scenario occurs. This should be changed to
        eternal wait with periodic notifications. The AMF service is
        functioning but
        can not process configuration changes on its data while in this state.
        That is not a fatal condition and so should not be esclated to SC restart.

        The problem of how to clear the interfering CCB can be solved in many
        ways.
        A short term alternative (a hack solution) is for the AMF to reboot a
        payload.
        That would also trigger a sync clearing al non critical CCBs.


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/opensaf/tickets/1105/
        https://sourceforge.net/p/opensaf/tickets/1105

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/
        https://sourceforge.net/auth/subscriptions

         

        Related

        Tickets: #1105

  • Anders Bjornerstedt

    • summary: AMFD: New standby crashes if blocked on becoming applier --> AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-06-22

    For non-critical ccb, ticket #1391 will take care.
    For critical ccb, Amf should be ok to wait a little when PBE delays the response.

    So, I would be going ahead and implementing the two points mentioned above as part of #1105 and others will get closed.

    Thanks
    -Nagu

     

    Last edit: Nagendra Kumar 2015-06-22
    • Anders Bjornerstedt

      For critical CCBs the wait can be indefinite since the delay can be due to problems on the file system.

      The AMF should not block a failover just because it can not attach as OI.
      There is no inherent functional dependence of the AMF failover mechanism on the AMF OI being available.
      Any such dependency is unnecessary and an impediment to service availability.

      /AndersBj

      From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
      Sent: den 22 juni 2015 08:54
      To: opensaf-tickets@lists.sourceforge.net
      Subject: [tickets] [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

      For non-critical ccb, ticket #1391 will take care.
      For critical ccb, Amf should ok to wait a little when PBE delays the response.

      So, I would be going ahead and implementing the two points mentioned above as part of #1105 and others will get closed.

      Thanks
      -Nagu


      [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

      Status: accepted
      Milestone: 4.5.2
      Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
      Last Updated: Wed Jun 17, 2015 12:59 PM UTC
      Owner: Nagendra Kumar

      This ticket is in essence a continuation of ticket #1078

      http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078

      In switchover, the new standby fails to attach as AMFD applier. It retries
      this for a limited time (45 seconds os so), but finally gives up and AMFD standby
      restarts.

      In ticket 1078 the blockage was actually caused by a bug because the lingering
      CCB was in that case not interfering with AMF data (data monitored by the
      AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
      for #1078.

      But this ticket tracks the case of true interference. The very same symptom
      can be acheived by creating a CCB that modifies an AMF object and then lingers.
      An si-swap done in this setup will result in the new standby rebooting after
      it gives up in retrying.

      The new active AMFD is doing the very same thing, failing to set itself
      as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
      AMFD triggers the restart of that SC, which triggers a sync, which aborts the
      CCB removing the blockage for the new active AMFD.

      Note that this scenario is not totally unrealistic. An operator starts to
      build a CCB. Forgets about it and then performs an si-swap. That will cause
      an SC restart. Not good.

      While a good NBI frontend should buffer the ccb and only send it to the system
      when the operator does his/her high level apply, we can not rely on that.

      I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
      invoking the saImmOmCcbApply. Then invoked this on one node:

      immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

      The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
      inside immcfg itself and aborting the CCB before the scenario can complete.

      Quickly after invoking the above I order an si-swap from another shell/node:

      immadm -o 7 safSi=SC-2N,safApp=OpenSAF

      The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
      attach as long as there is an active, non-empty ccb, that contains operations
      on AMF objects.

      The first level of solution in my opinion is that both AMFDs should retry
      forever (in a separate thread assumed to be the case already) to attach as
      implementer/applier. A notification should be sent periodically
      to inform the operator or whomever is listening that thre is a lingering
      AMF related CCB that should be terminated (aborted or committed by the user).

      Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
      an admin-operation for this purpose. The active AMFD could invoke this admop
      to trigger the immsv to clear all non-critical CCBs. It should do this if it
      ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
      for a while. Adding such an admin-operation to the immsv and implementing
      its use in AMF should probably be seen as two enhacnements.

      The really thorny issue is that there can be blocked critical CCBs.
      These are CCBs where the immsv is waiting on the result of commit from PBE.
      The probability is low that there is both a critical CCB stuck and that it
      contains AMF object operations, but it can happen. Such a system is in ANY CASE
      stuck in its CCB processing so the AMF should wait indefinitely here.
      Currently the system should cluster restart after some time. Not good.
      The immsv can not clear critical CCBs by itself. The only option is to
      use the admin-op (already implemented) for emergency disablement of PBE.

      To summarize: This defect ticket is only concerned with the problem of the AMF
      rebooting its standby when this scenario occurs. This should be changed to
      eternal wait with periodic notifications. The AMF service is functioning but
      can not process configuration changes on its data while in this state.
      That is not a fatal condition and so should not be esclated to SC restart.

      The problem of how to clear the interfering CCB can be solved in many ways.
      A short term alternative (a hack solution) is for the AMF to reboot a payload.
      That would also trigger a sync clearing al non critical CCBs.


      Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets

      To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

       

      Related

      Tickets: #1105
      Tickets: tickets

  • Nagendra Kumar

    Nagendra Kumar - 2015-06-26
    • status: accepted --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2015-06-29

    Hi Anders,
    Thanks for your comments. But imm need to take time-bound action in that case. It can't wait for a response for long.

    Thanks
    -Nagu

     
    • Anders Bjornerstedt

      It is impossible for the IMM to do anything about a critical CCB until the PBE re-attaches and the PBE can only re-attach
      when the file system is available. So not the IMM can not force the issue here. It can neither abort nor commit the CCB
      since this would have a 50/50 chance of diverging from the PBE representation of the CCB. A cluster restart may very
      well happen before the file system comes back, in fact a cluster restart may be caused by the long absence of the
      file system. It is in fact what happens here.

      As far as the AMF is concerned, since nether old active or old standby new active can have received any apply
      or abort callback in this case, the AMF should act as if it is still waiting for the commit of the CCB, i.e. as if the
      before-image for the CCB is what is valid and it is.

      /AndersBj

      From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
      Sent: den 29 juni 2015 09:02
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

      Hi Anders,
      Thanks for your comments. But imm need to take time-bound action in that case. It can't wait for a response for long.

      Thanks
      -Nagu


      [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

      Status: review
      Milestone: 4.5.2
      Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
      Last Updated: Fri Jun 26, 2015 09:34 AM UTC
      Owner: Nagendra Kumar

      This ticket is in essence a continuation of ticket #1078

      http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078

      In switchover, the new standby fails to attach as AMFD applier. It retries
      this for a limited time (45 seconds os so), but finally gives up and AMFD standby
      restarts.

      In ticket 1078 the blockage was actually caused by a bug because the lingering
      CCB was in that case not interfering with AMF data (data monitored by the
      AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
      for #1078.

      But this ticket tracks the case of true interference. The very same symptom
      can be acheived by creating a CCB that modifies an AMF object and then lingers.
      An si-swap done in this setup will result in the new standby rebooting after
      it gives up in retrying.

      The new active AMFD is doing the very same thing, failing to set itself
      as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
      AMFD triggers the restart of that SC, which triggers a sync, which aborts the
      CCB removing the blockage for the new active AMFD.

      Note that this scenario is not totally unrealistic. An operator starts to
      build a CCB. Forgets about it and then performs an si-swap. That will cause
      an SC restart. Not good.

      While a good NBI frontend should buffer the ccb and only send it to the system
      when the operator does his/her high level apply, we can not rely on that.

      I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
      invoking the saImmOmCcbApply. Then invoked this on one node:

      immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

      The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
      inside immcfg itself and aborting the CCB before the scenario can complete.

      Quickly after invoking the above I order an si-swap from another shell/node:

      immadm -o 7 safSi=SC-2N,safApp=OpenSAF

      The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
      attach as long as there is an active, non-empty ccb, that contains operations
      on AMF objects.

      The first level of solution in my opinion is that both AMFDs should retry
      forever (in a separate thread assumed to be the case already) to attach as
      implementer/applier. A notification should be sent periodically
      to inform the operator or whomever is listening that thre is a lingering
      AMF related CCB that should be terminated (aborted or committed by the user).

      Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
      an admin-operation for this purpose. The active AMFD could invoke this admop
      to trigger the immsv to clear all non-critical CCBs. It should do this if it
      ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
      for a while. Adding such an admin-operation to the immsv and implementing
      its use in AMF should probably be seen as two enhacnements.

      The really thorny issue is that there can be blocked critical CCBs.
      These are CCBs where the immsv is waiting on the result of commit from PBE.
      The probability is low that there is both a critical CCB stuck and that it
      contains AMF object operations, but it can happen. Such a system is in ANY CASE
      stuck in its CCB processing so the AMF should wait indefinitely here.
      Currently the system should cluster restart after some time. Not good.
      The immsv can not clear critical CCBs by itself. The only option is to
      use the admin-op (already implemented) for emergency disablement of PBE.

      To summarize: This defect ticket is only concerned with the problem of the AMF
      rebooting its standby when this scenario occurs. This should be changed to
      eternal wait with periodic notifications. The AMF service is functioning but
      can not process configuration changes on its data while in this state.
      That is not a fatal condition and so should not be esclated to SC restart.

      The problem of how to clear the interfering CCB can be solved in many ways.
      A short term alternative (a hack solution) is for the AMF to reboot a payload.
      That would also trigger a sync clearing al non critical CCBs.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

       

      Related

      Tickets: #1105

      • Anders Bjornerstedt

        Even if the AMF re-reads its entire AMF model from IMM during the PBE absence it will see the state
        Of the AMF model without that CCB being committed. IMM-ram only commits the CCB after the PBE has
        returned and responded on the outcome of the CCB.

        The good news is that the IMM is behaving entirely transactionally with CCBs.
        The only bad news is that the AMF currently does not wish to follow the transactional model (relative to imm data) during failover.

        /AndersBj

        From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
        Sent: den 29 juni 2015 09:22
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

        It is impossible for the IMM to do anything about a critical CCB until the PBE re-attaches and the PBE can only re-attach
        when the file system is available. So not the IMM can not force the issue here. It can neither abort nor commit the CCB
        since this would have a 50/50 chance of diverging from the PBE representation of the CCB. A cluster restart may very
        well happen before the file system comes back, in fact a cluster restart may be caused by the long absence of the
        file system. It is in fact what happens here.

        As far as the AMF is concerned, since nether old active or old standby new active can have received any apply
        or abort callback in this case, the AMF should act as if it is still waiting for the commit of the CCB, i.e. as if the
        before-image for the CCB is what is valid and it is.

        /AndersBj

        From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
        Sent: den 29 juni 2015 09:02
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

        Hi Anders,
        Thanks for your comments. But imm need to take time-bound action in that case. It can't wait for a response for long.

        Thanks
        -Nagu


        [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

        Status: review
        Milestone: 4.5.2
        Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
        Last Updated: Fri Jun 26, 2015 09:34 AM UTC
        Owner: Nagendra Kumar

        This ticket is in essence a continuation of ticket #1078

        http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078http://sourceforge.net/p/opensaf/tickets/1078

        In switchover, the new standby fails to attach as AMFD applier. It retries
        this for a limited time (45 seconds os so), but finally gives up and AMFD standby
        restarts.

        In ticket 1078 the blockage was actually caused by a bug because the lingering
        CCB was in that case not interfering with AMF data (data monitored by the
        AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
        for #1078.

        But this ticket tracks the case of true interference. The very same symptom
        can be acheived by creating a CCB that modifies an AMF object and then lingers.
        An si-swap done in this setup will result in the new standby rebooting after
        it gives up in retrying.

        The new active AMFD is doing the very same thing, failing to set itself
        as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
        AMFD triggers the restart of that SC, which triggers a sync, which aborts the
        CCB removing the blockage for the new active AMFD.

        Note that this scenario is not totally unrealistic. An operator starts to
        build a CCB. Forgets about it and then performs an si-swap. That will cause
        an SC restart. Not good.

        While a good NBI frontend should buffer the ccb and only send it to the system
        when the operator does his/her high level apply, we can not rely on that.

        I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
        invoking the saImmOmCcbApply. Then invoked this on one node:

        immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

        The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
        inside immcfg itself and aborting the CCB before the scenario can complete.

        Quickly after invoking the above I order an si-swap from another shell/node:

        immadm -o 7 safSi=SC-2N,safApp=OpenSAF

        The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
        attach as long as there is an active, non-empty ccb, that contains operations
        on AMF objects.

        The first level of solution in my opinion is that both AMFDs should retry
        forever (in a separate thread assumed to be the case already) to attach as
        implementer/applier. A notification should be sent periodically
        to inform the operator or whomever is listening that thre is a lingering
        AMF related CCB that should be terminated (aborted or committed by the user).

        Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
        an admin-operation for this purpose. The active AMFD could invoke this admop
        to trigger the immsv to clear all non-critical CCBs. It should do this if it
        ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
        for a while. Adding such an admin-operation to the immsv and implementing
        its use in AMF should probably be seen as two enhacnements.

        The really thorny issue is that there can be blocked critical CCBs.
        These are CCBs where the immsv is waiting on the result of commit from PBE.
        The probability is low that there is both a critical CCB stuck and that it
        contains AMF object operations, but it can happen. Such a system is in ANY CASE
        stuck in its CCB processing so the AMF should wait indefinitely here.
        Currently the system should cluster restart after some time. Not good.
        The immsv can not clear critical CCBs by itself. The only option is to
        use the admin-op (already implemented) for emergency disablement of PBE.

        To summarize: This defect ticket is only concerned with the problem of the AMF
        rebooting its standby when this scenario occurs. This should be changed to
        eternal wait with periodic notifications. The AMF service is functioning but
        can not process configuration changes on its data while in this state.
        That is not a fatal condition and so should not be esclated to SC restart.

        The problem of how to clear the interfering CCB can be solved in many ways.
        A short term alternative (a hack solution) is for the AMF to reboot a payload.
        That would also trigger a sync clearing al non critical CCBs.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105https://sourceforge.net/p/opensaf/tickets/1105

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptionshttps://sourceforge.net/auth/subscriptions


        [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

        Status: review
        Milestone: 4.5.2
        Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
        Last Updated: Mon Jun 29, 2015 07:02 AM UTC
        Owner: Nagendra Kumar

        This ticket is in essence a continuation of ticket #1078

        http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078

        In switchover, the new standby fails to attach as AMFD applier. It retries
        this for a limited time (45 seconds os so), but finally gives up and AMFD standby
        restarts.

        In ticket 1078 the blockage was actually caused by a bug because the lingering
        CCB was in that case not interfering with AMF data (data monitored by the
        AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
        for #1078.

        But this ticket tracks the case of true interference. The very same symptom
        can be acheived by creating a CCB that modifies an AMF object and then lingers.
        An si-swap done in this setup will result in the new standby rebooting after
        it gives up in retrying.

        The new active AMFD is doing the very same thing, failing to set itself
        as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
        AMFD triggers the restart of that SC, which triggers a sync, which aborts the
        CCB removing the blockage for the new active AMFD.

        Note that this scenario is not totally unrealistic. An operator starts to
        build a CCB. Forgets about it and then performs an si-swap. That will cause
        an SC restart. Not good.

        While a good NBI frontend should buffer the ccb and only send it to the system
        when the operator does his/her high level apply, we can not rely on that.

        I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
        invoking the saImmOmCcbApply. Then invoked this on one node:

        immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN

        The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
        inside immcfg itself and aborting the CCB before the scenario can complete.

        Quickly after invoking the above I order an si-swap from another shell/node:

        immadm -o 7 safSi=SC-2N,safApp=OpenSAF

        The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
        attach as long as there is an active, non-empty ccb, that contains operations
        on AMF objects.

        The first level of solution in my opinion is that both AMFDs should retry
        forever (in a separate thread assumed to be the case already) to attach as
        implementer/applier. A notification should be sent periodically
        to inform the operator or whomever is listening that thre is a lingering
        AMF related CCB that should be terminated (aborted or committed by the user).

        Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
        an admin-operation for this purpose. The active AMFD could invoke this admop
        to trigger the immsv to clear all non-critical CCBs. It should do this if it
        ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
        for a while. Adding such an admin-operation to the immsv and implementing
        its use in AMF should probably be seen as two enhacnements.

        The really thorny issue is that there can be blocked critical CCBs.
        These are CCBs where the immsv is waiting on the result of commit from PBE.
        The probability is low that there is both a critical CCB stuck and that it
        contains AMF object operations, but it can happen. Such a system is in ANY CASE
        stuck in its CCB processing so the AMF should wait indefinitely here.
        Currently the system should cluster restart after some time. Not good.
        The immsv can not clear critical CCBs by itself. The only option is to
        use the admin-op (already implemented) for emergency disablement of PBE.

        To summarize: This defect ticket is only concerned with the problem of the AMF
        rebooting its standby when this scenario occurs. This should be changed to
        eternal wait with periodic notifications. The AMF service is functioning but
        can not process configuration changes on its data while in this state.
        That is not a fatal condition and so should not be esclated to SC restart.

        The problem of how to clear the interfering CCB can be solved in many ways.
        A short term alternative (a hack solution) is for the AMF to reboot a payload.
        That would also trigger a sync clearing al non critical CCBs.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

         

        Related

        Tickets: #1105

  • Anders Bjornerstedt

    The patch submitted for review

    https://sourceforge.net/p/opensaf/mailman/message/34243280/
    

    looks OK from a correctness point of view as I see it. But there are is an
    incompleteness issue as I understand it. The ticket reports a problem for
    both si-swap and failover. I dont see that the patch fixes the failover
    variant of the problem. The problem is that a failover can not be "rejected"
    the way that an SI-swap request can be.

     
  • Nagendra Kumar

    Nagendra Kumar - 2015-07-02

    Hi Anders,
    Thanks for your review. It is mentioned in the patch review that it doesn't handle fail-over case.

    But fail-over case could be sorted out if Imm returns the OI call immediately and marks Amfd as Implementer. As doing so is not going to give problem to Amf as it can accept apply after becoming Act and process it.

    By making Amfd hang in Imm OI call is hampering Imm service availability, so Imm could share some responsibility for it.

    Thanks
    -Nagu

     
  • Minh Hon Chau

    Minh Hon Chau - 2015-07-22

    I have been testing the patch floated for review on default branch, it works fine in case of si-swap.
    Also, I'm trying to test in failover, below are steps I did:
    On SC-2 (standby):
    - immcfg -t 150 -m -a saAmfCtDefDisableRestart=0 safVersion=4.0.0,safCompType=OpenSafCompTypeIMMND
    - immcfg --ccb-apply (sleep 60s by hack immcfg)
    On SC-1 (active), issue reboot

    I have seen the CCB aborted on SC-2

    Jul 22 15:40:51 SC-2 osafimmnd[428]: NO Received: immadm -o 202 safRdn=immManagement,safApp=safImmService
    Jul 22 15:40:52 SC-2 osafimmnd[428]: NO CCB 2 aborted by: immadm -o 202 safRdn=immManagement,safApp=safImmService
    Jul 22 15:40:52 SC-2 osafimmnd[428]: WA Timeout while waiting for implementer, aborting ccb:2
    Jul 22 15:40:52 SC-2 osafimmnd[428]: NO Ccb 2 ABORTED (immcfg_SC-2_711)
    Jul 22 15:40:52 SC-2 osafimmnd[428]: WA >>s_info->to_svc == 0<< reply context destroyed before this reply could be made
    Jul 22 15:40:52 SC-2 osafimmnd[428]: WA Failed to send response to agent/client over MDS
    Jul 22 15:40:52 SC-2 osafimmnd[428]: NO Implementer connected: 13 (safAmfService) <11, 2020f>
    Jul 22 15:40:52 SC-2 osafamfd[480]: NO Node 'SC-1' left the cluster
    Jul 22 15:40:52 SC-2 osafamfd[480]: NO FAILOVER StandBy --> Active DONE!

    and then amfd-SC1 successfully attachs as applier

    Not sure to say whether we are fine in solving this interfering?

     
  • Nagendra Kumar

    Nagendra Kumar - 2015-07-22

    Hi Minh,
    The patch solves issues in case of switchover only as mentioned in the patch review description. In case of failover, it may not work(may be ccb are not related to Amf objects, so it may be working fine).

    Thanks
    -Nagu

     
  • Nagendra Kumar

    Nagendra Kumar - 2015-07-24
    • status: review --> fixed
     

Log in to post a comment.