OpenSAF / Tickets / #1105 AMFD: New standby crashes if blocked on becoming applier

Nagendra Kumar - 2014-09-18

Well, Amf as a HA provider can't wait eternal. Amf is doing some of imm operation in a separate thread, but that is also not a suitable solution for HA provider. As Amf has to deal with imm in each flow, Amf need not wait eternal.

Even rebooting Standby SC is fine as it doesn;t harm HA.

Hence and hereby, I don't find relevance of the issue in this ticket.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2014-09-18
  
  Rebooting an SC always harms SA.
  
  This is by definition so since the cluster becomes one-safe (single point of failure in the remaining SC).
  
  I am of course not saying that AMF as an entity shall wait forever in providing service.
  All I am saying is that the AMF should keep trying to attach as OI/Applier forever.
  The OI/applier initialize must be done either in a separate thread, or as a recurrent
  realtime single try with the task parked after each retry (coroutine solution).
  This so that the "eternal" task is isolated.
  I assume this is the case already.
  
  In the meantime the AMF is fully functional with one excpetion. It can not process ccb modifications on
  the imm-objects owned by the AMF-OI.
  
  But I repeat that is not a fatal condition and should not be allowed to compromize HA,
  which an SC restart always does.
  
  /AndersBj
  
  From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
  Sent: den 18 september 2014 08:57
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier
  
  Well, Amf as a HA provider can't wait eternal. Amf is doing some of imm operation in a separate thread, but that is also not a suitable solution for HA provider. As Amf has to deal with imm in each flow, Amf need not wait eternal.
  
  Even rebooting Standby SC is fine as it doesn;t harm HA.
  
  Hence and hereby, I don't find relevance of the issue in this ticket.
  
  [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier
  
  Status: unassigned
  Milestone: 4.3.3
  Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
  Last Updated: Wed Sep 17, 2014 06:18 PM UTC
  Owner: nobody
  
  This ticket is in essence a continuation of ticket #1078
  
  http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078
  
  In switchover, the new standby fails to attach as AMFD applier. It retries
  this for a limited time (45 seconds os so), but finally gives up and AMFD standby
  restarts.
  
  In ticket 1078 the blockage was actually caused by a bug because the lingering
  CCB was in that case not interfering with AMF data (data monitored by the
  AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
  for #1078.
  
  But this ticket tracks the case of true interference. The very same symptom
  can be acheived by creating a CCB that modifies an AMF object and then lingers.
  An si-swap done in this setup will result in the new standby rebooting after
  it gives up in retrying.
  
  The new active AMFD is doing the very same thing, failing to set itself
  as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
  AMFD triggers the restart of that SC, which triggers a sync, which aborts the
  CCB removing the blockage for the new active AMFD.
  
  Note that this scenario is not totally unrealistic. An operator starts to
  build a CCB. Forgets about it and then performs an si-swap. That will cause
  an SC restart. Not good.
  
  While a good NBI frontend should buffer the ccb and only send it to the system
  when the operator does his/her high level apply, we can not rely on that.
  
  I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
  invoking the saImmOmCcbApply. Then invoked this on one node:
  
  immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
  
  The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
  inside immcfg itself and aborting the CCB before the scenario can complete.
  
  Quickly after invoking the above I order an si-swap from another shell/node:
  
  immadm -o 7 safSi=SC-2N,safApp=OpenSAF
  
  The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
  attach as long as there is an active, non-empty ccb, that contains operations
  on AMF objects.
  
  The first level of solution in my opinion is that both AMFDs should retry
  forever (in a separate thread assumed to be the case already) to attach as
  implementer/applier. A notification should be sent periodically
  to inform the operator or whomever is listening that thre is a lingering
  AMF related CCB that should be terminated (aborted or committed by the user).
  
  Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
  an admin-operation for this purpose. The active AMFD could invoke this admop
  to trigger the immsv to clear all non-critical CCBs. It should do this if it
  ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
  for a while. Adding such an admin-operation to the immsv and implementing
  its use in AMF should probably be seen as two enhacnements.
  
  The really thorny issue is that there can be blocked critical CCBs.
  These are CCBs where the immsv is waiting on the result of commit from PBE.
  The probability is low that there is both a critical CCB stuck and that it
  contains AMF object operations, but it can happen. Such a system is in ANY CASE
  stuck in its CCB processing so the AMF should wait indefinitely here.
  Currently the system should cluster restart after some time. Not good.
  The immsv can not clear critical CCBs by itself. The only option is to
  use the admin-op (already implemented) for emergency disablement of PBE.
  
  To summarize: This defect ticket is only concerned with the problem of the AMF
  rebooting its standby when this scenario occurs. This should be changed to
  eternal wait with periodic notifications. The AMF service is functioning but
  can not process configuration changes on its data while in this state.
  That is not a fatal condition and so should not be esclated to SC restart.
  
  The problem of how to clear the interfering CCB can be solved in many ways.
  A short term alternative (a hack solution) is for the AMF to reboot a payload.
  That would also trigger a sync clearing al non critical CCBs.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
  
  Related
  
  Tickets: ~~#1105~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2014-09-18
    
    HA is a statistical property.
    
    It can only be truly evaluated by recording the availability history of a system.
    But one can predict if an operation will impact HA by analyzing the degree of increased
    vulnerability that the operation causes.
    
    Basically it is (at least) the MTBF of a single SC that becomes relevant for the duration of the SC restart.
    This instead of the MTB2F (mean time between double failure which is by definition smaller).
    But realistically the risk increases more than that because a failover is in itself
    a complex operation and thus an increased risk of complications => impact on HA statistics.
    Thus a "solution" requiring SC restart will reduce the MTB2F and so it should be avoided
    when possible. In this case it is defiinitely possible.
    
    /AndersBj
    
    From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
    Sent: den 18 september 2014 09:18
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier
    
    Rebooting an SC always harms SA.
    
    This is by definition so since the cluster becomes one-safe (single point of failure in the remaining SC).
    
    I am of course not saying that AMF as an entity shall wait forever in providing service.
    All I am saying is that the AMF should keep trying to attach as OI/Applier forever.
    The OI/applier initialize must be done either in a separate thread, or as a recurrent
    realtime single try with the task parked after each retry (coroutine solution).
    This so that the "eternal" task is isolated.
    I assume this is the case already.
    
    In the meantime the AMF is fully functional with one excpetion. It can not process ccb modifications on
    the imm-objects owned by the AMF-OI.
    
    But I repeat that is not a fatal condition and should not be allowed to compromize HA,
    which an SC restart always does.
    
    /AndersBj
    
    From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
    Sent: den 18 september 2014 08:57
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier
    
    Well, Amf as a HA provider can't wait eternal. Amf is doing some of imm operation in a separate thread, but that is also not a suitable solution for HA provider. As Amf has to deal with imm in each flow, Amf need not wait eternal.
    
    Even rebooting Standby SC is fine as it doesn;t harm HA.
    
    Hence and hereby, I don't find relevance of the issue in this ticket.
    
    [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier
    
    Status: unassigned
    Milestone: 4.3.3
    Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
    Last Updated: Wed Sep 17, 2014 06:18 PM UTC
    Owner: nobody
    
    This ticket is in essence a continuation of ticket #1078
    
    http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078http://sourceforge.net/p/opensaf/tickets/1078
    
    In switchover, the new standby fails to attach as AMFD applier. It retries
    this for a limited time (45 seconds os so), but finally gives up and AMFD standby
    restarts.
    
    In ticket 1078 the blockage was actually caused by a bug because the lingering
    CCB was in that case not interfering with AMF data (data monitored by the
    AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
    for #1078.
    
    But this ticket tracks the case of true interference. The very same symptom
    can be acheived by creating a CCB that modifies an AMF object and then lingers.
    An si-swap done in this setup will result in the new standby rebooting after
    it gives up in retrying.
    
    The new active AMFD is doing the very same thing, failing to set itself
    as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
    AMFD triggers the restart of that SC, which triggers a sync, which aborts the
    CCB removing the blockage for the new active AMFD.
    
    Note that this scenario is not totally unrealistic. An operator starts to
    build a CCB. Forgets about it and then performs an si-swap. That will cause
    an SC restart. Not good.
    
    While a good NBI frontend should buffer the ccb and only send it to the system
    when the operator does his/her high level apply, we can not rely on that.
    
    I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
    invoking the saImmOmCcbApply. Then invoked this on one node:
    
    immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
    
    The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
    inside immcfg itself and aborting the CCB before the scenario can complete.
    
    Quickly after invoking the above I order an si-swap from another shell/node:
    
    immadm -o 7 safSi=SC-2N,safApp=OpenSAF
    
    The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
    attach as long as there is an active, non-empty ccb, that contains operations
    on AMF objects.
    
    The first level of solution in my opinion is that both AMFDs should retry
    forever (in a separate thread assumed to be the case already) to attach as
    implementer/applier. A notification should be sent periodically
    to inform the operator or whomever is listening that thre is a lingering
    AMF related CCB that should be terminated (aborted or committed by the user).
    
    Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
    an admin-operation for this purpose. The active AMFD could invoke this admop
    to trigger the immsv to clear all non-critical CCBs. It should do this if it
    ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
    for a while. Adding such an admin-operation to the immsv and implementing
    its use in AMF should probably be seen as two enhacnements.
    
    The really thorny issue is that there can be blocked critical CCBs.
    These are CCBs where the immsv is waiting on the result of commit from PBE.
    The probability is low that there is both a critical CCB stuck and that it
    contains AMF object operations, but it can happen. Such a system is in ANY CASE
    stuck in its CCB processing so the AMF should wait indefinitely here.
    Currently the system should cluster restart after some time. Not good.
    The immsv can not clear critical CCBs by itself. The only option is to
    use the admin-op (already implemented) for emergency disablement of PBE.
    
    To summarize: This defect ticket is only concerned with the problem of the AMF
    rebooting its standby when this scenario occurs. This should be changed to
    eternal wait with periodic notifications. The AMF service is functioning but
    can not process configuration changes on its data while in this state.
    That is not a fatal condition and so should not be esclated to SC restart.
    
    The problem of how to clear the interfering CCB can be solved in many ways.
    A short term alternative (a hack solution) is for the AMF to reboot a payload.
    That would also trigger a sync clearing al non critical CCBs.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105https://sourceforge.net/p/opensaf/tickets/1105
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptionshttps://sourceforge.net/auth/subscriptions
    
    [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier
    
    Status: unassigned
    Milestone: 4.3.3
    Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
    Last Updated: Thu Sep 18, 2014 06:57 AM UTC
    Owner: nobody
    
    This ticket is in essence a continuation of ticket #1078
    
    http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078
    
    In switchover, the new standby fails to attach as AMFD applier. It retries
    this for a limited time (45 seconds os so), but finally gives up and AMFD standby
    restarts.
    
    In ticket 1078 the blockage was actually caused by a bug because the lingering
    CCB was in that case not interfering with AMF data (data monitored by the
    AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
    for #1078.
    
    But this ticket tracks the case of true interference. The very same symptom
    can be acheived by creating a CCB that modifies an AMF object and then lingers.
    An si-swap done in this setup will result in the new standby rebooting after
    it gives up in retrying.
    
    The new active AMFD is doing the very same thing, failing to set itself
    as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
    AMFD triggers the restart of that SC, which triggers a sync, which aborts the
    CCB removing the blockage for the new active AMFD.
    
    Note that this scenario is not totally unrealistic. An operator starts to
    build a CCB. Forgets about it and then performs an si-swap. That will cause
    an SC restart. Not good.
    
    While a good NBI frontend should buffer the ccb and only send it to the system
    when the operator does his/her high level apply, we can not rely on that.
    
    I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
    invoking the saImmOmCcbApply. Then invoked this on one node:
    
    immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
    
    The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
    inside immcfg itself and aborting the CCB before the scenario can complete.
    
    Quickly after invoking the above I order an si-swap from another shell/node:
    
    immadm -o 7 safSi=SC-2N,safApp=OpenSAF
    
    The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
    attach as long as there is an active, non-empty ccb, that contains operations
    on AMF objects.
    
    The first level of solution in my opinion is that both AMFDs should retry
    forever (in a separate thread assumed to be the case already) to attach as
    implementer/applier. A notification should be sent periodically
    to inform the operator or whomever is listening that thre is a lingering
    AMF related CCB that should be terminated (aborted or committed by the user).
    
    Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
    an admin-operation for this purpose. The active AMFD could invoke this admop
    to trigger the immsv to clear all non-critical CCBs. It should do this if it
    ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
    for a while. Adding such an admin-operation to the immsv and implementing
    its use in AMF should probably be seen as two enhacnements.
    
    The really thorny issue is that there can be blocked critical CCBs.
    These are CCBs where the immsv is waiting on the result of commit from PBE.
    The probability is low that there is both a critical CCB stuck and that it
    contains AMF object operations, but it can happen. Such a system is in ANY CASE
    stuck in its CCB processing so the AMF should wait indefinitely here.
    Currently the system should cluster restart after some time. Not good.
    The immsv can not clear critical CCBs by itself. The only option is to
    use the admin-op (already implemented) for emergency disablement of PBE.
    
    To summarize: This defect ticket is only concerned with the problem of the AMF
    rebooting its standby when this scenario occurs. This should be changed to
    eternal wait with periodic notifications. The AMF service is functioning but
    can not process configuration changes on its data while in this state.
    That is not a fatal condition and so should not be esclated to SC restart.
    
    The problem of how to clear the interfering CCB can be solved in many ways.
    A short term alternative (a hack solution) is for the AMF to reboot a payload.
    That would also trigger a sync clearing al non critical CCBs.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
    
    Related
    
    Tickets: ~~#1105~~
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-18

For the switchover case there is an alternative to "eternal wait" on
setting OI/applier. This is for the active AMFD to reject a switchover
if there is currently an active CCB modifying AMF data.

The AMFD must know if this is the case since it is the OI for that data.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-18

For the failover case, the new active AMFD really must wait eternally
on implementer-set, preferraby in combination with actions directed
at resolving the issue, such as the proposed admin-op on imm
(enhancement #1107).

The "alternative" of a cluster restart is not an alternative.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-18

Related: #1107
http://sourceforge.net/p/opensaf/tickets/1107
http://sourceforge.net/p/opensaf/tickets/1108
http://sourceforge.net/p/opensaf/tickets/1111

Last edit: Anders Bjornerstedt 2014-09-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2014-10-07

Milestone: 4.3.3 --> 4.4.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-03-24

Milestone: 4.4.2 --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-06-17

status: unassigned --> accepted

assigned_to: Nagendra Kumar

Milestone: future --> 4.5.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-06-17

Here is what I go along:
1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is going on. And AMF will also set the "error string" appropriately.
2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in progress and AMF will also set the "error string".

Also #1108 and #1111 will be closed.

Thanks,
-Nagu

Last edit: Nagendra Kumar 2015-06-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2015-06-17
  
  Hi
  
  Fix (1) fixes the problem reported in 1111 (111 is an enhancement).
  Fix (2) is only a partial fix for (2) that fixes #1105 for the si-swap case. Not sure about the failover case.
  
  Ticket #1108 is also an enhancement that will speed up the progress of any si-swap or failover that has problems
  setting OI (or applier).
  I see enhancement #1108 as still a valid enhancement even after we have this proposed fix for #1105.
  The fix proposed in #1108 is also trivial to implement. Just send the admin-op request asynchronously.
  No need to wait on a response.
  
  /AndersBj
  
  From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
  Sent: den 17 juni 2015 12:46
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier
  
  Here is what I go along:
  1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is going on. And AMF will also set the "error string" appropriately.
  2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in progress and AMF will also set the "error string".
  
  Also #1108 and #1111 will be closed.
  
  Thanks,
  -Nagu
  
  [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier
  
  Status: accepted
  Milestone: 4.5.2
  Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
  Last Updated: Wed Jun 17, 2015 09:48 AM UTC
  Owner: Nagendra Kumar
  
  This ticket is in essence a continuation of ticket #1078
  
  http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078
  
  In switchover, the new standby fails to attach as AMFD applier. It retries
  this for a limited time (45 seconds os so), but finally gives up and AMFD standby
  restarts.
  
  In ticket 1078 the blockage was actually caused by a bug because the lingering
  CCB was in that case not interfering with AMF data (data monitored by the
  AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
  for #1078.
  
  But this ticket tracks the case of true interference. The very same symptom
  can be acheived by creating a CCB that modifies an AMF object and then lingers.
  An si-swap done in this setup will result in the new standby rebooting after
  it gives up in retrying.
  
  The new active AMFD is doing the very same thing, failing to set itself
  as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
  AMFD triggers the restart of that SC, which triggers a sync, which aborts the
  CCB removing the blockage for the new active AMFD.
  
  Note that this scenario is not totally unrealistic. An operator starts to
  build a CCB. Forgets about it and then performs an si-swap. That will cause
  an SC restart. Not good.
  
  While a good NBI frontend should buffer the ccb and only send it to the system
  when the operator does his/her high level apply, we can not rely on that.
  
  I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
  invoking the saImmOmCcbApply. Then invoked this on one node:
  
  immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
  
  The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
  inside immcfg itself and aborting the CCB before the scenario can complete.
  
  Quickly after invoking the above I order an si-swap from another shell/node:
  
  immadm -o 7 safSi=SC-2N,safApp=OpenSAF
  
  The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
  attach as long as there is an active, non-empty ccb, that contains operations
  on AMF objects.
  
  The first level of solution in my opinion is that both AMFDs should retry
  forever (in a separate thread assumed to be the case already) to attach as
  implementer/applier. A notification should be sent periodically
  to inform the operator or whomever is listening that thre is a lingering
  AMF related CCB that should be terminated (aborted or committed by the user).
  
  Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
  an admin-operation for this purpose. The active AMFD could invoke this admop
  to trigger the immsv to clear all non-critical CCBs. It should do this if it
  ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
  for a while. Adding such an admin-operation to the immsv and implementing
  its use in AMF should probably be seen as two enhacnements.
  
  The really thorny issue is that there can be blocked critical CCBs.
  These are CCBs where the immsv is waiting on the result of commit from PBE.
  The probability is low that there is both a critical CCB stuck and that it
  contains AMF object operations, but it can happen. Such a system is in ANY CASE
  stuck in its CCB processing so the AMF should wait indefinitely here.
  Currently the system should cluster restart after some time. Not good.
  The immsv can not clear critical CCBs by itself. The only option is to
  use the admin-op (already implemented) for emergency disablement of PBE.
  
  To summarize: This defect ticket is only concerned with the problem of the AMF
  rebooting its standby when this scenario occurs. This should be changed to
  eternal wait with periodic notifications. The AMF service is functioning but
  can not process configuration changes on its data while in this state.
  That is not a fatal condition and so should not be esclated to SC restart.
  
  The problem of how to clear the interfering CCB can be solved in many ways.
  A short term alternative (a hack solution) is for the AMF to reboot a payload.
  That would also trigger a sync clearing al non critical CCBs.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
  
  Related
  
  Tickets: ~~#1105~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2015-06-17
    
    On 06/17/2015 01:28 PM, Anders Bjornerstedt wrote:
    
    Hi
    
    Fix (1) fixes the problem reported in 1111 (111 is an enhancement).
    Fix (2) is only a partial fix for (2) that fixes #1105 for the si-swap
    case. Not sure about the failover case.
    
    I just reproduced problem #1105 for the fail-over case (not switchover).
    To do so only requires a CCB that lingers, say for 240 seconds before
    applying and that the PBE (IMMND coord)
    resides at SC standby before failover. If the PBE (IMMND coord) resides
    at active before failover then it has to
    re-attach at standby at failover and since the PBE invokes the admin-op
    for aborting non-critical CCBs when
    re-attaching, the AMF is in that case saved by the PBE. But that will be
    rouchly 50% of the failover cases.
    
    If the PBE does not need restart at failover, because it already resided
    at old-standby-new-active, then
    the AMFD old-standby-new-active is not saved by the PBE and will reboot
    resulting in CLUSTER RELOAD.
    
    So I claim that to really fix #1105, not just for the si-swap
    interference problem but also for the fail-over case,
    you really need the fix for #1108.
    There are of course alternatives to a fix of type #1108.
    But why not take that one when we have it instead of inventing yet
    another way, or delaying indefinitely
    becoming AMF-OI ?
    
    The solution of the AMFD invoking an admin-op on the IMM was earlier
    "rejected" with the motivation that
    such a solution was "proprietary". While "proprietary" is not the
    correct words for decribing a mechanism
    that is public and part of an open-source implementation, I giess the
    complaint was that the solution was
    OpenSAF specific. But I dont get what the problem would be with the
    internals of OpenSAF being OpensAF specific.
    
    /AndersBj
    
    Ticket #1108 is also an enhancement that will speed up the progress of
    any si-swap or failover that has problems
    setting OI (or applier).
    I see enhancement #1108 as still a valid enhancement even after we
    have this proposed fix for #1105.
    The fix proposed in #1108 is also trivial to implement. Just send the
    admin-op request asynchronously.
    No need to wait on a response.
    
    /AndersBj
    
    From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
    Sent: den 17 juni 2015 12:46
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked
    on becoming applier
    
    Here is what I go along:
    1. Amf will return TRY_AGAIN for the SI-SWAP admin op when any ccb is
    going on. And AMF will also set the "error string" appropriately.
    2. AMF will return TRY_AGAIN in response of CCB, when SI-SWAP is in
    progress and AMF will also set the "error string".
    
    Also #1108 and #1111 will be closed.
    
    Thanks,
    -Nagu
    
    [tickets:#1105]
    http://sourceforge.net/p/opensaf/tickets/1105http://sourceforge.net/p/opensaf/tickets/1105
    AMFD: New standby crashes if blocked on becoming applier
    
    Status: accepted
    Milestone: 4.5.2
    Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
    Last Updated: Wed Jun 17, 2015 09:48 AM UTC
    Owner: Nagendra Kumar
    
    This ticket is in essence a continuation of ticket #1078
    
    http://sourceforge.net/p/opensaf/tickets/1078/
    http://sourceforge.net/p/opensaf/tickets/1078http://sourceforge.net/p/opensaf/tickets/1078
    
    In switchover, the new standby fails to attach as AMFD applier. It retries
    this for a limited time (45 seconds os so), but finally gives up and
    AMFD standby
    restarts.
    
    In ticket 1078 the blockage was actually caused by a bug because the
    lingering
    CCB was in that case not interfering with AMF data (data monitored by the
    AMFD-OI and the AMFD-applier). That "false" interference is fixed by
    the patch
    for #1078.
    
    But this ticket tracks the case of true interference. The very same
    symptom
    can be acheived by creating a CCB that modifies an AMF object and then
    lingers.
    An si-swap done in this setup will result in the new standby rebooting
    after
    it gives up in retrying.
    
    The new active AMFD is doing the very same thing, failing to set itself
    as OI 'saAmfService' becaue of the interfering CCB. But the crashed
    standby
    AMFD triggers the restart of that SC, which triggers a sync, which
    aborts the
    CCB removing the blockage for the new active AMFD.
    
    Note that this scenario is not totally unrealistic. An operator starts to
    build a CCB. Forgets about it and then performs an si-swap. That will
    cause
    an SC restart. Not good.
    
    While a good NBI frontend should buffer the ccb and only send it to
    the system
    when the operator does his/her high level apply, we can not rely on that.
    
    I reproduced this scenario by hacking immcfg so that it waits 60
    seconds before
    invoking the saImmOmCcbApply. Then invoked this on one node:
    
    immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \
    safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
    
    The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
    inside immcfg itself and aborting the CCB before the scenario can
    complete.
    
    Quickly after invoking the above I order an si-swap from another
    shell/node:
    
    immadm -o 7 safSi=SC-2N,safApp=OpenSAF
    
    The basic problem here is that neither the AMFD-OI nor the
    AMFD-applier can
    attach as long as there is an active, non-empty ccb, that contains
    operations
    on AMF objects.
    
    The first level of solution in my opinion is that both AMFDs should retry
    forever (in a separate thread assumed to be the case already) to attach as
    implementer/applier. A notification should be sent periodically
    to inform the operator or whomever is listening that thre is a lingering
    AMF related CCB that should be terminated (aborted or committed by the
    user).
    
    Rebooting an SC is a very coarse way of clearing CCBs. The Immsv
    should provide
    an admin-operation for this purpose. The active AMFD could invoke this
    admop
    to trigger the immsv to clear all non-critical CCBs. It should do this
    if it
    ends up in the implementer-set TRY_AGAIN loop. Preferably after it has
    waited
    for a while. Adding such an admin-operation to the immsv and implementing
    its use in AMF should probably be seen as two enhacnements.
    
    The really thorny issue is that there can be blocked critical CCBs.
    These are CCBs where the immsv is waiting on the result of commit from
    PBE.
    The probability is low that there is both a critical CCB stuck and that it
    contains AMF object operations, but it can happen. Such a system is in
    ANY CASE
    stuck in its CCB processing so the AMF should wait indefinitely here.
    Currently the system should cluster restart after some time. Not good.
    The immsv can not clear critical CCBs by itself. The only option is to
    use the admin-op (already implemented) for emergency disablement of PBE.
    
    To summarize: This defect ticket is only concerned with the problem of
    the AMF
    rebooting its standby when this scenario occurs. This should be changed to
    eternal wait with periodic notifications. The AMF service is
    functioning but
    can not process configuration changes on its data while in this state.
    That is not a fatal condition and so should not be esclated to SC restart.
    
    The problem of how to clear the interfering CCB can be solved in many
    ways.
    A short term alternative (a hack solution) is for the AMF to reboot a
    payload.
    That would also trigger a sync clearing al non critical CCBs.
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/opensaf/tickets/1105/
    https://sourceforge.net/p/opensaf/tickets/1105https://sourceforge.net/p/opensaf/tickets/1105
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    https://sourceforge.net/auth/subscriptionshttps://sourceforge.net/auth/subscriptions
    
    [tickets:#1105] http://sourceforge.net/p/opensaf/tickets/1105 AMFD:
    New standby crashes if blocked on becoming applier
    
    Status: accepted
    Milestone: 4.5.2
    Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
    Last Updated: Wed Jun 17, 2015 10:45 AM UTC
    Owner: Nagendra Kumar
    
    This ticket is in essence a continuation of ticket #1078
    
    http://sourceforge.net/p/opensaf/tickets/1078/
    http://sourceforge.net/p/opensaf/tickets/1078
    
    In switchover, the new standby fails to attach as AMFD applier. It retries
    this for a limited time (45 seconds os so), but finally gives up and
    AMFD standby
    restarts.
    
    In ticket 1078 the blockage was actually caused by a bug because the
    lingering
    CCB was in that case not interfering with AMF data (data monitored by the
    AMFD-OI and the AMFD-applier). That "false" interference is fixed by
    the patch
    for #1078.
    
    But this ticket tracks the case of true interference. The very same
    symptom
    can be acheived by creating a CCB that modifies an AMF object and then
    lingers.
    An si-swap done in this setup will result in the new standby rebooting
    after
    it gives up in retrying.
    
    The new active AMFD is doing the very same thing, failing to set itself
    as OI 'saAmfService' becaue of the interfering CCB. But the crashed
    standby
    AMFD triggers the restart of that SC, which triggers a sync, which
    aborts the
    CCB removing the blockage for the new active AMFD.
    
    Note that this scenario is not totally unrealistic. An operator starts to
    build a CCB. Forgets about it and then performs an si-swap. That will
    cause
    an SC restart. Not good.
    
    While a good NBI frontend should buffer the ccb and only send it to
    the system
    when the operator does his/her high level apply, we can not rely on that.
    
    I reproduced this scenario by hacking immcfg so that it waits 60
    seconds before
    invoking the saImmOmCcbApply. Then invoked this on one node:
    
    immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
    
    The high immcfg timeout (-t 120) is needed to avoid the OM side timing
    out
    inside immcfg itself and aborting the CCB before the scenario can
    complete.
    
    Quickly after invoking the above I order an si-swap from another
    shell/node:
    
    immadm -o 7 safSi=SC-2N,safApp=OpenSAF
    
    The basic problem here is that neither the AMFD-OI nor the
    AMFD-applier can
    attach as long as there is an active, non-empty ccb, that contains
    operations
    on AMF objects.
    
    The first level of solution in my opinion is that both AMFDs should retry
    forever (in a separate thread assumed to be the case already) to attach as
    implementer/applier. A notification should be sent periodically
    to inform the operator or whomever is listening that thre is a lingering
    AMF related CCB that should be terminated (aborted or committed by the
    user).
    
    Rebooting an SC is a very coarse way of clearing CCBs. The Immsv
    should provide
    an admin-operation for this purpose. The active AMFD could invoke this
    admop
    to trigger the immsv to clear all non-critical CCBs. It should do this
    if it
    ends up in the implementer-set TRY_AGAIN loop. Preferably after it has
    waited
    for a while. Adding such an admin-operation to the immsv and implementing
    its use in AMF should probably be seen as two enhacnements.
    
    The really thorny issue is that there can be blocked critical CCBs.
    These are CCBs where the immsv is waiting on the result of commit from
    PBE.
    The probability is low that there is both a critical CCB stuck and
    that it
    contains AMF object operations, but it can happen. Such a system is in
    ANY CASE
    stuck in its CCB processing so the AMF should wait indefinitely here.
    Currently the system should cluster restart after some time. Not good.
    The immsv can not clear critical CCBs by itself. The only option is to
    use the admin-op (already implemented) for emergency disablement of PBE.
    
    To summarize: This defect ticket is only concerned with the problem of
    the AMF
    rebooting its standby when this scenario occurs. This should be changed to
    eternal wait with periodic notifications. The AMF service is
    functioning but
    can not process configuration changes on its data while in this state.
    That is not a fatal condition and so should not be esclated to SC restart.
    
    The problem of how to clear the interfering CCB can be solved in many
    ways.
    A short term alternative (a hack solution) is for the AMF to reboot a
    payload.
    That would also trigger a sync clearing al non critical CCBs.
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/opensaf/tickets/1105/
    https://sourceforge.net/p/opensaf/tickets/1105
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    https://sourceforge.net/auth/subscriptions
    
    Related
    
    Tickets: ~~#1105~~
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-06-17

summary: AMFD: New standby crashes if blocked on becoming applier --> AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-06-22

For non-critical ccb, ticket #1391 will take care.
For critical ccb, Amf should be ok to wait a little when PBE delays the response.

So, I would be going ahead and implementing the two points mentioned above as part of #1105 and others will get closed.

Thanks
-Nagu

Last edit: Nagendra Kumar 2015-06-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2015-06-29
  
  For critical CCBs the wait can be indefinite since the delay can be due to problems on the file system.
  
  The AMF should not block a failover just because it can not attach as OI.
  There is no inherent functional dependence of the AMF failover mechanism on the AMF OI being available.
  Any such dependency is unnecessary and an impediment to service availability.
  
  /AndersBj
  
  From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
  Sent: den 22 juni 2015 08:54
  To: opensaf-tickets@lists.sourceforge.net
  Subject: [tickets] [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
  
  For non-critical ccb, ticket #1391 will take care.
  For critical ccb, Amf should ok to wait a little when PBE delays the response.
  
  So, I would be going ahead and implementing the two points mentioned above as part of #1105 and others will get closed.
  
  Thanks
  -Nagu
  
  [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
  
  Status: accepted
  Milestone: 4.5.2
  Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
  Last Updated: Wed Jun 17, 2015 12:59 PM UTC
  Owner: Nagendra Kumar
  
  This ticket is in essence a continuation of ticket #1078
  
  http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078
  
  In switchover, the new standby fails to attach as AMFD applier. It retries
  this for a limited time (45 seconds os so), but finally gives up and AMFD standby
  restarts.
  
  In ticket 1078 the blockage was actually caused by a bug because the lingering
  CCB was in that case not interfering with AMF data (data monitored by the
  AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
  for #1078.
  
  But this ticket tracks the case of true interference. The very same symptom
  can be acheived by creating a CCB that modifies an AMF object and then lingers.
  An si-swap done in this setup will result in the new standby rebooting after
  it gives up in retrying.
  
  The new active AMFD is doing the very same thing, failing to set itself
  as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
  AMFD triggers the restart of that SC, which triggers a sync, which aborts the
  CCB removing the blockage for the new active AMFD.
  
  Note that this scenario is not totally unrealistic. An operator starts to
  build a CCB. Forgets about it and then performs an si-swap. That will cause
  an SC restart. Not good.
  
  While a good NBI frontend should buffer the ccb and only send it to the system
  when the operator does his/her high level apply, we can not rely on that.
  
  I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
  invoking the saImmOmCcbApply. Then invoked this on one node:
  
  immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
  
  The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
  inside immcfg itself and aborting the CCB before the scenario can complete.
  
  Quickly after invoking the above I order an si-swap from another shell/node:
  
  immadm -o 7 safSi=SC-2N,safApp=OpenSAF
  
  The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
  attach as long as there is an active, non-empty ccb, that contains operations
  on AMF objects.
  
  The first level of solution in my opinion is that both AMFDs should retry
  forever (in a separate thread assumed to be the case already) to attach as
  implementer/applier. A notification should be sent periodically
  to inform the operator or whomever is listening that thre is a lingering
  AMF related CCB that should be terminated (aborted or committed by the user).
  
  Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
  an admin-operation for this purpose. The active AMFD could invoke this admop
  to trigger the immsv to clear all non-critical CCBs. It should do this if it
  ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
  for a while. Adding such an admin-operation to the immsv and implementing
  its use in AMF should probably be seen as two enhacnements.
  
  The really thorny issue is that there can be blocked critical CCBs.
  These are CCBs where the immsv is waiting on the result of commit from PBE.
  The probability is low that there is both a critical CCB stuck and that it
  contains AMF object operations, but it can happen. Such a system is in ANY CASE
  stuck in its CCB processing so the AMF should wait indefinitely here.
  Currently the system should cluster restart after some time. Not good.
  The immsv can not clear critical CCBs by itself. The only option is to
  use the admin-op (already implemented) for emergency disablement of PBE.
  
  To summarize: This defect ticket is only concerned with the problem of the AMF
  rebooting its standby when this scenario occurs. This should be changed to
  eternal wait with periodic notifications. The AMF service is functioning but
  can not process configuration changes on its data while in this state.
  That is not a fatal condition and so should not be esclated to SC restart.
  
  The problem of how to clear the interfering CCB can be solved in many ways.
  A short term alternative (a hack solution) is for the AMF to reboot a payload.
  That would also trigger a sync clearing al non critical CCBs.
  
  Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
  
  To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  Related
  
  Tickets: ~~#1105~~
  Tickets: tickets
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-06-26

status: accepted --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-06-29

Hi Anders,
Thanks for your comments. But imm need to take time-bound action in that case. It can't wait for a response for long.

Thanks
-Nagu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2015-06-29
  
  It is impossible for the IMM to do anything about a critical CCB until the PBE re-attaches and the PBE can only re-attach
  when the file system is available. So not the IMM can not force the issue here. It can neither abort nor commit the CCB
  since this would have a 50/50 chance of diverging from the PBE representation of the CCB. A cluster restart may very
  well happen before the file system comes back, in fact a cluster restart may be caused by the long absence of the
  file system. It is in fact what happens here.
  
  As far as the AMF is concerned, since nether old active or old standby new active can have received any apply
  or abort callback in this case, the AMF should act as if it is still waiting for the commit of the CCB, i.e. as if the
  before-image for the CCB is what is valid and it is.
  
  /AndersBj
  
  From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
  Sent: den 29 juni 2015 09:02
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
  
  Hi Anders,
  Thanks for your comments. But imm need to take time-bound action in that case. It can't wait for a response for long.
  
  Thanks
  -Nagu
  
  [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
  
  Status: review
  Milestone: 4.5.2
  Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
  Last Updated: Fri Jun 26, 2015 09:34 AM UTC
  Owner: Nagendra Kumar
  
  This ticket is in essence a continuation of ticket #1078
  
  http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078
  
  In switchover, the new standby fails to attach as AMFD applier. It retries
  this for a limited time (45 seconds os so), but finally gives up and AMFD standby
  restarts.
  
  In ticket 1078 the blockage was actually caused by a bug because the lingering
  CCB was in that case not interfering with AMF data (data monitored by the
  AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
  for #1078.
  
  But this ticket tracks the case of true interference. The very same symptom
  can be acheived by creating a CCB that modifies an AMF object and then lingers.
  An si-swap done in this setup will result in the new standby rebooting after
  it gives up in retrying.
  
  The new active AMFD is doing the very same thing, failing to set itself
  as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
  AMFD triggers the restart of that SC, which triggers a sync, which aborts the
  CCB removing the blockage for the new active AMFD.
  
  Note that this scenario is not totally unrealistic. An operator starts to
  build a CCB. Forgets about it and then performs an si-swap. That will cause
  an SC restart. Not good.
  
  While a good NBI frontend should buffer the ccb and only send it to the system
  when the operator does his/her high level apply, we can not rely on that.
  
  I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
  invoking the saImmOmCcbApply. Then invoked this on one node:
  
  immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
  
  The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
  inside immcfg itself and aborting the CCB before the scenario can complete.
  
  Quickly after invoking the above I order an si-swap from another shell/node:
  
  immadm -o 7 safSi=SC-2N,safApp=OpenSAF
  
  The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
  attach as long as there is an active, non-empty ccb, that contains operations
  on AMF objects.
  
  The first level of solution in my opinion is that both AMFDs should retry
  forever (in a separate thread assumed to be the case already) to attach as
  implementer/applier. A notification should be sent periodically
  to inform the operator or whomever is listening that thre is a lingering
  AMF related CCB that should be terminated (aborted or committed by the user).
  
  Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
  an admin-operation for this purpose. The active AMFD could invoke this admop
  to trigger the immsv to clear all non-critical CCBs. It should do this if it
  ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
  for a while. Adding such an admin-operation to the immsv and implementing
  its use in AMF should probably be seen as two enhacnements.
  
  The really thorny issue is that there can be blocked critical CCBs.
  These are CCBs where the immsv is waiting on the result of commit from PBE.
  The probability is low that there is both a critical CCB stuck and that it
  contains AMF object operations, but it can happen. Such a system is in ANY CASE
  stuck in its CCB processing so the AMF should wait indefinitely here.
  Currently the system should cluster restart after some time. Not good.
  The immsv can not clear critical CCBs by itself. The only option is to
  use the admin-op (already implemented) for emergency disablement of PBE.
  
  To summarize: This defect ticket is only concerned with the problem of the AMF
  rebooting its standby when this scenario occurs. This should be changed to
  eternal wait with periodic notifications. The AMF service is functioning but
  can not process configuration changes on its data while in this state.
  That is not a fatal condition and so should not be esclated to SC restart.
  
  The problem of how to clear the interfering CCB can be solved in many ways.
  A short term alternative (a hack solution) is for the AMF to reboot a payload.
  That would also trigger a sync clearing al non critical CCBs.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
  
  Related
  
  Tickets: ~~#1105~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2015-06-29
    
    Even if the AMF re-reads its entire AMF model from IMM during the PBE absence it will see the state
    Of the AMF model without that CCB being committed. IMM-ram only commits the CCB after the PBE has
    returned and responded on the outcome of the CCB.
    
    The good news is that the IMM is behaving entirely transactionally with CCBs.
    The only bad news is that the AMF currently does not wish to follow the transactional model (relative to imm data) during failover.
    
    /AndersBj
    
    From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
    Sent: den 29 juni 2015 09:22
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
    
    It is impossible for the IMM to do anything about a critical CCB until the PBE re-attaches and the PBE can only re-attach
    when the file system is available. So not the IMM can not force the issue here. It can neither abort nor commit the CCB
    since this would have a 50/50 chance of diverging from the PBE representation of the CCB. A cluster restart may very
    well happen before the file system comes back, in fact a cluster restart may be caused by the long absence of the
    file system. It is in fact what happens here.
    
    As far as the AMF is concerned, since nether old active or old standby new active can have received any apply
    or abort callback in this case, the AMF should act as if it is still waiting for the commit of the CCB, i.e. as if the
    before-image for the CCB is what is valid and it is.
    
    /AndersBj
    
    From: Nagendra Kumar [mailto:nagendra-k@users.sf.net]
    Sent: den 29 juni 2015 09:02
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] #1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
    
    Hi Anders,
    Thanks for your comments. But imm need to take time-bound action in that case. It can't wait for a response for long.
    
    Thanks
    -Nagu
    
    [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
    
    Status: review
    Milestone: 4.5.2
    Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
    Last Updated: Fri Jun 26, 2015 09:34 AM UTC
    Owner: Nagendra Kumar
    
    This ticket is in essence a continuation of ticket #1078
    
    http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078http://sourceforge.net/p/opensaf/tickets/1078
    
    In switchover, the new standby fails to attach as AMFD applier. It retries
    this for a limited time (45 seconds os so), but finally gives up and AMFD standby
    restarts.
    
    In ticket 1078 the blockage was actually caused by a bug because the lingering
    CCB was in that case not interfering with AMF data (data monitored by the
    AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
    for #1078.
    
    But this ticket tracks the case of true interference. The very same symptom
    can be acheived by creating a CCB that modifies an AMF object and then lingers.
    An si-swap done in this setup will result in the new standby rebooting after
    it gives up in retrying.
    
    The new active AMFD is doing the very same thing, failing to set itself
    as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
    AMFD triggers the restart of that SC, which triggers a sync, which aborts the
    CCB removing the blockage for the new active AMFD.
    
    Note that this scenario is not totally unrealistic. An operator starts to
    build a CCB. Forgets about it and then performs an si-swap. That will cause
    an SC restart. Not good.
    
    While a good NBI frontend should buffer the ccb and only send it to the system
    when the operator does his/her high level apply, we can not rely on that.
    
    I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
    invoking the saImmOmCcbApply. Then invoked this on one node:
    
    immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
    
    The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
    inside immcfg itself and aborting the CCB before the scenario can complete.
    
    Quickly after invoking the above I order an si-swap from another shell/node:
    
    immadm -o 7 safSi=SC-2N,safApp=OpenSAF
    
    The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
    attach as long as there is an active, non-empty ccb, that contains operations
    on AMF objects.
    
    The first level of solution in my opinion is that both AMFDs should retry
    forever (in a separate thread assumed to be the case already) to attach as
    implementer/applier. A notification should be sent periodically
    to inform the operator or whomever is listening that thre is a lingering
    AMF related CCB that should be terminated (aborted or committed by the user).
    
    Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
    an admin-operation for this purpose. The active AMFD could invoke this admop
    to trigger the immsv to clear all non-critical CCBs. It should do this if it
    ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
    for a while. Adding such an admin-operation to the immsv and implementing
    its use in AMF should probably be seen as two enhacnements.
    
    The really thorny issue is that there can be blocked critical CCBs.
    These are CCBs where the immsv is waiting on the result of commit from PBE.
    The probability is low that there is both a critical CCB stuck and that it
    contains AMF object operations, but it can happen. Such a system is in ANY CASE
    stuck in its CCB processing so the AMF should wait indefinitely here.
    Currently the system should cluster restart after some time. Not good.
    The immsv can not clear critical CCBs by itself. The only option is to
    use the admin-op (already implemented) for emergency disablement of PBE.
    
    To summarize: This defect ticket is only concerned with the problem of the AMF
    rebooting its standby when this scenario occurs. This should be changed to
    eternal wait with periodic notifications. The AMF service is functioning but
    can not process configuration changes on its data while in this state.
    That is not a fatal condition and so should not be esclated to SC restart.
    
    The problem of how to clear the interfering CCB can be solved in many ways.
    A short term alternative (a hack solution) is for the AMF to reboot a payload.
    That would also trigger a sync clearing al non critical CCBs.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105https://sourceforge.net/p/opensaf/tickets/1105
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptionshttps://sourceforge.net/auth/subscriptions
    
    [tickets:#1105]http://sourceforge.net/p/opensaf/tickets/1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover
    
    Status: review
    Milestone: 4.5.2
    Created: Wed Sep 17, 2014 06:18 PM UTC by Anders Bjornerstedt
    Last Updated: Mon Jun 29, 2015 07:02 AM UTC
    Owner: Nagendra Kumar
    
    This ticket is in essence a continuation of ticket #1078
    
    http://sourceforge.net/p/opensaf/tickets/1078/http://sourceforge.net/p/opensaf/tickets/1078
    
    In switchover, the new standby fails to attach as AMFD applier. It retries
    this for a limited time (45 seconds os so), but finally gives up and AMFD standby
    restarts.
    
    In ticket 1078 the blockage was actually caused by a bug because the lingering
    CCB was in that case not interfering with AMF data (data monitored by the
    AMFD-OI and the AMFD-applier). That "false" interference is fixed by the patch
    for #1078.
    
    But this ticket tracks the case of true interference. The very same symptom
    can be acheived by creating a CCB that modifies an AMF object and then lingers.
    An si-swap done in this setup will result in the new standby rebooting after
    it gives up in retrying.
    
    The new active AMFD is doing the very same thing, failing to set itself
    as OI 'saAmfService' becaue of the interfering CCB. But the crashed standby
    AMFD triggers the restart of that SC, which triggers a sync, which aborts the
    CCB removing the blockage for the new active AMFD.
    
    Note that this scenario is not totally unrealistic. An operator starts to
    build a CCB. Forgets about it and then performs an si-swap. That will cause
    an SC restart. Not good.
    
    While a good NBI frontend should buffer the ccb and only send it to the system
    when the operator does his/her high level apply, we can not rely on that.
    
    I reproduced this scenario by hacking immcfg so that it waits 60 seconds before
    invoking the saImmOmCcbApply. Then invoked this on one node:
    
    immcfg -t 120 -m -a saAmfCtDefDisableRestart=0 \ safVersion=4.0.0,safCompType=OpenSafCompTypeIMMN
    
    The high immcfg timeout (-t 120) is needed to avoid the OM side timing out
    inside immcfg itself and aborting the CCB before the scenario can complete.
    
    Quickly after invoking the above I order an si-swap from another shell/node:
    
    immadm -o 7 safSi=SC-2N,safApp=OpenSAF
    
    The basic problem here is that neither the AMFD-OI nor the AMFD-applier can
    attach as long as there is an active, non-empty ccb, that contains operations
    on AMF objects.
    
    The first level of solution in my opinion is that both AMFDs should retry
    forever (in a separate thread assumed to be the case already) to attach as
    implementer/applier. A notification should be sent periodically
    to inform the operator or whomever is listening that thre is a lingering
    AMF related CCB that should be terminated (aborted or committed by the user).
    
    Rebooting an SC is a very coarse way of clearing CCBs. The Immsv should provide
    an admin-operation for this purpose. The active AMFD could invoke this admop
    to trigger the immsv to clear all non-critical CCBs. It should do this if it
    ends up in the implementer-set TRY_AGAIN loop. Preferably after it has waited
    for a while. Adding such an admin-operation to the immsv and implementing
    its use in AMF should probably be seen as two enhacnements.
    
    The really thorny issue is that there can be blocked critical CCBs.
    These are CCBs where the immsv is waiting on the result of commit from PBE.
    The probability is low that there is both a critical CCB stuck and that it
    contains AMF object operations, but it can happen. Such a system is in ANY CASE
    stuck in its CCB processing so the AMF should wait indefinitely here.
    Currently the system should cluster restart after some time. Not good.
    The immsv can not clear critical CCBs by itself. The only option is to
    use the admin-op (already implemented) for emergency disablement of PBE.
    
    To summarize: This defect ticket is only concerned with the problem of the AMF
    rebooting its standby when this scenario occurs. This should be changed to
    eternal wait with periodic notifications. The AMF service is functioning but
    can not process configuration changes on its data while in this state.
    That is not a fatal condition and so should not be esclated to SC restart.
    
    The problem of how to clear the interfering CCB can be solved in many ways.
    A short term alternative (a hack solution) is for the AMF to reboot a payload.
    That would also trigger a sync clearing al non critical CCBs.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1105/https://sourceforge.net/p/opensaf/tickets/1105
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
    
    Related
    
    Tickets: ~~#1105~~
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-06-29

The patch submitted for review

https://sourceforge.net/p/opensaf/mailman/message/34243280/

looks OK from a correctness point of view as I see it. But there are is an
incompleteness issue as I understand it. The ticket reports a problem for
both si-swap and failover. I dont see that the patch fixes the failover
variant of the problem. The problem is that a failover can not be "rejected"
the way that an SI-swap request can be.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-07-02

Hi Anders,
Thanks for your review. It is mentioned in the patch review that it doesn't handle fail-over case.

But fail-over case could be sorted out if Imm returns the OI call immediately and marks Amfd as Implementer. As doing so is not going to give problem to Amf as it can accept apply after becoming Act and process it.

By making Amfd hang in Imm OI call is hampering Imm service availability, so Imm could share some responsibility for it.

Thanks
-Nagu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2015-07-22

I have been testing the patch floated for review on default branch, it works fine in case of si-swap.
Also, I'm trying to test in failover, below are steps I did:
On SC-2 (standby):
- immcfg -t 150 -m -a saAmfCtDefDisableRestart=0 safVersion=4.0.0,safCompType=OpenSafCompTypeIMMND
- immcfg --ccb-apply (sleep 60s by hack immcfg)
On SC-1 (active), issue reboot

I have seen the CCB aborted on SC-2

Jul 22 15:40:51 SC-2 osafimmnd[428]: NO Received: immadm -o 202 safRdn=immManagement,safApp=safImmService
Jul 22 15:40:52 SC-2 osafimmnd[428]: NO CCB 2 aborted by: immadm -o 202 safRdn=immManagement,safApp=safImmService
Jul 22 15:40:52 SC-2 osafimmnd[428]: WA Timeout while waiting for implementer, aborting ccb:2
Jul 22 15:40:52 SC-2 osafimmnd[428]: NO Ccb 2 ABORTED (immcfg_SC-2_711)
Jul 22 15:40:52 SC-2 osafimmnd[428]: WA >>s_info->to_svc == 0<< reply context destroyed before this reply could be made
Jul 22 15:40:52 SC-2 osafimmnd[428]: WA Failed to send response to agent/client over MDS
Jul 22 15:40:52 SC-2 osafimmnd[428]: NO Implementer connected: 13 (safAmfService) <11, 2020f>
Jul 22 15:40:52 SC-2 osafamfd[480]: NO Node 'SC-1' left the cluster
Jul 22 15:40:52 SC-2 osafamfd[480]: NO FAILOVER StandBy --> Active DONE!

and then amfd-SC1 successfully attachs as applier

Not sure to say whether we are fine in solving this interfering?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-07-22

Hi Minh,
The patch solves issues in case of switchover only as mentioned in the patch review description. In case of failover, it may not work(may be ccb are not related to Amf objects, so it may be working fine).

Thanks
-Nagu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-07-24

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2015-07-24

changeset: 6680:9761b6ed401c
tag: tip
parent: 6677:8f8719202335
user: Nagendra Kumarnagendra.k@oracle.com
date: Fri Jul 24 14:36:44 2015 +0530
summary: amfd: make ccb op and mw si swap mutually exclusive [#1105]

changeset: 6679:e22c9ac87dfb
branch: opensaf-4.6.x
parent: 6675:6b9b2cef6dfa
user: Nagendra Kumarnagendra.k@oracle.com
date: Fri Jul 24 14:36:34 2015 +0530
summary: amfd: make ccb op and mw si swap mutually exclusive [#1105]

changeset: 6678:9f1ebabba913
branch: opensaf-4.5.x
parent: 6674:d715a124a2ad
user: Nagendra Kumarnagendra.k@oracle.com
date: Fri Jul 24 14:35:20 2015 +0530
summary: amfd: make ccb op and mw si swap mutually exclusive [#1105]

[staging:9f1eba]
[staging:e22c9a]
[staging:9761b6]

Related

Tickets: ~~#1105~~
Commit: [9761b6]
Commit: [9f1eba]
Commit: [e22c9a]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AMFD: New standby crashes if blocked on becoming applier - both failover and...

Milestone

Searches

Help

#1105 AMFD: New standby crashes if blocked on becoming applier - both failover and switchover

Related

Discussion

Related

Related

Related

Related

Related

Related

Related

Related