Menu

#639 IMMSV: 2PBE - Automatic internal transition to oneSafe2PBE when an SC goes down.

future
unassigned
nobody
None
enhancement
imm
-
minor
2013-12-03
2013-11-26
No

When 2PBE is used, IMM will only allow CCB's when one of the controller is down
if bit 4 of the nostdFlags attribute in the OpenSAF IMM service object has been
toggled to on. Curently this has to be done explicitly from outside the IMM.

This would be done either manually by the operator, or invoked by the deployment
wrappings on top of OpenSAF.

The current recommendation is to not try to toggle on this bit immediately,
if/when there is a spontaneous SC restart. If the SC bounces back up again (the
normal case) then the time window for performing CCBs with only one SC should
be very short. Thus very little CCB processing availability is lost. Plus the
risk for divergence on the PBE files is minimized.

But for cases where an SC is down for long periods, then clearly the oneSAfe2PBE
state must be entered. Long duration down time for an SC is either planned or
unplanned.

If it is a planned shutdown of SC, then it should not be too much to ask that
the toggling on of the oneSafe2PBE bit is done also. Either manually as part
of an OPI (Operator Procedure Instruction), or wrapped into the scripts that
are typically invoked in that situation.

If it is an unplanned departure of an SC and that SC does not bounce back
within some time limit, then clearly there is a need for repair of that SC.
An alarm must be generated for that case. The same trigger that must exists
to generate such an alarm (on any system attempting to be highly available)
should also be used to invoke the admin-operation to toggle on oneSafe2PBE.
The reason the alarm must be triggered for this case is that a prolonged state
with only one SC means a prolonged state with a single point of failure, which
in turn will result (statistically) in reduced availability of the cluster.

Thus I am reducing the priority of this enhancement to minor, since I see it
at best as a redundant mechanism and at worst as a misdirected emphasis on
CCB availability at the expense of total cluster availability.

Note also that any reduced system, intended to permanently run with one SC,
should run with regular 1PBE (or 0PBE), never with 2PBE.

Discussion

  • Anders Bjornerstedt

    Clarification, the 2PBE IMM enhancement does provide for allowing CCB's when
    one SC is down, but requires an administrative-op to be invoked for this
    (as explained in the documentation for 2PBE).

    The main case where this would be relevant would be any longer term
    unavailability of an SC. Not simply an SC restart.

    In particular, SC restarts as part of an upgrade campaign are not relevant
    since PBE would be disabled during upgrade campaigns.

    Also a spontaneous SC restart where the SC comes back up within a reasonable
    time should also not be a problem. The CCB OM API would get TRY_AGAIN during
    such an SC restart, which should not be a problem for most CCBs.

    A longer term unplanned unavailability of an SC should really be caught by
    some other mechanism than the IMM and an alarm generated. That same mechanism
    can then also invoke the admin-op on the IMM to open up or CCBs.

    Relevant here is the fact that running a 2PBE system with only one SC is
    more dangerous than executing a regular PBE system (based on a shared
    file system) with one SC. The risk for a rewind of state is greater
    with 2PBE should there be a cluster restart.

    So the admin-op for allowing CCBs with only one SC in a 2PBE system should
    ideally only be invoked if there is some crucially important CCB that
    must be executed during this vulnerable period. That also speaks for
    making this into a conscious decision made by the operator.

     
  • Anders Bjornerstedt

    • summary: IMMSV: Support to allow CCB's when a controller is down when 2PBE is configured --> IMMSV: 2PBE - Automatic internal transition to oneSafe2PBE when an SC goes down.
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1 +1,37 @@
    -When 2PBE configuration is used IMM will not allow CCB's when one of the controller is down. This is a request for the IMM to allow CCB's after a certain time even if the controller does not come back (to handle the case where the controller is down for a long time e.g. due to HW error).
    +When 2PBE is used, IMM will only allow CCB's when one of the controller is down
    +if bit 4 of the nostdFlags attribute in the OpenSAF IMM service object has been
    +toggled to on. Curently this has to be done explicitly from outside the IMM.
    +
    +This would be done either manually by the operator, or invoked by the deployment
    +wrappings on top of OpenSAF. 
    +
    +The current recommendation is to not try to toggle on this bit immediately,
    +if/when there is a spontaneous SC restart. If the SC bounces back up again (the
    +normal case) then the time window for performing CCBs with only one SC should
    +be very short. Thus very little CCB processing availability is lost. Plus the
    +risk for divergence on the PBE files is minimized. 
    +
    +But for cases where an SC is down for long periods, then clearly the oneSAfe2PBE
    +state must be entered. Long duration down time for an SC is either planned or
    +unplanned. 
    +
    +If it is a planned shutdown of SC, then it should not be too much to ask that
    +the toggling *on* of the oneSafe2PBE bit is done also. Either manually as part
    +of an OPI (Operator Procedure Instruction), or wrapped into the scripts that
    +are typically invoked in that situation.
    +
    +If it is an unplanned departure of an SC and that SC does not bounce back
    +within some time limit, then clearly there is a need for repair of that SC.
    +An alarm must be generated for that case. The same trigger that must exists
    +to generate such an alarm (on any system attempting to be highly available)
    +should also be used to invoke the admin-operation to toggle *on* oneSafe2PBE.
    +The reason the alarm must be triggered for this case is that a prolonged state
    +with only one SC means a prolonged state with a single point of failure, which
    +in turn will result (statistically) in reduced availability of the cluster. 
    +
    +Thus I am reducing the priority of this enhancement to minor, since I see it
    +at best as a redundant mechanism and at worst as a misdirected emphasis on
    +CCB availability at the expense of total cluster availability.
    +
    +Note also that any reduced system, intended to permanently run with *one* SC,
    +should run with regular 1PBE (or 0PBE), never with 2PBE. 
    
    • Priority: major --> minor
     

Log in to post a comment.