OpenSAF / Tickets / #1448 smf: Make campaigns less fragile by retrying on ERR_NO

Mathi Naickan - 2015-08-25

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2015-08-25

I think this ticket has to be treated as a defect because
- this behaviour (IMM returning NO_RESOURCES) and the expectation from the immuser to treat it as try_again seems to be existing atleast as old as the 4.5.x branch.
- handling this error code as a try_again seems to be the only practical way for the user campaign to succeed and the scenario did succeed(tested ok!) upon treating this as try_again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-25

There are at least three points to make in response to the claim that this has to be a defect.

1) We have not seen this problem earlier. So obviously testing is different this time, i.e. this
is new way of testing that was not performed when testing the earlier releases.

2) Saying that this fix is "the only way for this campaign to succeed" is not true unless you show
that the problem is not performance related. I am convinced that the root cause is very much
performance related. So the very same campaign most likely succeeds, probably has succeeded
in earlier releases, simply because the platform it was tested on had a more reasonable
load/capacity ratio.

3) I have noticed that there is lately a tendency to stress test OpenSAF more often with higher
load/capacity ratio, at least here at Ericsson due to various reasons. Probably it is relaed to
the more volatile capacity of virtualized and/or "cloud" based platforms, in particular when
they are being reconfigured.

What I am basically saying is that it is always possible to increase the load/capcity ratio until you
do see a resource related problem ocurr in the system. It is a bit unfair to then declare that problem
as a defect. Particuarly when the effect is benign. In this case an SMF campaign gets aborted but
in a controlled way.

OpenSAF has no load regulation so OpenSAF is currently vulnerable to getting stuck in resource
prroblems. OpensAF does have partial overload protection in the IMM service and this is what
is geting triggered here (max outstanding fevs messages at the local IMMND a type of flow
control).

On the other hand if this is really a pratical and real problem also for deployments on old OpenSAF
releases being used in new ways in production , i.e. there is a plan to regularly run with
overloaded capacity in production, then one could declare this as a defect, even if it is a bit
"unfair".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-25

I should also clarify that there is a distinction between (a) getting ER_NO_RESOURCES as the direct
result from an IMM API call (in the above case a search or accessorGet used by SMF); and (b) determining that a CCB was aborted due to resource error and not validation error (new API enhancement #1449). In both cases it means that the thing/request was rejected/aborted for resource
reasons. But the handling of retry is different.

If the user (SMF) directly gets ERR_NO_RESOURCES returned on a call then that specifi call can be
retried.

But if the user (SMF) determines that a CCB has been aborted (ERR_FAILED_OPERATION) due to
a resource failure (return value false on argument 'isValidationAbort' for the new API
saImmOmCcbGetAbortReason, then a replay of the whole CCB can be atempted. But it makes no
sense here to retry the last ccb related downcall (ccbApply or ccbVAlidate or ccbObjectCreate..)
since the CCB has been aborted.

This distinction should be simple because in the resource aborted CCB case you dont
actually get SA_AIS_ERR_NO_RESOURCES as a return code.

SMF campaigns robustness can be improved on both aspects, when #1449 has been delivered.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2015-08-25

I think it is more unfair to the end user of recent releases by not passing on the benefit by providing an optimization or fix for an issue just because it was uncovered/hit late! And especially when the fix does not create any harm and only helps in succeeding the campaign. May be in the case of this ticket, there is more to help the user and nothing to harm the code path! Also, the facts that this is not a newly introduced error code and that IMM API users have not met the expectation set upon by IMM, to handle this as TRY_AGAIN calls for this to be a defect.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2015-08-26
  
  The end user of "recent releases" i.e. previous releases has not seen this problem.
  At least no report of the problem has been created until a few weeks ago.
  It only occurs in overload situations and has only been seen in recent testing with an overloaded system.
  New ways of testing are always good.
  But testing overload on a system with no load regulation will always find the next bottleneck symptom.
  WE can play that game indefinitely.
  Adding defect upon defect.
  Or we can provide some form of load-regulation mechanism for OpenSAF.
  
  It is also ironic that we need to fix this particular overload issue on old releases at the same time as we are
  Ripping up existing time release plans and suddenly declaring we are going to one-track development.,
  
  Personally I am increasingly frustrated by the deterioration in following the rules of the ticket system.
  Why not just drop the distinction between enhancement and defect ?
  No one seems to care (or bother ) about this distinction any more.
  
  The main reason for the distinction (I thought) was to provide an increased degree of stability on older
  Branches.
  
  New features always means new risk, at least in the short term i.e. first release occurrence of a new feature (enhancement).
  But no one seems to care about that.
  
  /AndersBj
  
  From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
  Sent: den 25 augusti 2015 16:37
  To: opensaf-tickets@lists.sourceforge.net
  Subject: [tickets] [opensaf:tickets] #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
  
  I think it is more unfair to the end user of recent releases by not passing on the benefit by providing an optimization or fix for an issue just because it was uncovered/hit late! And especially when the fix does not create any harm and only helps in succeeding the campaign. May be in the case of this ticket, there is more to help the user and nothing to harm the code path! Also, the facts that this is not a newly introduced error code and that IMM API users have not met the expectation set upon by IMM, to handle this as TRY_AGAIN calls for this to be a defect.
  
  [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
  
  Status: unassigned
  Milestone: future
  Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
  Last Updated: Tue Aug 25, 2015 11:14 AM UTC
  Owner: nobody
  
  The SMF service is a heavy user of the IMM service.
  The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
  control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
  Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.
  
  There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
  to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
  more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
  realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
  due to reasons that are outside of the imm service control. Also the time from request to a response
  of ERR_NO_RESOUIRCES may be long.
  
  The SMF service in general has no realtime requirments. The main goal for the SMF service is to
  successfully complete correctly formulated camopaings. This means that the SMF service should be
  programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
  could linger for seconds or minutes.
  
  The alternative of aborting the campaign will itself discard potentially large execution times already
  completed. It may sometimes even result in a system restore.
  
  This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
  but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
  the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
  to ERR_NO_RESOURCES in semantics, both logical and timing.
  
  Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/
  
  To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  Related
  
  Tickets: #1448
  Tickets: tickets
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2015-08-25

Just as a note - previously, I had a discussion with Ingvar and he had agreed to convert this into a defect.
I can provide a patch for this for OM api calls except for the CCB APIs (based on the description above).
Should we also give this treatment for OI APIs?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2015-08-26
  
  Yes the principle about handling ERR_NO_RESOURCES should be the same everywhere over all SAF services.
  Just as the rules for handling TRY_AGAIN should be the same over all OpenSAF services.
  
  Any client-application is free to decide to not handle these errors, i.e. to stop trying if they get them.
  But applications can be made more robust by handling these errors.
  
  There is also ERR_BUSY which for the immsv works exactly the same way as ERR_NO_RESOURCES.
  SAF created too many error codes as I see it.
  There should only be one error code for any particular handling behavior defined as appropriate for the error.
  If two error codes are to be handled exactly the same then one of the error codes should be deprecated.
  
  /AndersBJ
  
  From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
  Sent: den 25 augusti 2015 17:03
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
  
  Just as a note - previously, I had a discussion with Ingvar and he had agreed to convert this into a defect.
  I can provide a patch for this for OM api calls except for the CCB APIs (based on the description above).
  Should we also give this treatment for OI APIs?
  
  [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
  
  Status: unassigned
  Milestone: future
  Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
  Last Updated: Tue Aug 25, 2015 02:37 PM UTC
  Owner: nobody
  
  The SMF service is a heavy user of the IMM service.
  The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
  control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
  Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.
  
  There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
  to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
  more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
  realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
  due to reasons that are outside of the imm service control. Also the time from request to a response
  of ERR_NO_RESOUIRCES may be long.
  
  The SMF service in general has no realtime requirments. The main goal for the SMF service is to
  successfully complete correctly formulated camopaings. This means that the SMF service should be
  programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
  could linger for seconds or minutes.
  
  The alternative of aborting the campaign will itself discard potentially large execution times already
  completed. It may sometimes even result in a system restore.
  
  This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
  but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
  the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
  to ERR_NO_RESOURCES in semantics, both logical and timing.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1448/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Tickets: #1448
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2015-08-26
    
    Just t be clear.
    The handling of TRY_AGAIN versus the handling of NO_RESOURCES do differ.
    A TRY_AGAIN loop must have a delay for each iteration and iterations must be fast.
    A NO_RESOURCES client loop does not need a delay and each iteration may be indefinitely long.
    
    Nearly all applications should handle TRY_AGAIN because it is normally a temporary problem
    over in a few seconds. The immsv can generate TRY_AGAIN for up to 60 seconds for big syncs
    but I would argue that this should be changes so that the immsv shifts from TRY_AGAIN to NO_RESOURCES
    if the sync takes longer than say 20 seconds.
    Most crucially, from the implementation (server) side, each TRY_AGAIN iterations should be fast (miliseconds)
    Because this allows the client application to have realtime control over how long they tolerate being stuck.
    
    Application threads with real-time requirements should not bother retrying NO_RESOURCES.
    Each iteration can take a longer time and the error code itself signals that the wait-time is outside the control
    of the service and could be indefinite.
    
    /AndersBj
    
    From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
    Sent: den 26 augusti 2015 08:25
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
    
    Yes the principle about handling ERR_NO_RESOURCES should be the same everywhere over all SAF services.
    Just as the rules for handling TRY_AGAIN should be the same over all OpenSAF services.
    
    Any client-application is free to decide to not handle these errors, i.e. to stop trying if they get them.
    But applications can be made more robust by handling these errors.
    
    There is also ERR_BUSY which for the immsv works exactly the same way as ERR_NO_RESOURCES.
    SAF created too many error codes as I see it.
    There should only be one error code for any particular handling behavior defined as appropriate for the error.
    If two error codes are to be handled exactly the same then one of the error codes should be deprecated.
    
    /AndersBJ
    
    From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
    Sent: den 25 augusti 2015 17:03
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
    
    Just as a note - previously, I had a discussion with Ingvar and he had agreed to convert this into a defect.
    I can provide a patch for this for OM api calls except for the CCB APIs (based on the description above).
    Should we also give this treatment for OI APIs?
    
    [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
    
    Status: unassigned
    Milestone: future
    Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
    Last Updated: Tue Aug 25, 2015 02:37 PM UTC
    Owner: nobody
    
    The SMF service is a heavy user of the IMM service.
    The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
    control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
    Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.
    
    There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
    to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
    more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
    realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
    due to reasons that are outside of the imm service control. Also the time from request to a response
    of ERR_NO_RESOUIRCES may be long.
    
    The SMF service in general has no realtime requirments. The main goal for the SMF service is to
    successfully complete correctly formulated camopaings. This means that the SMF service should be
    programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
    could linger for seconds or minutes.
    
    The alternative of aborting the campaign will itself discard potentially large execution times already
    completed. It may sometimes even result in a system restore.
    
    This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
    but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
    the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
    to ERR_NO_RESOURCES in semantics, both logical and timing.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1448/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES
    
    Status: unassigned
    Milestone: future
    Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
    Last Updated: Tue Aug 25, 2015 03:03 PM UTC
    Owner: nobody
    
    The SMF service is a heavy user of the IMM service.
    The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
    control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
    Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.
    
    There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
    to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
    more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
    realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
    due to reasons that are outside of the imm service control. Also the time from request to a response
    of ERR_NO_RESOUIRCES may be long.
    
    The SMF service in general has no realtime requirments. The main goal for the SMF service is to
    successfully complete correctly formulated camopaings. This means that the SMF service should be
    programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
    could linger for seconds or minutes.
    
    The alternative of aborting the campaign will itself discard potentially large execution times already
    completed. It may sometimes even result in a system restore.
    
    This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
    but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
    the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
    to ERR_NO_RESOURCES in semantics, both logical and timing.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1448/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    Related
    
    Tickets: #1448
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

Milestone

Searches

Help

#1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

Related

Discussion

Related

Related

Related