Menu

#1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

future
unassigned
nobody
None
enhancement
smf
d
major
2015-08-25
2015-08-14
No

The SMF service is a heavy user of the IMM service.
The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.

There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
due to reasons that are outside of the imm service control. Also the time from request to a response
of ERR_NO_RESOUIRCES may be long.

The SMF service in general has no realtime requirments. The main goal for the SMF service is to
successfully complete correctly formulated camopaings. This means that the SMF service should be
programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
could linger for seconds or minutes.

The alternative of aborting the campaign will itself discard potentially large execution times already
completed. It may sometimes even result in a system restore.

This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
to ERR_NO_RESOURCES in semantics, both logical and timing.

Related

Tickets: #1448

Discussion

  • Mathi Naickan

    Mathi Naickan - 2015-08-25
     
  • Mathi Naickan

    Mathi Naickan - 2015-08-25

    I think this ticket has to be treated as a defect because
    - this behaviour (IMM returning NO_RESOURCES) and the expectation from the immuser to treat it as try_again seems to be existing atleast as old as the 4.5.x branch.
    - handling this error code as a try_again seems to be the only practical way for the user campaign to succeed and the scenario did succeed(tested ok!) upon treating this as try_again.

     
  • Anders Bjornerstedt

    There are at least three points to make in response to the claim that this has to be a defect.

    1) We have not seen this problem earlier. So obviously testing is different this time, i.e. this
    is new way of testing that was not performed when testing the earlier releases.

    2) Saying that this fix is "the only way for this campaign to succeed" is not true unless you show
    that the problem is not performance related. I am convinced that the root cause is very much
    performance related. So the very same campaign most likely succeeds, probably has succeeded
    in earlier releases, simply because the platform it was tested on had a more reasonable
    load/capacity ratio.

    3) I have noticed that there is lately a tendency to stress test OpenSAF more often with higher
    load/capacity ratio, at least here at Ericsson due to various reasons. Probably it is relaed to
    the more volatile capacity of virtualized and/or "cloud" based platforms, in particular when
    they are being reconfigured.

    What I am basically saying is that it is always possible to increase the load/capcity ratio until you
    do see a resource related problem ocurr in the system. It is a bit unfair to then declare that problem
    as a defect. Particuarly when the effect is benign. In this case an SMF campaign gets aborted but
    in a controlled way.

    OpenSAF has no load regulation so OpenSAF is currently vulnerable to getting stuck in resource
    prroblems. OpensAF does have partial overload protection in the IMM service and this is what
    is geting triggered here (max outstanding fevs messages at the local IMMND a type of flow
    control).

    On the other hand if this is really a pratical and real problem also for deployments on old OpenSAF
    releases being used in new ways in production , i.e. there is a plan to regularly run with
    overloaded capacity in production, then one could declare this as a defect, even if it is a bit
    "unfair".

     
  • Anders Bjornerstedt

    I should also clarify that there is a distinction between (a) getting ER_NO_RESOURCES as the direct
    result from an IMM API call (in the above case a search or accessorGet used by SMF); and (b) determining that a CCB was aborted due to resource error and not validation error (new API enhancement #1449). In both cases it means that the thing/request was rejected/aborted for resource
    reasons. But the handling of retry is different.

    If the user (SMF) directly gets ERR_NO_RESOURCES returned on a call then that specifi call can be
    retried.

    But if the user (SMF) determines that a CCB has been aborted (ERR_FAILED_OPERATION) due to
    a resource failure (return value false on argument 'isValidationAbort' for the new API
    saImmOmCcbGetAbortReason, then a replay of the whole CCB can be atempted. But it makes no
    sense here to retry the last ccb related downcall (ccbApply or ccbVAlidate or ccbObjectCreate..)
    since the CCB has been aborted.

    This distinction should be simple because in the resource aborted CCB case you dont
    actually get SA_AIS_ERR_NO_RESOURCES as a return code.

    SMF campaigns robustness can be improved on both aspects, when #1449 has been delivered.

     
  • Mathi Naickan

    Mathi Naickan - 2015-08-25

    I think it is more unfair to the end user of recent releases by not passing on the benefit by providing an optimization or fix for an issue just because it was uncovered/hit late! And especially when the fix does not create any harm and only helps in succeeding the campaign. May be in the case of this ticket, there is more to help the user and nothing to harm the code path! Also, the facts that this is not a newly introduced error code and that IMM API users have not met the expectation set upon by IMM, to handle this as TRY_AGAIN calls for this to be a defect.

     
    • Anders Bjornerstedt

      The end user of "recent releases" i.e. previous releases has not seen this problem.
      At least no report of the problem has been created until a few weeks ago.
      It only occurs in overload situations and has only been seen in recent testing with an overloaded system.
      New ways of testing are always good.
      But testing overload on a system with no load regulation will always find the next bottleneck symptom.
      WE can play that game indefinitely.
      Adding defect upon defect.
      Or we can provide some form of load-regulation mechanism for OpenSAF.

      It is also ironic that we need to fix this particular overload issue on old releases at the same time as we are
      Ripping up existing time release plans and suddenly declaring we are going to one-track development.,

      Personally I am increasingly frustrated by the deterioration in following the rules of the ticket system.
      Why not just drop the distinction between enhancement and defect ?
      No one seems to care (or bother ) about this distinction any more.

      The main reason for the distinction (I thought) was to provide an increased degree of stability on older
      Branches.

      New features always means new risk, at least in the short term i.e. first release occurrence of a new feature (enhancement).
      But no one seems to care about that.

      /AndersBj

      From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
      Sent: den 25 augusti 2015 16:37
      To: opensaf-tickets@lists.sourceforge.net
      Subject: [tickets] [opensaf:tickets] #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

      I think it is more unfair to the end user of recent releases by not passing on the benefit by providing an optimization or fix for an issue just because it was uncovered/hit late! And especially when the fix does not create any harm and only helps in succeeding the campaign. May be in the case of this ticket, there is more to help the user and nothing to harm the code path! Also, the facts that this is not a newly introduced error code and that IMM API users have not met the expectation set upon by IMM, to handle this as TRY_AGAIN calls for this to be a defect.


      [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

      Status: unassigned
      Milestone: future
      Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
      Last Updated: Tue Aug 25, 2015 11:14 AM UTC
      Owner: nobody

      The SMF service is a heavy user of the IMM service.
      The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
      control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
      Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.

      There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
      to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
      more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
      realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
      due to reasons that are outside of the imm service control. Also the time from request to a response
      of ERR_NO_RESOUIRCES may be long.

      The SMF service in general has no realtime requirments. The main goal for the SMF service is to
      successfully complete correctly formulated camopaings. This means that the SMF service should be
      programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
      could linger for seconds or minutes.

      The alternative of aborting the campaign will itself discard potentially large execution times already
      completed. It may sometimes even result in a system restore.

      This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
      but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
      the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
      to ERR_NO_RESOURCES in semantics, both logical and timing.


      Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/

      To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

       

      Related

      Tickets: #1448
      Tickets: tickets

  • Mathi Naickan

    Mathi Naickan - 2015-08-25

    Just as a note - previously, I had a discussion with Ingvar and he had agreed to convert this into a defect.
    I can provide a patch for this for OM api calls except for the CCB APIs (based on the description above).
    Should we also give this treatment for OI APIs?

     
    • Anders Bjornerstedt

      Yes the principle about handling ERR_NO_RESOURCES should be the same everywhere over all SAF services.
      Just as the rules for handling TRY_AGAIN should be the same over all OpenSAF services.

      Any client-application is free to decide to not handle these errors, i.e. to stop trying if they get them.
      But applications can be made more robust by handling these errors.

      There is also ERR_BUSY which for the immsv works exactly the same way as ERR_NO_RESOURCES.
      SAF created too many error codes as I see it.
      There should only be one error code for any particular handling behavior defined as appropriate for the error.
      If two error codes are to be handled exactly the same then one of the error codes should be deprecated.

      /AndersBJ

      From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
      Sent: den 25 augusti 2015 17:03
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

      Just as a note - previously, I had a discussion with Ingvar and he had agreed to convert this into a defect.
      I can provide a patch for this for OM api calls except for the CCB APIs (based on the description above).
      Should we also give this treatment for OI APIs?


      [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

      Status: unassigned
      Milestone: future
      Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
      Last Updated: Tue Aug 25, 2015 02:37 PM UTC
      Owner: nobody

      The SMF service is a heavy user of the IMM service.
      The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
      control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
      Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.

      There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
      to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
      more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
      realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
      due to reasons that are outside of the imm service control. Also the time from request to a response
      of ERR_NO_RESOUIRCES may be long.

      The SMF service in general has no realtime requirments. The main goal for the SMF service is to
      successfully complete correctly formulated camopaings. This means that the SMF service should be
      programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
      could linger for seconds or minutes.

      The alternative of aborting the campaign will itself discard potentially large execution times already
      completed. It may sometimes even result in a system restore.

      This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
      but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
      the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
      to ERR_NO_RESOURCES in semantics, both logical and timing.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1448/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #1448

      • Anders Bjornerstedt

        Just t be clear.
        The handling of TRY_AGAIN versus the handling of NO_RESOURCES do differ.
        A TRY_AGAIN loop must have a delay for each iteration and iterations must be fast.
        A NO_RESOURCES client loop does not need a delay and each iteration may be indefinitely long.

        Nearly all applications should handle TRY_AGAIN because it is normally a temporary problem
        over in a few seconds. The immsv can generate TRY_AGAIN for up to 60 seconds for big syncs
        but I would argue that this should be changes so that the immsv shifts from TRY_AGAIN to NO_RESOURCES
        if the sync takes longer than say 20 seconds.
        Most crucially, from the implementation (server) side, each TRY_AGAIN iterations should be fast (miliseconds)
        Because this allows the client application to have realtime control over how long they tolerate being stuck.

        Application threads with real-time requirements should not bother retrying NO_RESOURCES.
        Each iteration can take a longer time and the error code itself signals that the wait-time is outside the control
        of the service and could be indefinite.

        /AndersBj

        From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
        Sent: den 26 augusti 2015 08:25
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] Re: #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

        Yes the principle about handling ERR_NO_RESOURCES should be the same everywhere over all SAF services.
        Just as the rules for handling TRY_AGAIN should be the same over all OpenSAF services.

        Any client-application is free to decide to not handle these errors, i.e. to stop trying if they get them.
        But applications can be made more robust by handling these errors.

        There is also ERR_BUSY which for the immsv works exactly the same way as ERR_NO_RESOURCES.
        SAF created too many error codes as I see it.
        There should only be one error code for any particular handling behavior defined as appropriate for the error.
        If two error codes are to be handled exactly the same then one of the error codes should be deprecated.

        /AndersBJ

        From: Mathi Naickan [mailto:mathi-naickan@users.sf.net]
        Sent: den 25 augusti 2015 17:03
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] #1448 smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

        Just as a note - previously, I had a discussion with Ingvar and he had agreed to convert this into a defect.
        I can provide a patch for this for OM api calls except for the CCB APIs (based on the description above).
        Should we also give this treatment for OI APIs?


        [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

        Status: unassigned
        Milestone: future
        Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
        Last Updated: Tue Aug 25, 2015 02:37 PM UTC
        Owner: nobody

        The SMF service is a heavy user of the IMM service.
        The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
        control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
        Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.

        There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
        to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
        more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
        realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
        due to reasons that are outside of the imm service control. Also the time from request to a response
        of ERR_NO_RESOUIRCES may be long.

        The SMF service in general has no realtime requirments. The main goal for the SMF service is to
        successfully complete correctly formulated camopaings. This means that the SMF service should be
        programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
        could linger for seconds or minutes.

        The alternative of aborting the campaign will itself discard potentially large execution times already
        completed. It may sometimes even result in a system restore.

        This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
        but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
        the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
        to ERR_NO_RESOURCES in semantics, both logical and timing.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1448/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/


        [tickets:#1448]http://sourceforge.net/p/opensaf/tickets/1448/ smf: Make campaigns less fragile by retrying on ERR_NO_RESOURCES

        Status: unassigned
        Milestone: future
        Created: Fri Aug 14, 2015 07:09 AM UTC by Anders Bjornerstedt
        Last Updated: Tue Aug 25, 2015 03:03 PM UTC
        Owner: nobody

        The SMF service is a heavy user of the IMM service.
        The IMM has an established client pattern for ERR_TRY_AGAIN which allows an application realtime
        control over how long it is prepared to wait for a transient inability of the IMM service to fullfill a request.
        Each response of TRY_AGAIN should in itself be fast so the application needs a delay in its retry loop.

        There is also the very similar error code ERR_NO_RESOURSES. Logically that error code is identical
        to TRY_AGAIN in that the request could not be accepted due to no fault of the client but due to some
        more or less temporary problem in the IMM service. The difference is that NO_RESOURCES has no
        realtime ambitions. Typically this error code is used by the imm when the imm can not fullfill a request
        due to reasons that are outside of the imm service control. Also the time from request to a response
        of ERR_NO_RESOUIRCES may be long.

        The SMF service in general has no realtime requirments. The main goal for the SMF service is to
        successfully complete correctly formulated camopaings. This means that the SMF service should be
        programmed to avoid unnecessary fragility related to temporary problems, even if the temporary problem
        could linger for seconds or minutes.

        The alternative of aborting the campaign will itself discard potentially large execution times already
        completed. It may sometimes even result in a system restore.

        This means that SMF campaigns should have a "retry loop" that handles not just TRY_AGAIN,
        but also ERR_NO_RESOURCES where this return code is relevant (can be returned according to
        the API spec).. The error copde ERR_BUSY also exists and is for all practical purposes identical
        to ERR_NO_RESOURCES in semantics, both logical and timing.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1448/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

         

        Related

        Tickets: #1448


Log in to post a comment.