Menu

#2648 smf: smfd crashes after cluster reboot when campaign is in ExecutionCompleted

future
assigned
None
defect
smf
d
4.7.2
major
False
2017-11-07
2017-10-19
Alex Jones
No

smfd crashes in updateImmAttr because it returns NO_RESOURCES. Here is how to reproduce:

  1. enable PBE, and make sure the "disable" flag is set in OpenSafSmfConfig
  2. execute an upgrade campaign, and let it go to "execution completed", but don't commit it
  3. reboot the entire cluster
  4. only allow 1 system controller to come up
  5. smfd will attempt to re-execute the campaign
  6. any writes to IMM (like setting an error because the campaign file can't be found) will fail with NO_RESOURCES and smfd will assert and crash

The reason for the assert and crash is because PBE has not been turned off by smfd before the campaign has been inititialized.

Related

Tickets: #2648

Discussion

  • Alex Jones

    Alex Jones - 2017-10-19

    I think moving restorePbe from executeWrapup to commit is the right thing to do to solve this issue. That way when the cluster is rebooted, and smfd starts up again after the reboot, the campaign will be in init state, and nothing will be executed.

    I think, conceptually, it also makes sense to not write everything to PBE until commit time. Then you can use a cluster reboot like a fallback.

     
    • Rafael Odzakow

      Rafael Odzakow - 2017-10-19

      I can not comment on moving restorePbe to commit for now. But a SMF rollback is what is used to undo the campaign operations. A reboot would only clear changed IMM data if PBE was off. That leaves software and CLI operations, which would cause incompatibilities.

      A rollback has to be planned for in a campaign and does not handle errors. So by default SMF takes a backup before starting a campaign to be able to recover.

       
  • Alex Jones

    Alex Jones - 2017-10-19
    • status: accepted --> review
     
  • Alex Jones

    Alex Jones - 2017-10-20

    I completely agree with using rollback and a backup. We do this. And we have the system setup such that if a cluster reboot takes place before commit, the old version of software is used.

    This ticket really addresses a corner case. We have a customer that did an upgrade, forgot to commit, and then the chassis rebooted. This caused an infinite reboot cycle. The only way we can get out of it, is to remove the imm.db files.

     
  • Rafael Odzakow

    Rafael Odzakow - 2017-10-20

    A rollback will not work if a unexpected cluster-reboot was done before PBE was enabled. SMF looses its runtime data in that case, so your patch would cause issues for rollback. The intention is to be able to test the system and then decide to proceed with rollback or commit. That means reboots are allowed once the campaign is completed together with rollback.

    I think the issue you are having needs a solution preferably in SMF but I'm not sure how that would look yet.

     
  • Alex Jones

    Alex Jones - 2017-10-20

    I understand the intention. It makes sense.

    One of the other solutions I had considered is to put a check at the beginning of SmfCampaign::initExecution(). If the campaign state is EXECUTION_COMPLETED, then just return. What is the point of reexecuting a campaign that already completed?

    Are you OK with that?

     
    • Rafael Odzakow

      Rafael Odzakow - 2017-10-25

      That would work. As long as it is possible to rollback the campaign it
      is fine.

      On 10/20/2017 03:18 PM, Alex Jones wrote:

      I understand the intention. It makes sense.

      One of the other solutions I had considered is to put a check at the
      beginning of SmfCampaign::initExecution(). If the campaign state is
      EXECUTION_COMPLETED, then just return. What is the point of
      reexecuting a campaign that already completed?

      Are you OK with that?


      [tickets:#2648] https://sourceforge.net/p/opensaf/tickets/2648/
      smf: smfd crashes after cluster reboot when campaign is in
      ExecutionCompleted

      Status: review
      Milestone: 5.17.10
      Created: Thu Oct 19, 2017 06:45 PM UTC by Alex Jones
      Last Updated: Fri Oct 20, 2017 10:04 AM UTC
      Owner: Alex Jones

      smfd crashes in updateImmAttr because it returns NO_RESOURCES. Here is
      how to reproduce:

      1. enable PBE, and make sure the "disable" flag is set in
        OpenSafSmfConfig
      2. execute an upgrade campaign, and let it go to "execution
        completed", but don't commit it
      3. reboot the entire cluster
      4. only allow 1 system controller to come up
      5. smfd will attempt to re-execute the campaign
      6. any writes to IMM (like setting an error because the campaign file
        can't be found) will fail with NO_RESOURCES and smfd will assert
        and crash

      The reason for the assert and crash is because PBE has not been turned
      off by smfd before the campaign has been inititialized.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/opensaf/tickets/2648/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #2648

  • Anders Widell

    Anders Widell - 2017-11-03
    • Milestone: 5.17.11 --> 5.18.01
     
  • Alex Jones

    Alex Jones - 2017-11-07
    • status: review --> assigned
    • Milestone: 5.18.01 --> future
     

Log in to post a comment.