smfd crashes in updateImmAttr because it returns NO_RESOURCES. Here is how to reproduce:
- enable PBE, and make sure the "disable" flag is set in OpenSafSmfConfig
- execute an upgrade campaign, and let it go to "execution completed", but don't commit it
- reboot the entire cluster
- only allow 1 system controller to come up
- smfd will attempt to re-execute the campaign
- any writes to IMM (like setting an error because the campaign file can't be found) will fail with NO_RESOURCES and smfd will assert and crash
The reason for the assert and crash is because PBE has not been turned off by smfd before the campaign has been inititialized.
I think moving restorePbe from executeWrapup to commit is the right thing to do to solve this issue. That way when the cluster is rebooted, and smfd starts up again after the reboot, the campaign will be in init state, and nothing will be executed.
I think, conceptually, it also makes sense to not write everything to PBE until commit time. Then you can use a cluster reboot like a fallback.
I can not comment on moving restorePbe to commit for now. But a SMF rollback is what is used to undo the campaign operations. A reboot would only clear changed IMM data if PBE was off. That leaves software and CLI operations, which would cause incompatibilities.
A rollback has to be planned for in a campaign and does not handle errors. So by default SMF takes a backup before starting a campaign to be able to recover.
I completely agree with using rollback and a backup. We do this. And we have the system setup such that if a cluster reboot takes place before commit, the old version of software is used.
This ticket really addresses a corner case. We have a customer that did an upgrade, forgot to commit, and then the chassis rebooted. This caused an infinite reboot cycle. The only way we can get out of it, is to remove the imm.db files.
A rollback will not work if a unexpected cluster-reboot was done before PBE was enabled. SMF looses its runtime data in that case, so your patch would cause issues for rollback. The intention is to be able to test the system and then decide to proceed with rollback or commit. That means reboots are allowed once the campaign is completed together with rollback.
I think the issue you are having needs a solution preferably in SMF but I'm not sure how that would look yet.
I understand the intention. It makes sense.
One of the other solutions I had considered is to put a check at the beginning of SmfCampaign::initExecution(). If the campaign state is EXECUTION_COMPLETED, then just return. What is the point of reexecuting a campaign that already completed?
Are you OK with that?
That would work. As long as it is possible to rollback the campaign it
is fine.
On 10/20/2017 03:18 PM, Alex Jones wrote:
Related
Tickets: #2648