smf: update PR with information about faster upgrade [#2017]
That would work. As long as it is possible to rollback the campaign it is fine. On 10/20/2017 03:18 PM, Alex Jones wrote: I understand the intention. It makes sense. One of the other solutions I had considered is to put a check at the beginning of SmfCampaign::initExecution(). If the campaign state is EXECUTION_COMPLETED, then just return. What is the point of reexecuting a campaign that already completed? Are you OK with that? [tickets:#2648] https://sourceforge.net/p/opensaf/tickets/2648/ smf:...
A rollback will not work if a unexpected cluster-reboot was done before PBE was enabled. SMF looses its runtime data in that case, so your patch would cause issues for rollback. The intention is to be able to test the system and then decide to proceed with rollback or commit. That means reboots are allowed once the campaign is completed together with rollback. I think the issue you are having needs a solution preferably in SMF but I'm not sure how that would look yet.
I can not comment on moving restorePbe to commit for now. But a SMF rollback is what is used to undo the campaign operations. A reboot would only clear changed IMM data if PBE was off. That leaves software and CLI operations, which would cause incompatibilities. A rollback has to be planned for in a campaign and does not handle errors. So by default SMF takes a backup before starting a campaign to be able to recover.
smf: refactor smfd folders [#2633]
smf: refactor smfd folders
smf: refactor smfd directory structure
smf: Node by node upgrade
base: double start failed
base: double start failed [#2622]
issue was found on ubuntu 14.04 where subsys folder is not created by default. Move the pid removal to be called after pidofproc.
base: double start failed [#2622]
base: double start failed [#2622]
smf: execLevel for balanced upgrade
base: double start failed [#2622]
base: double start failed
base: double start failed
[base] double start failed
double start failed
smf: execLevel for balanced upgrades [#2555]
smf: remove cascading delete for runtime objects
smf: remove cascading delete for runtime objects
smf: remove cascading delete for runtime objects
smf: try to wait for opensafd status before executing reboot
commit f3ef8eebf44f0eab4dcc65f83fe3119a77ef5067 (HEAD -> develop, origin/develop) Author: Rafael Odzakow rafael.odzakow@ericsson.com Date: Mon Sep 25 13:52:03 2017 +0200 smf: try to wait for opensafd status before reboot [#2464]
smf: try to wait for opensafd status before reboot [#2464]
smf: try to wait for opensafd status before executing reboot
Seen again as protecting with mutexes and try again loops in opensafd nid script does not solve for when triggering node reboot as other services will be shutting down and causing unexpected errors.
smf: try to wait for opensafd status before reboot [#2464]
smf: execLevel for balanced upgrade [#2555]
smf: execLevel for balanced upgrade
smf: execLevel for balanced upgrade
smf: execLevel for balanced upgrade
smf: execLevel for balanced upgrade
smf: coredump and syslog flood after immnd crash
Setting it to minor until it shows up again.
smf: coredump and syslog flood after immnd crash
smf: try to wait for opensafd status before executing reboot
solved in base opensaf commit a051496719a3c862594af17d88b082031dd53b33 (ticket-2459)
nid: order of system log print out is not correct [#2541]
nid: order of system log print out is not correct
smf: no node locking when procedures are empty
for rolling upgrades only commit 653edb5d9b217f1a3280b5aed8597fb53ffa5f61
smf: no node locking when procedures are empty [#2521]
smf: no node locking when procedures are empty
smf: no node locking when procedures are empty
smf: no node locking when procedures are empty [#2521]
smf: remove node locking with empty procedures
smf: remove node locking with empty procedures
fixed in commit 3e1d1091270fa83cb8efe5458d6050b56f41f001 Author: Rafael Odzakow rafael.odzakow@ericsson.com Date: Fri Jun 30 10:57:36 2017 +0200 smf: 20 seconds timeout in getting node destination is not enough [#2499]
SMF: 20 seconds timeout in getting node destination is not enough
smf: 20 seconds timeout in getting node destination is not enough [#2499]
For the node that is not allowed to join the CLM cluster will this solution also block IMM (and other services) from starting up?
This issue is as far as I could see a bug. In other campaign sequences SMF will wait with rebootTimeout before doing any operation after reboot. In this campaign sequence the first operation type after a reboot was to to a CLI command on a payload node. This timed out because the CLI command is not wrapped in a retry using the rebootTimeout of SMF. SMF does not keep track of all nodes after a cluster reboot therefore the mechanism for handling a cluster reboot is to wrap all possible operations that...
This issue is as far as I could see a bug. In other campaign sequences SMF will wait with rebootTimeout before doing any operation after reboot. In this campaign sequence the first operation type after a reboot was to to a CLI command on a payload node. This timed out because the CLI command is not wrapped in a retry using the rebootTimeout of SMF. SMF does not keep track of all nodes after a cluster reboot therefore the mechanism for handling a cluster reboot is to wrap all possible operations that...
smf: 20 seconds timeout in getting node destination is not enough [#2499]
This issue is as far as I could see a bug. In other campaign sequences SMF will wait with rebootTimeout before doing any operation after reboot. In this campaign sequence the first operation type after a reboot was to to a CLI command on a payload node. This timed out because the CLI command is not wrapped in a retry using the rebootTimeout of SMF. SMF does not keep track of all nodes after a cluster reboot therefore the mechanism for handling a cluster reboot is to wrap any operations that is done...
try-again for opensafd stop
commit a051496719a3c862594af17d88b082031dd53b33 (ticket-2459) base: Try again for opensafd stop [#2459] Internally opensafd creates a mutex during start/stop to avoid parallel execution. Makes mutex more robust and add a short retry if mutex is taken.
base: Try again for opensafd stop [#2459]
Going for a short vacation, here is the untested patch. Use rebootTimeout to increase the timeout for it. commit 2ffbd1c5cd3f4193fd631130eef60b17c92892e6 (HEAD -> ticket-2499) Author: Rafael Odzakow rafael.odzakow@ericsson.com Date: Tue Jun 20 16:10:12 2017 +0200 smf: 20 seconds timeout in getting node destination is not enough [#2499] diff --git a/src/smf/smfd/SmfUpgradeStep.cc b/src/smf/smfd/SmfUpgradeStep.cc index 2ffeab110..a99c7661a 100644 --- a/src/smf/smfd/SmfUpgradeStep.cc +++ b/src/smf/smfd/SmfUpgradeStep.cc...
It should be enough to wrap getNodeDestination in waitForGetNodeDestination in SmfCliCommandAction::execute(). Other getNodeDestination calls are not needing to wait for nodes or have custom code for retry.
If you have the logs please send them my way.
waitForNodeDestination already uses smfRebootTimeout. Is it still timing out or was getNodeDestination called without the waitFor wrapper?
try-again for opensafd stop
base: Try again for opensafd stop [#2459]
base: Try again for opensafd stop [#2459]
smf: try to wait for opensafd status before executing reboot
smf: try to wait for opensafd status before executing reboot [#2464]
smf: try to wait for opensafd status before executing reboot
base: Improve state report for opensafd [#2459]
improve state report for opensafd
base: Improve state report for opensafd [#2459]
improve state report for opensafd
graceful shutdown of opensafd
smf: coredump and syslog flood after immnd crash
pushed to develop with commit f9149b49420d989b6ffcaf0f3553c5452e7e2302
smf: One step upgrade with cluster reboot does not wait for nodes to start
smf: cli-command does not wait for nodes to start [#1969]
I consider the AMF objects as an interface and some external code outside of OpenSAF might be reading that campaignDN attribute.
base: "hardening" use of lockfile in opensafd
I have seen a issue with the lockfile. Here are some parts from the system log: 21:59:15 SC-1 opensafd: Starting OpenSAF Services(5.2.0 - 8767:c1cc2a915e72:default) (Using TCP) Reboot command is issued from SC-2: 21:59:16 SC-2 osafsmfd[599]: NO STEP: Reboot node for removal safAmfNode=SC-1,safAmfCluster=myAmfCluster SC-1 is not finished with the start of opensaf. This line is missing from SC-1: opensafd: OpenSAF services successfully started 21:59:17 SC-1 opensafd: Stopping OpenSAF Services 21:59:17...
It is possible to do it both ways but I prefer to do this in AMF because it appears that the campaign dn was set on the objects before #2144 and #2145 were introduced. It was set by SMF and most likely the attribute was never used but I can't say for sure. The safe solution is to keep setting it just as it has been previously. As for turning this on/off during a campaign. If someone external decides to change things in IMM during upgrade then we can not guarantee that the campaign will be successful....
smf: cli-command does not wait for nodes to sta...
Suggestion is to disable setting the maintenance campaign attribute on the AMF object...
Suggestion is to disable setting the maintenance campaign attribute on the AMF object...
Hej, valid question. In the case that we looked at the component recovered automatically...
Hej, valid question. In the case that we looked at the component recovered automatically...
smf: One step upgrade with cluster reboot does not wait for nodes to start
smf: when fixing ticket #2145 a NBC problem was introduced
Hej, valid question. In the case that we looked at the component recevered automatically...
smf: admin owner err_exist on parallel procedur...
smf: admin owner err_exist on parallel procedur...
smf: admin owner err_exist on parallel procedures
Pushed to default branch: HG changeset patch User Rafael Odzakow rafael.odzakow@ericsson.com...
smf: admin owner err_exist on parallel procedur...
campaign example: <softwareBundle name="safSmfBundle=BundleA"> <removal> <offline...
campaign example: <softwareBundle name="safSmfBundle=BundleA"> <removal> <offline...