We're now using a hard coded timeout value (20 seconds) in getting node destination.
This is sometimes not enough especially in cluster reboot procedure. Controller may come up first and continue the campaign without waiting for the rest to be up.
This can make the getNodeDestination() fail sometimes, especially for a large cluster.
In our case, it needs 3 more seconds.
I guess this timeout need to be increased or should be configurable.
Reuse some existing attribute for this purpose is also fine, e.g: smfRebootTimeout.
/Tai
waitForNodeDestination already uses smfRebootTimeout. Is it still timing out or was getNodeDestination called without the waitFor wrapper?
Hi Rafael,
It's under SmfCliCommandAction::execute() => getNodeDestination(n, &nodeDest, NULL, -1).
-1 was passed as maxWaitTime which means 20 seconds timeout will be used.
For a rolling upgrade procedure, this should be OK sine we already wait for the node but for cluster reboot procedure, the similar thing does not happen.
/Tai
If you have the logs please send them my way.
It should be enough to wrap getNodeDestination in waitForGetNodeDestination in SmfCliCommandAction::execute(). Other getNodeDestination calls are not needing to wait for nodes or have custom code for retry.
Going for a short vacation, here is the untested patch. Use rebootTimeout to increase the timeout for it.
commit 2ffbd1c5cd3f4193fd631130eef60b17c92892e6 (HEAD -> ticket-2499)
Author: Rafael Odzakow rafael.odzakow@ericsson.com
Date: Tue Jun 20 16:10:12 2017 +0200
diff --git a/src/smf/smfd/SmfUpgradeStep.cc b/src/smf/smfd/SmfUpgradeStep.cc
index 2ffeab110..a99c7661a 100644
--- a/src/smf/smfd/SmfUpgradeStep.cc
+++ b/src/smf/smfd/SmfUpgradeStep.cc
@@ -1966,7 +1966,7 @@ bool SmfUpgradeStep::callActivationCmd() {
TRACE("Get node destination for %s", getSwNode().c_str());
uint32_t rc;
LOG_NO("no node destination found for node %s", getSwNode().c_str());
result = false;
goto done;
Thank Rafael,
This is what I expected.
/Tai
This issue is as far as I could see a bug. In other campaign sequences SMF will wait with rebootTimeout before doing any operation after reboot. In this campaign sequence the first operation type after a reboot was to to a CLI command on a payload node. This timed out because the CLI command is not wrapped in a retry using the rebootTimeout of SMF.
SMF does not keep track of all nodes after a cluster reboot therefore the mechanism for handling a cluster reboot is to wrap all possible operations that are to be executed after cluster reboot in a retry loop.
Last edit: Rafael Odzakow 2017-06-29
fixed in
commit 3e1d1091270fa83cb8efe5458d6050b56f41f001
Author: Rafael Odzakow rafael.odzakow@ericsson.com
Date: Fri Jun 30 10:57:36 2017 +0200