Menu

#2499 SMF: 20 seconds timeout in getting node destination is not enough

5.17.07
fixed
None
defect
smf
d
major
False
2017-07-27
2017-06-16
Tai Dinh
No

We're now using a hard coded timeout value (20 seconds) in getting node destination.
This is sometimes not enough especially in cluster reboot procedure. Controller may come up first and continue the campaign without waiting for the rest to be up.
This can make the getNodeDestination() fail sometimes, especially for a large cluster.
In our case, it needs 3 more seconds.

I guess this timeout need to be increased or should be configurable.
Reuse some existing attribute for this purpose is also fine, e.g: smfRebootTimeout.

/Tai

Related

Wiki: ChangeLog-5.17.07

Discussion

  • Rafael Odzakow

    Rafael Odzakow - 2017-06-19

    waitForNodeDestination already uses smfRebootTimeout. Is it still timing out or was getNodeDestination called without the waitFor wrapper?

     
  • Tai Dinh

    Tai Dinh - 2017-06-20

    Hi Rafael,

    It's under SmfCliCommandAction::execute() => getNodeDestination(n, &nodeDest, NULL, -1).
    -1 was passed as maxWaitTime which means 20 seconds timeout will be used.
    For a rolling upgrade procedure, this should be OK sine we already wait for the node but for cluster reboot procedure, the similar thing does not happen.

    /Tai

     
    • Rafael Odzakow

      Rafael Odzakow - 2017-06-20

      If you have the logs please send them my way.

       
    • Rafael Odzakow

      Rafael Odzakow - 2017-06-20

      It should be enough to wrap getNodeDestination in waitForGetNodeDestination in SmfCliCommandAction::execute(). Other getNodeDestination calls are not needing to wait for nodes or have custom code for retry.

       
    • Rafael Odzakow

      Rafael Odzakow - 2017-06-20

      Going for a short vacation, here is the untested patch. Use rebootTimeout to increase the timeout for it.

      commit 2ffbd1c5cd3f4193fd631130eef60b17c92892e6 (HEAD -> ticket-2499)
      Author: Rafael Odzakow rafael.odzakow@ericsson.com
      Date: Tue Jun 20 16:10:12 2017 +0200

      smf: 20 seconds timeout in getting node destination is not enough [#2499]
      

      diff --git a/src/smf/smfd/SmfUpgradeStep.cc b/src/smf/smfd/SmfUpgradeStep.cc
      index 2ffeab110..a99c7661a 100644
      --- a/src/smf/smfd/SmfUpgradeStep.cc
      +++ b/src/smf/smfd/SmfUpgradeStep.cc
      @@ -1966,7 +1966,7 @@ bool SmfUpgradeStep::callActivationCmd() {
      TRACE("Get node destination for %s", getSwNode().c_str());
      uint32_t rc;

      • if (!getNodeDestination(getSwNode(), &nodeDest, NULL, -1)) {
      • if (!waitForNodeDestination(getSwNode(), &nodeDest)) {
        LOG_NO("no node destination found for node %s", getSwNode().c_str());
        result = false;
        goto done;
       
  • Tai Dinh

    Tai Dinh - 2017-06-21

    Thank Rafael,

    This is what I expected.

    /Tai

     
  • Rafael Odzakow

    Rafael Odzakow - 2017-06-28

    This issue is as far as I could see a bug. In other campaign sequences SMF will wait with rebootTimeout before doing any operation after reboot. In this campaign sequence the first operation type after a reboot was to to a CLI command on a payload node. This timed out because the CLI command is not wrapped in a retry using the rebootTimeout of SMF.

    SMF does not keep track of all nodes after a cluster reboot therefore the mechanism for handling a cluster reboot is to wrap all possible operations that are to be executed after cluster reboot in a retry loop.

     

    Last edit: Rafael Odzakow 2017-06-29
  • Rafael Odzakow

    Rafael Odzakow - 2017-06-30
    • status: unassigned --> fixed
    • assigned_to: Rafael Odzakow
     
  • Rafael Odzakow

    Rafael Odzakow - 2017-06-30

    fixed in
    commit 3e1d1091270fa83cb8efe5458d6050b56f41f001
    Author: Rafael Odzakow rafael.odzakow@ericsson.com
    Date: Fri Jun 30 10:57:36 2017 +0200

    smf: 20 seconds timeout in getting node destination is not enough [#2499]
    
     

Log in to post a comment.