#94 Increase sigChldWaitTime

1.4.8
fixed
Dave Heller
None
None
Function
2014-03-27
2014-01-05
Dave Heller
No

The value I have named "sigChldWaitTime" is the max time, in seconds, that stopBroker() will wait to receive a SIGHCLD from an adapter or provider process during SFCB shutdown. I am changing the default value from 1s to 3s.

Shutdown works roughly like this: stopBroker() sends SIGUSR1s to all adapter and provider processes, which serve as a sort of shutdown signal that each process must handle and then shut down. As each process dies it sends a SIGCHLD back to it's parent, sfcBroker (a.k.a Main), which handles the SIGCHLD via it's handleSigChld() handler. It is handleSigChld() that does the "accounting" of which processes are shutdown and which are still running.

As handleSigChld() processes the incoming SIGCHLDs it sends a pthread_cond_signal() back to stopBroker(), which is waiting on a corresponding pthread_cond_timedwait(). The max amount of time that stopBroker() will wait is sigChldWaitTime.

sigChldWaitTime has only an indirect effect on the overall execution time of stopBroker(), i.e. the total shutdown time. This is because, as stopBroker() issues a SIGUSR1 to a process, it does not wait for the SIGHCLD from that specific process, but rather a SIGCHLD from any child process. Overall, the number of SIGHCLDs received must account for all outstanding adapter and provider processes. But the order in which they die is somewhat indeterminant, and stopBroker() will keep moving (keep looping, that is) as long as processes are continuing to die.

That means that sigChldWaitTime, on the high side, has a marked effect only when a adapter or provider process is slow to die, and only when that is the last (or only) adapter or provider process left running the respective group. This is observable when the system is under load (i.e. requests outstanding) at the time of shutdown. In these cases you can see a visible effect of a higher sigChldWaitTime.

On the low side, I have observed what I consider as an undesirable side-effect of a sigChldWaitTime too short. That is: stopBroker() will not allow sufficient time for the SIGCHLD to be received, causing it to loop too fast and send multiple SIGUSR1 to the same pid in rapid succession, in cases where it's really not necessary and the process would have died after the first SIGUSR1 if just given enough time.

Here is an example of what this looks like at shutdown. In this case the provider was under a somewhat heavy load but was otherwise working normally (i.e. not hung). This is fairly easy to produce with SFCB currently, with the sigChldWaitTime value of 1s:

--- Stopping adapters
--- localConnectServer ended
--- All adapters stopped.
--- Stopping providers
--- stopped OSBase_OperatingSystemProvider 8250
--- stopped OSBase_OperatingSystemProvider 8250
--- stopped OSBase_OperatingSystemProvider 8250
--- stopped OSBase_OperatingSystemProvider 8250
--- stopped OSBase_OperatingSystemProvider 8250
--- stopped OSBase_OperatingSystemProvider 8250
--- stopped TestInstanceProvider 8184
--- stopped OSBase_MetricRegisteredProfileProvider 8082
--- stopped IndCIMXMLHandler 8015
--- stopped ServerProvider 7998
--- stopped InternalProvider 7995
--- stopped ProfileProvider 7991
--- stopped InteropProvider 7989
--- stopped ClassProvider 7987
--- Providers stopped

I have seen this behavior in the past and experimented w/ sleeps at various places in stopBroker(), handleSigChld(), stopNextProc(), etc. But I now see that sigChldWaitTime has the greatest impact on the behavior here.

In most cases this has no adverse effect beyond sending a few unnecessary SIGUSR1s. But although it happens rarely, I have observed cases where this "rapid fire" behavior results in a provider crash. I believe this is due to signals being received too quickly by the provider processes. I have attached a separate file that shows some examples.

These particular crashes happen at process cleanup, when the process is in stopProc(). It's possible there are some bugs in stopProc() that account for these crashes. But rather than continue to debug crashes that only occur at shutdown -- the fixes to which may involve timing and thread locking -- I think changing the value of sigChldWaitTime will eliminate, or at least greatly reduce, the incidence of these crashes.

It is interesting that the original value of sigChldWaitTime, as coded in the "1176879 - Need safe shutdown" patch was 5s. It was changed to 1s with sourceforge 3497096, which actually was to address a similar sort of crash. I'm not sure what was the rationale for the change; perhaps the programmer felt the longer sigChldWaitTime was simply unnecessary. Regardless, it is my humble opinion this is not the case, and a slightly longer value is prudent.

I chose the value of 3s since it was more than sufficient to eliminate the rapid-fire behavior (indeed, a value of 2s seemed sufficient) but not long enough to have an noticeable effect on overall shutdown, even under the conditions I described. And 3s seems like a reasonable amount of time to wait for an adapter or provider process to die. If it does not die within that period, the process is likely hung waiting on a req handler process, and at that point will not terminate until the req handler times out.

I considered making this a configurable value but decided against it since this is a lot of detail to expose to the average user, and I'm fairly confident the new value will be good in all environments. If anyone finds otherwise please report this to sblim-devel@lists.sourceforge.net, as I would be interested to know.

Related

News: 2014/03/new-release-sfcb-148

Discussion

  • Dave Heller
    Dave Heller
    2014-01-05

    Commit [12d4de] for v1.4

     

    Related

    Commit: [12d4de]

  • Dave Heller
    Dave Heller
    2014-01-05

    • status: open --> pending
     
  • Dave Heller
    Dave Heller
    2014-02-07

    • Release: backlog --> 1.4.8
     
  • Dave Heller
    Dave Heller
    2014-03-27

    • Status: pending --> fixed