Menu

#91 sfcbd fails to shut down if provider crashed with sigsegv

1.4.8
wont-fix
None
None
Function
2014-03-27
2013-11-15
No

If a provider crashes with a segmentation violation (due to a NULL pointer access), sfcbd fails to shut down properly.

Attached patch fixed this.

1 Attachments

Discussion

  • Dave Heller

    Dave Heller - 2013-11-18

    Courageous of you to enter the quagmire of SFCB shutdown! The line you identify has been commented out for quite some time. In fact I think it dates back to the the original implementation of the feature:

    http://sourceforge.net/tracker/?func=detail&group_id=128809&atid=712784&aid=1176879

    I think the idea is, the accounting that takes place within handleSigChld() is supposed to accurately determine when the providersStopped flag gets set. You have found some situation where this does not happen?

    I am aware of situations where SFCB will hang on shutdown following a provider crash, but they all involve cases where a HTTP request handler process is hung. This happens, in part, because we never timeout on a recvmsg() when communicating with a provider. I am working on a patch to implement a select() timeout in this case, and I'm hopeful that will fix this problem and some other issues where SFCB can hang. But in my test case to produce this sort or shutdown problem, your patch makes no difference (since the real problem there is that a req handler is preventing a http adapter from exiting).

    So, can you share what test case you are using, where setting the flag as you suggest makes a difference? Thx -Dave.

     
  • Klaus Kämpf

    Klaus Kämpf - 2013-11-18

    "quagmire of SFCB shutdown", yeah, lol !

    Here's the instance provider I used for testing:
    https://github.com/kkaempf/cmpi-crash

    It only implements EnumInstanceNames, which calls abort()
    All other intrinsic methods call NULL.

    The abort() case is handled in tix#89 already.

    However, if I ctrl-c sfcb after a NULL pointer access, it won't shut down properly.

    Adding debug prints pointed me to an endless loop calling stopNextProc().
    Setting providersStopped=1 breaks that endless loop for me.

    Having recvmsg() timeouts is of course very welcomed ;-)

     
  • Dave Heller

    Dave Heller - 2013-11-27

    Sorry I did not have a chance to look at this yet. But you are saying, there is no hung req handler process in this case (as there would not be, if no query was performed while SFCB was in this state), but Main is still in a hung state, waiting on the stopBroker thread?

     
  • Klaus Kämpf

    Klaus Kämpf - 2013-11-29

    Yes, exactly: "Adding debug prints pointed me to an endless loop calling stopNextProc()."

     
  • Dave Heller

    Dave Heller - 2014-01-06

    Sorry, I'm still having trouble seeing where this change makes any difference. The only way this providersStopped=1 would be hit here is if stopNextProc() returned 0. That would only happen if all the running provider processes, including classProvider, had been killed. Not just by virtue of sending the signal from stopNextProc(), but by receiving the SIGCHLD in handleSigChld(), where testStartedProc() is called, which is the only place we mark the process as stopped (set the pid to 0) in the provProc array. And if that is true, the providersStopped flag would already be set to 1 in handleSigChld(). Which is why I think this line has always been commented out.

    So far, the only case I have seen where this makes any difference is the Problem 2 from [sfcb-tix:#90]. But that is a special case since it has to due with the default value of providersStopped, not the way the flag is set during execution.

    I have attached a debug patch that checks the number of running provider processes against the value of the providersStopped flag, from within what I call the "main kill loop" in stopBroker(). Rather than just set the flag, we check if any running processes are left according to the provProc array. I added a test function getRunningProcCount() for this purpose. If that function reports 0 we can break even if the providersStopped is not set. That would mean a mismatch between the providersStopped flag and what's reflected in the provProc array. Like I said, I have not yet hit this case, except as I described above.

    If you can show me where you hit this case I'll have a look!

    Adding debug prints pointed me to an endless loop calling stopNextProc().

    Well, the way it's coded currently we'll be in this loop regardless of the return value of stopNextProc(), if providersStopped=0. If stopNextProc() returns > 0 there is a pid still outstanding (a provider still running) and we should not quit. If stopNextProc() returns 0 we should be done and the providersStopped flag should be set as I described, and then we should break (i.e. this shouldn't happen).

     

    Related

    SFCB: #90

  • Dave Heller

    Dave Heller - 2014-01-06

    This patch contains test code for SFCB-89, 91, 96 and 97, plus some debug statements. The different parts are labeled. Hopefully it's not too confusing. This stuff is all related so I found it easier to test it together.

     
  • Dave Heller

    Dave Heller - 2014-01-12

    Update debug patch for post-checkin of 89 and 96. Contains the test code for 91 and 97. Debug and comments only for the others.

     
  • Dave Heller

    Dave Heller - 2014-03-27

    I'm going to go ahead and close this as wont-fix here. It's hard to see how this condition can occur, even with a race condition between the stopBroker() thread and handleSigChld(), or a missed pthread signal between the two. (It's actually easier to see how a race condition could result in stopBroker() terminating early on a premature providersStopped=1 than it is to see how it could hang indefinitely in a case where the flag never gets set). As I mentioned, this line has been always commented out in stopBroker() and I don't want to enable it w/o understanding the condition better. And with other recent adjustments to shutdown, better to see how those work out before changing this as well.

    We can always revisit this if see this sort of hang in the future. Thx -DaveH

     
  • Dave Heller

    Dave Heller - 2014-03-27
    • status: open --> wont-fix
    • assigned_to: Dave Heller
    • Release: backlog --> 1.4.8
     

Log in to post a comment.