Standards Based Linux Instrumentation / SFCB / #95 Do not release pthread lock prior to stopBroker() exit

#95 Do not release pthread lock prior to stopBroker() exit

Milestone: 1.4.8

Status: fixed

Owner: Dave Heller

Labels: None

Priority: None

Category: Function

Updated: 2015-02-22

Created: 2014-01-05

Creator: Dave Heller

Private: No

I believe I have found the root cause of a crash at shutdown I have been observing for quite some time. The crash occurs in the main sfcBroker (a.k.a ProviderManager) process, while the stopBroker() thread is at the very end of its execution, in the final exit() call. The crash occurs in the main ProviderManager thread, when the thread is the middle of one of the mHandler[] calls (as defined in ProviderMgr.c). I believe the crash is due to the exit() being called from the stopBroker() thread while the main thread is at this point.

This is similar to crash from [sfcb-tix:#56], the main difference being that in that case, the main thread is in spRecvReq(), as it normally is when it is idle, rather than being in the mhandler call as it is here. The spRecvReq() crash is more likely when the system is idle at shutdown, while the crash in mHandler is more likely to occur under load.

Here is an example of the crash in the mHandler call. Thread 1 is the stopBroker() thread; it is in the exit() call so there's not much useful info there. Thread 2 is main. It is executing a assocProviderList() call at the time, but I have similar core dumps where it's in classProvider() or instProviderList() instead:

$ gdb -q /root/sblim/sfcb/.libs/lt-sfcbd /var/log/core/dump

Reading symbols from /root/sblim/sfcb/.libs/lt-sfcbd...done.
[New LWP 31807]
[New LWP 31620]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/root/sblim/sfcb/.libs/lt-sfcbd -proc:Main                                    '.
Program terminated with signal 6, Aborted.
#0  0x00000039ed835965 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-29.fc17.x86_64 glibc-2.15-37.fc17.x86_64 keyutils-libs-1.5.5-2.fc17.x86_64 krb5-libs-1.10.2-2.fc17.x86_64 libcom_err-1.42.3-2.fc17.x86_64 libcurl-7.24.0-4.fc17.x86_64 libgcc-4.7.0-5.fc17.x86_64 libidn-1.24-1.fc17.x86_64 libselinux-2.1.10-3.fc17.x86_64 libssh2-1.4.1-2.fc17.x86_64 nspr-4.9.5-1.fc17.x86_64 nss-3.14.2-2.fc17.x86_64 nss-softokn-freebl-3.14.2-3.fc17.x86_64 nss-util-3.14.2-2.fc17.x86_64 openldap-2.4.30-2.fc17.x86_64 openslp-1.2.1-16.fc17.x86_64 openssl-1.0.0k-1.fc17.x86_64 zlib-1.2.5-6.fc17.x86_64
(gdb) bt
#0  0x00000039ed835965 in raise () from /lib64/libc.so.6
#1  0x00000039ed837118 in abort () from /lib64/libc.so.6
#2  0x00007f58876415a8 in __gcov_open () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#3  0x00007f5887641bae in gcov_exit () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#4  0x00000039ed838f9a in __cxa_finalize () from /lib64/libc.so.6
#5  0x00007f58875edc73 in __do_global_dtors_aux () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#6  0x00007f588677bd50 in ?? ()
#7  0x00000039ed00f4e7 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
(gdb) i th
  Id   Target Id         Frame
  2    Thread 0x7f5887183800 (LWP 31620) 0x00000039ed8e4c7d in write () from /lib64/libc.so.6

* 1    Thread 0x7f588677c700 (LWP 31807) 0x00000039ed835965 in raise () from /lib64/libc.so.6
(gdb) t 2
[Switching to thread 2 (Thread 0x7f5887183800 (LWP 31620))]
#0  0x00000039ed8e4c7d in write () from /lib64/libc.so.6
(gdb) bt
#0  0x00000039ed8e4c7d in write () from /lib64/libc.so.6
#1  0x00000039ed8767e3 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00000039ed8766c2 in new_do_write () from /lib64/libc.so.6
#3  0x00000039ed87734e in _IO_new_file_xsputn () from /lib64/libc.so.6
#4  0x00000039ed86ce4d in fwrite () from /lib64/libc.so.6
#5  0x00007f5887641252 in gcov_write_block () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#6  0x00007f588764160f in __gcov_close () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#7  0x00007f5887641c91 in gcov_exit () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#8  0x00007f588764281f in __gcov_flush () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#9  0x00007f5887642909 in __gcov_fork () from /root/sblim/sfcb/.libs/libsfcBrokerCore.so.0
#10 0x00007f5887630b36 in getProcess (proc=<synthetic pointer>, info=0x9c4970) at providerDrv.c:1139
#11 forkProvider (info=info@entry=0x9c4970, msg=msg@entry=0x0) at providerDrv.c:1338
#12 0x00007f588761bbc2 in assocProviderList (requestor=0x7fffc65cfc9c, req=0x9cc510) at providerMgr.c:696
#13 0x00007f588761aa10 in processProviderMgrRequests () at providerMgr.c:1115
#14 0x00000000004039af in main (argc=2, argv=<optimized out>) at sfcBroker.c:1270

This occurs infrequently, but can happen if requests are outstanding at the time of shutdown. It is much more likely to occur if stopBroker is hung waiting a http adapter to die, as I discovered while coding my (as yet uncommitted) "http request handler timeout" patch. But the crash can occur with SFCB as it is coded currently, and can be reproduced with enough shutdown attempts under moderate load. ("moderate" because we won't see the crash if a req handler is hung at the end, w/o my timeout patch)

The key seems to be the call to pthread_mutex_unlock() toward the end of stopBroker. This is almost the last call in stopBroker, but not quite. I believe what's happening is, there is enough time between the pthread_mutex_unlock() and the exit() to allow the Main thread to enter a new mHandler, and when the exit() call comes, it crashes.

It's fairly easy to demonstrate this, by adding some trace statements right before the final call to pthread_mutex_unlock() in stopBroker:

@@ -274,11 +400,30 @@ stopBroker(void *p)
   uninitProvProcCtl();
   uninitSocketPairs();
   sunsetControl();
   uninitGarbageCollector();


+  fprintf(stderr, "--> sfcBroker.stopBroker(): calling pthread_mutex_unlock()...\n");
   pthread_mutex_unlock(&syncMtx);

   unloadHostnameLib();

   _SFCB_TRACE_STOP();

and surrounding around the pthread_mutex_lock() call in processProviderMgrRequests(), immediately before the mHandler call:

@@ -1078,9 +1091,23 @@ processProviderMgrRequests()
                     ("--- Mgr request for %s-%s (%d) from %d",
                      req->nameSpace.data, req->className.data, req->type,
                      requestor));


+        fprintf(stderr, "--> providerMgr.processProviderMgrRequests(%ld): about to lock syncMtx...\n",syscall(__NR_gettid));
         pthread_mutex_lock(&syncMtx); /* 77022 */
+        fprintf(stderr, "--> providerMgr.processProviderMgrRequests(%ld): acquired lock on syncMtx\n",syscall(__NR_gettid));
         prov_rdy_state = -1;
         pthread_mutex_unlock(&syncMtx);

         sigprocmask(SIG_SETMASK, &mask, &old_mask);
         hdlr = mHandlers[req->type];                 <- enter handler here
         hdlr.handler(&requestor, req);
         sigprocmask(SIG_SETMASK, &old_mask, NULL);

         pthread_mutex_lock(&syncMtx); /* 77022 */
         prov_rdy_state = 1;
         pthread_mutex_unlock(&syncMtx);

        _SFCB_TRACE(1,
                     ("--- Mgr request for %s-%s DONE", req->nameSpace.data,
                      req->className.data));

This (plus some additional trace) produces this output when the crash is observed:

********** KEYBOARD INTERRUPT **********
^C--> providerMgr.processProviderMgrRequests(31620): about to lock syncMtx...
--> providerMgr.processProviderMgrRequests(31620): acquired lock on syncMtx
--- Winding down sfcbd
--- localConnectServer ended
--- Stopping adapters
********** main thread blocked here **********
--> providerMgr.processProviderMgrRequests(31620): about to lock syncMtx...
--> sfcBroker.stopBroker(31807): stopNextAdapter() just sent SIGUSR1 to pid 31636
--- HTTP-Daemon terminating 31636
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31636
--- Adapters stopped
--- Stopping providers
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31665
--> providerDrv.handleSigUsr1(31665): IndCIMXMLHandler provider received SIGUSR1 signal
---  stopped IndCIMXMLHandler 31665
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31648
--> providerDrv.handleSigUsr1(31648): ServerProvider provider received SIGUSR1 signal
---  stopped ServerProvider 31648
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31648
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31645
--> providerDrv.handleSigUsr1(31645): InternalProvider provider received SIGUSR1 signal
---  stopped InternalProvider 31645
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31645
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31641
--> providerDrv.handleSigUsr1(31641): ProfileProvider provider received SIGUSR1 signal
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31641
--> providerDrv.handleSigUsr1(31641): ProfileProvider provider received SIGUSR1 signal
---  stopped ProfileProvider 31641
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31641
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31639
--> providerDrv.handleSigUsr1(31639): InteropProvider provider received SIGUSR1 signal
---  stopped InteropProvider 31639
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31639
--> sfcBroker.stopBroker(31807): stopNextProc() just sent SIGUSR1 to pid 31637
--> providerDrv.handleSigUsr1(31637): ClassProvider provider received SIGUSR1 signal
---  stopped ClassProvider 31637
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31637
--- Providers stopped
==> sfcBroker.stopBroker(31807): 0 adapter procs left, adaptersStopped = 1
==> sfcBroker.stopBroker(31807): 0 provider procs left, providersStopped = 1
==> sfcBroker.stopBroker(31807): finished the main kill loop
--> sfcBroker.stopBroker(): calling closeLogging()...
--> sfcBroker.stopBroker(): closeLogging() called.
********** stopBroker thread releases lock -**********
--> sfcBroker.stopBroker(): calling pthread_mutex_unlock()...
--> sfcBroker.handleSigChld(31620): got SIGCHLD from 31634
--> sfcBroker.handleSigChld(31620): Everything already stopped, now what...?
********** main thread can now continue **********
--> providerMgr.processProviderMgrRequests(31620): acquired lock on syncMtx
Aborted (core dumped)

This clearly shows that the main thread is blocked on the mutex lock early in the shutdown, and remains there until stopBroker() releases the lock near the end. At that point the main thread can continue. If it is in the mHandler call when the stopBroker() thread calls exit(), we will likely crash.

The crash is rather benign, in the sense that SFCB is about to quit anyway, and it's unlikely there's any disk activity at this point (i.e. nothing to corrupt). Nevertheless, fixing it seems easy.

All we need to do is remove the pthread_mutex_unlock() call, which prevents stopBroker() from releasing the mutex lock before exit(). In that case we never see this crash. I can't think of any good reason to release the lock here: there is no good reason to start processing a new request at this point. And I can't see any adverse effect from not releasing the lock -- at least not on a standard x86_64 box.

This locking code was added with 3497096 (a.k.a. 77022). I think the intent was to block stopBroker() until the main thread was locked from entering a new mHandler call. However, I think the same code can be used to prevent the main thread from continuing before the exit(). Put another way, having locked the main thread at the start of shutdown, I see no good reason to release it before shutdown is complete.

Dave Heller - 2014-01-05

Commit [88484a] for v1.4

Related

Commit: [88484a]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Heller - 2014-01-05

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Heller - 2014-02-07

Release: backlog --> 1.4.8
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Heller - 2014-03-27

Status: pending --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Do not release pthread lock prior to stopBroker() exit

Release

Searches

Help

#95 Do not release pthread lock prior to stopBroker() exit

Related

Discussion

Related