Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#5057 Crash on exit on Fedora 17

obsolete: 8.5.11
closed-works-for-me
5
2012-06-27
2012-06-21
julien
No

Our TCL-based application crashes from times to times on exit. The crash seems to occurs in the TCL thread. Here is the backtrace:

#0 0x00000000019de02e in Tcl_MutexUnlock (mutexPtr=0x0)
at /home/julien/Certitude/main_line/thirdparty/tcl8.5.11/unix/../unix/tclUnixThrd.c:556
#1 0x00000000019bfa06 in UnlockBucket (cachePtr=0x7feecc00ae10, bucket=5)
at /home/julien/Certitude/main_line/thirdparty/tcl8.5.11/unix/../generic/tclThreadAlloc.c:814
#2 0x00000000019bfb94 in PutBlocks (cachePtr=0x7feecc00ae10, bucket=5, numMove=5)
at /home/julien/Certitude/main_line/thirdparty/tcl8.5.11/unix/../generic/tclThreadAlloc.c:863
#3 0x00000000019bec18 in TclFreeAllocCache (arg=0x7feecc00ae10)
at /home/julien/Certitude/main_line/thirdparty/tcl8.5.11/unix/../generic/tclThreadAlloc.c:240
#4 0x00000000019de3bb in TclpFreeAllocCache (ptr=0x7feecc00ae10)
at /home/julien/Certitude/main_line/thirdparty/tcl8.5.11/unix/../unix/tclUnixThrd.c:818
#5 0x000000372cc07b12 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#6 0x000000372cc07d22 in start_thread () from /lib64/libpthread.so.0
#7 0x000000372c8f199d in clone () from /lib64/libc.so.6

This problems only occurs on Fedora 17. Our application runs for years on many other Linux distributions without problem.

Changing the Tcl_MutexUnlock and Tcl_MutexLock function to check for null pointers seems to solve the issue, but this is probably not the right solution, and something is certainly wrong elsewhere.

So is there a bug is the TCL thread handling ? Or is something wrong in our application ? Any idea ?

We use TCL 8.5.11, and it is compiled as follow:
./configure --enable-threads --enable-gcc --disable-langinfo --disable-shared

Discussion

  • Looks like TclFinalizeThreadAlloc was called before TclFreeAllocCache; the latter is called to deallocate the thread-specific memory manager instance. I'm guessing that the pthread implementation on Fedora is different to some other Linux distros, and the exact time of doing that destruction callback is an example of this difference. Nasty!

     
  • The finalization code is deeply evil, BTW. (In 8.6, we don't try to finalize things quite so fiercely on [exit], as it was so damn problematic. It leads to more technical memory leaks, but the OS can deal with all that for us.)

     
  • julien
    julien
    2012-06-21

    I added some traces in the TclFinalizeThreadAlloc and TclFreeAllocCache functions.
    TclFinalizeThreadAlloc is NOT called before TclFreeAllocCache.

    Is there anything I can do to help to solve this (add some debug, etc.) ?

     
  • Is your app script-only, or is there also some C code of yours in the mix ? Your mention of "the Tcl thread" seems to indicate the latter, but I'd like to be sure. If the former, try to build a minimal script. And in any case, as Donal suggests, try 8.6 since we took steps to avoid that can of worms.

     
  • julien
    julien
    2012-06-21

    Our application is a huge C++ application which uses Tcl among other libraries. And it's highly multi-threaded.

    I tried with Tcl 8.6: I still have crashes, but much less often than with 8.5.11.
    Here is the backtrace:
    #0 0x0000000001a15c75 in Tcl_MutexUnlock (mutexPtr=0x0) at /home/julien/Certitude/main_line/tcl8.6b2/unix/tclUnixThrd.c:465
    #1 0x00000000019eb336 in UnlockBucket (cachePtr=0x7f622c00ae30, bucket=7)
    at /home/julien/Certitude/main_line/tcl8.6b2/generic/tclThreadAlloc.c:833
    #2 0x00000000019eb4c4 in PutBlocks (cachePtr=0x7f622c00ae30, bucket=7, numMove=3)
    at /home/julien/Certitude/main_line/tcl8.6b2/generic/tclThreadAlloc.c:882
    #3 0x00000000019ea548 in TclFreeAllocCache (arg=0x7f622c00ae30) at /home/julien/Certitude/main_line/tcl8.6b2/generic/tclThreadAlloc.c:264
    #4 0x0000000001a16002 in TclpFreeAllocCache (ptr=0x7f622c00ae30) at /home/julien/Certitude/main_line/tcl8.6b2/unix/tclUnixThrd.c:730
    #5 0x000000372cc07b12 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
    #6 0x000000372cc07d22 in start_thread () from /lib64/libpthread.so.0
    #7 0x000000372c8f199d in clone () from /lib64/libc.so.6

    I also have noticed that the crash seems to occur when another thread allocates or frees some memory:
    (gdb) t 4
    [Switching to thread 4 (Thread 0x7f5e0ca03880 (LWP 2341))]
    #0 0x000000372c87c20a in _int_free () from /lib64/libc.so.6

    Or, in another execution:
    (gdb) t 3
    [Switching to thread 3 (Thread 0x7fddc7894880 (LWP 18314))]
    #0 0x000000372c87cdbb in _int_malloc () from /lib64/libc.so.6
    (gdb) bt
    #0 0x000000372c87cdbb in _int_malloc () from /lib64/libc.so.6
    #1 0x000000372c87fcb3 in malloc () from /lib64/libc.so.6
    #2 0x000000372f45f3cd in operator new(unsigned long) () from /lib64/libstdc++.so.6

     
  • > Our application is a huge C++ application which uses Tcl among other
    > libraries. And it's highly multi-threaded.

    A side note: better architectures exist...

    > I also have noticed that the crash seems to occur when another thread
    > allocates or frees some memory:

    This hints at the possibility of a violation of the apartment model. Please verify that *all* your calls into Tcl are from the thread where you created the interp.

     
  • julien
    julien
    2012-06-22

    Correction:
    The crash does not seems to depend on the fact that the other threads allocate or free some memory (I also had the crash when the other threads do not allocate or free memory).

    Regarding the possibility of a violation of the apartment model, I guess that if something was wrong here, we would have experienced some crashes on other Linux distributions. But crashes only occur on Fedora 17.

    Basically, the architecture of our application is as follow:
    The main thread (thread 1) creates a new thread (thread 2). In this new thread, the Tcl interpreter is created. It seems that Tcl creates another thread at this moment (thread 3) to execute the NotifierThreadProc function. The Tcl commands are executed in thread 2 (where the Tcl interpreter was created) until the user decides to exits the application. On exit, the Tcl_DeleteInterp and Tcl_Finalize functions are called in thread 2. Then this thread exits and is deleted. The crash occurs in thread 3 during the exit process of the main thread (so after the thread 2 was deleted).

    Is there something wrong with our architecture ? Any idea or suggestion ?

     
  • julien
    julien
    2012-06-27

    After some investigations, it appears that there was a nasty bug in our application: in some cases, the Tcl interpreter was not deleted in the thread where it was created. The strange thing is that we never observed crashes before Fedora 17.
    The issue can be closed now.

     
  • julien
    julien
    2012-06-27

    • status: open --> closed-works-for-me
     
  • Glad to see your problem solved. I can also answer your question "Is there something wrong with our architecture ? Any idea or suggestion ?":

    Yes. Don't build huge multithreaded monoliths. Modularize, and use the process boundary as often as compatible with performance. On next core dump you'll be glad you did ;)