#4429 Tcl_AsyncMark does not work with --enable-threads

obsolete: 8.6b1.1
open
6
2009-11-02
2009-09-29
No

OS: sparc Solaris 10
Tcl_AsynMark cannot be called in a signal handler because it locks the same mutex (notifierMutex) as Tcl_WaitForEvent. If a signal is handled in main thread when it serves Tcl events, then deadlock is very probable. Attached example will produce a stack like this
----------------- lwp# 1 / thread# 1 --------------------
feb40408 lwp_park (0, 0, 0)
ff0385fc Tcl_MutexLock (ff08247c, 0, 0, 1000, 0, 0) + b4
ff039624 Tcl_AlertNotifier (2bda8, 0, 0, 1000, 0, 0) + 34
fefd83a8 Tcl_ThreadAlert (1, 0, 0, 0, 0, 0) + a8
feedf7e0 Tcl_AsyncMark (1a80d8, 0, 0, 0, 0, 0) + 60
fee306a8 GotAlarm (e, 0, ffbfeb10, 1, 21c30, 21c2c) + 30
feb40494 __sighndlr (e, 0, ffbfeb10, fee30678, 0, 1) + c
feb3558c call_user_handler (e, 0, 4, 0, ff0c2000, ffbfeb10) + 3b8
feb4187c _write (4, ff05ce88, 1, 0, 0, 2c0b5) + c
ff03a2f0 Tcl_WaitForEvent (224e8, fffffffd, ff1966e4, ff3ee0f8, ff3f06e0, 0) + 3c0
fefd7f74 Tcl_DoOneEvent (0, 6e7f8, 0, 0, 1, 0) + 1ec
ff17c738 Tk_MainLoop (2c1a8, 31188, 0, 1, 1, 5400) + 48
ff1966e4 Tk_MainEx (2, ffbff2ac, 10e90, 2c1a8, 4, 4) + b1c
00010e60 main (2, ffbff2ac, ffbff2b8, 21000, ff0c0100, ff0c0140) + 50
000109e0 _start (0, 0, 0, 0, 0, 0) + 108
----------------- lwp# 2 / thread# 2 --------------------
feb40408 lwp_park (0, 0, 0)
ff0385fc Tcl_MutexLock (ff08247c, fea7bf1c, fea7be9c, fea7be1c, 0, fea7bea5) + b4
ff03a8e4 NotifierThreadProc (0, fea7c000, 0, 0, ff03a4a8, 1) + 43c
feb40368 _lwp_start (0, 0, 0, 0, 0, 0)
in a few seconds. To reproduce the problem, use regular wish built with --enable-threads and use attached script as input:
wish run.tcl
It will hang shortly. To speed up hanging, one can decrease timeouts. Tk must serve some events to make the application hanging faster, so the script changes text widget in loop. Build script is attached (need to modify paths).

Discussion

  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-09-29

    testcase to hang

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-09-29
    • summary: Tcl_AsynMark does not work with --enable-threads --> Tcl_AsyncMark does not work with --enable-threads
     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-09-29

    Tried on Linux, Red Hat 4 update 4. The same problem.

     
  • Konstantin Khomoutov

    I've hit this while implementing an extension to handle POSIX signals (it's unfinished yet so not released).

    The problem appears to be more complex and I doubt it can be fixed at all for all possible cases of a deadlock,
    simply because pthread's mutex is allowed to deadlock when locked twice from the same thread, and that's
    what it does on Linux 2.6 at least.

    The problem is that the signal handler bumps into the thread's execution stack at unpredictable moments,
    and hence the signal handling can be considered as a "superthread" which don't play by the usual rules for threads.

    Conceptually, to make Tcl_AsyncMark() (and any other function which tries to lock the target thread's LTS) not deadlock,
    we have to make locking function only attempt locking if some flag is not raised (meaning that no lock
    is held). But using a flag means sharing a mutable state, which inherently implies using another mutex, so we're back at the starting point.
    Masking all signals before locking the LTS mutex is also a bad idea: a) too many syscalls b) this doesn't help to prevent mutual deadlocks between two threads.

    So, in my package mentioned above I finally implemented a solution involving a special manager thread with which signal handlers interact and which dispatches events to target threads. The solution is complicated but works. As I intend to release the code under a Tcl-like lisence, I will be happy to share it, if you so wish.

     
  • Alexandre Ferrieux

    I seem to recall (see that 12-year old paper http://www.linuxjournal.com/article/2121 ) that the recommended way of doing pthread+signals is to block all signals in all threads except one, thus effectively guiding signals to one specific thread that you control. Is this still working in modern pthreads ?

     
  • Konstantin Khomoutov

    The problem with this approach is that it implies we control all the threads existing when our setup code is beging run and postulate the policy than there shall never be a thread created which will have singals unmasked, except the signal-processing thread.

    I reckon this is impossible to achieve from an extension implementing signal handling, and can only possibly be done if integrated directly in the core, which is questionnable, as the concept of signals only exists on Unix.

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-09-29

    pthread_trylock may help (in addition with global flag or something). I don't have clear idea though.

     
  • Konstantin Khomoutov

    It seems this function is actually named pthread_mutex_trylock().

    The idea is interesting, but I don't quite get how to implement it:
    In the defautl mode of operation pthread_mutex_trylock() is defined to return immediately with EBUSY if the mutex lock is held, and Tcl core doesn't expect this behaviour from the mutex API it uses (and exposes).
    If we instead make all mutexes "recursive" (in terms of pthreads), this will fix deadlocks occuring in the same thread because of signal handling but will break expectations in all other cases when a mutex locking function is supposed to wait on a mutex already locked by some other thread.

    It need further thinking (and taking into account threading subsystem on Windows) as Tcl mutex API hides platform details from the programmer.

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-09-29

    I don't think that recursive mutex will help. Cannot modify global data safely in a signal handler without be sure that you are alone.
    Agree that the problem cannot be solved without help of another thread that does not receive signals.
    Please also note that there may be several signals come before "end of special handling".
    Draft idea:
    Need a thread that blocks all signals (except, maybe one, usr123) or does not use notifierMutex. Let's call it right_thread.
    AsyncMark:
    - if (trylock(&notifierMutex) == 0)
    - // normal code
    - lock (&another_mutex)
    - queue async_tls_data
    - unlock (&another_mutex)
    - wake up right_thread (using pthread_kill(right_thread,usr123) or pthread_cond_signal)
    right_thread:
    - lock (&another_mutex)
    - take asyn_tls_data from queue
    - save async_tls_data to right interp
    - unlock (&another_mutex)

    There is a non-blocking technique to modify global data. May also be used in any solution.

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-10-28

    Any update? It is not a minor issue. One of the oldest group of functions does not work in the situation it was primarily designed for. And there is no workaround.

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2009-10-28
    • priority: 5 --> 6
     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02

    The extensive use of locking in the Unix notifier seems to be the cause of this issue? I believe that with careful analysis of the locks in the Unix notifier they can be minimized. Of special concern is waiting for (an|any) event while holding any locks because we have no way of knowing precisely when an event will be triggered.

     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02

    Please try the attached patch in your environment.

     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02
    • milestone: 874763 --> obsolete: 8.6b1.1
     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02

    possible fix

     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02

    Slightly modified build script and test case to run in my env.

     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02

    BSD Tcl test case (no Tk)

     
  • Joe Mistachkin

    Joe Mistachkin - 2009-11-02

    It seems that my patch fixes something; however, the original test script still hangs on Linux 2.6. Thanks to flatworm for helping me test this. I would really appreciate it if the original submitter of the bug would run both tests in their environment and report the results here.

     
  • Konstantin Khomoutov

    Note that in order to compile the CVS HEAD with the proposed patch applied under Linux 2.6.26 I had to tweak the Makefile.

    Namely, to allow compiling these recursive mutexes code, _XOPEN_SOURCE must be defined to be 500 or higher, and as this POSIX compatibility layer makes strncasecmp() inaccessible, _BSD_SOURCE must also be defined to return it back again.

    So, a temporary solution is to make the usual "./configure" step, then open the generated Makefile, locate the "AC_FLAGS" variable and add this line below:
    AC_FLAGS += -D_XOPEN_SOURCE=500 -D_BSD_SOURCE

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2010-03-02

    Sorry for long absence.
    1) As I mentioned before, recursive mutex does not protect "protected data" from corruption in signal handler.
    2) I made Tcl testcase much more reliable, it hangs just in seconds. Need to change timeout in C code to 100 us and in Tcl to 1 ms and decrease printing. Attaching run3.tcl to use with tclsh.
    3) After making mutexes recursive and using modified testcase, it hangs in the same place. I collected a number of different stacks. Last one:
    Tcl_WaitForEvent(timePtr = 0x22328)
    Tcl_MutexUnlock(mutexPtr = 0xff38294c)
    __mutex_unlock(0x2b468, 0x1000, 0xfef6cbc0, 0xff072000, 0x2b470, 0x6)
    --- Signal SIGALRM raised ---
    __sighndlr(0xe, 0x0, 0xffbfe4f0, 0xfee20678, 0x0, 0x1)
    GotAlarm(signo = 14)
    Tcl_AsyncMark(async = 0x51738)
    Tcl_ThreadAlert(threadId = 0x1)
    Tcl_AlertNotifier(clientData = 0x2bbe8)
    Tcl_MutexLock(mutexPtr = 0xff38294c)
    mutex_lock_queue(0x0, 0x0, 0x2b468, 0x0, 0x0, 0x1)
    __lwp_park(0x0, 0x0, 0x0, 0x0, 0xfee50000, 0x1)
    4) You search web for "pthread_mutex_lock async signal safety" and you will
    find that you cannot use mutexes in signal handler at all! Here's an example
    of finding:
    ASYNC-SIGNAL SAFETY
    The mutex functions are not async-signal safe. What this means is that
    they should not be called from a signal handler. In particular, calling
    pthread_mutex_lock or pthread_mutex_unlock from a signal handler may
    deadlock the calling thread.
    This all means that direction of the research is wrong. Tcl_AsyncMark must be mutex free or must not be used in signal handler.

     
  • Yevgen Ryazanov

    Yevgen Ryazanov - 2010-03-02

    updated testcase

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks