#4525 Unset PATTERN unsafe with trace-originated secondary unsets

obsolete: 8.5.8
closed-fixed
8
2010-02-03
2010-01-25
No

I wrote a simple application to gather Tk events and pass them from one interpreter to another via tequila (http://wiki.tcl.tk/1243), which uses traces and socket communication to keep array values in sync between Tcl interpreters.

My application displays a canvas, to which a <Motion> event is bound. Thus as the cursor is moved over the canvas a steady stream of Tk events is generated; the scripts associated with these events are captured and written to an array, which tequila echoes to an interpreter in a separate process.

When I start this application and move the cursor continually over the canvas, sometime between five and sixty seconds later one of the interpreters segfaults and crashes.

Platform is Ubuntu Hardy, with ActiveTcl 8.5.5 installed. I duplicated the bug using Pat Thoyt's tclkits versions 8.5.8 (http://www.patthoyts.tk/tclkit/linux-ix86/8.5.8/tclkit%2d8.5.8.gz) and 8.6b1.1 (http://www.patthoyts.tk/tclkit/linux-ix86/8.6%2dbeta/tclkit%2d8.6b1.1.gz).

For ease of duplication I created a starkit containing all relevant code called echoevent.kit (attached). To duplicate:

1. in a terminal console, execute: tclkit echoevent.kit server
This will start a tequila server.

2. in a separate terminal console, execute: tclkit echoevent.kit master
This will start a simple Tk "master" application that displays a canvas. Moving the cursor across this "master" canvas will cause the command associated with the <Motion> event to echo repeatedly to the terminal console.

3. in a third terminal console, execute: tclkit echoevent.kit slave
This will cause an identical "slave" Tk application to appear, except event processing for the "slave" canvas is disabled. Moving the cursor across the "master" canvas will cause the event command to echo to both the "master" console and the "slave" console. The tequila server is passing event information from the master interpreter to the slave interpreter.

4. move the cursor continually over the master canvas to generate a constant stream of <Motion> events to be echoed to the consoles. Either the master or the slave process will crash with a segmentation fault within 60 seconds.

report cross-posted at: http://bugs.activestate.com/show_bug.cgi?id=85927

Discussion

1 2 > >> (Page 1 of 2)
  • Stephen Huntley

    Stephen Huntley - 2010-01-25

    starkit containing example code

     
  • Alexandre Ferrieux

    Since this happens on unix, can you enable core dumps, and provide a gdb stack trace of the crash ? I'd like to dichotomize between Metakit and the vanilla Tcl core.

    As a side note, a .kit is not the easiest transport for some maintainers (me at least), because it requires additional effort to crack the hull open, and reproduce with the core-under-development. I hate efforts ;-)

     
  • Steve Landers

    Steve Landers - 2010-01-25

    I've reproduced the problem on OSX 10.6.2 using wish 8.5.7. If I can figure how, I'll attach the traceback file

     
  • Alexandre Ferrieux

    gdb trace under Fedora 12 32bits

     
  • Alexandre Ferrieux

    After Steve kindly brought all the pieces beside my bed, I could get the attached gdb stack trace under Linux (Fedora 12, 32bits). Nothing about Metakit there.

     
  • Alexandre Ferrieux

    Further analysis of the core shows the command

    array unset collab .c,queue,42,*

    being executed at the time. So it seems to hint at an interaction between patterned unsets and traces. I'd appreciate somebody knowing the app at hand to trim that stuff down to a manageable size... TBC

     
  • Alexandre Ferrieux

    Homing in.
    Just adding [puts] around the unset, shows a reentrancy issue.

    UNSETCOLLAB>: .c,queue,19,* (size:23 -- matching:10)
    UNSETCOLLAB>: .c,queue,20,* (size:22 -- matching:10)
    UNSETCOLLAB<
    UNSETCOLLAB>: .c,queue,19,* (size:12 -- matching:9)
    UNSETCOLLAB<

    This in turn can be traced down to Tequila doing [vwait] in its guts. With Tk and a heavy stream of events, that [vwait] can catch a GUI event and call arbitrary code, including some doing an extra [unset] on the same var.

    Knowing this, I'll try to build an atomic example.

     
  • Alexandre Ferrieux

    Still not managed to reproduce, but a good candidate from RTFS:
    At line 3199 of tclVar.c, for [array unset var pattern], a loop iterates over key with a stateful iterator:

    for (varPtr2=VarHashFirstVar(varPtr->value.tablePtr, &search);
    varPtr2!=NULL ; varPtr2=VarHashNextVar(&search)) {

    then, in the body of that loop, when a match is found, the unset is done, possibly calling traces, possibly wreaking the chaining of hash keys.

    Some kind of locking, or a more robust iterator, must be used I think.

     
  • Alexandre Ferrieux

    Additional insight: the way the "stateful iterator" Tcl_HashSearch work, is by a {bucket index, chained intra bucket pointer} pair. So the cases where the aforementioned interaction can cause harm, is when the second-level [unset] removes the very next entry from the same bucket.

    That's why the crash is so rare. Now to designing a pin-sharp case ;-)

     
  • Alexandre Ferrieux

    Sigh.
    Just produced a bucket of depth three: [array set x {a 1 e 1 i 1}].
    (verify with [array stat x])

    Then tried this:
    trace add variable x unset rem
    proc rem args {puts REM:$args; if {![info exists ::again]} {set ::again 1;unset ::x(e)}}

    Then [array unset x *] should remove, in order, i,e,a.
    *But*, since the trace proc removes e while removing i, the outer loop should get into trouble....

    Though it doesn't for some reason.
    Just noticed the tricky game of Undefining a var instead of coldly killing its hash entry. Maybe this explains that, and my hypothesis was naive. Something more subtle must come into play...

     
  • Alexandre Ferrieux

    Ha ! Got it now.

    In truth the secondary unset _does_ kill the hash entry in that case, it's just that the reuse of the freed block doesn't cause immediate harm normally. Hence I had to add the following 'sanitizing' line to FreeVarEntry:

    if (TclIsVarUndefined(varPtr) && !TclIsVarTraced(varPtr)
    && (VarHashRefCount(varPtr) == 1)) {
    + varPtr->value.objPtr=(void *)0xdeadbeef;
    ckfree((char *) varPtr);

    It is simply a matter of resetting a field to a visibly invalid value just before freeing.
    The effect is that, as soon as somebody reuses that freed structure (and typically by indirections off this objPtr field), a SIGSEGV or SIGBUS will burst on 0xdeadbeef.

    And it does: given this sanitizing patch, the following ultrashort example crashes tclsh:

    array set x {e 1 i 1}
    trace add variable x unset rem
    proc rem args {
    puts REM:$args
    if {![info exists ::already]} {
    set ::already 1
    unset ::x(e)
    }
    }
    array unset x *

    Explanation: a nascent array with a small size has only 4 buckets, allowing a quick search of collisions.
    It turns out 'e' and 'i' fall in the same bucket. Since the entries of a same bucket are stored in a LIFO linked list, the [array unset x *] iteration will do 'i', then 'e'. The rest of the code arranges for the 'e' to be removed as a side effect of the unset trace on 'i', hence removing the rug. Boom.

    Renaming the bug according to this confirmation.

     
  • Alexandre Ferrieux

    • summary: IPC via sockets and traces causes segfault --> Unset PATTERN unsafe with trace-originated secondary unsets
     
  • Alexandre Ferrieux

    • priority: 5 --> 8
     
  • Alexandre Ferrieux

    Upping prio. Patch under way.

     
  • Alexandre Ferrieux

    The attached patch fixes the issue by properly protecting the next-in-current-bucket.
    This is done modularly (without touching tclHash.c) thanks to the property that the search object has a NULL nextEntryPtr at each bucket change. So the protection is done by upping the refcount of nextEntryPtr, and decreasing it again one loop cycle later.

    Since the patch includes the "sanitizing" line, a deterministic test case will be provided shortly.

    Please review.

     
  • Alexandre Ferrieux

    • status: open --> open-fixed
     
  • Donal K. Fellows

    That patch doesn't quite work. It has excess refcount problems when there's a failure to unset midway. Evil case!

    I'll attach the patch I've applied to the HEAD, which addresses the issue (it's closely based on yours and includes a test). Note that it doesn't include any code to invalidate unused memory; the --enable-symbols=mem configuration works fine for that already.

     
  • Donal K. Fellows

    Backported to 8.5. I think it's probably not practical to backport to 8.4 since this is thoroughly dependent on miguel's VarReform work.

     
  • Donal K. Fellows

    • labels: 104240 --> 07. Variables
    • assigned_to: dgp --> dkf
    • status: open-fixed --> closed-fixed
     
  • Alexandre Ferrieux

    • status: closed-fixed --> open-fixed
     
  • Alexandre Ferrieux

    Beware, there is a leak in the non-error case now. This is due to only decrementing refcounts without ever calling CleanupVar. Not sure why you removed this call from my initial patch.

    Proof of leak:

    array set x {e 1 i 1}
    trace add variable x unset rem
    proc rem args {
    puts REM:$args
    if {![info exists ::already]} {
    set ::already 1
    unset ::x(e)
    }
    }
    array unset x i*
    array stat x

    =>
    number of buckets with 0 entries: 3
    number of buckets with 1 entries: 1

    array size x
    => 0

    What happens is that the secondary unset is delayed because "e" is in the protected state, but since the overall pattern "i*" doesn't include "e", it doesn't get directly deleted from the main iteration either.

    Having noticed that VarHashRefCount(x)-- doesn't take the actual final step, we must call CleanupVar independently from the match outcome; this is exactly what my patch was doing.

     
  • Alexandre Ferrieux

    Attached the small delta-patch reintroducing CleanupVar and eliminating the leak.

     
  • Donal K. Fellows

    • status: open-fixed --> closed-fixed
     
1 2 > >> (Page 1 of 2)