I wrote a simple application to gather Tk events and pass them from one interpreter to another via tequila (http://wiki.tcl.tk/1243), which uses traces and socket communication to keep array values in sync between Tcl interpreters.
My application displays a canvas, to which a <Motion> event is bound. Thus as the cursor is moved over the canvas a steady stream of Tk events is generated; the scripts associated with these events are captured and written to an array, which tequila echoes to an interpreter in a separate process.
When I start this application and move the cursor continually over the canvas, sometime between five and sixty seconds later one of the interpreters segfaults and crashes.
Platform is Ubuntu Hardy, with ActiveTcl 8.5.5 installed. I duplicated the bug using Pat Thoyt's tclkits versions 8.5.8 (http://www.patthoyts.tk/tclkit/linux-ix86/8.5.8/tclkit%2d8.5.8.gz) and 8.6b1.1 (http://www.patthoyts.tk/tclkit/linux-ix86/8.6%2dbeta/tclkit%2d8.6b1.1.gz).
For ease of duplication I created a starkit containing all relevant code called echoevent.kit (attached). To duplicate:
1. in a terminal console, execute: tclkit echoevent.kit server
This will start a tequila server.
2. in a separate terminal console, execute: tclkit echoevent.kit master
This will start a simple Tk "master" application that displays a canvas. Moving the cursor across this "master" canvas will cause the command associated with the <Motion> event to echo repeatedly to the terminal console.
3. in a third terminal console, execute: tclkit echoevent.kit slave
This will cause an identical "slave" Tk application to appear, except event processing for the "slave" canvas is disabled. Moving the cursor across the "master" canvas will cause the event command to echo to both the "master" console and the "slave" console. The tequila server is passing event information from the master interpreter to the slave interpreter.
4. move the cursor continually over the master canvas to generate a constant stream of <Motion> events to be echoed to the consoles. Either the master or the slave process will crash with a segmentation fault within 60 seconds.
report cross-posted at: http://bugs.activestate.com/show_bug.cgi?id=85927
starkit containing example code
Since this happens on unix, can you enable core dumps, and provide a gdb stack trace of the crash ? I'd like to dichotomize between Metakit and the vanilla Tcl core.
As a side note, a .kit is not the easiest transport for some maintainers (me at least), because it requires additional effort to crack the hull open, and reproduce with the core-under-development. I hate efforts ;-)
I've reproduced the problem on OSX 10.6.2 using wish 8.5.7. If I can figure how, I'll attach the traceback file
gdb trace under Fedora 12 32bits
After Steve kindly brought all the pieces beside my bed, I could get the attached gdb stack trace under Linux (Fedora 12, 32bits). Nothing about Metakit there.
Further analysis of the core shows the command
array unset collab .c,queue,42,*
being executed at the time. So it seems to hint at an interaction between patterned unsets and traces. I'd appreciate somebody knowing the app at hand to trim that stuff down to a manageable size... TBC
Homing in.
Just adding [puts] around the unset, shows a reentrancy issue.
UNSETCOLLAB>: .c,queue,19,* (size:23 -- matching:10)
UNSETCOLLAB>: .c,queue,20,* (size:22 -- matching:10)
UNSETCOLLAB<
UNSETCOLLAB>: .c,queue,19,* (size:12 -- matching:9)
UNSETCOLLAB<
This in turn can be traced down to Tequila doing [vwait] in its guts. With Tk and a heavy stream of events, that [vwait] can catch a GUI event and call arbitrary code, including some doing an extra [unset] on the same var.
Knowing this, I'll try to build an atomic example.
Still not managed to reproduce, but a good candidate from RTFS:
At line 3199 of tclVar.c, for [array unset var pattern], a loop iterates over key with a stateful iterator:
for (varPtr2=VarHashFirstVar(varPtr->value.tablePtr, &search);
varPtr2!=NULL ; varPtr2=VarHashNextVar(&search)) {
then, in the body of that loop, when a match is found, the unset is done, possibly calling traces, possibly wreaking the chaining of hash keys.
Some kind of locking, or a more robust iterator, must be used I think.
Additional insight: the way the "stateful iterator" Tcl_HashSearch work, is by a {bucket index, chained intra bucket pointer} pair. So the cases where the aforementioned interaction can cause harm, is when the second-level [unset] removes the very next entry from the same bucket.
That's why the crash is so rare. Now to designing a pin-sharp case ;-)
Sigh.
Just produced a bucket of depth three: [array set x {a 1 e 1 i 1}].
(verify with [array stat x])
Then tried this:
trace add variable x unset rem
proc rem args {puts REM:$args; if {![info exists ::again]} {set ::again 1;unset ::x(e)}}
Then [array unset x *] should remove, in order, i,e,a.
*But*, since the trace proc removes e while removing i, the outer loop should get into trouble....
Though it doesn't for some reason.
Just noticed the tricky game of Undefining a var instead of coldly killing its hash entry. Maybe this explains that, and my hypothesis was naive. Something more subtle must come into play...
Ha ! Got it now.
In truth the secondary unset _does_ kill the hash entry in that case, it's just that the reuse of the freed block doesn't cause immediate harm normally. Hence I had to add the following 'sanitizing' line to FreeVarEntry:
if (TclIsVarUndefined(varPtr) && !TclIsVarTraced(varPtr)
&& (VarHashRefCount(varPtr) == 1)) {
+ varPtr->value.objPtr=(void *)0xdeadbeef;
ckfree((char *) varPtr);
It is simply a matter of resetting a field to a visibly invalid value just before freeing.
The effect is that, as soon as somebody reuses that freed structure (and typically by indirections off this objPtr field), a SIGSEGV or SIGBUS will burst on 0xdeadbeef.
And it does: given this sanitizing patch, the following ultrashort example crashes tclsh:
array set x {e 1 i 1}
trace add variable x unset rem
proc rem args {
puts REM:$args
if {![info exists ::already]} {
set ::already 1
unset ::x(e)
}
}
array unset x *
Explanation: a nascent array with a small size has only 4 buckets, allowing a quick search of collisions.
It turns out 'e' and 'i' fall in the same bucket. Since the entries of a same bucket are stored in a LIFO linked list, the [array unset x *] iteration will do 'i', then 'e'. The rest of the code arranges for the 'e' to be removed as a side effect of the unset trace on 'i', hence removing the rug. Boom.
Renaming the bug according to this confirmation.
Upping prio. Patch under way.
The attached patch fixes the issue by properly protecting the next-in-current-bucket.
This is done modularly (without touching tclHash.c) thanks to the property that the search object has a NULL nextEntryPtr at each bucket change. So the protection is done by upping the refcount of nextEntryPtr, and decreasing it again one loop cycle later.
Since the patch includes the "sanitizing" line, a deterministic test case will be provided shortly.
Please review.
That patch doesn't quite work. It has excess refcount problems when there's a failure to unset midway. Evil case!
I'll attach the patch I've applied to the HEAD, which addresses the issue (it's closely based on yours and includes a test). Note that it doesn't include any code to invalidate unused memory; the --enable-symbols=mem configuration works fine for that already.
Updated patch
Backported to 8.5. I think it's probably not practical to backport to 8.4 since this is thoroughly dependent on miguel's VarReform work.
Beware, there is a leak in the non-error case now. This is due to only decrementing refcounts without ever calling CleanupVar. Not sure why you removed this call from my initial patch.
Proof of leak:
array set x {e 1 i 1}
trace add variable x unset rem
proc rem args {
puts REM:$args
if {![info exists ::already]} {
set ::already 1
unset ::x(e)
}
}
array unset x i*
array stat x
=>
number of buckets with 0 entries: 3
number of buckets with 1 entries: 1
array size x
=> 0
What happens is that the secondary unset is delayed because "e" is in the protected state, but since the overall pattern "i*" doesn't include "e", it doesn't get directly deleted from the main iteration either.
Having noticed that VarHashRefCount(x)-- doesn't take the actual final step, we must call CleanupVar independently from the match outcome; this is exactly what my patch was doing.
Attached the small delta-patch reintroducing CleanupVar and eliminating the leak.
Reintroduce CleanupVar