[Tcl-bugs] [ tcl-Bugs-1726873 ] Intermittent crash in threaded Tcl

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1726873, was opened at 2007-05-28 05:34
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1726873&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 48. Threading
Group: None
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: Twylite (twylite)
Assigned to: Zoran Vasiljevic (vasiljevic)
Summary: Intermittent crash in threaded Tcl

Initial Comment:
We have a multithreaded Tcl application running under Windows.  One thread is a TCP server that dispatches calls to a pool of processing threads via thread::send -async(and they reply via thread::send)

We have experienced an intermittent crash bug that occurs in ForgetSyncObject in generic/tclThread.c, causing an access violation exception.  We a problem with the same symptoms on various occasions, and managed to get a debugger involved twice (on different systems, one under 2000 Pro and the other 2003 Server) to confirm that is was the same bug.

By "intermittent" we mean that we can process up to 100 million transactions (that's at least 200m thread::send calls) before seeing the problem ;(  On the up side we're not seeing any resource leaks ;)

static void
ForgetSyncObject(
    char *objPtr, /* Pointer to sync object */
    SyncObjRecord *recPtr) /* Record of sync objects */
{
    int i;

    for (i=0 ; i<recPtr->num ; i++) {
	if (objPtr == recPtr->list[i]) {
	    recPtr->list[i] = NULL;
	    return;
	}
    }
}

With the debugger we see i go massively out of range.  In on case by about 17,000; in another by around 100,000.

Looking at code that modifies recPtr->num tells me that there should be no condition under which this can possibly happen (unless 32-bit operations are not atomic on x86, and last time I checked they were).

That only guess we have at the moment is that parts of the stack (including i) are getting trashed by activity somewhere in another thread.  We have only observed this bug in a multithreaded scenario.  But if we set the value of i to (recPtr->num - 1) through the debugger and let the program run, everything is still fine (so the return value on the stack isn't damaged).

Every time we've seen this there has been a call to ForgetSyncObj from Tcl_ConditionFinalize, which was called from ThreadSend (in threadCmd.c in the Thread extension).

To make matters worse we don't know how to reproduce the problem, but we have seen it under both debug/unchecked and release builds of 8.5a6 (built under MSVC6).

We need to try and track down this bug, and have resources to commit to doing so, but right now we're at a loss on how to proceed.  Any suggestions on possible cause and/or how we may be able to force a reproduction may help.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2007-06-19 00:46

Message:
Logged In: NO 

Another look reveals that Tcl_MutexFinalize may suffer from the same
serious flaw.

----------------------------------------------------------------------

Comment By: Twylite (twylite)
Date: 2007-06-19 00:27

Message:
Logged In: YES 
user_id=91629
Originator: YES

*mutter* Knew I was forgetting something -- like submitting back our patch
;(

I agree with mistachkin's analysis -- we managed to create a small app
with a group of threads just sending messages to each other, and were able
to reproduce this bug in anywhere from 1 to 15 minutes.  It is a race
condition in the Tcl_*Finalize functions resulting from the lack of
locking.  For interest, the window for the bug to occur is about two asm
instructions long ;/

The solution we developed (see attached patch file) does not use the
masterLock, but it is threadsafe.  We have reworked how the SyncObjRecord
lists are represented to get away from the need to free dynamically
allocated memory (this is the root of the problem -- either you put a lock
around all activity involving that memory, or you rework RememberSyncObj so
that it never frees the memory).

We specifically wanted to avoid using the masterLock, otherwise you have
to lock/unlock a mutex (critical section, etc.) in order to release a Tcl
mutex or condition.  That in my experience has nasty implications for
performance (in particular you may end up giving up your timeslice
prematurely on some platforms).

In the context of this patch the mini-patch provided by afredd doesn't
make much sense -- there are no fields in SyncObjRecord that would really
benefit from guarding.

File Added: tcl-thread-syncobj-leak.patch

----------------------------------------------------------------------

Comment By: Joe Mistachkin (mistachkin)
Date: 2007-06-18 23:24

Message:
Logged In: YES 
user_id=113501
Originator: NO

After a very careful examination of the code involved here it appears to
me that Tcl_ConditionFinalize is potentially buggy (i.e. it calls
ForgetSyncObject without any masterLock mutex protection).

Also, I believe the mini-patch provided by 'afredd' in the previous
comment should be applied to the core.

----------------------------------------------------------------------

Comment By: afredd (afredd)
Date: 2007-05-29 09:11

Message:
Logged In: YES 
user_id=1386588
Originator: NO

Interesting problem. Just skimmed thro' tclthread.c and
noticed that RememberSyncObject() does not check the list for
free slots before alloc'ing a new list ie. it can grow unnecesarily
and unbounded. Although you'd have hit a malloc failure
long before the "recPtr->max += 8" wrapped... :-\
Anyway this patch at the top of that function might help:

+    /* Check for a free NULL slot in the list (if it has been allocated)
*/
+    if (recPtr->list != NULL) {
+       for (i = 0; i < recPtr->num; ++i) {
+           if (recPtr->list[i] == NULL) {
+               recPtr->list[i] = objPtr;
+               return;
+           }
+       }
+    }
+

Also if you think that there may be a memory corruption,
you could try something like (untested):

 typedef struct {
+    int guard1;                /* memory corruption guard */
     int num;           /* Number of objects remembered */
+    int guard2;                /* memory corruption guard */
     int max;           /* Max size of the array */
+    int guard3;                /* memory corruption guard */
     char **list;       /* List of pointers */
+    int guard4;                /* memory corruption guard */
 } SyncObjRecord;

-static SyncObjRecord keyRecord = {0, 0, NULL};
-static SyncObjRecord mutexRecord = {0, 0, NULL};
-static SyncObjRecord condRecord = {0, 0, NULL};
+#define GX 0xbbbbbbbb /* the value in the guard fields */
+
+static SyncObjRecord keyRecord   = {GX, 0, GX, 0, GX, NULL, GX};
+static SyncObjRecord mutexRecord = {GX, 0, GX, 0, GX, NULL, GX};
+static SyncObjRecord condRecord  = {GX, 0, GX, 0, GX, NULL, GX};

If the guard values have changed when the crash occurs you
know something's scrawled over memory.

Good luck!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1726873&group_id=10894

[Tcl-bugs] [ tcl-Bugs-1726873 ] Intermittent crash in threaded Tcl

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-1726873 ] Intermittent crash in threaded Tcl