From: SourceForge.net <no...@so...> - 2007-06-19 07:46:45
|
Bugs item #1726873, was opened at 2007-05-28 05:34 Message generated for change (Comment added) made by nobody You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1726873&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 48. Threading Group: None Status: Open Resolution: None Priority: 9 Private: No Submitted By: Twylite (twylite) Assigned to: Zoran Vasiljevic (vasiljevic) Summary: Intermittent crash in threaded Tcl Initial Comment: We have a multithreaded Tcl application running under Windows. One thread is a TCP server that dispatches calls to a pool of processing threads via thread::send -async(and they reply via thread::send) We have experienced an intermittent crash bug that occurs in ForgetSyncObject in generic/tclThread.c, causing an access violation exception. We a problem with the same symptoms on various occasions, and managed to get a debugger involved twice (on different systems, one under 2000 Pro and the other 2003 Server) to confirm that is was the same bug. By "intermittent" we mean that we can process up to 100 million transactions (that's at least 200m thread::send calls) before seeing the problem ;( On the up side we're not seeing any resource leaks ;) static void ForgetSyncObject( char *objPtr, /* Pointer to sync object */ SyncObjRecord *recPtr) /* Record of sync objects */ { int i; for (i=0 ; i<recPtr->num ; i++) { if (objPtr == recPtr->list[i]) { recPtr->list[i] = NULL; return; } } } With the debugger we see i go massively out of range. In on case by about 17,000; in another by around 100,000. Looking at code that modifies recPtr->num tells me that there should be no condition under which this can possibly happen (unless 32-bit operations are not atomic on x86, and last time I checked they were). That only guess we have at the moment is that parts of the stack (including i) are getting trashed by activity somewhere in another thread. We have only observed this bug in a multithreaded scenario. But if we set the value of i to (recPtr->num - 1) through the debugger and let the program run, everything is still fine (so the return value on the stack isn't damaged). Every time we've seen this there has been a call to ForgetSyncObj from Tcl_ConditionFinalize, which was called from ThreadSend (in threadCmd.c in the Thread extension). To make matters worse we don't know how to reproduce the problem, but we have seen it under both debug/unchecked and release builds of 8.5a6 (built under MSVC6). We need to try and track down this bug, and have resources to commit to doing so, but right now we're at a loss on how to proceed. Any suggestions on possible cause and/or how we may be able to force a reproduction may help. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2007-06-19 00:46 Message: Logged In: NO Another look reveals that Tcl_MutexFinalize may suffer from the same serious flaw. ---------------------------------------------------------------------- Comment By: Twylite (twylite) Date: 2007-06-19 00:27 Message: Logged In: YES user_id=91629 Originator: YES *mutter* Knew I was forgetting something -- like submitting back our patch ;( I agree with mistachkin's analysis -- we managed to create a small app with a group of threads just sending messages to each other, and were able to reproduce this bug in anywhere from 1 to 15 minutes. It is a race condition in the Tcl_*Finalize functions resulting from the lack of locking. For interest, the window for the bug to occur is about two asm instructions long ;/ The solution we developed (see attached patch file) does not use the masterLock, but it is threadsafe. We have reworked how the SyncObjRecord lists are represented to get away from the need to free dynamically allocated memory (this is the root of the problem -- either you put a lock around all activity involving that memory, or you rework RememberSyncObj so that it never frees the memory). We specifically wanted to avoid using the masterLock, otherwise you have to lock/unlock a mutex (critical section, etc.) in order to release a Tcl mutex or condition. That in my experience has nasty implications for performance (in particular you may end up giving up your timeslice prematurely on some platforms). In the context of this patch the mini-patch provided by afredd doesn't make much sense -- there are no fields in SyncObjRecord that would really benefit from guarding. File Added: tcl-thread-syncobj-leak.patch ---------------------------------------------------------------------- Comment By: Joe Mistachkin (mistachkin) Date: 2007-06-18 23:24 Message: Logged In: YES user_id=113501 Originator: NO After a very careful examination of the code involved here it appears to me that Tcl_ConditionFinalize is potentially buggy (i.e. it calls ForgetSyncObject without any masterLock mutex protection). Also, I believe the mini-patch provided by 'afredd' in the previous comment should be applied to the core. ---------------------------------------------------------------------- Comment By: afredd (afredd) Date: 2007-05-29 09:11 Message: Logged In: YES user_id=1386588 Originator: NO Interesting problem. Just skimmed thro' tclthread.c and noticed that RememberSyncObject() does not check the list for free slots before alloc'ing a new list ie. it can grow unnecesarily and unbounded. Although you'd have hit a malloc failure long before the "recPtr->max += 8" wrapped... :-\ Anyway this patch at the top of that function might help: + /* Check for a free NULL slot in the list (if it has been allocated) */ + if (recPtr->list != NULL) { + for (i = 0; i < recPtr->num; ++i) { + if (recPtr->list[i] == NULL) { + recPtr->list[i] = objPtr; + return; + } + } + } + Also if you think that there may be a memory corruption, you could try something like (untested): typedef struct { + int guard1; /* memory corruption guard */ int num; /* Number of objects remembered */ + int guard2; /* memory corruption guard */ int max; /* Max size of the array */ + int guard3; /* memory corruption guard */ char **list; /* List of pointers */ + int guard4; /* memory corruption guard */ } SyncObjRecord; -static SyncObjRecord keyRecord = {0, 0, NULL}; -static SyncObjRecord mutexRecord = {0, 0, NULL}; -static SyncObjRecord condRecord = {0, 0, NULL}; +#define GX 0xbbbbbbbb /* the value in the guard fields */ + +static SyncObjRecord keyRecord = {GX, 0, GX, 0, GX, NULL, GX}; +static SyncObjRecord mutexRecord = {GX, 0, GX, 0, GX, NULL, GX}; +static SyncObjRecord condRecord = {GX, 0, GX, 0, GX, NULL, GX}; If the guard values have changed when the crash occurs you know something's scrawled over memory. Good luck! ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1726873&group_id=10894 |