Menu

#3641 Heap corruption in interp.test

obsolete: 8.5a6
closed-fixed
9
2007-04-07
2007-02-13
No

Platform: Win2k
Configuration: win/configure --enable-symbols=all

When I run 'make test', tcltest crashes in interp.test.
When I try to move the walls in with test selection,
I can get as far as:

make -s TESTFLAGS='-file interp.test -match interp-2.* -verbose btefp' test

and still see the same crash - which is 'free' detecting heap corruption when freeing the root callframe pointer of an interp inside interp-2.13.

Running interp-2.13 by itself doesn't crash. But running interp-2.10 by itself (rather than in the complete test suite' gives the bogus error

==== interp-2.10 basic interpreter creation FAILED
==== Contents of test case:

interp create {a x1}
interp create {a x2}
interp create {a x3} -safe

---- Test generated error; Return code was: 1
---- Return code should have been one of: 0 2
---- errorInfo: could not find interpreter "a"
while executing
"interp create {a x1}"
("uplevel" body line 2)
invoked from within
"uplevel 1 $script"
---- errorCode: NONE
==== interp-2.10 FAILED

Something appears to have gone badly wrong in memory allocation. Not having Purify or BoundsChecker to hand, I'm a bit stymied.

Discussion

  • Joe Mistachkin

    Joe Mistachkin - 2007-02-13

    Logged In: YES
    user_id=113501
    Originator: NO

    The 'could not find interpreter "a"' error you are seeing is a side-effect of not running the interp-2.1 test, which creates that interpreter. Further invetigation seems to pinpoint the problem to the following statement in Tcl_MakeSafe (tclInterp.c):

    Tcl_UnsetVar(interp, "env", TCL_GLOBAL_ONLY);

    Also, this issue repros on Windows XP SP2.

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902
    Originator: NO

    I can make this crash for me using:
    make test TESTFLAGS="-file interp.test -verbose t -match interp-2.[1-2]"

    Commenting out the call to TclSetupEnv in Tcl_CreateInterp (in tclBasic.c) also makes the tests proceed to completion when running in a configuration where they would otherwise crash. Not that this is a good solution though!

    Be aware that this is a Heisenbug (or very close to one); adding in a 'puts FOO' between tests interp-2.1 and 2.2 makes the problem go away!

    I've not yet been able to duplicate the problem on Linux.

     
  • Don Porter

    Don Porter - 2007-02-14

    Logged In: YES
    user_id=80530
    Originator: NO

    Does the same crash appear using
    the 8.5a5 release sources? When
    did it first appear?

     
  • Twylite

    Twylite - 2007-03-15

    Logged In: YES
    user_id=91629
    Originator: NO

    Just experienced this bug myself :(
    I'm building with Visual Studio 2005 Express, but I've also reproduced it with Visual Studio 6 Pro. Operating system is Windows 2000 Professional.

    Build:
    Sources from CVS nightly 2007/03/08, and straight from CVS today (2007/03/15).
    nmake /nologo /f makefile.vc release OPTS=symbols,threads
    nmake /nologo /f makefile.vc test OPTS=symbols,threads

    Test:
    While running interp.test, tcltest.exe crashes (Access Violation error).

    Analysis:
    The problem is NOT in 2.13. It is in the foreach + interp delete loop directly after 2.13.

    The crash occurs in Tcl_PopCallFrame() in tclnamesp.c, line 473: nsPtr->activationCount--;
    At this point nsPtr has the value 0x009ae728 (more on that later).

    The smallest code snippet that can reproduce this is:
    ---
    interp create
    interp create
    interp create
    interp create
    interp slaves
    # --> interp2 interp3 interp0 interp1
    interp delete interp2
    interp delete interp3
    ---

    Now, deleting the interps in anything but that order does not produce the bug. I've tried every number of interps up to and including 4, and every order of deleting interps -- that's a LOT of combinations :(

    Following the call chain down from the "interp delete interp3":
    Tcl_InterpObjCmd calls
    Tcl_DeleteCommandFromToken which invokes (*cmdPtr->deleteProc)(cmdPtr->deleteData) which is
    SlaveObjCmdDeleteProc which calls
    Tcl_DeleteInterp which calls
    Tcl_EventuallyFree which invokes (*freeProc)((char *)clientData) which is
    DeleteInterpProc which calls
    TclTeardownNamespace ...
    ... which overwrites memory at 0x009ae728 (and other locations)

    Digging deeper still ...
    TclTeardownNamespace calls
    TclDeleteNamespaceVars which invokes
    ckfree((char *) varPtr) ...
    ... which appears to corrupt 3 bytes at 0x009ae728

    The varPtr is no-where near 0x009ae728, nor (so far as I can tell) is any information contained in the varPtr struct.

    There are slightly differences in the corruption every time I look a little deeper down the stack -- I assume this may have something to do with the debugger having information in the stack frame.

    In any event, from 1 to 4 bytes of memory at 0x009ae728 get corrupted every time, to seemingly arbitrary values.

    Now I'm afraid I'm not familiar with the Tcl core, so the various structs here are meaningless to me. Also my findings don't agree with mistachkin's conclusion about Tcl_UnsetVar called from Tcl_MakeSafe.

    But I'm going to go on a limb here and hazzard a guess:
    This bug only occurs with at least 4 interps, and only when you delete the 1st in the list returned by [interp slaves] followed by the second. That list is not ordered; rather it looks to me like the interp names are kept in some sort of tree that is flattened by [interp slaves]. Is this possibly some sort of bug with deleting nodes from a tree, and more specifically when deleting variable names from the namespace's variable table?

     
  • Twylite

    Twylite - 2007-03-15

    Logged In: YES
    user_id=91629
    Originator: NO

    Followup:
    The script in my last comment works against a DEBUG build (OPTS=symbols,threads) using VC 2005 Express.
    It doesn't work against a VC 6 build without symbols (OPTS=threads), but the following script does:

    ---
    for {set i 0} {$i < 12} {incr i} { interp create }
    foreach i [interp slaves] { interp delete $i }
    ---

    NOTE: these scripts were run with tcltest.exe. The one below appears to break tcl85t.exe though.

    I have occasionally got it to crash with other numbers of interps, but then I have been playing with interp creation and deletion before that point. This script works consistently if run directly after startup.

    Heisenbug indeed :(

     
  • Twylite

    Twylite - 2007-03-16

    Logged In: YES
    user_id=91629
    Originator: NO

    More feedback

    (1) BoundsChecker doesn't find any memory problems. interp.test still crashes ... but no memory problems.

    (2) The script the crashed consistently yesterday ... well it refuses to crash today. And I haven't touched the binary :(
    interp.test still crashes though.

    (3) The CallFrame is being overwritten by ... malloc().
    I ran tcltest.exe with interp.test, and noted the memory location of the corrupt CallFrame (containing the bad nsPtr value that is dereferenced causing the application to crash).
    Then I ran tcltest.exe with interp.test again in exactly the same manner, and put a watch on the memory location. Leading up to the crash, the memory allocator happily allocates over the CallFrame as if it is free memory.

    Conclusion: It _is_ free memory. Neither BoundsChecker nor the MSVCRT debug allocator see any problem with writing over the memory ... so I can only conclude that the CallFrame has been explicitly (and errneously) freed as some point.
    ... of course I can't tell you what that point is ;(

    (4) But I can tell you where to look ;)

    A bunch of CVS carefully selected updates and a batch file tell me that -D 2006/10/23 runs fine, and -D 2006/10/24 crashes.

    The changelog for the latter date says the following:

    ---
    2006-10-23 Miguel Sofer <msofer@users.sf.net>

    * generic/tcl.h: Modified the Tcl call stack so
    * generic/tclBasic.c: there is always a valid CallFrame, even
    * generic/tclCmdIL.c: at level 0 [Patch 1577278]. Most of the
    * generic/tclInt.h: changes involve removing tests for
    * generic/tclNamesp.c: iPtr->(var)framePtr==NULL. There is now a
    * generic/tclObj.c: CallFrame pushed at interp creation
    * generic/tclProc.c: with a pointer to it stored in
    * generic/tclTrace.c: iPtr->rootFramePtr. A second unused
    * generic/tclVar.c: field in Interp is hijacked to enable
    further functionality, currently unused (but with several FRQs
    depending on it).

    ***POTENTIAL INCOMPATIBILITY***
    Any user that includes tclInt.h and needs to determine if it is
    running at level 0 should change (iPtr->varFramePtr==NULL) to
    (iPtr->varFramePtr==iPtr->rootFramePtr).
    ---

    (5) It looks like this was the first change introduced after 8.5a5 :( It has been present for nearly 5 months ... is no-one using bleeding-edge builds on Windows?!

     
  • Twylite

    Twylite - 2007-03-22

    Logged In: YES
    user_id=91629
    Originator: NO

    Hi,
    Sorry to bug, but this is quite a high priority crash bug that could affect any use of interps, almost certainly affects more than just Windows (although you only seem to see symtoms on Windows), and no progress seems to be happening (or at least no feedback).

    Can we get this reassigned to Miguel on the strength of what I found last week? He would seem to be the right person to look at the problem.

     
  • Donal K. Fellows

    • labels: 105661 --> 47. Bytecode Compiler
    • priority: 8 --> 9
    • assigned_to: hobbs --> msofer
     
  • Don Porter

    Don Porter - 2007-03-23

    Logged In: YES
    user_id=80530
    Originator: NO

    Moving line 1220 of tclBasic.c to
    after line 1226 appears to be at
    least part of the correct fix.

     
  • miguel sofer

    miguel sofer - 2007-03-23
    • status: open --> pending-fixed
     
  • miguel sofer

    miguel sofer - 2007-03-23

    Logged In: YES
    user_id=148712
    Originator: NO

    I think this is now fixed in HEAD as described in dgp's comment - the problem was that we were freeing the global namespace before popping the root frame pointer.

    Setting status to 'Pending' waiting for twylite to test: if this is correct, the CallFrame was still valid at the time of the segfault, but its nsPtr field was pointing to freed memory. This contradicts twylite's findings

     
  • Twylite

    Twylite - 2007-03-26

    Logged In: YES
    user_id=91629
    Originator: NO

    I will build and test later today.

    Regarding the validity of the CallFrame, this is what I see (from Tcl_PopCallFrame in tclnamesp.c):

    nsPtr = framePtr->nsPtr;
    nsPtr->activationCount--;

    The decrement of the activationCount crashes with exception "0xC0000005: Access violation reading location 0x011c62d8." At that point framePtr is 0x01176b10, and nsPtr is 0x011c6280. All other fields in framePtr are 0 according to the debugger.

    From what I saw walking through with the debugger (don't have time to repeat that at the moment, it was lengthy) the value 0x011c6280 is not the original value of nsPtr, and it gets in there during a malloc() operation which is related to a different "interp delete" statement (the one that preceeds the statement that leads to the exception).

     
  • Twylite

    Twylite - 2007-03-26

    Logged In: YES
    user_id=91629
    Originator: NO

    I have just got the latest sources from CVS (latest CHANGELOG entry 2007/03/24, Miguel's changes on 03/23), and built on Windows 2000 with VC8 (OPTS=symbols,threads). All tests pass without problems.

    Based on dgp's description of the change and what my debugger tells me about the CallFrame, I'm not convinced that this problem has been fixed. The symtoms have certainly been removed, but I am concerned that we're merely trashing a different part of the heap.

    If someone can confirm that the tests pass on other majors platforms, then I suggest that we drop the priority of the issue, but I'd like to leave it open for a bit for further investigation into why the CallFrame appeared to be getting trashed. If you disagree with this approach please let me know.

     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).

     
  • SourceForge Robot

    • status: pending-fixed --> closed-fixed