Menu

#4449 TclCleanupLiteralTable segmentation fault

obsolete: 8.6b1.1
closed-fixed
9
2014-08-15
2009-10-28
No

NEW _________________________________________________________________________________________

Running the tcllib testsuite I find that the tests of the ncgi package crash in the very same same place, with an identical stacktrace, when run by Tcl 8.6 head. Based on the contents of the structures I am very sure that this is the same problem as shown by Tcl's testsuite, described below.

Mail 1 _______________________________________________
While evaluating a patch to avoid calling channel outputproc with 0-length writes I came across a crash in the Tcl 8.6 head testsuite, which I have now confirmed to happen in the unpatched core.

Context:
Platform is linux-ix86 (32bit).
The core was built using --enable-symbols=all
(and installed in some side directory)

The testsuite was run using

make test TESTFLAGS='-verbose bpste -singleproc 1' \ 2>&1 | tee LOG

The -singleproc 1 is important.

Note: On my platform the --enable-symbols=all means that loading and compiling the 2MB clock.test file took around 15 minutes, during which no output happens. The test is __not__ hung. Just be patient.

The test suite crashes after httpdold has completed, see log excerpt:

---- httpold-5.1 start
++++ httpold-5.1 PASSED
---- httpold-5.2 start
++++ httpold-5.2 PASSED
---- httpold-5.3 start
++++ httpold-5.3 PASSED
---- httpold-6.1 start
++++ httpold-6.1 PASSED
httpold.test: Total 31 Passed 31 Skipped 0 Failed 0
make: *** [test-tcl] Segmentation fault (core dumped)

The stacktrace in the core shows that the crash happens in

#0 0x400e4ecd in TclCleanupLiteralTable (interp=0x9ea73e8, tablePtr=0x9ea74b4) at /home/andreask/workbench/ZeroWrite/src/tcl86base/generic/tclLiteral.c:131
#1 0x400409cc in DeleteInterpProc (interp=0x9ea73e8) at /home/andreask/workbench/ZeroWrite/src/tcl86base/generic/tclBasic.c:1392
#2 0x4010313b in Tcl_EventuallyFree (clientData=0x9ea73e8, freeProc=0x40040829 <DeleteInterpProc>)
at /home/andreask/workbench/ZeroWrite/src/tcl86base/generic/tclPreserve.c:299
#3 0x40040821 in Tcl_DeleteInterp (interp=0x9ea73e8) at /home/andreask/workbench/ZeroWrite/src/tcl86base/generic/tclBasic.c:1284
#4 0x400c7e2c in SlaveObjCmdDeleteProc (clientData=0x9ea73e8) at /home/andreask/workbench/ZeroWrite/src/tcl86base/generic/tclInterp.c:2581
#5 0x40042488 in Tcl_DeleteCommandFromToken (interp=0x8064d78, cmd=0x820f830) at /home/andreask/workbench/ZeroWrite/src/tcl86base/generic/tclBasic.c:2984
#6 0x400c5096 in Tcl_InterpObjCmd (clientData=0x0, interp=0x8064d78, objc=3, objv=0x80669a8)

The bug is in TclCleanupLiteralTable itself I believe. According to the comments

* The problem is that freeing a literal's internal representation can
* delete other literals to which it refers, making nextPtr invalid.
* So each time we free an internal rep, we start its bucket over
* again.
*/

Looking at the code, however the "we start its bucket over again" does __not__ happen. There is the didOne flag, which is set, but then the inner loop is not aborted. It is actually run a second time for the same value of entryPtr, and due to objPtr->type now set to NULL it runs the skip branch of the if, and then continues on to traverse the bucket chain from the current point. No reset at all.

In our crash it is (according to my investigations) the current entryPtr which actually gets recursively deleted ((*) = removed from the chain), and then the second iteration on this same entryPtr crashes trying to get the typePtr for an objPtr which is freed and located in a freed LiteralEntry structure (Everything is 0x61616161).

A possible red-herring, the type of the objPtr in question was 'substcode'.

My first attempt at fixing this by adding a 'break;' just after 'didOne = 1;', to actually reset the loop and restart it from the beginning caused 17 tests to fail. It is actually unclear if that happens because
(a) now some literals are actually released, or, conversely,
(b) if some literals are not released anymore. Or
(c) if these tests would even fail without my patch, but are just prevented from being reached by the crash (they are all in interp.test, after http*.test).

So, according to annotate the last changes in this area were by dgp (rev 1.20) and dkf (rev 1.25, 1.34) and I hope you can shed some more light on this. The whole area was surrounded by kbk (rev 1.19), so I added you as well to this mail.

Ah, yes. This crash doesn't happen for 8.4 nor 8.5 branches.

Next steps here ... See if it crashes for regular build as well (no enable-symbols, (x)), and if yes, try to bisect it.

(*) freeIntRepProc is called, and TclReleaseLiteral is called recursively on the same objPtr, refCount goes zero, which also removes the LiteralEntry from the chain.

(x) Hoping so because then clock.test should not take such effing long time.

Mail 2 _____________________________________________________

> Next steps here ... See if it crashes for regular build as well (no enable-symbols, (x)),

Yes. It does.

> and if yes, try to bisect it.

Bisecting tells us that the crash was introduced between
2009-09-11 and 2009-09-12.

From the Changelog.

--- tcl-20090911/ChangeLog 2009-09-10 14:20:00.000000000 -0700
+++ tcl-20090912/ChangeLog 2009-09-11 13:13:27.000000000 -0700
@@ -1,3 +1,18 @@
+2009-09-11 Don Porter <dgp@users.sourceforge.net>
+
+ * generic/tclBasic.c: Completed the NR-enabling of [subst].
+ * generic/tclCmdMZ.c: [Bug 2314561].
+ * generic/tclCompCmds.c:
+ * generic/tclCompile.c:
+ * generic/tclInt.h:
+ * tests/coroutine.test:
+ * tests/parse.test:
+
+2009-09-11 Donal K. Fellows <dkf@users.sf.net>
+
+ * tests/http.test: Added in cleaning up of http tokens for each test
+ to reduce amount of global-variable pollution.
+

This is either a problem with FreeSubstCodeInternalRep, or this function simply exposes the fact that a bug in TclCleanupLiteral was not as fixed as previously thought/believed. Looking at FSCIR it looks good, and I believe it is the latter.

What happens, I guess, is: CleanupLiteral table frees the substcode intrep, which frees the Bytecode (TclCleanupByteCode), which refers back to the whole literal string, and TclReleaseLiteral's it, which removes it from the bucket chain, and now the next iteration works with the freed literal entry, bang.
Except per the comments it should have restarted the loop from the beginning.

Discussion

  • Don Porter

    Don Porter - 2009-10-28

    only one of the files "http.test" or "http11.test"
    needs to be run in sequence with "httpold.test"
    to trigger the segfault.

     
  • Don Porter

    Don Porter - 2009-10-28

    adding the "break" fixes the reported bug.

    This permits the test run to continue, exposing
    other -singleproc 1 bugs in interp.test. Confirm
    that with a test:

    make test-tcl TESTFLAGS="-notfile h\*.test -singleproc 1"

     
  • Don Porter

    Don Porter - 2009-10-28
    • labels: 105679 --> 47. Bytecode Compiler
    • assigned_to: hobbs --> andreas_kupries
    • milestone: --> obsolete: 8.6b1.1
    • priority: 8 --> 9
     
  • Don Porter

    Don Porter - 2009-10-28

    When committing the fix, it likely needs
    backporting as well.

     
  • Don Porter

    Don Porter - 2009-10-28

    The interp.test failures are because earlier
    test files leave a variable ::a behind. Those
    test files guilty of such poor hygiene are:

    binary.test
    clock.test
    ....
    <more to come>

     
  • Don Porter

    Don Porter - 2009-10-28

    ...
    dict.test
    dstring.test
    eval.test
    exec.test
    execute.test
    expr-old.test
    ...

     
  • Don Porter

    Don Porter - 2009-10-28

    ...
    for-old.test
    for.test
    history.test
    if-old.test
    if.test
    info.test

    It is the info.test leaving behind
    an array variable ::a that causes
    the breakage.

     
  • Andreas Kupries

    Andreas Kupries - 2009-10-28

    Ok. Based on this info (*) it seems I can simply fix the literal bug by adding 'break', and backporting something equivalent into 8.5. And the interp issue can then be done by cleaning up things in the testsuite itself, that is apparently an independent issue.

    (*) And a big thank you to dgp for doing the digging through the testsuite.

     
  • Don Porter

    Don Porter - 2009-10-28

    fix committed to 8-5-branch and HEAD.

     
  • Don Porter

    Don Porter - 2009-10-28
    • assigned_to: andreas_kupries --> dgp
    • status: open --> closed-fixed