#4337 Tcl exit can misbehave on Windows

obsolete: 8.5.6

I originally wrote this back in July of 2008, but I did not submit it because my workaround is working fine for me. However, there is a current discussion on the newsgroup which appears to be related.

I've been struggling with this one literally for years, and I finally figured out what's happening. The apparent solutions seem to have warning flags all over them, though.

I have a server process running on a Windows box that spawns tclsh processes which do some work and then exit. Most of the time, all is well. However, sometimes the server gets back the wrong exit code from the subprocess. Other times, the subprocess never ends at all, despite the fact that the script ended. When I've snooped those orphan processes with the Process Explorer, I've seen a thread sitting there just spinning in what looks like some NT library call. Out of tens of thousands of tclsh processes, you might not see it happen, and I'd never been able to get it to happen on purpose so I could trap it and see what was going on. Nowadays, my Tcl scripts are using TclBlend to run some Java code, and about a year ago I discovered that if used the Java exit (e.g., from Tcl, java::call System exit $code), then the problems did not happen anymore. I had assumed that something in the Tcl cleanup on exit was the cause of my problem, and the Java call was skipping it. I never noticed any ill effects from doing this, so I left it that way. (Note that the exit problems predate the use of TclBlend in the application, so I am pretty sure that library at least is not the cause of the problems!)

This week, though, I was using a TCL_MEM_DEBUG build to try to track down the source of a "TclStackFree: incorrect freePtr. Call out of sequence?" message. In doing some basic sanity tests with that build before I tried to use it with the application, I discovered that no matter what I did, that build always returned an exit code of 0 to the parent process. However, if I configured with the --disable-shared option, I always got the right exit status. I compared both of those builds in the debugger, and I found out that Tcl was completing its cleanup fully in both cases and passing the correct error code to the real "exit" call, but in one case it was getting lost before the process really ended.

This led me to go learn more than I wanted to know about Windows and how it attempts to terminate processes, but the long and the short of it is that unless you call TerminateProcess, Windows tries to be nice and let things clean up properly, and that can have the effect of the exit code being changed or the process not even ending. They strongly discourage the use of TerminateProcess because it doesn't allow loaded DLLs a chance to execute their cleanup code, but they also say that even if you exit nicely that DLL cleanup code gets no guarantees that any of its data structures are in a consistent state.

I found an old bug report at Sun's Java site stating that they use TerminateProcess on Windows because it's the only way to guarantee that the process ends, so that explains why calling the Java exit command seemed to help me. So that's the question. Is it safe to replace exit(code) with TerminateProcess(GetCurrentProcess(), code) in the Windows version of Tcl? I've done this in my copy, and it's working fine with all my tests.


  • Alexandre Ferrieux

    Can you try to reproduce with unmodified 8.6 HEAD ? The fix listed in bug 2001201 changes a lot in the exit sequence.

  • Alexander James Pasadyn

    I probably won't be able to give you a concrete answer for a while. The problem was a pretty low-frequency occurrence to begin with. Over tens of thousands of tclsh runs we might not see it, and when we did there did not seem to be any correlation to what the particular script did. We used to see the problem for example on a continuous build box only every couple weeks, but since I applied my workaround (a couple years ago) we have not seen the problem. We never had a test case that could produce the problem on demand in a finite amount of time. From the perspective of finding out if the problem is fixed, the simplest thing to do would be to replace Tcl on a box like that and let it run for a couple months. I won't get permission to do that any time soon, though. Maybe they would go for it in the future if they decide to update to Tcl 8.6, but that would likely not be before Tcl 8.6 was in more widespread use. If someone wants to offer up a couple Windows boxes, though...

  • Alexandre Ferrieux

    Given these considerations, would you mind my closing the bug, keeping in mind that:
    - either you reproduce it someday and reopen at that time, to proceed with the investigation
    - or it's been fixed by the exit reform, and we can all forget about it ?