I originally wrote this back in July of 2008, but I did not submit it because my workaround is working fine for me. However, there is a current discussion on the newsgroup which appears to be related.
I've been struggling with this one literally for years, and I finally figured out what's happening. The apparent solutions seem to have warning flags all over them, though.
I have a server process running on a Windows box that spawns tclsh processes which do some work and then exit. Most of the time, all is well. However, sometimes the server gets back the wrong exit code from the subprocess. Other times, the subprocess never ends at all, despite the fact that the script ended. When I've snooped those orphan processes with the Process Explorer, I've seen a thread sitting there just spinning in what looks like some NT library call. Out of tens of thousands of tclsh processes, you might not see it happen, and I'd never been able to get it to happen on purpose so I could trap it and see what was going on. Nowadays, my Tcl scripts are using TclBlend to run some Java code, and about a year ago I discovered that if used the Java exit (e.g., from Tcl, java::call System exit $code), then the problems did not happen anymore. I had assumed that something in the Tcl cleanup on exit was the cause of my problem, and the Java call was skipping it. I never noticed any ill effects from doing this, so I left it that way. (Note that the exit problems predate the use of TclBlend in the application, so I am pretty sure that library at least is not the cause of the problems!)
This week, though, I was using a TCL_MEM_DEBUG build to try to track down the source of a "TclStackFree: incorrect freePtr. Call out of sequence?" message. In doing some basic sanity tests with that build before I tried to use it with the application, I discovered that no matter what I did, that build always returned an exit code of 0 to the parent process. However, if I configured with the --disable-shared option, I always got the right exit status. I compared both of those builds in the debugger, and I found out that Tcl was completing its cleanup fully in both cases and passing the correct error code to the real "exit" call, but in one case it was getting lost before the process really ended.
This led me to go learn more than I wanted to know about Windows and how it attempts to terminate processes, but the long and the short of it is that unless you call TerminateProcess, Windows tries to be nice and let things clean up properly, and that can have the effect of the exit code being changed or the process not even ending. They strongly discourage the use of TerminateProcess because it doesn't allow loaded DLLs a chance to execute their cleanup code, but they also say that even if you exit nicely that DLL cleanup code gets no guarantees that any of its data structures are in a consistent state.
I found an old bug report at Sun's Java site stating that they use TerminateProcess on Windows because it's the only way to guarantee that the process ends, so that explains why calling the Java exit command seemed to help me. So that's the question. Is it safe to replace exit(code) with TerminateProcess(GetCurrentProcess(), code) in the Windows version of Tcl? I've done this in my copy, and it's working fine with all my tests.