Menu

#3987 exec with pipes and fileevents leaks ~2Mb everytime

final: 8.3.5
open
8
2008-05-06
2008-05-06
Mark Gammon
No

I'm running Tcl8.3.5 on Windows XP/SP2 & VC8.

I think I've seen this reported as a bug before, many years back, but looking at the latest 8.5.2 tclWinPipe.c code it looks like its still there.

I need to spawn the same child process 100s of times.
After each child process dies a few seconds pass and then it is spawned again. When I do this I see the Virtual Size of the tcl interpreter process slowly growing each time by about 8Mb. With the 2Gb limit for 32bit execs this allows me to make about 250 launches before everything stops.

I've cut it down to a simple tcl script that demonstrates a 2Mb leak each time. It would be 4Mb if I attach a pipe to the stderr channel too.

...................................................
proc callback {chan fid} {
read $fid
}

pipe in_r in_w
pipe out_r out_w

fconfigure $out_r -blocking 0 -buffering none

set pid [exec ls <@$in_r >@$out_w &]

fileevent $out_r readable "callback stdout $out_r"

# wait for "ls" to die
sleep 1
# MUST force the event loop to be processed
update

close $in_r
close $in_w
close $out_r
close $out_w
...................................................

run the above script repeatedly in the same interpreter session and watch the Virtual Size grow in ProcessExplorer.

I've dug down into the code in tclWinPipe.c and I'm pretty sure its happening because of the TerminateThread calls used to kill the pipe read thread.

It has to resort to a TerminateThread because the ReadFile call in the read thread is blocked. What I don't quite understand is why it stays blocked even after the child process it is reading from is long dead.

I've seen suggestions elsewhere that to get round the blocking problem of ReadFile it should be used in asynchronous mode.

This bug is really crippling my app. I'd really like to see a fix for it soon.

Discussion

  • Mark Gammon

    Mark Gammon - 2008-05-06
    • priority: 5 --> 8
     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Just to be sure.. You're using TclX, right ?
    (That's funny, I was precisely considering integrating TclX's standalone pipes into the core (TIP304))
    Will try to reproduce; please give TclX version used.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    Mark, are you actually waiting waiting for EOF in the readable callback? [close] should happen there, not in the top after churning the event loop. No wonder you're hitting that situation.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    proc callback {x out_r} {
    set line[gets $out_r]
    ...
    if {[eof $out_r]} {
    close $out_r
    set ::done 1
    }
    }

    Do it that way and use [vwait $done] instead of [update] in the top

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    David, while I agree with you that [update] is dangerous here in that potentially it will only superficially skim the event queue, I don't understand the 2M leak yet. Closing from within a callback or not doesn't change much from the notifier's PoV...

    Moreover, in this specific case, the EOF will never happen since the other side of the $out pipe stays open in the parent.
    To fix that I'd suggest closing $out_w and $in_r immediately after the [exec &].

    Continuing investigations...

     
  • Mark Gammon

    Mark Gammon - 2008-05-07

    Logged In: YES
    user_id=2080734
    Originator: YES

    David, my callback wasn't testing for EOF as a sign of the child's death and closing the pipe then. When I add that and change to a vwait on the death then the script hangs. The callback read chars from the child but never sees a EOF. From the debugger it looks like the Tcl event loop is spinning round and checking for events on the pipes, the pipe says its readable but the read thread is hung in a call to ReadFile.

    I found this article from MS about how to do this child spawning with pipes in C++:

    http://support.microsoft.com/kb/190351

    Looking at this I think it explains why ReadFile is hanging. It says the parent process MUST close the unused child ends of the pipes after it has spawned the child. If not then the child can't close the pipe properly when it dies, and hence the pipe stays up and ReadFile gets hung on it. I added this to my script and bingo!, it all works perfectly now. Here's the current version of my script:

    ...........................................................
    proc callback {out_r} {
    global isdead
    echo "read [string length [read $out_r]] chars" ; flush stdout
    # MUST test for EOF condition and close pipe NOW
    if {[eof $out_r]} {
    echo "child has died!" ; flush stdout
    close $out_r
    set isdead 1
    }
    }

    pipe in_r in_w
    pipe out_r out_w

    fconfigure $out_r -blocking 0 -buffering none

    set pid [exec ls <@$in_r >@$out_w &]

    # MUST close unused channels to child else child can't close pipe when it dies causing read to hang
    close $in_r
    close $out_w

    fileevent $out_r readable "callback $out_r"

    # wait for "ls" to die
    vwait isdead

    close $in_w

    ...........................................................

    I've instrumented tclWinPipe to tell me how the pipe read/write threads are stopped. Previously is was using the leaky TerminateThread call on the child reader threads, but now the threads are stopping properly in response to the event signal. Great!

    There is one thing in the MS article that I'm not doing. For some reason it duplicates the child read and write pipes and closes the original channels it got from CreatePipe. Not sure why its doing this. I tried reproducing this with the TclX "dup" command and after hacking it to allow me to duplicate a pipe (odd restriction?) I got my script doing exactly the same as the MS C++ code. However I found that my script worked just fine WITHOUT doing this extra duplication, so I've left it out.

    Alexandre: thanks for your help. Just seen your last comment. Looks like you are on to the same solution, i.e. closing the unused pipes after the exec. I think I've confirmed that this does cure the problem. BTW, we are using TclX8.2.0.

    I don't think I'm out of the woods yet though as when I replace the child process "ls.exe" with my real child process, then I'm still seeing the child reader threads killed by TerminateThread. My real child process does more than "ls" in that it sits and reads from stdin and sends responses down it's stdout and stderr. I'll see if I can pin down why this isn't behaving as cleanly as the simple "ls" example.

    Thanks a lot to you both for your excellent help!

     
  • Mark Gammon

    Mark Gammon - 2008-05-07

    Logged In: YES
    user_id=2080734
    Originator: YES

    I took a closer look at how I'd implemented the above fix in my full app. and I must of missed something crucial as I've re-implemented it again and now it works perfectly. This looks like a good work-around to the TerminateThread leak so it maybe the tclWinPipe code is OK, it just needs more careful tcl scripting.
    Thanks.

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Mark,

    While I'm very happy for you since you can proceed, there's one thing I don't quite grasp here.
    You mentioned instrumenting tclWinPipe, which to my knowledge implements the core Tcl "command pipelines", that is, the channel type of [open |cmd r+]. This type uses complicated machinery including ancillary threads and Windows Events to implement fileevents, because up to Win9x/Me it was the only way.

    Now your offending script instead uses TclX's [pipe] command, which yields two channels of type "file", which in Windows encapsulates only the vanilla file handle of Win32. As it turns out, those vanilla handles became waitable with WinNT (Thanks to Jeff and David for digging out these historical facts), so now TclX's [pipe] are fileevent-ready, which was not the case with pre-NT Windows (of course unix has been OK all the time ;-).

    Given this dichotomy, I don't understand where tclWinPipe, its extra threads, and their brutal termination, come into play... Any idea ?

     
  • Mark Gammon

    Mark Gammon - 2008-05-07

    Logged In: YES
    user_id=2080734
    Originator: YES

    I think this is how it works with TclX pipes.
    The "pipe" command calls the Windows CreatePipe to create the pipe and get the two handles to it.
    It then passes these to Tcl_MakeFileChannel to get them wrapped up as a normal tcl channel.
    This in turn calls the Windows GetFileType which tells it that the handles are actually pipes
    and not plain file handles. After this they get treated like pipes created by the "open" command,
    and get processed by the tclWinPipe code.
    Not 100% sure about the above but that's what I think it is doing.

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Looking at the current source (TclX 8.4.0), I do see that GetFileType, but only by a specific set of TclX commands (approximate names): dup, bind, fstat, fsize, ftruncate, select, flock, funlock, cloexec.

    So, at least in the source I have access to, I don't see TclX retrofitting its raw pipes into core Tcl's instrumented ones. (As additional evidence, the instrumenting threads are only created from within TclpCreateCommandChannel, which is really not able to wrap a simple preexisting handle).

    This is getting interesting ! Investigation continued...

    is called within the ChannelToHandle function, which does recognize pipes to produce a type identifier TCLX_WIN_PIPE, but this value is largely ignored by the only caller of interest here, TclXOSGetSelectFnum (

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    I have too much to say, not sure if I'll say it all in this one post.

    First, why does [close] exhibit the problem at all with pipes? Easy, ReadFile() is not interruptible. The purpose of the reader thread and the ReadFile() call inside it is to act as the alert mechanism. But if pipe handles are waitable objects, then what's the purpose of that? Again, easy.. under Win9x, pipe handles are _not_ waitable objects and that mess is needed to make the alert mechanism. But the question really is, why was it forced to close before it was done? That direction is not supported and can't easily be without kill.

    Second, what's wrong with [open], it does the same thing as [exec ... &] with the pipe creation?

    Brain lapse.. I lost my train of thought.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    Also, of course the unused duped side should be closed. That's what is supposed to be done. Examples of doing it right are shown in the core as well as that silly nmakehlp.c in the /win directory.

    Third, given that win9x support can be finally dropped and that logic mess with the reader thread with the non-interuptable ReadFile() can go away, anonymous pipes can use all the named pipe functions and behaviors and are full waitable objects.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    Mark, you have scripted what the core already does with [open]. If there isn't a special purpose I'd recommend plain [open "|ls" r+].

     
  • Mark Gammon

    Mark Gammon - 2008-05-07

    Logged In: YES
    user_id=2080734
    Originator: YES

    David,
    It's been a long time since we coded it up to use "exec" and pipes, but I think the reason why we couldn't use "open" was that we needed to read from both stdout AND stderr separately. Am I right in thinking that the r+ arg to "open" returns a bi-directional channel that when written goes to stdin of the child and when read it gets stdout from the child. There is no way to get access to the stderr output from the child?

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    Gotcha. Yes, [open] mixes somewhat. At the end of the process, you can get stderr though, but readables are on stdout. You just can't intermix the reception of the two while the child is running. There's an example on the wiki: http://wiki.tcl.tk/3846 , see [gotpipereadable] for getting stderr.

    There's also another technique using BLT's bgexec command.

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Mark, please don't go to bgexec right now, we're just on the verge of integrating [pipe] (well, ti will be [chan pipe] I guess) in the core ;-)

    As to my prior interrogations, I have found the missing piece in the puzzle: Yes, Tclx creates raw pipes, but No, it doesn't leave them unadorned to the core ! Indeed, what I have just realized, is that Tcl_MakeFileChannel does introspect the type, and in the pipe case calls TclpCreateCommandChannel with an empty pid list ! This explains everything, including the resort to the bad TerminateThread when the reader thread is blocked in ReadFile etc...

    David, do you agree with the following conclusion:
    - TclX's [pipe] fully interoperates with the (now obsolete) machinery in tclWinPipe.c
    - Hence it can be ported to the core (TIP304) with no change and no worries
    - It does have the old TerminateThread issue.
    - This could be solved in post-NT era by direct wait on handles
    - But this last move is orthogonal to TIP 304 and can wait a bit.

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    (Don't know if this will reach you, SF is awfully unresponsive for me)
    I've sent this also to tcl-core.
    ---
    I think I can now produce a global analysis of the pipe-leak issue.

    (1) This is a core issue, Tclx is a red herring here, it is only one
    of many ways of exhibiting the problem;

    (2) A core-only way is:

    # this leaks 2M per run in Windows
    # assuming "block.tcl" contains just "after 5000"
    set pr [open "|tclsh block.tcl" r];after 500;fconfigure $pr
    -blocking 0;read $pr 1;close $pr

    (3) The crux of the problem is the (now possiby obsolete)
    PipeReaderThread being blocked on a ReadFile at [close]-time, implying
    the use of the thermonuclear TerminateThread, which is known to have a
    4k kernel leak in Windows.

    (4) But in today's Windows (XPSP2), it leaks 2M, not 4k !!!

    I don't know at which point in time the x500 amplification of the leak occured.
    Notice that the situation producing it is rather unusual (it assumes a
    long-running child which we close prematurely from the output side,
    and only on Windows where those tools (like vmstat) are not common).

    Now there are IMO two parallel paths to follow simultaneously:

    (P1) investigate to understand the large leak

    (P2) (longer term) modernize the core to use waitable handles
    (assuming they are really !), thus avoiding the need of ancillary
    threads.

    I would love to hear comments on this. I feel a bit shaky on Windows
    to undertake (P2); and possibly a person who is literate in Windows
    patch history migh home in on (P1) much fasterthan I would...

    Thanks in advance,

    -Alex

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Gosh, just noticed the MS documentation says that the leak is the initial stack of the thread !
    This might explain the variability of the size, since default stack size is largely in the hand of the OS.
    A quick workaround would be to minimize the stack size of PipeReaderThread, who doesn't need 2M...

    I also tried DisconnectNamedPipe(), but the poor guy blocks until all pending IO complete, so back to square one :-(

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    Yes, the stack gets blasted, and it's gone for good, never to reused 'till a reboot

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    > I also tried DisconnectNamedPipe(), but the poor guy blocks
    > until all pending IO complete, so back to square one :-(

    CancelIO() ? That might do it. IIRR, it only cancels I/O that was made in the thread it was called from. :/

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Will be trying the first of a two-step strategy shortly:

    (1) Since pipes are now waitable in all living Windows, replace the blocking ReadFile in PipeReaderThread with a WaitForMultipleObjects of both the pipe and the event object. This way, the thread will spend all its life in a state where it can be gracefully asked to leave by signalling the event, hence removing the need for TerminateThread.

    (2) Overhaul the pipe event source so that the pipe's handle is directly monitored by the notifier's central WaitForMultipleObjects.
    (For this one I would love to have David's help ;-)

     
  • David Gravereaux

    Logged In: YES
    user_id=7549
    Originator: NO

    re #2, you got it.

    Years ago I committed a patch that added the wait flag to the main Tcl_WaitForEvent's WaitForMultiple.. (as well a check on the proper return code) so that thread could be used to service asynchronous procedure calls during the wait. Time for us to collect on that and make use of it ;)

    I don't much free-time these days now that I'm a dad, but I'll try my best to assist with the heavy lifting.

    In the meantime, it might be fun (or not) to read through my source for the IOCPSOCK extension <http://sf.net/projects/iocpsock/> that makes use of overlapped-I/O in a channel driver. The big meat is in the iocpsock_lolevel.c file. The notification method isn't event objects, though. That's the trick.. Event based will take some work.