[Tcl-bugs] [ tcl-Bugs-1437595 ] WinSock cores/hangs on thread or app exit

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1437595, was opened at 2006-02-23 10:19
Message generated for change (Comment added) made by davygrvy
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1437595&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 27. Channel Types
Group: current: 8.4.12
Status: Open
Resolution: Fixed
Priority: 5
Submitted By: Zoran Vasiljevic (vasiljevic)
Assigned to: Zoran Vasiljevic (vasiljevic)
Summary: WinSock cores/hangs on thread or app exit

Initial Comment:
There is a sequence problem in the way how the 
win socket/pipe subsystem is teared down. The generic
code which handles the finalization invokes thread
exit handlers first, then finalizes IO subsystem.
Unfortunately, the finalization of the IO subsystem
may access data which is gargabe-collected by the 
therad exit handler. In such cases, the application
will either hang or core.

The attached patch works-arround this by moving the
functionality of the thread-exit handler into a 
special TclpFinalizeSockets() call. This call is now
part of the generic TclFinalizeIOSubsystem.

This change is no-op for Unix. It affects Windows
(win/) and old Mac-9 (mac/) code bases. 

----------------------------------------------------------------------

>Comment By: David Gravereaux (davygrvy)
Date: 2006-03-15 10:57

Message:
Logged In: YES 
user_id=7549

I hadn't thought about a hard abort before, but it wouldn't
be wrong in this case over the punt.  Same rule can apply to
the proceeding check for a thread handle and even the
following window handle check, too.

We can't allow TclpHasSockets() to cause a hard abort
though.  So it might be a bit of a dance to get right.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-03-14 14:25

Message:
Logged In: YES 
user_id=95086

This does not sound very deterministic to me. I would never
trust such a system. Why 20 secs? Why not 10 or 100 or 1000?

I do not believe this is the right way to go.

I guess if the thread needs that long time (20 secs in that case is
almost forever) to start up, then something is wrong with the 
computer and there  is no reason to continue.
If a thread A wants/needs to start the thread B then it must
wait forever if needed. Because if it really waits forever 
then you have some other problem. I see it this way:
somebody wants to use socket communication. For that Windows
needs a helper thread: If this thread is not started then we need
to Tcl_Panic! See the example of the notifier fhread. If this thread
does not start, would you also wait 20 secs and declare threading
communication useless or would you panic the system? Fortunately
the wise designer of that part decided to Tcl_Panic which is the
only correct way to go. 
Why relaxing that behaviour for sockets?

Please understand: the simplest thing for me is to go and put that
timeout back there. But I'm still not convinced that would be the
right solution.

----------------------------------------------------------------------

Comment By: David Gravereaux (davygrvy)
Date: 2006-03-14 13:50

Message:
Logged In: YES 
user_id=7549

The issue is not CreateWindow blocking, but the time it
takes the thread to start.  If your system is at such a
state where it takes more than 20 seconds to get the
notification window up, you'd better punt.  That's my point.
 The timeout was very valid.  By taking it out, your saying
the state of the system doesn't matter and we should wait
until next thursday for the thread to start and create the
notification window.

All it is a "head check" and nothing more.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-03-14 13:22

Message:
Logged In: YES 
user_id=95086

The SocketThread's first thing to do is to 
   CreateWindow  (line 2326)
   then
   SetEvent(tsdPtr->readyEvent) (line 2333)

The thread creating it sits in 
   WaitForSingleObject(tsdPtr->readyEvent, INFINITE);
and waits for the readyEvent to be signalized.

There is no problem there. Unless the CreateWindow() 
blocks forever. Then the SetEvent is never executed
and the caller blocks indefinitely. Hence my question
once more: can CreateWindow block forever?
Yes or no? If yes: when and why?

I can put back the timeout there but I do not see any valid
reason why to do that. Do you have a test case that brings
the system in such state?

----------------------------------------------------------------------

Comment By: David Gravereaux (davygrvy)
Date: 2006-03-14 13:03

Message:
Logged In: YES 
user_id=7549

What guarentee do you have that the thread can create the
new window and hit the winproc's WM_CREATE?  An INFINITE
timeout is rather presumptuous.

Please find it in your heart to put back the conservative
"head check" timeout.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-03-14 12:29

Message:
Logged In: YES 
user_id=95086

David, 

You are speaking of some email where  you explained
the problem. Where did you send it? I never got any such email!

All I got was your response on this thread in Bug tracker. 
Hence I had to ask what's wrong. You would do the same,
wouldn't you? I believe we have some comm problem here...

I'm quoting: 
It waits for the async notification window to be in a ready state for use
by WSAAsyncSelect. If it really is taking over 20 seconds to
create a window in a thread, something else is wrong. 

WHAT else is wrong? Is this system (Window) so undeterministic?
I asked once. I'm asking once again:

       tsdPtr->hwnd = CreateWindow("TclSocket", "TclSocket", 
	    WS_TILED, 0, 0, 0, 0, NULL, NULL, windowClass.hInstance, arg);

Is this call (CreateWindow) defined to return sometime?
Can it block forever?
If yes, for what reason i.e. under which circumstances? 
If yes, is the system usabe any more after that?
Do you have a situation where you can simply 
and easily reproduce it? 

I have read the function specs in MSDN and there is 
nothing about that call blocking indefinitely.

I decided to block there instead of go over it. I would 
like to know WHY the app blocks there! It is so simple
to put a timeout there and abandon but this is not buying
you anything and will most probably hang somewhere else.

If the CreateWindow is known to block indefinitely, then
why has anybody created such a poor implementation 
at all in the first place? If it is so, then it has to be re-written.
I never solve any such problems by brute-force (which the
timeout at THAT place definitely is). And I have good reasons
to follow that mantra.

So: once again: can CreateWindow block indefinitely? 
If yes: when and why?

----------------------------------------------------------------------

Comment By: David Gravereaux (davygrvy)
Date: 2006-03-14 11:19

Message:
Logged In: YES 
user_id=7549

I already explained why in email.  The problem is not swept
under the carpet.  By removing the timeout and replacing
with infinite, you have created a situation where a deadlock
can happen.

Let's go over this again..  actually, why don't I just paste
my email here instead..

Zoran,

It isn't a timer, it's a wait with a timeout.  It waits for
the async notification window to be in a ready state for use
by WSAAsyncSelect. If it really is taking over 20 seconds to
create a window in a thread, something else is wrong.  The
state left after the timeout isn't undefined.  All publics
do the SocketsEnabled() check to see that the system is
usable, else punt.  I don't see how using an infinite
timeout makes it wait less for the window to become ready.

If you have patches for improvements, go for it, especially
regarding how Tcl does a tear-down.  Yes, I agree that the
I/O system tear-down is a bit backassward.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-03-14 00:27

Message:
Logged In: YES 
user_id=95086

I would love to hear why and where is the deadlock?

If you recall, there was a thread on TCL core list about
that, where I asked if there is/could-be such situation that
the socket thread cannot start because it cannot create that
dummy window for event passing. Well, nobody commented 
on that and Win manuals were also not helpful.
I decided to let it block there for reason. If it blocks, then there
is some other problem and it is not OK to put it under the carpet.
You tell about deadlocks. What deadlocks? Can you explan when
and how the deadlock can happen?

I have no problem in putting that silly timeout there but from
this perspective and with this knowledge I think it is just a bad
design to leave it there. I'm open to learn, though.

----------------------------------------------------------------------

Comment By: David Gravereaux (davygrvy)
Date: 2006-03-13 17:19

Message:
Logged In: YES 
user_id=7549

r1.52 has introduced a deadlock problem for when the
notification can not start.  The 20 second timeout that used
to be on line 502 needs to be put back.

What happens if tsdPtr->readyEvent never is signaled?  It
blocks forever.  Don't let it block forever, do something
resonable to avoid a deadlock.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-03-10 09:41

Message:
Logged In: YES 
user_id=95086

Ok, it is now in both 8.4 and head branches.
I will still have to add tests for exposing
this bug so it won't hit us in future, hence
the bug-ticket is sill left open.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-03-10 05:40

Message:
Logged In: YES 
user_id=95086

Ah, Andreas. regarding your hint about lazy handling
the socket subsystem init/teardown, the answer is
unfortunately NO, rather than yes.

Why?

Because this thing will just needlessly init/teardown
the subsystem exactly 1000 times which is 999 times 
too much:

  time {close [socket $host $port]} 1000

if there are no other sockets opened in
the process. 

I believe we should keep explicit teardown.

I also believe we should introduce explicit
init as well, rather than checkin the state
of the subsystem all the time by the InitSockets().
This new call can be part of the Tcl lib bootstrap
during initialization of the IO subsystem. This
way it will be symetric and easy to understand.

What do you think?

----------------------------------------------------------------------

Comment By: Andreas Kupries (andreas_kupries)
Date: 2006-02-28 14:43

Message:
Logged In: YES 
user_id=75003

Yes, that (looping and repetition) is how I would do it as
well for a case which fails non-deterministically in
general, but with a relatively high probability. 20% is high
IMHO. So trying a hundred times is good enough for me.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-02-28 14:30

Message:
Logged In: YES 
user_id=95086

I wish I had a simple test case...
Normally, one would create a thread (using testthread command 
from the test suite), open the socket channel in that thread and then
exit the thread. But... this will NOT always break. Sometimes
nothing will happen, sometimes the thread will never exit and
everyting will stall, sometimes, the app will core. In our app
it was about 2 of 10 i.e. 2 hangs out of 10 attempts.
Well I could try to stress the behaviour by creating/destroying
threads and channels in a row for some 10's or 100's of times...

----------------------------------------------------------------------

Comment By: Andreas Kupries (andreas_kupries)
Date: 2006-02-28 14:16

Message:
Logged In: YES 
user_id=75003

> OK. Easy.  But you will have to follow me...

I will try.

> If you open a socket, don't close it and exit the thread
> (or the application,  this will eventually come to
> 
>   Tcl_FinalizeThread()
> 
> The Tcl_FinalizeThread will pull thread-exit handlers
> one by one. One of them will be the handler to tear down
> the socket subsystem.

> This thread-exit handler will tear-down the socket thread and
> all the sync primitieves for the synchronization. After that 
> no calls into this subsystem should be performed.

Ah, now I see what problem I had before. I was looking at
the TclpFinalizesockets for the Mac :( D'oh.

> But....
> After that, the TclFinalizeIOSubsystem() is called.
> The TclFinalizeIOSubsystem walks the list of opened channels,
> founds one still opened and tries to close it (line 254 in 
> generic/tclIO.c).
> The Tcl_Close will invoke CloseChannel and this will call
Tcl_CutChannel.
> The Tcl_CutChannel will, as the last step of operation, call
> ThreadActionProc with  TCL_CHANNEL_THREAD_REMOVE.
> Understand? Now, the ThreadActionProc will result in 
> TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will
start 
> using tsdPtr->socketListLock, will attemt to SendMessage to 
> window owned by the SocketThread which already is exited 
> and so on and so on... 

Ok, yes I see the problem with that. So the new ThreadAction
stuff I did caused this, due to not fully understood
finalize ordering. My apologies. Although ... No, we had no
thread actions before, causing crashes when the socket moved
around between threads. My integration fixes this, and
opened the problem with the finalize ordering :( Thanks for
finding and fixing this.

> Is this now little bit more clear?

Yep.

> Concerning lazy init/teardown of socket subsystem, the
answer is: yes.
> If this is in place, then this will be better and more
logical. The first
> channel/socket inits, the last closing tears down the
subsystem.
> This is how I'd do the thing If I had designed it.

Note: ThreadActions are a very new thing, they were not
available when the driver was written initially.

> But I was not so I
> had to at least rectify the obviously broken current
implementation.

I guess we can try to rewrite the driver to use of the
thread actions in this way, now that the problem is fixed.

Note: Do you have test cases which demonstrate the crash
with the old implementation ? That would be good to have to
prevent any future implementation from running into the same
trap.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-02-28 13:19

Message:
Logged In: YES 
user_id=95086

OK. Easy.  But you will have to follow me...

If you open a socket, don't close it and exit the thread
(or the application,  this will eventually come to

  Tcl_FinalizeThread()

The Tcl_FinalizeThread will pull thread-exit handlers
one by one. One of them will be the handler to tear down
the socket subsystem.
This thread-exit handler will tear-down the socket thread and
all the sync primitieves for the synchronization. After that 
no calls into this subsystem should be performed.
But....
After that, the TclFinalizeIOSubsystem() is called.
The TclFinalizeIOSubsystem walks the list of opened channels,
founds one still opened and tries to close it (line 254 in generic/tclIO.c).
The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel.
The Tcl_CutChannel will, as the last step of operation, call
ThreadActionProc with  TCL_CHANNEL_THREAD_REMOVE.
Understand? Now, the ThreadActionProc will result in 
TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start 
using tsdPtr->socketListLock, will attemt to SendMessage to 
window owned by the SocketThread which already is exited 
and so on and so on... 

Is this now little bit more clear?

Concerning lazy init/teardown of socket subsystem, the answer is: yes.
If this is in place, then this will be better and more logical. The first
channel/socket inits, the last closing tears down the subsystem.
This is how I'd do the thing If I had designed it. But I was not so I
had to at least rectify the obviously broken current implementation.

----------------------------------------------------------------------

Comment By: Andreas Kupries (andreas_kupries)
Date: 2006-02-28 12:57

Message:
Logged In: YES 
user_id=75003

Hm. I can't see in the patch which of the data released by the
thread-exit-handler is accessed later during finalization.
The event source ? If yes, how ?

Question: Is that something which in 8.5 can be handled by
the TIP #218 Thread Actions ? (Last channel removed from a
thread causes tear-down of the subsystem, like first channel
added may auto-initialize it ?)

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2006-02-28 12:30

Message:
Logged In: YES 
user_id=95086

Changes in tclEvent.c are just cosmetics.

I wanted to check this in but AK suggested to file the bug-report
so others can eventually comment. 

Yes, similar chnages are also needed for 8.5. I wanted
to do both in one run.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2006-02-28 10:34

Message:
Logged In: YES 
user_id=80530

Here's the patch in
unified diff format for easier
review.

Can you confirm that all the
changes to tclEvent.c are
cosmetic only?

Does this need further review,
or will you be committing it?

Are similar changes required
for Tcl 8.5a4 ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1437595&group_id=10894

[Tcl-bugs] [ tcl-Bugs-1437595 ] WinSock cores/hangs on thread or app exit

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-1437595 ] WinSock cores/hangs on thread or app exit