From: SourceForge.net <no...@so...> - 2006-03-15 18:57:37
|
Bugs item #1437595, was opened at 2006-02-23 10:19 Message generated for change (Comment added) made by davygrvy You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1437595&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 27. Channel Types Group: current: 8.4.12 Status: Open Resolution: Fixed Priority: 5 Submitted By: Zoran Vasiljevic (vasiljevic) Assigned to: Zoran Vasiljevic (vasiljevic) Summary: WinSock cores/hangs on thread or app exit Initial Comment: There is a sequence problem in the way how the win socket/pipe subsystem is teared down. The generic code which handles the finalization invokes thread exit handlers first, then finalizes IO subsystem. Unfortunately, the finalization of the IO subsystem may access data which is gargabe-collected by the therad exit handler. In such cases, the application will either hang or core. The attached patch works-arround this by moving the functionality of the thread-exit handler into a special TclpFinalizeSockets() call. This call is now part of the generic TclFinalizeIOSubsystem. This change is no-op for Unix. It affects Windows (win/) and old Mac-9 (mac/) code bases. ---------------------------------------------------------------------- >Comment By: David Gravereaux (davygrvy) Date: 2006-03-15 10:57 Message: Logged In: YES user_id=7549 I hadn't thought about a hard abort before, but it wouldn't be wrong in this case over the punt. Same rule can apply to the proceeding check for a thread handle and even the following window handle check, too. We can't allow TclpHasSockets() to cause a hard abort though. So it might be a bit of a dance to get right. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-03-14 14:25 Message: Logged In: YES user_id=95086 This does not sound very deterministic to me. I would never trust such a system. Why 20 secs? Why not 10 or 100 or 1000? I do not believe this is the right way to go. I guess if the thread needs that long time (20 secs in that case is almost forever) to start up, then something is wrong with the computer and there is no reason to continue. If a thread A wants/needs to start the thread B then it must wait forever if needed. Because if it really waits forever then you have some other problem. I see it this way: somebody wants to use socket communication. For that Windows needs a helper thread: If this thread is not started then we need to Tcl_Panic! See the example of the notifier fhread. If this thread does not start, would you also wait 20 secs and declare threading communication useless or would you panic the system? Fortunately the wise designer of that part decided to Tcl_Panic which is the only correct way to go. Why relaxing that behaviour for sockets? Please understand: the simplest thing for me is to go and put that timeout back there. But I'm still not convinced that would be the right solution. ---------------------------------------------------------------------- Comment By: David Gravereaux (davygrvy) Date: 2006-03-14 13:50 Message: Logged In: YES user_id=7549 The issue is not CreateWindow blocking, but the time it takes the thread to start. If your system is at such a state where it takes more than 20 seconds to get the notification window up, you'd better punt. That's my point. The timeout was very valid. By taking it out, your saying the state of the system doesn't matter and we should wait until next thursday for the thread to start and create the notification window. All it is a "head check" and nothing more. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-03-14 13:22 Message: Logged In: YES user_id=95086 The SocketThread's first thing to do is to CreateWindow (line 2326) then SetEvent(tsdPtr->readyEvent) (line 2333) The thread creating it sits in WaitForSingleObject(tsdPtr->readyEvent, INFINITE); and waits for the readyEvent to be signalized. There is no problem there. Unless the CreateWindow() blocks forever. Then the SetEvent is never executed and the caller blocks indefinitely. Hence my question once more: can CreateWindow block forever? Yes or no? If yes: when and why? I can put back the timeout there but I do not see any valid reason why to do that. Do you have a test case that brings the system in such state? ---------------------------------------------------------------------- Comment By: David Gravereaux (davygrvy) Date: 2006-03-14 13:03 Message: Logged In: YES user_id=7549 What guarentee do you have that the thread can create the new window and hit the winproc's WM_CREATE? An INFINITE timeout is rather presumptuous. Please find it in your heart to put back the conservative "head check" timeout. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-03-14 12:29 Message: Logged In: YES user_id=95086 David, You are speaking of some email where you explained the problem. Where did you send it? I never got any such email! All I got was your response on this thread in Bug tracker. Hence I had to ask what's wrong. You would do the same, wouldn't you? I believe we have some comm problem here... I'm quoting: It waits for the async notification window to be in a ready state for use by WSAAsyncSelect. If it really is taking over 20 seconds to create a window in a thread, something else is wrong. WHAT else is wrong? Is this system (Window) so undeterministic? I asked once. I'm asking once again: tsdPtr->hwnd = CreateWindow("TclSocket", "TclSocket", WS_TILED, 0, 0, 0, 0, NULL, NULL, windowClass.hInstance, arg); Is this call (CreateWindow) defined to return sometime? Can it block forever? If yes, for what reason i.e. under which circumstances? If yes, is the system usabe any more after that? Do you have a situation where you can simply and easily reproduce it? I have read the function specs in MSDN and there is nothing about that call blocking indefinitely. I decided to block there instead of go over it. I would like to know WHY the app blocks there! It is so simple to put a timeout there and abandon but this is not buying you anything and will most probably hang somewhere else. If the CreateWindow is known to block indefinitely, then why has anybody created such a poor implementation at all in the first place? If it is so, then it has to be re-written. I never solve any such problems by brute-force (which the timeout at THAT place definitely is). And I have good reasons to follow that mantra. So: once again: can CreateWindow block indefinitely? If yes: when and why? ---------------------------------------------------------------------- Comment By: David Gravereaux (davygrvy) Date: 2006-03-14 11:19 Message: Logged In: YES user_id=7549 I already explained why in email. The problem is not swept under the carpet. By removing the timeout and replacing with infinite, you have created a situation where a deadlock can happen. Let's go over this again.. actually, why don't I just paste my email here instead.. Zoran, It isn't a timer, it's a wait with a timeout. It waits for the async notification window to be in a ready state for use by WSAAsyncSelect. If it really is taking over 20 seconds to create a window in a thread, something else is wrong. The state left after the timeout isn't undefined. All publics do the SocketsEnabled() check to see that the system is usable, else punt. I don't see how using an infinite timeout makes it wait less for the window to become ready. If you have patches for improvements, go for it, especially regarding how Tcl does a tear-down. Yes, I agree that the I/O system tear-down is a bit backassward. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-03-14 00:27 Message: Logged In: YES user_id=95086 I would love to hear why and where is the deadlock? If you recall, there was a thread on TCL core list about that, where I asked if there is/could-be such situation that the socket thread cannot start because it cannot create that dummy window for event passing. Well, nobody commented on that and Win manuals were also not helpful. I decided to let it block there for reason. If it blocks, then there is some other problem and it is not OK to put it under the carpet. You tell about deadlocks. What deadlocks? Can you explan when and how the deadlock can happen? I have no problem in putting that silly timeout there but from this perspective and with this knowledge I think it is just a bad design to leave it there. I'm open to learn, though. ---------------------------------------------------------------------- Comment By: David Gravereaux (davygrvy) Date: 2006-03-13 17:19 Message: Logged In: YES user_id=7549 r1.52 has introduced a deadlock problem for when the notification can not start. The 20 second timeout that used to be on line 502 needs to be put back. What happens if tsdPtr->readyEvent never is signaled? It blocks forever. Don't let it block forever, do something resonable to avoid a deadlock. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-03-10 09:41 Message: Logged In: YES user_id=95086 Ok, it is now in both 8.4 and head branches. I will still have to add tests for exposing this bug so it won't hit us in future, hence the bug-ticket is sill left open. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-03-10 05:40 Message: Logged In: YES user_id=95086 Ah, Andreas. regarding your hint about lazy handling the socket subsystem init/teardown, the answer is unfortunately NO, rather than yes. Why? Because this thing will just needlessly init/teardown the subsystem exactly 1000 times which is 999 times too much: time {close [socket $host $port]} 1000 if there are no other sockets opened in the process. I believe we should keep explicit teardown. I also believe we should introduce explicit init as well, rather than checkin the state of the subsystem all the time by the InitSockets(). This new call can be part of the Tcl lib bootstrap during initialization of the IO subsystem. This way it will be symetric and easy to understand. What do you think? ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2006-02-28 14:43 Message: Logged In: YES user_id=75003 Yes, that (looping and repetition) is how I would do it as well for a case which fails non-deterministically in general, but with a relatively high probability. 20% is high IMHO. So trying a hundred times is good enough for me. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-02-28 14:30 Message: Logged In: YES user_id=95086 I wish I had a simple test case... Normally, one would create a thread (using testthread command from the test suite), open the socket channel in that thread and then exit the thread. But... this will NOT always break. Sometimes nothing will happen, sometimes the thread will never exit and everyting will stall, sometimes, the app will core. In our app it was about 2 of 10 i.e. 2 hangs out of 10 attempts. Well I could try to stress the behaviour by creating/destroying threads and channels in a row for some 10's or 100's of times... ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2006-02-28 14:16 Message: Logged In: YES user_id=75003 > OK. Easy. But you will have to follow me... I will try. > If you open a socket, don't close it and exit the thread > (or the application, this will eventually come to > > Tcl_FinalizeThread() > > The Tcl_FinalizeThread will pull thread-exit handlers > one by one. One of them will be the handler to tear down > the socket subsystem. > This thread-exit handler will tear-down the socket thread and > all the sync primitieves for the synchronization. After that > no calls into this subsystem should be performed. Ah, now I see what problem I had before. I was looking at the TclpFinalizesockets for the Mac :( D'oh. > But.... > After that, the TclFinalizeIOSubsystem() is called. > The TclFinalizeIOSubsystem walks the list of opened channels, > founds one still opened and tries to close it (line 254 in > generic/tclIO.c). > The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel. > The Tcl_CutChannel will, as the last step of operation, call > ThreadActionProc with TCL_CHANNEL_THREAD_REMOVE. > Understand? Now, the ThreadActionProc will result in > TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start > using tsdPtr->socketListLock, will attemt to SendMessage to > window owned by the SocketThread which already is exited > and so on and so on... Ok, yes I see the problem with that. So the new ThreadAction stuff I did caused this, due to not fully understood finalize ordering. My apologies. Although ... No, we had no thread actions before, causing crashes when the socket moved around between threads. My integration fixes this, and opened the problem with the finalize ordering :( Thanks for finding and fixing this. > Is this now little bit more clear? Yep. > Concerning lazy init/teardown of socket subsystem, the answer is: yes. > If this is in place, then this will be better and more logical. The first > channel/socket inits, the last closing tears down the subsystem. > This is how I'd do the thing If I had designed it. Note: ThreadActions are a very new thing, they were not available when the driver was written initially. > But I was not so I > had to at least rectify the obviously broken current implementation. I guess we can try to rewrite the driver to use of the thread actions in this way, now that the problem is fixed. Note: Do you have test cases which demonstrate the crash with the old implementation ? That would be good to have to prevent any future implementation from running into the same trap. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-02-28 13:19 Message: Logged In: YES user_id=95086 OK. Easy. But you will have to follow me... If you open a socket, don't close it and exit the thread (or the application, this will eventually come to Tcl_FinalizeThread() The Tcl_FinalizeThread will pull thread-exit handlers one by one. One of them will be the handler to tear down the socket subsystem. This thread-exit handler will tear-down the socket thread and all the sync primitieves for the synchronization. After that no calls into this subsystem should be performed. But.... After that, the TclFinalizeIOSubsystem() is called. The TclFinalizeIOSubsystem walks the list of opened channels, founds one still opened and tries to close it (line 254 in generic/tclIO.c). The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel. The Tcl_CutChannel will, as the last step of operation, call ThreadActionProc with TCL_CHANNEL_THREAD_REMOVE. Understand? Now, the ThreadActionProc will result in TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start using tsdPtr->socketListLock, will attemt to SendMessage to window owned by the SocketThread which already is exited and so on and so on... Is this now little bit more clear? Concerning lazy init/teardown of socket subsystem, the answer is: yes. If this is in place, then this will be better and more logical. The first channel/socket inits, the last closing tears down the subsystem. This is how I'd do the thing If I had designed it. But I was not so I had to at least rectify the obviously broken current implementation. ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2006-02-28 12:57 Message: Logged In: YES user_id=75003 Hm. I can't see in the patch which of the data released by the thread-exit-handler is accessed later during finalization. The event source ? If yes, how ? Question: Is that something which in 8.5 can be handled by the TIP #218 Thread Actions ? (Last channel removed from a thread causes tear-down of the subsystem, like first channel added may auto-initialize it ?) ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2006-02-28 12:30 Message: Logged In: YES user_id=95086 Changes in tclEvent.c are just cosmetics. I wanted to check this in but AK suggested to file the bug-report so others can eventually comment. Yes, similar chnages are also needed for 8.5. I wanted to do both in one run. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2006-02-28 10:34 Message: Logged In: YES user_id=80530 Here's the patch in unified diff format for easier review. Can you confirm that all the changes to tclEvent.c are cosmetic only? Does this need further review, or will you be committing it? Are similar changes required for Tcl 8.5a4 ? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1437595&group_id=10894 |