From: SourceForge.net <no...@so...> - 2009-04-27 20:24:33
|
Bugs item #1028264, was opened at 2004-09-14 19:30 Message generated for change (Settings changed) made by dgp You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1028264&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 27. Channel Types Group: obsolete: 8.4.7 Status: Open Resolution: Fixed Priority: 9 Private: No Submitted By: Don Porter (dgp) >Assigned to: Alexandre Ferrieux (ferrieux) Summary: socket: failed write prevents read Initial Comment: Seeing very troubling socket failures only on Windows. (Testing with ActiveTcl 8.4.7.0) First here's a script that implements an echo server on port 1111, it takes a command line argument which adds delays in milliseconds to simulate the server having to attend to other tasks between servicing the socket when idle. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-28 02:24 Message: *This* bug report is fixed by the second patch, the one about WSACleanup and late exit handlers. The bulk of the patch is in tclEvent.c, the tclWinSock.c part is just a one-line change. Please tell me what is unclear about it in the ChangeLog, given that this bug's comments are pretty verbose about these notions. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2009-01-27 22:36 Message: Yeah, I can read the ChangeLog. If what was written there were enough for me to understand, I would not have asked the question. Writing it again does not help me. Let me try again. Which patch will make the failing scripts in this bug report stop failing? Do I need both to achieve that? Or will just one do it? Which one? ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-27 18:03 Message: The ChangeLog entries document which fixes what. Of course you can consider the two patches as orthogonal: one is about not generating RSTs on exit, the other about not raising an error on receiving them from the peer. But I think both the current bug and 2446662 deserve fixing in all versions. So I suggest to backport both. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2009-01-27 13:06 Message: There are two commits to tclWinSock.c on 2009-01-26. Do I need to backport both to core-8-4-branch to fix this bug? Or just one? Which? ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2009-01-26 22:35 Message: claiming in order to test and verify backports. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-26 19:32 Message: Patch committed. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-13 16:51 Message: Updating the patch so that the new functions handling late handlers get MODULE_SCOPE status instead of stubs, as discussed on the chat with dkf. File Added: fin2.patch ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-06 19:00 Message: The attached patch introduces a new queue of "late exit handlers", with the associated public creation/deletion API. This API is then used in tclWinSock.c in lieu of the "early exit handler" for Winsock finalization. File Added: fin.patch ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-05 12:30 Message: Further analysis shows that WSACleanup() could be omitted altogether. But before resorting to such extremes, I'd like to understand comments saying that Tcl_FinalizeThread makes the IO subsystem unavailable: it is just the channels of this thread+interp, or the whole IO subsystem ? If the latter, how are things supposed to work with the Thread extension spawning several threads+interps, each with several opened channels at [exit]-time ? In an idea world, we'd have 3 categories of exit handlers called in this order: (a) "early" process-wide stuff (b) thread-specific handlers, for each thread in turn (pthread_atexit) (c) "late" process-wide stuff Currently the WSACleanup occurs in (a) and (b) (!), I'd first shift it to (c). The only problem being that the (c) class currently is static (no CreateXXHandler API). Do you agree on this picture ? ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-01-03 16:27 Message: Thanks for diagnosing my split-brain condition Don :-) The other bug, 2446662, appeared wildly different in the beginning, but apart from that I have no good excuses... Now let's separate the issues this way: here we'll fix the finalization problem, and the RST generation. On 2446662 we'll take care of the race condition when receiving an RST. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2008-12-31 10:27 Message: ferrieux appears to be tracking down bugs similar to this one. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-16 18:49 Message: OK, here is a first report on my analysis. The problem is that SocketExitHandler is registered among the process-wide handlers, which are called early in Tcl_Finalize, while the closeProc is called in FinalizeIOSubsystem which is part of the later per-thread cleanup. Now what's peculiar is that FinalizeIOSubsystem also calls "TclpFinalizeSockets" which is the channeldriver-guts part of the cleanup in SocketExitHandler... I'm not sure (euphemism) I understand this state of affairs (especially the redundant call of TclpFinalizeSockets on the main thread), however here is how I'd fix it: - remove entirely the process-wide handler concerning sockets - instead enrich TclpFinalizeSockets with a client-counting scheme keeping track of how many threads "with socket interest" remain - when this count drops to zero, do the rest of what is currently in SocketExitHandler: UnregisterClass and WSACleanup. Does this feel correct ? Any hint of a definition of "with socket interest" ? -Alex ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-09 05:34 Message: For the first action, here is what's happening: The TcpCloseProc is called during interp finalization, *but* it contains the test: /* * Check that WinSock is initialized; do not call it if not, to prevent * system crashes. This can happen at exit time if the exit handler for * WinSock ran before other exit handlers that want to use sockets. */ if (SocketsEnabled()) ... so, I assume we're being hit by this situation: SocketExitHandler() was called too early, before individual sockets could be closed by the generic TclFinalizeIOSubsystem(). The result is a WSACleanup() in mid-life of all open sockets, hence the RST. Now to look into the order of finalizations to see how to circumvent this ;-) ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-08 18:06 Message: OK got it, guys :-) The key here is whether the client does an explicit [close] of the socket before exiting. If it does, the whole OS socket buffer is actually sent over the wire, *and* a FIN TCP packet ends the stream. If it doesn't, the stream is truncated (depending on the speed of things), *and* an RST TCP packet goes out on the truncation. This RST in turn wreaks havoc in the receiver. So a first action should be to make sure Tcl finalizers do the same thing as [close]. A second action would be to decide whether it is legal for the receiveing OS to discard input on an RST. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-08 17:49 Message: OK just tested with a real network interface (two PCs over wi-fi with a router). Result: - behavior is the same (!) - WireShark shows that the client indeed does a 'short write' (just the first of the 10 lines) - This doesn't make it normal for the server to discard that single line. So.. do we have 2 bugs instead of 1 ? ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-08 17:10 Message: Hmm, just noticed that in the nonzero-initial-pause case (ie when we start reading long after the client has closed), even the first [gets] on the server was empty, and EOF. That makes the bug even nastier. Single-stepping into tclWinSock.c/TcpInputProc shows that even the first recv() is empty, meaning the loss of queued input happens in the OS. Will have to retry with two PCs, because it could be a property of Windows 'loopback' interface (which is not a true network interface, ask Wireshark). ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-08 16:17 Message: Typo in previous comment: client connects and sends _10_ lines. File Added: srv.tcl ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-08 16:16 Message: File Added: cli.tcl ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-08 16:15 Message: Attaching much simpler test case: all synchronous, no fileevents. Repro on Win32. Start server: tclsh srv.tcl INITIAL-PAUSE-IN-MILLISECONDS Start client: tclsh cli.tcl Principle: client connects, sends 1 lines quickly, exits (closes). Server accepts connection, pauses the given number of ms, gets 1 line, then does a puts yielding a ECONNRESET, then reads the tail and displays it. Result: - With initial pause is large enough (100ms is okay), the tail is reliably empty - With no initial pause, the tail varies from 0 to 7 lines at random Notice that both are pathological. I like the first because it is easier to debug (no race condition): looks like the input is read and buffered inside Tcl's generic buffered input layer, but somehow discarded because of the ECONNRESET. Investigation going on ;-) ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2007-12-06 14:33 Message: Logged In: YES user_id=80530 Originator: YES won't block 8.5.0 for this. It really sucks this is still broken, but it's not any more broken than 8.4.* have been. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2007-05-15 16:33 Message: Logged In: NO still buggy getting ready to release 8.4.15. Can't we do anything about this? ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2006-03-23 14:16 Message: Logged In: YES user_id=80530 still see same behavior in 8.4.13 -- reliable data transport on Solaris. Data loss on (ancient) Linux/Alpha. Did not retest on Windows. ditto for 8.5a4. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2006-03-10 01:42 Message: Logged In: YES user_id=80530 time to check this again. ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2005-10-05 19:00 Message: Logged In: YES user_id=75003 Still present in the 8.4 head ? 8.5 as well ? ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-09-14 20:31 Message: Logged In: YES user_id=80530 On Linux/Alpha I tested several older Tcl versions and saw the same problem. Not new, just newly noticed. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-09-14 20:13 Message: Logged In: YES user_id=80530 Reliably on Windows, and sometimes on Linux/Alpha, I see this undesirable result from server.tcl instead: $ tclsh server.tcl 1000 [ 1]==>This is message 1 Response: Message 1 received: This is message 1 [ 2]==>This is message 2 Response: Message 2 received: This is message 2 [ 3]==>This is message 3 Response: Message 3 received: This is message 3 OUTPUT ERROR on socket sock812: error writing "sock812": connection reset by peer [ 4]==>This is message 4 [ 5]==>This is message 5 INPUT EOF detected on socket sock812; closing socket. Note most importantly that messages 6 through 10 are lost! Data loss is bad, mm-kay? Note also the different error message. Anything that can be done to make socket communications more reliable? Also, is this a new problem? I don't recall noticing it before. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-09-14 20:08 Message: Logged In: YES user_id=80530 Now let the server have delays longer than the client, so that a queue of messages can be waiting to be read by the server. Say a value of 1000 for server.tcl and 200 for client.tcl. Then I see what I like on Solaris. >From client.tcl: $ tclsh client.tcl 200 Writing message 1 Writing message 2 Writing message 3 Writing message 4 Writing message 5 Writing message 6 Reply from server: Message 1 received: This is message 1 Writing message 7 Writing message 8 Writing message 9 Writing message 10 Socket closed >From server.tcl: $ tclsh server.tcl 1000 [ 1]==>This is message 1 Response: Message 1 received: This is message 1 [ 2]==>This is message 2 Response: Message 2 received: This is message 2 [ 3]==>This is message 3 Response: Message 3 received: This is message 3 [ 4]==>This is message 4 Response: Message 4 received: This is message 4 OUTPUT ERROR on socket sock4: error writing "sock4": broken pipe [ 5]==>This is message 5 [ 6]==>This is message 6 [ 7]==>This is message 7 [ 8]==>This is message 8 [ 9]==>This is message 9 [10]==>This is message 10 INPUT EOF detected on socket sock4; closing socket. Note that server.tcl was able to read all 10 messages from the client. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-09-14 20:05 Message: Logged In: YES user_id=80530 And here is a script that implements a client of that server. It connects to the server and sends and receives messages from it. It also takes a command line argument to add delays between messages it sends, representing time the client program is busy tending to other tasks. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1028264&group_id=10894 |