From: SourceForge.net <no...@so...> - 2005-10-13 21:02:58
|
Bugs item #1323992, was opened at 2005-10-11 17:58 Message generated for change (Comment added) made by vasiljevic You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1323992&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 27. Channel Types Group: current: 8.4.11 Status: Open Resolution: None Priority: 9 Submitted By: Don Porter (dgp) Assigned to: Zoran Vasiljevic (vasiljevic) Summary: socket.test failures Initial Comment: I'm seeing tests socket-9.2 and socket-11.13 fail consistently, but only on one platform. I suspect the real problem is on the system and not in Tcl, but I don't know how to verify that. These tests are failing because the [vwait] is timing out. The tests are not well constructed to communicate that scenario to the user. I had to hack the tests to verify that's what is happening. ---------------------------------------------------------------------- >Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-10-13 23:02 Message: Logged In: YES user_id=95086 Yes. It is 11.13 test which is failing. I have added this into the both 8.4 and 8.5 branches. As I see Andreas has some ideas and I will have to first understand them. I will leave this for tomorrow (its pretty late here now). ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2005-10-13 22:10 Message: Logged In: YES user_id=72656 The patch said 11.3 in the C comments - it's 11.13 (oops). ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2005-10-13 22:08 Message: Logged In: YES user_id=72656 Attached is my simple but descriptive interim patch for this. I'll leave it to Zoran to commit. ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2005-10-13 21:06 Message: Logged In: YES user_id=75003 Modulo variable declarations and type casts ... static void EventProcessingOff (chan) { chan = Tcl_GetTopChannel (chan); watch = Tcl_ChannelWatchProc (chan) inst = Tcl_GetChannelInstanceData (chan); watch (inst, 0); } ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2005-10-13 21:03 Message: Logged In: YES user_id=75003 I agree with the ifdef THREADS as a short-term solution. From the digging Don has done it seems to me that during the _regular_ closing of channel Tcl_ClearChannelHandlers (Tcl_CCH) is called at a very early point in time, when it still may generate events for flushing in the background. And it is expected that these events are processed, and full shutdown of the channel happens only after that has happened. So our belief that we could co-opt Tcl_CCH as the point where event-processing is switched off in preparation for the transfering of a channel was wrong. We need it, but because of the above it must only remove the parts generated events, and not the parts processing those still in the pipeline. This part of the transfer has to go somewhere else. We might try to co-opt a different function for this, or we may need a new public API. Zoran, we will have to go back to the mails we exchanged, dig out the current code sequences used by the channel transfer and see if one of the other functions used can be co-opted instead. If not, then a new API will be needed to perform the action. And if we come to that I will also entertain the notion of moving the whole transfer process and structures into the core, as a Tcl_TransferChannel (chan, destinationthread) or some such. Because we would need a TIP anyway in that case. The threads package would then contain only the Tcl command around that API. ... I belay that. We do not need new APIs. We can use the existing APIs to get the instance data of a channel, its watchproc, and then do the call "watchproc (instance, 0)" using the piecves we got. No need for additional APIs. Tcl_ChannelWatchProc, Tcl_GetChannelInstanceData, and Tcl_GetTopChannel. The latter to make sure that our watchproc call goes through the whole stack, if there is one. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-10-13 20:41 Message: Logged In: YES user_id=95086 Yes. As the matter of fact, why not! At least until I figure out how to close this hole properly. I will commit ifdefs into both 8.4 and 8.5 and start a deep thinking session again.... Damn, all this is getting pretty complicated. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-13 20:35 Message: Logged In: YES user_id=80530 Would making the 2004-10-04 commit conditional on #ifdef TCL_THREADS make a greater number of configurations work correctly? ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-10-13 20:28 Message: Logged In: YES user_id=95086 Understand. Thanks for going that deep! The problem is/are obviously the events in the queue which do not get processed (deliberately) at the time one calls Tcl_CleanChannelHandlers. If we do not disable those pending events in Tcl_CCH, then we risk hitting a detached channel later on which inevitably leads to core. If we however do disable them, as my change is doing, then we lose some events pertaining to that channel. So it is a lose-lose situation. I have not found a way to somehow "drain" the channel at the time it is about to be spliced out of the thread/interp. This would be the thing needed here. Perhaps I should do some more thinking... ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2005-10-13 20:05 Message: Logged In: YES user_id=72656 I have confirmed that socket-9.2 and 11.3 fail on a SuSE-9.0 system (kernel 2.4.21-297-default). Both fail with: ---- Result was: 65536 ---- Result should have been (exact matching): 65566 Verified no problems on Solaris 2.8, Solaris-x86 10, HP-UX 11, AIX 5.1, Linux 2.6 or OS X 10.3. 10.4 (Tiger) does have this interesting repeatable failure: ---- Result was: a:one b: c: ---- Result should have been (exact matching): a:one b: c:two ==== socket-2.11 FAILED Specifically, we seem to be losing a flush on close on 2.4. If I explicitly add a 'flush $s' in the writedata procs, the errors go away (that's for the 9.2 and 11.3 tests - no idea about 2.11). ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-13 16:32 Message: Logged In: YES user_id=80530 Debian Linux, "Sarge" release is based on Linux 2.4.27, so that's a system to test. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-13 16:27 Message: Logged In: YES user_id=80530 Some more testing suggests this failure only shows up on (outdated versions of?) the Linux 2.4 kernel. Linux 2.6 users report no test failures. Would be useful for someone to test with the latest 2.4 kernel (kernel.org says that is 2.4.31) in case this is demonstrating a kernel bug that has been fixed. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-10-13 10:51 Message: Logged In: YES user_id=95086 Hm... out of curiosity, I checked this again under Solaris and guess what: it passes all tests OK. This leads me to believe that there is another hidden problem either with this test or with Linux per-se. I can't really point my finger at the exact place yet. If you like I can backoff this change in both 8.4 and 8.5 or invest time to understand and eventually solve this generally. In the former case I'd need to keep my own patched copy of the code because w/o the change, it is unstable and keeps coring my app on all platforms. What should I do? ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-10-13 09:57 Message: Logged In: YES user_id=95086 Of course it is possible to do that. Then, you will have ocassional cores when transferring channels between threads. The fact that this one is exposed *only* on Linux makes me very suspicious about this test. What my change does is prevent processing of events in the thread event queue designated for the currently closing channel. In single threaded env and in the cases the channel transfer caps are not used, this is no issue. In other cases you get difficult to trace and subtle memory corruptions leading to cores at various other non-related places. I can backoff this one easily by commenting out just one single line in tclIO.c but this is potentially just hiding one more deeper problem (on Linux?) which is sitting there waiting to explode. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-13 03:07 Message: Logged In: YES user_id=80530 Is it reasonable to revert the change (at least on the "stable" core-8-4-branch) until this is sorted out? I'm uneasy about "stable" releases (like an ActiveTcl release which comes from CVS snapshots) possibly having sockets that do not work on Linux. What bug did the 2005-10-04 commit fix? And is it more critical than the failures detected by these tests? ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-10-11 18:56 Message: Logged In: YES user_id=95086 Confirmed on Linux. Indeed, I do have yet to understand why. The same (tests) passed well on Mac OSX. The part in question is in tclIO.c: /* * Must set the interest mask now to 0, otherwise infinite loops * will occur if Tcl_DoOneEvent is called before the channel is * finally deleted in FlushChannel. This can happen if the channel * has a background flush active. * Also, delete all registered file handlers for this channel * (and for the current thread). This prevents executing of pending * file-events still sitting in the event queue of the current thread. * We deliberately do not call UpdateInterest() because this could * re-schedule new events if the channel still needs to be flushed. */ statePtr->interestMask = 0; (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0); Effectively this disables handling of events in the event queue for the current thread which are still pending when the channel is getting spliced out of the thread for later transfer to some other interp/thread. If this is not done, the test will pass but your app will core. To check, just comment-out the above call to watchProc and try again. I guess I will have to see what this test is actually trying to test. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-11 18:28 Message: Logged In: YES user_id=80530 Revision 1.61.2.12 of tclIO.c introduced these test failures. Same changes in Revision 1.95 introduce the same failures on the HEAD branch. both are commits from 2005-10-04 by vasiljevic ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-11 18:10 Message: Logged In: YES user_id=80530 um, no, these same tests on the same system pass just fine using the Tcl 8.4.11 release. They fail on core-8-4-branch tip. Something has happened to break these tests, even though I see the breakage on only one platform. I see the failure on a Linux/x86 system. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1323992&group_id=10894 |