[Tcl-bugs] [ tcl-Bugs-1323992 ] socket.test failures

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1323992, was opened at 2005-10-11 17:58
Message generated for change (Comment added) made by vasiljevic
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1323992&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 27. Channel Types
Group: current: 8.4.11
Status: Open
Resolution: None
Priority: 9
Submitted By: Don Porter (dgp)
Assigned to: Zoran Vasiljevic (vasiljevic)
Summary: socket.test failures

Initial Comment:

I'm seeing tests socket-9.2
and socket-11.13 fail consistently,
but only on one platform.

I suspect the real problem is
on the system and not in Tcl,
but I don't know how to verify
that.

These tests are failing because
the [vwait] is timing out.  The
tests are not well constructed
to communicate that scenario
to the user.  I had to hack the
tests to verify that's what is
happening.

----------------------------------------------------------------------

>Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-10-13 23:02

Message:
Logged In: YES 
user_id=95086

Yes. It is 11.13 test which is failing. I have added this 
into the both 8.4 and 8.5 branches.

As I see Andreas has some ideas and I will have to 
first understand them. I will leave this for tomorrow
(its pretty late here now).

----------------------------------------------------------------------

Comment By: Jeffrey Hobbs (hobbs)
Date: 2005-10-13 22:10

Message:
Logged In: YES 
user_id=72656

The patch said 11.3 in the C comments - it's 11.13 (oops).

----------------------------------------------------------------------

Comment By: Jeffrey Hobbs (hobbs)
Date: 2005-10-13 22:08

Message:
Logged In: YES 
user_id=72656

Attached is my simple but descriptive interim patch for
this.  I'll leave it to Zoran to commit.

----------------------------------------------------------------------

Comment By: Andreas Kupries (andreas_kupries)
Date: 2005-10-13 21:06

Message:
Logged In: YES 
user_id=75003

Modulo variable declarations and type casts ...

static void EventProcessingOff (chan) {
    chan  = Tcl_GetTopChannel (chan);
    watch = Tcl_ChannelWatchProc (chan)
    inst  = Tcl_GetChannelInstanceData (chan);
    watch (inst, 0);
}

----------------------------------------------------------------------

Comment By: Andreas Kupries (andreas_kupries)
Date: 2005-10-13 21:03

Message:
Logged In: YES 
user_id=75003

I agree with the ifdef THREADS as a short-term solution.

From the digging Don has done it seems to me that during the
_regular_ closing of channel Tcl_ClearChannelHandlers
(Tcl_CCH) is called at a very early point in time, when it
still may generate events for flushing in the background.
And it is expected that these events are processed, and full
shutdown of the channel happens only after that has happened.

So our belief that we could co-opt Tcl_CCH as the point
where event-processing is switched off in preparation for
the transfering of a channel was wrong. We need it, but
because of the above it must only remove the parts generated
events, and not the parts processing those still in the
pipeline. This part of the transfer has to go somewhere
else. We might try to co-opt a different function for this,
or we may need a new public API.

Zoran, we will have to go back to the mails we exchanged,
dig out the current code sequences used by the channel
transfer and see if one of  the other functions used can be
co-opted instead.

If  not, then a new API will be needed to perform the
action. And if we come to that I will also entertain the
notion of moving the whole transfer process and structures
into the core, as a Tcl_TransferChannel (chan,
destinationthread) or some such. Because we would need a TIP
anyway in that case. The threads package would then contain
only the Tcl command around that API.

... I belay that. We do not need new APIs. We can use the
existing APIs to get the instance data of a channel, its
watchproc, and then  do the call "watchproc (instance, 0)"
using the piecves we got. No need for additional APIs.
Tcl_ChannelWatchProc, Tcl_GetChannelInstanceData, and
Tcl_GetTopChannel. The latter to make sure that our
watchproc call goes through the whole stack, if there is one.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-10-13 20:41

Message:
Logged In: YES 
user_id=95086

Yes. As the matter of fact, why not! 
At least until I figure out how to close this hole properly.
I will commit ifdefs into both 8.4 and 8.5 and start a deep
thinking session again.... Damn, all this is getting pretty
complicated.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-13 20:35

Message:
Logged In: YES 
user_id=80530

Would making the 2004-10-04 commit
conditional on #ifdef TCL_THREADS
make a greater number of configurations
work correctly?

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-10-13 20:28

Message:
Logged In: YES 
user_id=95086

Understand. Thanks for going that deep! 

The problem is/are obviously the events in the queue which
do not get processed (deliberately) at the time one
calls Tcl_CleanChannelHandlers. 
If we do not disable those pending events in Tcl_CCH, 
then we risk hitting a detached channel later on which
inevitably leads to core. 
If we however do disable them, as my change is doing,
then we lose some events pertaining to that channel. 
So it is a lose-lose situation.

I have not found a way to somehow "drain" the channel
at the time it is about to be spliced out of the thread/interp.
This would be the thing needed here. Perhaps I should 
do some more thinking...

----------------------------------------------------------------------

Comment By: Jeffrey Hobbs (hobbs)
Date: 2005-10-13 20:05

Message:
Logged In: YES 
user_id=72656

I have confirmed that socket-9.2 and 11.3 fail on a SuSE-9.0
system (kernel 2.4.21-297-default).  Both fail with:

---- Result was:
65536
---- Result should have been (exact matching):
65566

Verified no problems on Solaris 2.8, Solaris-x86 10, HP-UX
11, AIX 5.1, Linux 2.6 or OS X 10.3.  10.4 (Tiger) does have
this interesting repeatable failure:

---- Result was:
a:one b: c:
---- Result should have been (exact matching):
a:one b: c:two
==== socket-2.11 FAILED

Specifically, we seem to be losing a flush on close on 2.4.
 If I explicitly add a 'flush $s' in the writedata procs,
the errors go away (that's for the 9.2 and 11.3 tests - no
idea about 2.11).

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-13 16:32

Message:
Logged In: YES 
user_id=80530

Debian Linux, "Sarge" release
is based on Linux 2.4.27, so
that's a system to test.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-13 16:27

Message:
Logged In: YES 
user_id=80530

Some more testing suggests
this failure only shows up
on (outdated versions of?) the
Linux 2.4 kernel.  Linux 2.6
users report no test failures.

Would be useful for someone
to test with the latest 2.4 kernel
(kernel.org says that is 2.4.31)
in case this is demonstrating
a kernel bug that has been fixed.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-10-13 10:51

Message:
Logged In: YES 
user_id=95086

Hm... out of curiosity, I checked this again under Solaris and 
guess what: it passes all tests OK. 
This leads me to believe that there is another hidden problem 
either with this test or with Linux per-se. I can't really point 
my finger at the exact place yet. 
If you like I can backoff this change in both 8.4 and 8.5
or 
invest time to understand and eventually solve this generally.

In the former case I'd need to keep my own patched copy
of the code because w/o the change, it is unstable and 
keeps coring my app on all platforms. 

What should I do? 

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-10-13 09:57

Message:
Logged In: YES 
user_id=95086

Of course it is possible to do that. Then, you will have ocassional
cores when transferring channels between threads.
The fact that this one is exposed *only* on Linux makes me very
suspicious about this test.
What my change does is prevent processing of events in the thread
event queue designated for the currently closing channel. In single
threaded env and in the cases the channel transfer caps are not used,
this is no issue. In other cases you get difficult to trace and subtle 
memory corruptions leading to cores at various other non-related places.

I can backoff this one easily by commenting out just one single line
in tclIO.c but this is potentially just hiding one more deeper problem 
(on Linux?) which is sitting there waiting to explode.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-13 03:07

Message:
Logged In: YES 
user_id=80530

Is it reasonable to revert the change
(at least on the "stable" core-8-4-branch)
until this is sorted out?

I'm uneasy about "stable" releases
(like an ActiveTcl release which
comes from CVS snapshots) possibly
having sockets that do not work on
Linux.

What bug did the 2005-10-04 commit
fix?  And is it more critical than
the failures detected by these tests?

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-10-11 18:56

Message:
Logged In: YES 
user_id=95086

Confirmed on Linux. Indeed, I do have yet to understand why.
The same (tests) passed well on Mac OSX. 

The part in question is in tclIO.c:

    /*
     * Must set the interest mask now to 0, otherwise infinite loops
     * will occur if Tcl_DoOneEvent is called before the channel is
     * finally deleted in FlushChannel. This can happen if the channel
     * has a background flush active.
     * Also, delete all registered file handlers for this channel
     * (and for the current thread). This prevents executing of pending
     * file-events still sitting in the event queue of the current thread.
     * We deliberately do not call UpdateInterest() because this could
     * re-schedule new events if the channel still needs to be flushed.
     */

    statePtr->interestMask = 0;
    (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0);

Effectively this disables handling of events in the event queue
for the current thread which are still pending when the channel
is getting spliced out of the thread for later transfer to some other
interp/thread. If this is not done, the test will pass but your app
will core. To check, just comment-out the above call to watchProc
and try again. 

I guess I will have to see what this test is actually trying to test.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-11 18:28

Message:
Logged In: YES 
user_id=80530

Revision 1.61.2.12 of tclIO.c
introduced these test failures.

Same changes in Revision 1.95
introduce the same failures on
the HEAD branch.

both are commits from 2005-10-04
by vasiljevic

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-11 18:10

Message:
Logged In: YES 
user_id=80530

um, no, these same tests on
the same system pass just
fine using the Tcl 8.4.11 release.

They fail on core-8-4-branch tip.

Something has happened to
break these tests, even though
I see the breakage on only one
platform.

I see the failure on a Linux/x86
system.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1323992&group_id=10894

[Tcl-bugs] [ tcl-Bugs-1323992 ] socket.test failures

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-1323992 ] socket.test failures