#3262 socket.test failures

obsolete: 8.4.11
closed-rejected
9
2005-10-14
2005-10-11
Don Porter
No

I'm seeing tests socket-9.2
and socket-11.13 fail consistently,
but only on one platform.

I suspect the real problem is
on the system and not in Tcl,
but I don't know how to verify
that.

These tests are failing because
the [vwait] is timing out. The
tests are not well constructed
to communicate that scenario
to the user. I had to hack the
tests to verify that's what is
happening.

Discussion

1 2 > >> (Page 1 of 2)
  • Don Porter
    Don Porter
    2005-10-11

    • priority: 5 --> 7
     
  • Don Porter
    Don Porter
    2005-10-11

    Logged In: YES
    user_id=80530

    um, no, these same tests on
    the same system pass just
    fine using the Tcl 8.4.11 release.

    They fail on core-8-4-branch tip.

    Something has happened to
    break these tests, even though
    I see the breakage on only one
    platform.

    I see the failure on a Linux/x86
    system.

     
  • Don Porter
    Don Porter
    2005-10-11

    • priority: 7 --> 9
    • assigned_to: andreas_kupries --> vasiljevic
     
  • Don Porter
    Don Porter
    2005-10-11

    Logged In: YES
    user_id=80530

    Revision 1.61.2.12 of tclIO.c
    introduced these test failures.

    Same changes in Revision 1.95
    introduce the same failures on
    the HEAD branch.

    both are commits from 2005-10-04
    by vasiljevic

     
  • Logged In: YES
    user_id=95086

    Confirmed on Linux. Indeed, I do have yet to understand why.
    The same (tests) passed well on Mac OSX.

    The part in question is in tclIO.c:

    /*
    * Must set the interest mask now to 0, otherwise infinite loops
    * will occur if Tcl_DoOneEvent is called before the channel is
    * finally deleted in FlushChannel. This can happen if the channel
    * has a background flush active.
    * Also, delete all registered file handlers for this channel
    * (and for the current thread). This prevents executing of pending
    * file-events still sitting in the event queue of the current thread.
    * We deliberately do not call UpdateInterest() because this could
    * re-schedule new events if the channel still needs to be flushed.
    */

    statePtr->interestMask = 0;
    (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0);

    Effectively this disables handling of events in the event queue
    for the current thread which are still pending when the channel
    is getting spliced out of the thread for later transfer to some other
    interp/thread. If this is not done, the test will pass but your app
    will core. To check, just comment-out the above call to watchProc
    and try again.

    I guess I will have to see what this test is actually trying to test.

     
  • Don Porter
    Don Porter
    2005-10-13

    Logged In: YES
    user_id=80530

    Is it reasonable to revert the change
    (at least on the "stable" core-8-4-branch)
    until this is sorted out?

    I'm uneasy about "stable" releases
    (like an ActiveTcl release which
    comes from CVS snapshots) possibly
    having sockets that do not work on
    Linux.

    What bug did the 2005-10-04 commit
    fix? And is it more critical than
    the failures detected by these tests?

     
  • Logged In: YES
    user_id=95086

    Of course it is possible to do that. Then, you will have ocassional
    cores when transferring channels between threads.
    The fact that this one is exposed *only* on Linux makes me very
    suspicious about this test.
    What my change does is prevent processing of events in the thread
    event queue designated for the currently closing channel. In single
    threaded env and in the cases the channel transfer caps are not used,
    this is no issue. In other cases you get difficult to trace and subtle
    memory corruptions leading to cores at various other non-related places.

    I can backoff this one easily by commenting out just one single line
    in tclIO.c but this is potentially just hiding one more deeper problem
    (on Linux?) which is sitting there waiting to explode.

     
  • Logged In: YES
    user_id=95086

    Hm... out of curiosity, I checked this again under Solaris and
    guess what: it passes all tests OK.
    This leads me to believe that there is another hidden problem
    either with this test or with Linux per-se. I can't really point
    my finger at the exact place yet.
    If you like I can backoff this change in both 8.4 and 8.5
    or
    invest time to understand and eventually solve this generally.

    In the former case I'd need to keep my own patched copy
    of the code because w/o the change, it is unstable and
    keeps coring my app on all platforms.

    What should I do?

     
  • Don Porter
    Don Porter
    2005-10-13

    Logged In: YES
    user_id=80530

    Some more testing suggests
    this failure only shows up
    on (outdated versions of?) the
    Linux 2.4 kernel. Linux 2.6
    users report no test failures.

    Would be useful for someone
    to test with the latest 2.4 kernel
    (kernel.org says that is 2.4.31)
    in case this is demonstrating
    a kernel bug that has been fixed.

     
  • Don Porter
    Don Porter
    2005-10-13

    Logged In: YES
    user_id=80530

    Debian Linux, "Sarge" release
    is based on Linux 2.4.27, so
    that's a system to test.

     
  • Jeffrey Hobbs
    Jeffrey Hobbs
    2005-10-13

    Logged In: YES
    user_id=72656

    I have confirmed that socket-9.2 and 11.3 fail on a SuSE-9.0
    system (kernel 2.4.21-297-default). Both fail with:

    ---- Result was:
    65536
    ---- Result should have been (exact matching):
    65566

    Verified no problems on Solaris 2.8, Solaris-x86 10, HP-UX
    11, AIX 5.1, Linux 2.6 or OS X 10.3. 10.4 (Tiger) does have
    this interesting repeatable failure:

    ---- Result was:
    a:one b: c:
    ---- Result should have been (exact matching):
    a:one b: c:two
    ==== socket-2.11 FAILED

    Specifically, we seem to be losing a flush on close on 2.4.
    If I explicitly add a 'flush $s' in the writedata procs,
    the errors go away (that's for the 9.2 and 11.3 tests - no
    idea about 2.11).

     
  • Logged In: YES
    user_id=95086

    Understand. Thanks for going that deep!

    The problem is/are obviously the events in the queue which
    do not get processed (deliberately) at the time one
    calls Tcl_CleanChannelHandlers.
    If we do not disable those pending events in Tcl_CCH,
    then we risk hitting a detached channel later on which
    inevitably leads to core.
    If we however do disable them, as my change is doing,
    then we lose some events pertaining to that channel.
    So it is a lose-lose situation.

    I have not found a way to somehow "drain" the channel
    at the time it is about to be spliced out of the thread/interp.
    This would be the thing needed here. Perhaps I should
    do some more thinking...

     
  • Don Porter
    Don Porter
    2005-10-13

    Logged In: YES
    user_id=80530

    Would making the 2004-10-04 commit
    conditional on #ifdef TCL_THREADS
    make a greater number of configurations
    work correctly?

     
  • Logged In: YES
    user_id=95086

    Yes. As the matter of fact, why not!
    At least until I figure out how to close this hole properly.
    I will commit ifdefs into both 8.4 and 8.5 and start a deep
    thinking session again.... Damn, all this is getting pretty
    complicated.

     
  • Logged In: YES
    user_id=75003

    I agree with the ifdef THREADS as a short-term solution.

    From the digging Don has done it seems to me that during the
    _regular_ closing of channel Tcl_ClearChannelHandlers
    (Tcl_CCH) is called at a very early point in time, when it
    still may generate events for flushing in the background.
    And it is expected that these events are processed, and full
    shutdown of the channel happens only after that has happened.

    So our belief that we could co-opt Tcl_CCH as the point
    where event-processing is switched off in preparation for
    the transfering of a channel was wrong. We need it, but
    because of the above it must only remove the parts generated
    events, and not the parts processing those still in the
    pipeline. This part of the transfer has to go somewhere
    else. We might try to co-opt a different function for this,
    or we may need a new public API.

    Zoran, we will have to go back to the mails we exchanged,
    dig out the current code sequences used by the channel
    transfer and see if one of the other functions used can be
    co-opted instead.

    If not, then a new API will be needed to perform the
    action. And if we come to that I will also entertain the
    notion of moving the whole transfer process and structures
    into the core, as a Tcl_TransferChannel (chan,
    destinationthread) or some such. Because we would need a TIP
    anyway in that case. The threads package would then contain
    only the Tcl command around that API.

    ... I belay that. We do not need new APIs. We can use the
    existing APIs to get the instance data of a channel, its
    watchproc, and then do the call "watchproc (instance, 0)"
    using the piecves we got. No need for additional APIs.
    Tcl_ChannelWatchProc, Tcl_GetChannelInstanceData, and
    Tcl_GetTopChannel. The latter to make sure that our
    watchproc call goes through the whole stack, if there is one.

     
  • Logged In: YES
    user_id=75003

    Modulo variable declarations and type casts ...

    static void EventProcessingOff (chan) {
    chan = Tcl_GetTopChannel (chan);
    watch = Tcl_ChannelWatchProc (chan)
    inst = Tcl_GetChannelInstanceData (chan);
    watch (inst, 0);
    }

     
  • Jeffrey Hobbs
    Jeffrey Hobbs
    2005-10-13

    Logged In: YES
    user_id=72656

    Attached is my simple but descriptive interim patch for
    this. I'll leave it to Zoran to commit.

     
  • Jeffrey Hobbs
    Jeffrey Hobbs
    2005-10-13

     
    Attachments
  • Jeffrey Hobbs
    Jeffrey Hobbs
    2005-10-13

    Logged In: YES
    user_id=72656

    The patch said 11.3 in the C comments - it's 11.13 (oops).

     
  • Logged In: YES
    user_id=95086

    Yes. It is 11.13 test which is failing. I have added this
    into the both 8.4 and 8.5 branches.

    As I see Andreas has some ideas and I will have to
    first understand them. I will leave this for tomorrow
    (its pretty late here now).

     
  • Don Porter
    Don Porter
    2005-10-13

    Logged In: YES
    user_id=80530

    Just for the record, the commit
    to HEAD corrects the test failures
    in the default build, but they are
    still present in an --enable-threads
    build.

     
  • Logged In: YES
    user_id=95086

    Andreas,

    What is the difference between:

    static void EventProcessingOff (chan) {
    chan = Tcl_GetTopChannel (chan);
    watch = Tcl_ChannelWatchProc (chan)
    inst = Tcl_GetChannelInstanceData (chan);
    watch (inst, 0);
    }

    and what I've done in tclIO.c:

    (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0);

    ????

    I do not see any difference. Perhaps you wanted to say that
    this should be done elsewhere and/or before or after some other
    thing (dunno) but to me the both above are practically equivalent.

     
  • Logged In: YES
    user_id=75003

    There is no difference with regard to the functionality the
    two pieces of code do. However "EventProcessingOff" (EPO) is
    something we can implement and use from _outside_ of the
    core, as we use only public APIs. I.e. in packages.

    My proposal is to completely remove the watchProc call from
    Tcl_CCH and instead implement EPO in the Thread package, and
    call it in the appropriate place of the transfer sequence.

    That way we can leave the Tcl IO core unmodified and still
    get channel transfers which do not dump core.

     
  • Logged In: YES
    user_id=95086

    Ah! This is what u mean... I see...

    Well, to be honest, this would do the trick allright, but it would
    need to be heavily documented in the Tcl_CCH API, i.e:

    Tcl_ClearChannelHandlers removes all channelhandlers and
    event scripts associated with the specified channel, thus
    shutting down all event processing for this channel.
    It will however NOT disable events in the event queue of this
    thread waiting to fire. If you are calling this function before
    Tcl_CutChannel in order cut the channel from the current
    thread, you will need to make sure those events are either
    processed or discarded. Otherwise you are risking late
    references to an already cut'ed channel which will result
    in memory corruption.

    A kind of above thing should be present in the Tcl_CCH manpage.
    Hm... not that nice for an API. But I understand this is
    as far as we can get w/o turning everything upside down.

    OK. I can live with that.

     
  • Logged In: YES
    user_id=95086

    Backoff parts responsible for failures of 10.13
    Tcl test in favour of external events handling
    as suggested by AK. Commited to both 8.4
    and 8.5 branches.

     
1 2 > >> (Page 1 of 2)