#3373 WinSock cores/hangs on thread or app exit

obsolete: 8.4.12
open-fixed
5
2006-03-10
2006-02-23
No

There is a sequence problem in the way how the
win socket/pipe subsystem is teared down. The generic
code which handles the finalization invokes thread
exit handlers first, then finalizes IO subsystem.
Unfortunately, the finalization of the IO subsystem
may access data which is gargabe-collected by the
therad exit handler. In such cases, the application
will either hang or core.

The attached patch works-arround this by moving the
functionality of the thread-exit handler into a
special TclpFinalizeSockets() call. This call is now
part of the generic TclFinalizeIOSubsystem.

This change is no-op for Unix. It affects Windows
(win/) and old Mac-9 (mac/) code bases.

Discussion

1 2 > >> (Page 1 of 2)
  • Zoran Vasiljevic

    cvs diff against core-8-4-branch

     
    Attachments
  • Don Porter

    Don Porter - 2006-02-28

    Logged In: YES
    user_id=80530

    Here's the patch in
    unified diff format for easier
    review.

    Can you confirm that all the
    changes to tclEvent.c are
    cosmetic only?

    Does this need further review,
    or will you be committing it?

    Are similar changes required
    for Tcl 8.5a4 ?

     
  • Don Porter

    Don Porter - 2006-02-28
     
    Attachments
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    Changes in tclEvent.c are just cosmetics.

    I wanted to check this in but AK suggested to file the bug-report
    so others can eventually comment.

    Yes, similar chnages are also needed for 8.5. I wanted
    to do both in one run.

     
  • Andreas Kupries

    Andreas Kupries - 2006-02-28

    Logged In: YES
    user_id=75003

    Hm. I can't see in the patch which of the data released by the
    thread-exit-handler is accessed later during finalization.
    The event source ? If yes, how ?

    Question: Is that something which in 8.5 can be handled by
    the TIP #218 Thread Actions ? (Last channel removed from a
    thread causes tear-down of the subsystem, like first channel
    added may auto-initialize it ?)

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    OK. Easy. But you will have to follow me...

    If you open a socket, don't close it and exit the thread
    (or the application, this will eventually come to

    Tcl_FinalizeThread()

    The Tcl_FinalizeThread will pull thread-exit handlers
    one by one. One of them will be the handler to tear down
    the socket subsystem.
    This thread-exit handler will tear-down the socket thread and
    all the sync primitieves for the synchronization. After that
    no calls into this subsystem should be performed.
    But....
    After that, the TclFinalizeIOSubsystem() is called.
    The TclFinalizeIOSubsystem walks the list of opened channels,
    founds one still opened and tries to close it (line 254 in generic/tclIO.c).
    The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel.
    The Tcl_CutChannel will, as the last step of operation, call
    ThreadActionProc with TCL_CHANNEL_THREAD_REMOVE.
    Understand? Now, the ThreadActionProc will result in
    TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start
    using tsdPtr->socketListLock, will attemt to SendMessage to
    window owned by the SocketThread which already is exited
    and so on and so on...

    Is this now little bit more clear?

    Concerning lazy init/teardown of socket subsystem, the answer is: yes.
    If this is in place, then this will be better and more logical. The first
    channel/socket inits, the last closing tears down the subsystem.
    This is how I'd do the thing If I had designed it. But I was not so I
    had to at least rectify the obviously broken current implementation.

     
  • Andreas Kupries

    Andreas Kupries - 2006-02-28

    Logged In: YES
    user_id=75003

    > OK. Easy. But you will have to follow me...

    I will try.

    > If you open a socket, don't close it and exit the thread
    > (or the application, this will eventually come to
    >
    > Tcl_FinalizeThread()
    >
    > The Tcl_FinalizeThread will pull thread-exit handlers
    > one by one. One of them will be the handler to tear down
    > the socket subsystem.

    > This thread-exit handler will tear-down the socket thread and
    > all the sync primitieves for the synchronization. After that
    > no calls into this subsystem should be performed.

    Ah, now I see what problem I had before. I was looking at
    the TclpFinalizesockets for the Mac :( D'oh.

    > But....
    > After that, the TclFinalizeIOSubsystem() is called.
    > The TclFinalizeIOSubsystem walks the list of opened channels,
    > founds one still opened and tries to close it (line 254 in
    > generic/tclIO.c).
    > The Tcl_Close will invoke CloseChannel and this will call
    Tcl_CutChannel.
    > The Tcl_CutChannel will, as the last step of operation, call
    > ThreadActionProc with TCL_CHANNEL_THREAD_REMOVE.
    > Understand? Now, the ThreadActionProc will result in
    > TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will
    start
    > using tsdPtr->socketListLock, will attemt to SendMessage to
    > window owned by the SocketThread which already is exited
    > and so on and so on...

    Ok, yes I see the problem with that. So the new ThreadAction
    stuff I did caused this, due to not fully understood
    finalize ordering. My apologies. Although ... No, we had no
    thread actions before, causing crashes when the socket moved
    around between threads. My integration fixes this, and
    opened the problem with the finalize ordering :( Thanks for
    finding and fixing this.

    > Is this now little bit more clear?

    Yep.

    > Concerning lazy init/teardown of socket subsystem, the
    answer is: yes.
    > If this is in place, then this will be better and more
    logical. The first
    > channel/socket inits, the last closing tears down the
    subsystem.
    > This is how I'd do the thing If I had designed it.

    Note: ThreadActions are a very new thing, they were not
    available when the driver was written initially.

    > But I was not so I
    > had to at least rectify the obviously broken current
    implementation.

    I guess we can try to rewrite the driver to use of the
    thread actions in this way, now that the problem is fixed.

    Note: Do you have test cases which demonstrate the crash
    with the old implementation ? That would be good to have to
    prevent any future implementation from running into the same
    trap.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    I wish I had a simple test case...
    Normally, one would create a thread (using testthread command
    from the test suite), open the socket channel in that thread and then
    exit the thread. But... this will NOT always break. Sometimes
    nothing will happen, sometimes the thread will never exit and
    everyting will stall, sometimes, the app will core. In our app
    it was about 2 of 10 i.e. 2 hangs out of 10 attempts.
    Well I could try to stress the behaviour by creating/destroying
    threads and channels in a row for some 10's or 100's of times...

     
  • Andreas Kupries

    Andreas Kupries - 2006-02-28

    Logged In: YES
    user_id=75003

    Yes, that (looping and repetition) is how I would do it as
    well for a case which fails non-deterministically in
    general, but with a relatively high probability. 20% is high
    IMHO. So trying a hundred times is good enough for me.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    Ah, Andreas. regarding your hint about lazy handling
    the socket subsystem init/teardown, the answer is
    unfortunately NO, rather than yes.

    Why?

    Because this thing will just needlessly init/teardown
    the subsystem exactly 1000 times which is 999 times
    too much:

    time {close [socket $host $port]} 1000

    if there are no other sockets opened in
    the process.

    I believe we should keep explicit teardown.

    I also believe we should introduce explicit
    init as well, rather than checkin the state
    of the subsystem all the time by the InitSockets().
    This new call can be part of the Tcl lib bootstrap
    during initialization of the IO subsystem. This
    way it will be symetric and easy to understand.

    What do you think?

     
  • Zoran Vasiljevic

    • status: open --> open-fixed
     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    Ok, it is now in both 8.4 and head branches.
    I will still have to add tests for exposing
    this bug so it won't hit us in future, hence
    the bug-ticket is sill left open.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    r1.52 has introduced a deadlock problem for when the
    notification can not start. The 20 second timeout that used
    to be on line 502 needs to be put back.

    What happens if tsdPtr->readyEvent never is signaled? It
    blocks forever. Don't let it block forever, do something
    resonable to avoid a deadlock.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    I would love to hear why and where is the deadlock?

    If you recall, there was a thread on TCL core list about
    that, where I asked if there is/could-be such situation that
    the socket thread cannot start because it cannot create that
    dummy window for event passing. Well, nobody commented
    on that and Win manuals were also not helpful.
    I decided to let it block there for reason. If it blocks, then there
    is some other problem and it is not OK to put it under the carpet.
    You tell about deadlocks. What deadlocks? Can you explan when
    and how the deadlock can happen?

    I have no problem in putting that silly timeout there but from
    this perspective and with this knowledge I think it is just a bad
    design to leave it there. I'm open to learn, though.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    I already explained why in email. The problem is not swept
    under the carpet. By removing the timeout and replacing
    with infinite, you have created a situation where a deadlock
    can happen.

    Let's go over this again.. actually, why don't I just paste
    my email here instead..

    Zoran,

    It isn't a timer, it's a wait with a timeout. It waits for
    the async notification window to be in a ready state for use
    by WSAAsyncSelect. If it really is taking over 20 seconds to
    create a window in a thread, something else is wrong. The
    state left after the timeout isn't undefined. All publics
    do the SocketsEnabled() check to see that the system is
    usable, else punt. I don't see how using an infinite
    timeout makes it wait less for the window to become ready.

    If you have patches for improvements, go for it, especially
    regarding how Tcl does a tear-down. Yes, I agree that the
    I/O system tear-down is a bit backassward.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    David,

    You are speaking of some email where you explained
    the problem. Where did you send it? I never got any such email!

    All I got was your response on this thread in Bug tracker.
    Hence I had to ask what's wrong. You would do the same,
    wouldn't you? I believe we have some comm problem here...

    I'm quoting:
    It waits for the async notification window to be in a ready state for use
    by WSAAsyncSelect. If it really is taking over 20 seconds to
    create a window in a thread, something else is wrong.

    WHAT else is wrong? Is this system (Window) so undeterministic?
    I asked once. I'm asking once again:

    tsdPtr->hwnd = CreateWindow("TclSocket", "TclSocket",
    WS_TILED, 0, 0, 0, 0, NULL, NULL, windowClass.hInstance, arg);

    Is this call (CreateWindow) defined to return sometime?
    Can it block forever?
    If yes, for what reason i.e. under which circumstances?
    If yes, is the system usabe any more after that?
    Do you have a situation where you can simply
    and easily reproduce it?

    I have read the function specs in MSDN and there is
    nothing about that call blocking indefinitely.

    I decided to block there instead of go over it. I would
    like to know WHY the app blocks there! It is so simple
    to put a timeout there and abandon but this is not buying
    you anything and will most probably hang somewhere else.

    If the CreateWindow is known to block indefinitely, then
    why has anybody created such a poor implementation
    at all in the first place? If it is so, then it has to be re-written.
    I never solve any such problems by brute-force (which the
    timeout at THAT place definitely is). And I have good reasons
    to follow that mantra.

    So: once again: can CreateWindow block indefinitely?
    If yes: when and why?

     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    What guarentee do you have that the thread can create the
    new window and hit the winproc's WM_CREATE? An INFINITE
    timeout is rather presumptuous.

    Please find it in your heart to put back the conservative
    "head check" timeout.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    The SocketThread's first thing to do is to
    CreateWindow (line 2326)
    then
    SetEvent(tsdPtr->readyEvent) (line 2333)

    The thread creating it sits in
    WaitForSingleObject(tsdPtr->readyEvent, INFINITE);
    and waits for the readyEvent to be signalized.

    There is no problem there. Unless the CreateWindow()
    blocks forever. Then the SetEvent is never executed
    and the caller blocks indefinitely. Hence my question
    once more: can CreateWindow block forever?
    Yes or no? If yes: when and why?

    I can put back the timeout there but I do not see any valid
    reason why to do that. Do you have a test case that brings
    the system in such state?

     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    The issue is not CreateWindow blocking, but the time it
    takes the thread to start. If your system is at such a
    state where it takes more than 20 seconds to get the
    notification window up, you'd better punt. That's my point.
    The timeout was very valid. By taking it out, your saying
    the state of the system doesn't matter and we should wait
    until next thursday for the thread to start and create the
    notification window.

    All it is a "head check" and nothing more.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    This does not sound very deterministic to me. I would never
    trust such a system. Why 20 secs? Why not 10 or 100 or 1000?

    I do not believe this is the right way to go.

    I guess if the thread needs that long time (20 secs in that case is
    almost forever) to start up, then something is wrong with the
    computer and there is no reason to continue.
    If a thread A wants/needs to start the thread B then it must
    wait forever if needed. Because if it really waits forever
    then you have some other problem. I see it this way:
    somebody wants to use socket communication. For that Windows
    needs a helper thread: If this thread is not started then we need
    to Tcl_Panic! See the example of the notifier fhread. If this thread
    does not start, would you also wait 20 secs and declare threading
    communication useless or would you panic the system? Fortunately
    the wise designer of that part decided to Tcl_Panic which is the
    only correct way to go.
    Why relaxing that behaviour for sockets?

    Please understand: the simplest thing for me is to go and put that
    timeout back there. But I'm still not convinced that would be the
    right solution.

     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    I hadn't thought about a hard abort before, but it wouldn't
    be wrong in this case over the punt. Same rule can apply to
    the proceeding check for a thread handle and even the
    following window handle check, too.

    We can't allow TclpHasSockets() to cause a hard abort
    though. So it might be a bit of a dance to get right.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    Correct. Once the DLL is loaded OK and all needed functions were found,
    we *should* have a working socket subsystem.
    If at that point ANY of the things fail to initialize, then this is a sign to stop.
    It is better to stop then go halfway on, IMO.

    The TclpHasSockets() should not break and should really just test the
    winsock.hModule != NULL and return TCL_OK in that case.

    I guess the only question now is how to get all this done right.
    The tricky part is InitSockets() and TclpHasSockets().
    These two must somehow divorce. I'm not yet sure how but
    I will give it a thought today/tomorrow.

    I would Tcl_Panic in all cases when something goes wrong in InitSockets()
    if and only if all functions were loaded OK from the DLL.

     
  • Zoran Vasiljevic

    Logged In: YES
    user_id=95086

    Here is how I would solve that:

    I would introduce new platform-specific call: TclpInitializeSockets().
    This should be symetric to TclpFinalizeSockets() and should be called
    from within TclInitSubsystems() in the generic layer.
    This call would do exactly the same what the InitSockets() was doing in
    tclWinSock.c. Actually, I'd just rename InitSockets() -> TclpInitializeSockets()
    and get it called from the generic part explictly.

    OTOH, the TclpHasSockets() will just examine winsock.hModule != NULL
    hence it is non-block. It will not attempt to lazy-load the entire socket
    subsystem as it is now. So one more lock-source less, which is a nice
    side-effect.

    This way the entire socket subsystem is called *once* per thread creation.
    If any call to it breaks at some place AFTER (initially ) loading the socket DLL
    then we punt (Tcl_Panic).

    What do you think?

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    Sounds like a good plan to me (speaking from the sidelines).

     
  • David Gravereaux

    Logged In: YES
    user_id=7549

    Sounds great.

     
1 2 > >> (Page 1 of 2)

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks