#2823 Cocoa event loop won't let external events through

obsolete: 8.6b1.1
open-wont-fix
7
2012-10-02
2010-07-12
No

I've got a program where I've chopped up a long running computation into smaller pieces, using the

after 0 [after idle {next_iteration}]

idiom for scheduling the next chunk of work while allowing the GUI to stay responsive; I believe this is the recommended way to do this.

Unfortunately my program, which has worked well on my old PPC macs, does not seem to work on my new Intel mac: although the GUI is being updated in response to changes in e.g. label -textvariable values, I'm no longer getting any response to pressing buttons when the computation is running. Since this includes buttons I'd press to control the program, it makes it pretty useless!

Instructions to see the problem:

1. Start wish from a terminal.
2. [source] the attached beachball-bug.tcl.
-> This will create a window "Control panel" with three buttons "Halt", "Pause", and "Run", and two labels (each displaying the number 0).
3. Press "Run".
-> The left label will start to count upwards, slowing down as it goes. The right label will mostly say 1 but sometimes 0.
4. Press "Pause"
-> What should happen is that the counting stops and the right label changes to 2; an [after info] command (that may be typed at the prompt in the terminal) should confirm nothing is happening.
->< However, what rather happens is that the "Pause" button becomes highlighted and *stays highlighted*, while the counting continues unaffected. The mouse pointer will soon turn into the spinning beach ball above wish windows, confirming that wish is not processing any events. At this stage, you may type one (1) command at the terminal and have wish evaluate it; if you make this command "after cancel [after info]" then you can still stop the counting, but otherwise there's no way of stopping it short of killing the wish process.

I see this problem for:
* Tk 8.6b1.2 that I've compiled myself from sources checked out from CVS.
* Tk 8.5.7 from /System/Library/Frameworks/Tk.framework/..., that I suppose Apple compiled.

I don't see this problem for:
* Tk 8.4.19 from /System/Library/Frameworks/Tk.framework/..., that I again suppose Apple compiled.
* Tk 8.5.6 that I compiled on my old mac; this is a PPC executable, but it runs (I suppose) using Rosetta.

The problematic wishes all seem to be 64-bit executables (tcl_platform(wordSize) is 8) whereas the working ones are 32-bit (tcl_platform(wordSize) is 4). I suppose this could mean the faulty/working distinction is rather one of Cocoa/Carbon, but I don't know if there is some other way to check if the 8.4.19 is Carbon or Cocoa.

Discussion

  • Lars Hellström

    Lars Hellström - 2010-07-12

    Demo program (simplification of real app)

     
  • Donal K. Fellows

    • assigned_to: hobbs --> das
     
  • Daniel A. Steffen

    The tk/macosx/README tells you how to differentiate between Carbon and Cocoa.
    However this is most likely an issue with the tcl notifier, and so not a tk bug strictly speaking (although it will only manifest in tk cocoa, which uses the embedded mode of the tcl notifier, unlike tk carbon).

    Can you confirm that this happens with the latest tcl from CVS HEAD? there were a bunch of fixes dealing with recursive event loop invocations not that long ago.
    In any case, to debug this tclMacOSXNotify.c is what would need to be looked at/stepped through. Running sample/taking a stackshot or attaching with gdb when things are spinning would also be useful

    I don't anticipate being able to look into this myself anytime soon however.

     
  • Lars Hellström

    Lars Hellström - 2010-07-13

    Confirming all the working wishes are Carbon (and the buggy Cocoa). Is there a way of forcing build with Carbon? OTOH, I suppose that would mean the executable becomes 32-bit, and wouldn't be able to [load] extensions built 64-bit. Sigh.

    A cvs update of the tcl and tk sources changes nothing that seems related to events, save a few test files. Haven't tried building yet, but I'm probably seeing this for a version later than the changes you mentioned.

     
  • Lars Hellström

    Lars Hellström - 2010-07-16

    A few additional observations:

    1. The following script is sufficient to show how wish goes spinning beachball when one tries to press "Stop":

    pack [label .l -textvariable ::count] -side top
    pack [button .b -text "Start" -command {step}] -side left
    pack [button .b2 -text "Stop" -command {after cancel [after info]}] -side right
    set ::count 0
    proc step {} {
    incr ::count
    after 0 {after idle step}
    }

    The symptoms are a bit different however: in this case, the counting stops (at least visibly).

    2. If the [after 0] is changed to [after 1] then everything works as expected.

    (I still haven't tried debugging it with gdb, but it'd probably be a good exercise.)

     
  • Lars Hellström

    Lars Hellström - 2010-10-28

    Have finally gotten around to running this in a debugger.

    When Wish is in its unresponsive state, it appears to be sitting in the while loop at line 1417 of tclMacOSXNotify.c:

    while (Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

    The full call stack is typically something like

    #0 0x7fff823fb2fa in mach_msg_trap
    #1 0x7fff823fb96d in mach_msg
    #2 0x7fff802493c2 in __CFRunLoopRun
    #3 0x7fff8024884f in CFRunLoopRunSpecific
    #4 0x10038c799 in Tcl_Sleep at tclMacOSXNotify.c:1545
    #5 0x100350d73 in AfterDelay at tclTimer.c:1042
    #6 0x100350446 in Tcl_AfterObjCmd at tclTimer.c:843
    #7 0x100249c3c in NRRunObjProc at tclBasic.c:4371
    #8 0x1002499d1 in TclNRRunCallbacks at tclBasic.c:4318
    #9 0x10024c24b in TclEvalObjEx at tclBasic.c:5893
    #10 0x10024c1ef in Tcl_EvalObjEx at tclBasic.c:5874
    #11 0x100351019 in AfterProc at tclTimer.c:1173
    #12 0x1003501a6 in TclServiceIdle at tclTimer.c:745
    #13 0x100326bad in Tcl_ServiceAll at tclNotify.c:1087
    #14 0x10038c4bb in UpdateWaitingListAndServiceEvents at tclMacOSXNotify.c:1417
    #15 0x7fff8026d077 in __CFRunLoopDoObservers
    #16 0x7fff802490cf in __CFRunLoopRun
    #17 0x7fff8024884f in CFRunLoopRunSpecific
    #18 0x7fff8356691a in RunCurrentEventLoopInMode
    #19 0x7fff8356671f in ReceiveNextEventCommon
    #20 0x7fff835665d8 in BlockUntilNextEventMatchingListInMode
    #21 0x7fff844ec29e in _DPSNextEvent
    #22 0x7fff844ebbed in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
    #23 0x1001309f8 in -[TKApplication(TKNotify) nextEventMatchingMask:untilDate:inMode:dequeue:] at tkMacOSXNotify.c:63
    #24 0x7fff846cc7c1 in -[NSCell trackMouse:inRect:ofView:untilMouseUp:]
    #25 0x7fff846fd536 in -[NSButtonCell trackMouse:inRect:ofView:untilMouseUp:]
    #26 0x7fff846cb4b5 in -[NSControl mouseDown:]
    #27 0x7fff845e5763 in -[NSWindow sendEvent:]
    #28 0x7fff8451aee2 in -[NSApplication sendEvent:]
    #29 0x100130b4e in -[TKApplication(TKNotify) sendEvent:] at tkMacOSXNotify.c:85
    #30 0x10013112c in TkMacOSXEventsCheckProc at tkMacOSXNotify.c:307
    #31 0x100326a28 in Tcl_DoOneEvent at tclNotify.c:964
    #32 0x100026a30 in Tk_MainLoop at tkEvent.c:2134
    #33 0x10003add1 in Tk_MainEx at tkMain.c:363
    #34 0x10000545a in main at tkAppInit.c:70

    The AfterDelay and stuff is because beachball-bug.tcl contains an [after 5] (instead of some actual processing) to slow down the counter. The while loop mentioned above is part of #14 (UpdateWaitingListAndServiceEvents), whereas the nearest function from tkMacOSXNotify is #30 (TkMacOSXEventsCheckProc). I suspect this confirms Daniel's suspicions below.

     
  • Lars Hellström

    Lars Hellström - 2010-11-02

    Contued working on this. I request comments from someone who understands how a Tcl notifier is normally supposed to work.

    If I understand the Tcl Notifier manpage correctly, a custom modifier embedded into an external event loop (which tclMaxOSXNotify.c appears to be) is supposed to call Tcl_ServiceAll to make sure Tcl events are serviced, *but only once* on each invokation. This makes sense, since Tcl_ServiceAll already contains a loop over the event queue.

    However, when I try this (reduce the while loop to just a Tcl_ServiceAll call), parts of the Tcl test suite seem to become very slow, and http11.test gets stuck entirely.

    Reading up on CFRunLoop*, I think I understand why. UpdateWaitingListAndServiceEvents is set up as an observer callback, and the kCFRunLoopBeforeWaiting case in which it calls Tcl_ServiceAll is "I'm about to go to sleep now", so this is the CF equivalent of an [after idle] callback. The effect of servicing Tcl events here is that they only get serviced after some other event has occurred, and the CF runloop is preparing to go idle again. Hence a process must receive some external event between each invokation of UpdateWaitingListAndServiceEvents, and if it never does (as might be the case for some thread of http11.test) then it is stuck.

    That doesn't mean the while loop is the right thing to do, however; I rather suspect it could be covering up a second bug -- namely that Tcl events are only serviced at idle time -- but I haven't studied the code in enough detail to say that for sure. There could be other situations than this kCFRunLoopBeforeWaiting callback in which Tcl event are serviced, and in that case it might be OK.

    Also, upon randomly browsing documentation, it has occurred to me that a minimal fix to the problem at hand could be to set up a version 0 CFRunLoopSource, and replace the while loop

    while ( Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

    with something like

    if ( Tcl_ServiceAll() && tsdPtr->waitTime == 0) {
    CFRunLoopSourceSignal( tsdPtr->wantCallbackRLSource );
    }

    to make sure the runloop thinks "oh, something happened", and go idle again a bit later. But I *really* want a second pair of eyes to take a look at the problem as a whole before attempting that.

     
  • Daniel A. Steffen

    Lars, thanks for looking into this!

    The TkCocoa notifier is a hybrid of an embedding and an embedded notifier, see
    tkMacOSXNotifier.c for details. The outermost event loop is driven by tcl via
    Tcl_WaitForEvent() and has a tk event source that processes NSEvents (i.e.
    embedding). That processing may however involve nested CFRunLoops (possibly
    running in different runloop modes) e.g. during menu tracking/window
    resizing/when a modal dialog is presented etc. At such time, the relationship
    is reversed (i.e. embedded), the runloop is driven externally and our runloop
    observer processes tcl events via Tcl_ServiceAll().

    Note that as far as tk goes, the while loop at issue here is only ever used
    in the second case (i.e. a nested runloop invocation). For typical tcl use as
    in the testsuite, that while loop should never be used at all (the notifier
    must have been switched to embedded-enabled mode via Tcl_ServiceModeHook() for
    it to be used).
    OTOH this is the normal mode of operation when tcl is embedded in a
    Cocoa app that drives its own event loop that is unaware of tcl.

    As far as getting stuck in this while loop, Tcl_ServiceAll() is supposed to
    return true as long as it has processed an event, so this might be due to
    new tcl events continuously arriving.
    From the backtrace you posted, it looks like an after event called from
    TclServiceIdle() is involved that ends up calling a nested blocking [after] ?

    Do note that running the tcl testsuite is insufficient to guard against
    regressions in this area as it does not cover the notifier in embedded mode.
    the tk testsuite for TkCoco as well as TkX11 along with manual testing of the
    tk widget demo is required (in particular, test that the animated demos
    correctly continue animating during menu tracking/window resizing, invocation
    of a menu item via menu or key binding, and while a modal dialog & a modal
    sheet & a file open dialog/sheet are displayed (all of these involve slightly
    different uses of nested runloops that have caused problems in the past).

     
  • Lars Hellström

    Lars Hellström - 2010-11-02

    Replying to some specific points of das's:
    > The TkCocoa notifier is a hybrid of an embedding and an
    > embedded notifier,

    Very subtle terminology here... I need to tread carefully.

    > see tkMacOSXNotifier.c for details. The
    > outermost event loop is driven by tcl via Tcl_WaitForEvent()
    > and has a tk event source that processes NSEvents (i.e.
    > embedding). That processing may however involve nested
    > CFRunLoops (possibly running in different runloop modes)
    > e.g. during menu tracking/window resizing/when a modal
    > dialog is presented etc. At such time, the relationship is
    > reversed (i.e. embedded), the runloop is driven externally
    > and our runloop observer processes tcl events via
    > Tcl_ServiceAll().

    > Note that as far as tk goes, the while loop at issue here is
    > only ever used in the second case (i.e. a nested runloop
    > invocation).

    Yes.

    > For typical tcl use as in the testsuite, that
    > while loop should never be used at all (the notifier must
    > have been switched to embedded-enabled mode via
    > Tcl_ServiceModeHook() for it to be used). OTOH this is the
    > normal mode of operation when tcl is embedded in a Cocoa app
    > that drives its own event loop that is unaware of tcl.

    > As far as getting stuck in this while loop, Tcl_ServiceAll()
    > is supposed to return true as long as it has processed an
    > event, so this might be due to new tcl events continuously
    > arriving.

    It is. As stated at the top of the original bug report, I'm using the standard [after 0 {after idle next_iteration}] idiom to keep the processes working while allowing the GUI to be responsive. Tcl_ServiceAll will first do the [after 0] script, which adds an event to the idle queue. Then it processes the idle queue, so next_iteration is evaluated, but since that does an [after 0 {after idle next_iteration}], there will now be a new event in the non-idle queue, and tsdPtr->waitTime is 0. But Tk is supposed to also let GUI events be serviced before going back to this timer event! That won't happen if UpdateWaitingListAndServiceEvents never returns.

    > From the backtrace you posted, it looks like an after event
    > called from TclServiceIdle() is involved that ends up
    > calling a nested blocking [after] ?

    The blocking [after] is probably not relevant. As I've stated before, it is just a substitute for the useful computation that the loop would be doing, to stop it from running unpleasantly fast.

    > Do note that running the tcl testsuite is insufficient to
    > guard against regressions in this area as it does not cover
    > the notifier in embedded mode.

    Yes, that is what I expected. But changing that single line in tclMacOSXNotify.c *changed* the behaviour of the *Tcl* testsuite, so I didn't dare to proceed to Tk. It seems this code is used even without Tk.

     
  • Daniel A. Steffen

    it seems repeated Tcl_ServiceAll() isn't providing quite the semantics needed here, the idle queue should really be processed only once per runloop observer invocation (but all other pending tcl events should be drained)
    ISTR trying to achieve this via Tcl_DoOneEvent() invocations initially and being unsuccessful (git history may provide details), we may need extra support in tclNotify.c for this.

     
  • Kevin Walzer

    Kevin Walzer - 2011-07-06

    This bug appears to manifest itself whenever a bgerror is thrown when accessing a Tk-Cocoa menu, see bug 3285355. I would consider this to be a release-blocking bug, although the bgerror issue itself is trivial to work around by re-defining the bgerror procedure.

    Simply commenting out this line:

    while ( Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

    in tclMacOSXNotify.c solves the bgerror issue, it runs fine. It also appears to solve the "beachball" issue in the test script submitted by Lars.

    The only side effect I can see from this change is that the Tcl test suite in "make test" seems to run more slowly. Is there any other testing I should do before committing this change as a short-term solution, pending a more complete investigation/implementation by Lars?

     
  • Kevin Walzer

    Kevin Walzer - 2011-07-06
    • assigned_to: das --> wordtech
    • priority: 5 --> 9
     
  • Lars Hellström

    Lars Hellström - 2011-07-06

    If it's as easy as wordtech says, then splendid! Unfortunately, I'm pretty sure that deleting that while was one of the first things I tried all those months ago -- the loop anyway didn't seem to be more than a premature overoptimisation (avoid dropping out into an "outer" event loop when you know you'll be back here soon anyway) -- but it didn't work for me: as mentioned in my 2010-11-02 comment, it caused some parts of the tcl test suite to be very slow, and at least one test got stuck completely. So be very careful! I suspect things like network speed may affect the order in which things happen, and may affect whether tests like those hang or not.

    If you're rather dropping the Tcl_ServiceAll call as well, then I don't see how you can be servicing idle events.

    I fear the probability of me coming up with a fix for this bug is at present rather low, however. (Other stuff coming in between.)

     
  • Kevin Walzer

    Kevin Walzer - 2011-07-07
    • priority: 9 --> 7
    • status: open --> open-later
     
  • Kevin Walzer

    Kevin Walzer - 2011-07-07

    I've looked at DAS's comment below:

    "Do note that running the tcl testsuite is insufficient to guard against
    regressions in this area as it does not cover the notifier in embedded
    mode.
    the tk testsuite for TkCoco as well as TkX11 along with manual testing of
    the
    tk widget demo is required (in particular, test that the animated demos
    correctly continue animating during menu tracking/window resizing,
    invocation
    of a menu item via menu or key binding, and while a modal dialog & a
    modal
    sheet & a file open dialog/sheet are displayed (all of these involve
    slightly
    different uses of nested runloops that have caused problems in the past)."

    Even though this change fixes issues with bgerror, this proposed change does cause issues with animation during window resize as demonstrated in the Tk widget demo. As a result, I am reluctant to proceed with it as it may have unexpected side effects/breakage.

    Lowering the priority to 7 and changing the resolution to "later" because this bug seems beyond my skill set to fix at this time. Others are welcome to delve into it.

     
  • Kevin Walzer

    Kevin Walzer - 2011-08-05

    Further review suggests that event loop/notifier integration between Tk and Cocoa may require some re-design of this by das: I have not been able to isolate a specific place where a bug can be addressed.

     
  • Kevin Walzer

    Kevin Walzer - 2012-10-02

    Adding an "update" command to the core procedure in the sample script sidesteps the problem entirely.

    I realize that this script was put together to assemble the core issue, that is, incoming Tcl events overloading the Cocoa/Tk event loop and causing it to lock up, but the fact is that careful management of the event loop with occasional workarounds like "update" or "after" can avoid many of the issues that are commonly reported.

    I've spent a lot of time reading both das's code and the relevant docs (Tcl and CoreFoundation), and given the complexity of the task, I've concluded that the current implementation is good enough, albeit far from optimal. das had previously committed a significant update to the notifier, see https://sourceforge.net/mailarchive/message.php?msg_id=23324050, that addressed truly showstopper issues with its performance. Now it is slow and pone to occasional lockups, but in most cases these can be worked around. I seldom see any issues with the event loop in my own applications because I work around the most common issues; it's been a long time since I've had a user complaint about a lockup because of event loop issues.

    It seems clear that the Tk/Cocoa event loop is slower, more fragile, and more complex than the Carbon implementation; however, the ultimate responsibility for this lies with Apple's decision to deprecate Carbon and force Cocoa integration. Cocoa has many advantages, but it is not a natural fit for Tk the way that Carbon, which operated at the same low level of abstraction as Tk's C API, was.

    Therefore, I've decided to close this bug, and will spend a bit of time deciding how to document the limitations of Tk-Cocoa's event loop and best practices for working around them.

     
  • Kevin Walzer

    Kevin Walzer - 2012-10-02
    • status: open-later --> closed-wont-fix
     
  • Kevin Walzer

    Kevin Walzer - 2012-10-02
    • status: closed-wont-fix --> open-wont-fix
     
  • Kevin Walzer

    Kevin Walzer - 2012-10-02

    Re-opening for comments based on Adrian Roberts' comments on Tcl-Mac, but previous remarks still stand.

     
  • Lars Hellström

    Lars Hellström - 2012-10-02

    Regarding Kevin's comment earlier today, that Tcl-Mac posting is http://code.activestate.com/lists/tcl-mac/1553/

    On the matter of work-arounds: What I've been using for over a year now is to change "after 0 ..." to "after 1 ...". I seem to recall that the inner event loop treats the case that there is a timer due right now separately, which might explain why this would make a difference.

    In principle, I suppose the Mac version of Tk could be changed to silently do "after 1 ..." whenever it's asked to do "after 0 ...", which would make that work-around invisible on the script level. But before deploying something like that, one should probably check whether it also is a work-around when there is more than one task rescheduling itself in this way.

    Would it make sense to update the "group" setting to 8.6b3?

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks