Menu

#1104 Rapid creation/deletion of windows creates "orphan" windows (could be bug #922)

v1.4.0
open-accepted
None
5
2016-07-31
2014-05-13
Kevin Corry
No

Hi,

I've been working with a team that has developed a Java application that runs in fluxbox. During testing they've occasionally noticed "orphaned" windows, which occur when the application has closed and deleted a window, but the fluxbox frame and window decorations for that window don't get cleaned up. The only way they've found to remove these orphans is to exit and restart fluxbox (xkill does not seem to remove the orphans).

They've put together a short Java testcase application that demonstrates the behavior, which I'll attach below. This program just rapidly creates and deletes a sequence of small windows. The bug occurs the majority of the time, although sometimes it will take running it for a few iterations to trigger the bug. I've tested this and reproduced the bug on several fluxbox versions from 1.1.0 to 1.3.5, including the latest git tree. I've also configured the latest build with --enable-debug and I've been collecting fluxbox logs for each test run. I will attach some of the log files and screenshots that demonstrate the bug.

The same test case does not seem to cause any orphaned windows when run with the twm or xfce window managers, or without any window manager (simply starting X with only an xterm), so right now we believe the bug is in fluxbox, or possibly due to some interaction between fluxbox and the X server.

From looking through the logs, the one thing that seems to stand out is that the windows that get cleaned up correctly all have "DestroyNotify" events in the log, and the windows that don't get cleaned up seem to be missing these events. (Each window that gets created by the testcase should have a unique "DialogTest###" title.)

I'm happy to run any additional tests that you might need to try to track down the source of this issue. Please let me know if there are additional debug options I should build with, or if there are patches or log statements you'd like me to add. If you can point me at the parts of the code that handle the window cleanup and give me an idea of what code sequences should be occurring, I can try to add some of my own logging as well.

Thanks in advance for any help you can provide!

Discussion

1 2 > >> (Page 1 of 2)
  • Kevin Corry

    Kevin Corry - 2014-05-13

    This is the Java test case mentioned above.

    After extracting the tarball, you can run the test with:
    java -jar dialogtest.jar 100

    The number at the end of the command controls the number of windows created during the test. The higher the value, the greater the chance of triggering the bug. Running with 100 triggers it the majority of the time on my test system.

    The source for the test case is in the dialogtest.java file, and can be rebuilt with the included Makefile. The .class and .jar files in the tarball were built and run with IBM Java 1.6.0.

     
  • Kevin Corry

    Kevin Corry - 2014-05-13

    This is a screenshot of the desktop and orphaned windows following the Java test case. You can see the titles of the corresponding windows in the taskbar at the bottom of the screen.

     
  • Kevin Corry

    Kevin Corry - 2014-05-13

    This is the fluxbox log corresponding to the test from the above screenshot.

    As I mentioned earlier, the windows that were correctly created and deleted during the test (e.g. DialogTest28) each have a "DestroyNotifyEvent" entry in the log. The orphaned windows (e.g. DialogTest27) do not have this "DestroyNotifyEvent" entry in the log. I can't say for sure that this is pertinent, but it's one thing that stood out to me.

    I'll also point out that there may be some windows that seem to be missing entirely from the log (e.g. DialogTest26). I believe this is due to the random lifetime given to each window in the Java test case, so some windows may be created and deleted too fast to actually register in fluxbox.

     
  • Mathias Gumz

    Mathias Gumz - 2014-05-14
    • status: open --> open-accepted
    • assigned_to: Mathias Gumz
    • Group: v1.3.5 --> v1.3.6
     
  • Mathias Gumz

    Mathias Gumz - 2014-05-14

    this is one of the most prepared bugreports i ever saw on fluxbox :)

     
    • Kevin Corry

      Kevin Corry - 2014-05-14

      You're welcome. :) I deal with a lot of software bug reports, so I try to be as thorough as I can when I have to open one myself.

       
  • Mathias Gumz

    Mathias Gumz - 2014-05-14

    please provide info about the used xserver, os, gpu, xdriver etc.

     
  • Kevin Corry

    Kevin Corry - 2014-05-14

    Most of the test systems are running some version of RHEL 6.x. (2.6.32-based kernel with lots of Redhat updates) with Xorg server 1.7.7, but I've also seen the bug with Xorg server 1.13.0. It looks like Xorg 1.14.x and 1.15.x are also available upstream - I can try to download and test with those as well. Let me know if you'd like the Xorg logs, or if there's additional logging or debugging I should enable in the X server.

    As for hardware, many of the test systems use Intel graphics chips with the i915 kernel driver and intel Xorg driver (version 2.14.0 or 2.20.2). I've also seen this on KVM guests that use the Cirrus Xorg driver (version 1.3.2 or 1.5.2). I'll ask the development team if there are other hardware platforms where they've seen this and update the bug report again.

     
  • Mathias Gumz

    Mathias Gumz - 2014-05-14

    just to be clear: that test-case creates "dialog"-windows, not new apps which each app has a window but these windows are essentially a "transient" to the dialog-test-panel (http://tronche.com/gui/x/icccm/sec-4.html).

     
  • Mathias Gumz

    Mathias Gumz - 2014-05-14

    yeah, they are transients, just checked.

     
  • Kevin Corry

    Kevin Corry - 2014-05-14

    Hi Mathias,
    Are you just thinking out-loud? Or does that imply that there's something incorrect about the way the test-case code is operating?

    Unfortunately, I didn't write the test-case (and I'm not all that familiar with Java myself). But I've been told that the test-case is relatively similar to how our development team's Java application operates. If you think there are any issues with the test-case (e.g. if it's violating any rules or conventions related to the X server or window manager), please let me know.

     
  • Mathias Gumz

    Mathias Gumz - 2014-05-14

    yah, just thinking loud.

    i saw the isse when working from within vmware, i tried to retrigger it when working native via nvidia-drivers: no success. i suspect that "avoid handling bad events" scheme is somewhat broken:

    http://git.fluxbox.org/fluxbox.git/tree/src/fluxbox.cc#n151
    http://git.fluxbox.org/fluxbox.git/tree/src/fluxbox.cc#n537

    and i suspect that thats why deleting / freeing the (fluxbox-)window associated with the window-id which is a "bad window" already.

     
  • Mathias Gumz

    Mathias Gumz - 2014-05-19

    ok, try the attached patch and report back any issues / success :)

    it was written against my git-branch here, but it should apply to 1.3.5 as well.

     
  • Kevin Corry

    Kevin Corry - 2014-05-20

    Hi Mathias,
    Unfortunately, I'm still seeing the orphaned windows with the new patch. :( I ran a number of tests yesterday, but I'm still hitting the bug pretty consistently. I added some of my own logging to the code to see if I could shed some light on the issue, although I'm not sure I've discovered anything new. I need to get a better understanding of all the data structures involved and the general flow of the window creation and deletion code. I'll attach my patch and the latest log file. Let me know if there's anything else you'd like me to try.

     
  • Kevin Corry

    Kevin Corry - 2014-05-20

    Patch to add some extra logging statements. This is against the git tree, and on top of the "bad_window_cleanup.diff" patch.

     
  • Kevin Corry

    Kevin Corry - 2014-05-20

    Screenshot after latest test. This shows five orphan windows: DialogTest23, DialogTest37, DialogTest43, DialogTest63, and DialogTest85.

     
  • Kevin Corry

    Kevin Corry - 2014-05-20

    Fluxbox log file from latest test.

     
  • Mathias Gumz

    Mathias Gumz - 2014-05-20

    the problem is that x11 seems to not notify us about all gone windows. whyever. some of the bad_window - events are obviously based upon, well, bad windows. and fluxbox keeps a list of "managed" windows. when one of these windows becomes bad, it should be removed. my patch does this by not hoping for some future destroynotify event but by instantly removing the window from "lists".

    what i find a bit strange is that you still have windows (visible) which have no associated client at all: " XEvent->type == DestroyNotify: No WinClient found for e->xdestroywindow.window=10485764" ... allthough this could be explained by events still in the queue. strange.

     
  • Kevin Corry

    Kevin Corry - 2014-06-16

    Hi Mathias,

    Sorry, I got sidetracked for a few weeks - other bugs got bumped up in priority. But another member of our team took a look at this fluxbox issue again last week and came up with a hack (on top of the patch you posted above) that seems to get rid of the orphaned windows in our testing. In her tests, she noticed that the event loop would occassionally dequeue an event associated with a bad window number before that window was put on the bad window list. She said this was only happening for ClientMessage type windows, so this hack saves the most recent ClientMessage event in a temporary location so it can check later on if it's bad. If it's later determined to be bad, it pushes the "current" event back on the queue and processes the bad ClientMessage event instead.

    This obviously isn't terribly robust or scalable, but I'm attaching it in the hopes that you (or someone more familiar with the fluxbox code) might be able to use this idea to come up with a better fix.

    Let me know what you think or if you have any questions.

    Thanks again for your help on this bug!

     
  • Mathias Gumz

    Mathias Gumz - 2015-01-21

    not forgotten, but some time-constraints here as well (this bug here seems like a bigger problem with deeper impact and thus requires more time). just to give you some "alive feedback".

     
  • Mathias Gumz

    Mathias Gumz - 2015-01-21
    • Group: v1.3.6 --> future release
     
  • Buzzy Brown

    Buzzy Brown - 2015-01-23

    I was working on this problem with Kevin and believe I found the real cause of the problem.

    The problem is in fluxbox's handling of the MapRequest message. The MapRequest is sent to FB when the client (java) attempts to map in a window. Fluxbox then creates the WinClient and FluxboxWindow objects to manage it and reparents the client window. Eventually it calls XConfigureWindow and XMapWindow to display both the the wrapper and client windows. The problem happens when the client window is removed (destroyed) between the time the XServer the MapRequest to FB and the time the XServer receives (and successfully processes) the MapRequest sent by FB (in my tests the XServer also sends a ReparentNotify so the window could be smaller).

    In any case, at this point FB has already created the WinClient/FluxboxWindow objects and will expect to receive either an UnmapNotify (or in later versions of FB a DestroyNotify) in order to destroy them. Since X never actually mapped the window, it will never send the unmap. I'm not sure why we're not getting a destroy message (since we do indeed get a CreateNotify for the window).

    In any case, FYI, there are a few reasons the current FB bad_window code doesn't fix the problem.

    • The code only passes down destroy messages....in our version of FB, the destroy just hides the window (in newer versions it cleans up the WinClient/etc).

    • The code can only handle at most one known bad window id at a time. If the client window we mapped has child windows, we could get multiple BadWindow errors and if the last one isn't for the parent client window the code won't take effect.

    • The code checks the e.xany.window field for the window id. For several events (unmap and destroy in particular) that field could be set to window the request came from and not the window itself....not sure if this is really a problem since we'd just end up forwarding more messages that way.

    As far as the bad_window_cleanup.diff patch goes, I think the main reason the patch didn't work is because it depended on FB receiving at least one more message apparently FROM the bad client window AFTER receiving the BadWindow error for it.

    The workaroundtest.patch patch helped a little but I was still able to recreate the problem. However, the patch also had the problem that that bad_window_list grew without bound - we got BadWindow errors for children of the main client bad window. Since the code never created a WinClient for the children, we'd never find them in our list and thus never clear them from the bad window list.

    I think one way to fix this would be to change the code to make a synchronous call like XGetWindowAttributes after the (first) XMapWindow request we make for a new client window. If the client is no longer there, then we abort the creation. However, there are a lot of ways a window can come into existence and not all of them would necessarily cause it to be mapped right away (slit, tabbed windows, windows on other desktops, etc). So, I think the best way to handle it now without a lot of major changes is to use a modification of the already suggested patch that watches for BadWindow errors.

    I'll attach a patch that's working for me. It still uses a set to contain a list of bad windows, but drains the list each time it receives a message. For each bad window, it sends a UnmapNotify message directly to handleUnmapNotify(). In the current version of fluxbox, this seems to have the same effect as sending a DestroyNotify to Fluxbox::handleEvent (which the original patch did) - it destroys the WinClient and FluxboxWindow (if needed). I chose to use the unmap instead because seems to be a superset.

     
  • Mathias Gumz

    Mathias Gumz - 2015-01-24

    yah. what i really love about this bugreport is that someone (other than myself) has thought really really long about the darn interactions and implications of what and when something happens.

    i agree with most of the things you wrote, especially the cause. actually, it does not matter what the cause for the lack of events in regards of "xserver destroyed that window" actually is. we are talking to a server over a network-like protocol and some things might get lost.

    (the time it takes FB to create the decoration/carrier windows (plural) and map them might easily be queued in later by the xserver. the sequence would then look like: client-mapping-request -> [FB creates it's stuff] -> client-destroy -> fb-mapping-request. and after that we already have lost the race)

    the only problem right now i see with your fine patch is that it creates an artificial unmap-event and tries to use it. the reason for this is that Fluxbox::handleUnmapNotify() first tries to detect the BScreen the event happend. it uses Fluxbox::searchScreen() for
    this and this was "fine" (to a certain degree) some days ago, but it is not anymore:

    http://git.fluxbox.org/fluxbox.git/commit/?id=c0e5d1c7a3867a82e0418e921afabf8e14909429

    in order to properly detect the BScreen the window actually lived in, FB now asks the window for it's root window. and this is a problem when the window itself is not there anymore:

    Fix broken _NET_REQUEST_FRAME_EXTENTS support
    There was a subtle flaw in the way fluxbox detects to which BScreen a
    given Window belongs: We have to compare the RootWindow of the given
    Window against the RootWindow of each BScreen.
    
    That underlying flaw made _NET_REQUEST_FRAME_EXTENTS fail: the code path
    needs a valid BScreen for the given window, otherwise we return early.
    

    calling Fluxbox::searchWindow() and then just deleting the created WinClient for that bad window should be enough. in Fluxbox::handleUnmapNotify() there is too much code that
    assumes the window it operates on is actually valid (which it isn't).

    since you are deep in the topic: do you want to tackle the issue in that direction?

     
  • Mathias Gumz

    Mathias Gumz - 2015-01-24
    • Group: future release --> v1.3.7
     
  • Buzzy Brown

    Buzzy Brown - 2015-01-26

    Thanks! I'll change my patch...I'm all for keeping the code as simple as possible. I was initially afraid of missing some type of cleanup (especially when the window was one of multiple in the FluxboxWindow's m_client_list), but now I see the ~WinClient() handles that.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.