Fluxbox / Bugs / #1104 Rapid creation/deletion of windows creates "orphan" windows (could be bug #922)

Kevin Corry - 2014-05-13

This is the Java test case mentioned above.

After extracting the tarball, you can run the test with:
java -jar dialogtest.jar 100

The number at the end of the command controls the number of windows created during the test. The higher the value, the greater the chance of triggering the bug. Running with 100 triggers it the majority of the time on my test system.

The source for the test case is in the dialogtest.java file, and can be rebuilt with the included Makefile. The .class and .jar files in the tarball were built and run with IBM Java 1.6.0.

dialogtest.tar.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-13

This is a screenshot of the desktop and orphaned windows following the Java test case. You can see the titles of the corresponding windows in the taskbar at the bottom of the screen.

fluxbox-2014-05-13.3.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-13

This is the fluxbox log corresponding to the test from the above screenshot.

As I mentioned earlier, the windows that were correctly created and deleted during the test (e.g. DialogTest28) each have a "DestroyNotifyEvent" entry in the log. The orphaned windows (e.g. DialogTest27) do not have this "DestroyNotifyEvent" entry in the log. I can't say for sure that this is pertinent, but it's one thing that stood out to me.

I'll also point out that there may be some windows that seem to be missing entirely from the log (e.g. DialogTest26). I believe this is due to the random lifetime given to each window in the Java test case, so some windows may be created and deleted too fast to actually register in fluxbox.

fluxbox-2014-05-13-w0100-t0001.out

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-14

status: open --> open-accepted

assigned_to: Mathias Gumz

Group: v1.3.5 --> v1.3.6
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-14

this is one of the most prepared bugreports i ever saw on fluxbox :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Kevin Corry - 2014-05-14
  
  You're welcome. :) I deal with a lot of software bug reports, so I try to be as thorough as I can when I have to open one myself.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-14

please provide info about the used xserver, os, gpu, xdriver etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-14

Most of the test systems are running some version of RHEL 6.x. (2.6.32-based kernel with lots of Redhat updates) with Xorg server 1.7.7, but I've also seen the bug with Xorg server 1.13.0. It looks like Xorg 1.14.x and 1.15.x are also available upstream - I can try to download and test with those as well. Let me know if you'd like the Xorg logs, or if there's additional logging or debugging I should enable in the X server.

As for hardware, many of the test systems use Intel graphics chips with the i915 kernel driver and intel Xorg driver (version 2.14.0 or 2.20.2). I've also seen this on KVM guests that use the Cirrus Xorg driver (version 1.3.2 or 1.5.2). I'll ask the development team if there are other hardware platforms where they've seen this and update the bug report again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-14

just to be clear: that test-case creates "dialog"-windows, not new apps which each app has a window but these windows are essentially a "transient" to the dialog-test-panel (http://tronche.com/gui/x/icccm/sec-4.html).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-14

yeah, they are transients, just checked.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-14

Hi Mathias,
Are you just thinking out-loud? Or does that imply that there's something incorrect about the way the test-case code is operating?

Unfortunately, I didn't write the test-case (and I'm not all that familiar with Java myself). But I've been told that the test-case is relatively similar to how our development team's Java application operates. If you think there are any issues with the test-case (e.g. if it's violating any rules or conventions related to the X server or window manager), please let me know.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-14

yah, just thinking loud.

i saw the isse when working from within vmware, i tried to retrigger it when working native via nvidia-drivers: no success. i suspect that "avoid handling bad events" scheme is somewhat broken:

http://git.fluxbox.org/fluxbox.git/tree/src/fluxbox.cc#n151
http://git.fluxbox.org/fluxbox.git/tree/src/fluxbox.cc#n537

and i suspect that thats why deleting / freeing the (fluxbox-)window associated with the window-id which is a "bad window" already.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-19

ok, try the attached patch and report back any issues / success :)

it was written against my git-branch here, but it should apply to 1.3.5 as well.

bad_window_cleanup.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-20

Hi Mathias,
Unfortunately, I'm still seeing the orphaned windows with the new patch. :( I ran a number of tests yesterday, but I'm still hitting the bug pretty consistently. I added some of my own logging to the code to see if I could shed some light on the issue, although I'm not sure I've discovered anything new. I need to get a better understanding of all the data structures involved and the general flow of the window creation and deletion code. I'll attach my patch and the latest log file. Let me know if there's anything else you'd like me to try.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-20

Patch to add some extra logging statements. This is against the git tree, and on top of the "bad_window_cleanup.diff" patch.

fluxbox.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-20

Screenshot after latest test. This shows five orphan windows: DialogTest23, DialogTest37, DialogTest43, DialogTest63, and DialogTest85.

fluxbox-2014-05-20.1.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-05-20

Fluxbox log file from latest test.

fluxbox-2014-05-20-w0100-t0001.out

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2014-05-20

the problem is that x11 seems to not notify us about all gone windows. whyever. some of the bad_window - events are obviously based upon, well, bad windows. and fluxbox keeps a list of "managed" windows. when one of these windows becomes bad, it should be removed. my patch does this by not hoping for some future destroynotify event but by instantly removing the window from "lists".

what i find a bit strange is that you still have windows (visible) which have no associated client at all: " XEvent->type == DestroyNotify: No WinClient found for e->xdestroywindow.window=10485764" ... allthough this could be explained by events still in the queue. strange.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Corry - 2014-06-16

Hi Mathias,

Sorry, I got sidetracked for a few weeks - other bugs got bumped up in priority. But another member of our team took a look at this fluxbox issue again last week and came up with a hack (on top of the patch you posted above) that seems to get rid of the orphaned windows in our testing. In her tests, she noticed that the event loop would occassionally dequeue an event associated with a bad window number before that window was put on the bad window list. She said this was only happening for ClientMessage type windows, so this hack saves the most recent ClientMessage event in a temporary location so it can check later on if it's bad. If it's later determined to be bad, it pushes the "current" event back on the queue and processes the bad ClientMessage event instead.

This obviously isn't terribly robust or scalable, but I'm attaching it in the hopes that you (or someone more familiar with the fluxbox code) might be able to use this idea to come up with a better fix.

Let me know what you think or if you have any questions.

Thanks again for your help on this bug!

workaroundtest.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2015-01-21

not forgotten, but some time-constraints here as well (this bug here seems like a bigger problem with deeper impact and thus requires more time). just to give you some "alive feedback".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2015-01-21

Group: v1.3.6 --> future release
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Buzzy Brown - 2015-01-23

I was working on this problem with Kevin and believe I found the real cause of the problem.

The problem is in fluxbox's handling of the MapRequest message. The MapRequest is sent to FB when the client (java) attempts to map in a window. Fluxbox then creates the WinClient and FluxboxWindow objects to manage it and reparents the client window. Eventually it calls XConfigureWindow and XMapWindow to display both the the wrapper and client windows. The problem happens when the client window is removed (destroyed) between the time the XServer the MapRequest to FB and the time the XServer receives (and successfully processes) the MapRequest sent by FB (in my tests the XServer also sends a ReparentNotify so the window could be smaller).

In any case, at this point FB has already created the WinClient/FluxboxWindow objects and will expect to receive either an UnmapNotify (or in later versions of FB a DestroyNotify) in order to destroy them. Since X never actually mapped the window, it will never send the unmap. I'm not sure why we're not getting a destroy message (since we do indeed get a CreateNotify for the window).

In any case, FYI, there are a few reasons the current FB bad_window code doesn't fix the problem.

The code only passes down destroy messages....in our version of FB, the destroy just hides the window (in newer versions it cleans up the WinClient/etc).

The code can only handle at most one known bad window id at a time. If the client window we mapped has child windows, we could get multiple BadWindow errors and if the last one isn't for the parent client window the code won't take effect.

The code checks the e.xany.window field for the window id. For several events (unmap and destroy in particular) that field could be set to window the request came from and not the window itself....not sure if this is really a problem since we'd just end up forwarding more messages that way.

As far as the bad_window_cleanup.diff patch goes, I think the main reason the patch didn't work is because it depended on FB receiving at least one more message apparently FROM the bad client window AFTER receiving the BadWindow error for it.

The workaroundtest.patch patch helped a little but I was still able to recreate the problem. However, the patch also had the problem that that bad_window_list grew without bound - we got BadWindow errors for children of the main client bad window. Since the code never created a WinClient for the children, we'd never find them in our list and thus never clear them from the bad window list.

I think one way to fix this would be to change the code to make a synchronous call like XGetWindowAttributes after the (first) XMapWindow request we make for a new client window. If the client is no longer there, then we abort the creation. However, there are a lot of ways a window can come into existence and not all of them would necessarily cause it to be mapped right away (slit, tabbed windows, windows on other desktops, etc). So, I think the best way to handle it now without a lot of major changes is to use a modification of the already suggested patch that watches for BadWindow errors.

I'll attach a patch that's working for me. It still uses a set to contain a list of bad windows, but drains the list each time it receives a message. For each bad window, it sends a UnmapNotify message directly to handleUnmapNotify(). In the current version of fluxbox, this seems to have the same effect as sending a DestroyNotify to Fluxbox::handleEvent (which the original patch did) - it destroys the WinClient and FluxboxWindow (if needed). I chose to use the unmap instead because seems to be a superset.

handle_destroyed.patch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2015-01-24

yah. what i really love about this bugreport is that someone (other than myself) has thought really really long about the darn interactions and implications of what and when something happens.

i agree with most of the things you wrote, especially the cause. actually, it does not matter what the cause for the lack of events in regards of "xserver destroyed that window" actually is. we are talking to a server over a network-like protocol and some things might get lost.

(the time it takes FB to create the decoration/carrier windows (plural) and map them might easily be queued in later by the xserver. the sequence would then look like: client-mapping-request -> [FB creates it's stuff] -> client-destroy -> fb-mapping-request. and after that we already have lost the race)

the only problem right now i see with your fine patch is that it creates an artificial unmap-event and tries to use it. the reason for this is that Fluxbox::handleUnmapNotify() first tries to detect the BScreen the event happend. it uses Fluxbox::searchScreen() for
this and this was "fine" (to a certain degree) some days ago, but it is not anymore:

http://git.fluxbox.org/fluxbox.git/commit/?id=c0e5d1c7a3867a82e0418e921afabf8e14909429

in order to properly detect the BScreen the window actually lived in, FB now asks the window for it's root window. and this is a problem when the window itself is not there anymore:

Fix broken _NET_REQUEST_FRAME_EXTENTS support There was a subtle flaw in the way fluxbox detects to which BScreen a given Window belongs: We have to compare the RootWindow of the given Window against the RootWindow of each BScreen. That underlying flaw made _NET_REQUEST_FRAME_EXTENTS fail: the code path needs a valid BScreen for the given window, otherwise we return early.

calling Fluxbox::searchWindow() and then just deleting the created WinClient for that bad window should be enough. in Fluxbox::handleUnmapNotify() there is too much code that
assumes the window it operates on is actually valid (which it isn't).

since you are deep in the topic: do you want to tackle the issue in that direction?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathias Gumz - 2015-01-24

Group: future release --> v1.3.7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Buzzy Brown - 2015-01-26

Thanks! I'll change my patch...I'm all for keeping the code as simple as possible. I was initially afraid of missing some type of cleanup (especially when the window was one of multiple in the FluxboxWindow's m_client_list), but now I see the ~WinClient() handles that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rapid creation/deletion of windows creates "orphan" windows (could be bug #922)

Group

Searches

Help

#1104 Rapid creation/deletion of windows creates "orphan" windows (could be bug #922)

Discussion