Thread: Re: [java-gnome-hackers] Ugly deadlock with Dialog.run()
Brought to you by:
afcowie
From: Vreixo F. <met...@ya...> - 2010-08-10 13:46:23
|
Hi, > On Mon, 2010-08-09 at 21:42 +0200, Vreixo Formoso wrote: > > I would like to discuss with you an ugly behavior related with the > way > > java-gnome automatically handles the Gtk thread-aware feature. I > don't > > really think this is a bug, but I would like to talk about it. > > Yeah, I raised this on 13 Jan, Ups, sorry! I didn't notice it. > We could > > 1. Take our own lock (which we could then manually unlock in special > cases) > > 2. Just mess with a manual MonitorExit/MonitorEnter > > 3. Seriously mess with the entire main loop process: > > [...] > > All of these are invasive. #2 is probably easiest, but there's a > bigger > problem: what happens if you're not nested 2 deep, but nested 3 deep? > Then it won't help. ok, just a first sight opinion, I will review this in depth tomorrow: First of all, I agree that 3) will be a pain in the ass. I don't think, however, 2) would be a better solution: as you have noticed, you can be nested 3 deep, or even 0 deep if you are not calling run() within a signal handler (I wonder, however, if both cases could naturally arise in a real application). On the other hand, 1) could be a good solution and it seems not that hard. We might just use java.util.concurrent.locks.ReentrantLock. I think it is just a matter of changing the automatically generated synchronized blocks in translation layer (plus any override that might exist), and the C side bindings_java_threads_lock()/unlock() functions. Then, we add an override for Dialog.run(), java-side, where, by means of ReentrantLock.getHoldCount() and unlock() we release all locks but one, then we call gtk_dialog_run(), and when it returns we finally acquire as many locks as we originally hold. I have not been working on this low-level stuff since some years, so maybe I'm forgetting something... what do you think? Anyway, I'm not really sure this bug deserves so much work at this time... Cheers, Vreixo > > > Note that in my opinion is it ok as it is now, you can always forget > > about run() > > I don't have any hard opinions about it. It's a bug, and we need to > either fix it if we can. I'd like to keep run(), though. It's nice to > use inside signal handlers. Or it would be if it worked :) > > AfC > Sydney |
From: Vreixo F. <met...@ya...> - 2010-08-11 14:47:58
|
Hi, Andrew: > Maybe that alone will help? Someone with a threaded program that > demonstrates the bug will need to test that... > > Attached is a bundle, so if you want to merge & try it, go ahead :) I'm getting the same error as Guillaume. The problem is the stop condition in the while. You need to use: while (!GtkMain.mainIterationDo(true)) { ^ if (quit) { break; } } (And it will thrown the exception anyway if Gtk.mainQuit() is called, I guess). You also need a this.present() so the Dialog is actually shown. Ok once I fixed this I tested it, but it doesn't work either. The reason is the same as in the original problem. When native gtk_main_iteration_do() is called, we hold a 2-level nested lock: the one of the signal handler, plus the one in GtkMain.mainIterationDo(). I really think the easier approach is to replace the synchronize blocks with another kind of Lock, as I have described in my previous mail. Cheers Vreixo > > > Or maybe > > > > 4. Write our own run() > > > I had a preliminary try at it; see attached. > > The idiom is pretty common; little inner class handler with a field so > we can capture a value and return it. > > By itself, this doesn't fix our lock problem. But I thought I'd show > it > to you in case it caused you to think of anywhere interesting to go > from > here. > > Obviously the fact that we are iterating the real [outer] main loop > rather than creating a(n implicit) new inner one is what is different > compared to the real gtk_dialog_run(). > > Maybe that alone will help? Someone with a threaded program that > demonstrates the bug will need to test that... > > Attached is a bundle, so if you want to merge & try it, go ahead :) > > AfC > Sydney |
From: Andrew C. <an...@op...> - 2010-08-11 23:44:23
|
On Wed, 2010-08-11 at 15:02 +0200, Guillaume Mazoyer wrote: > > Exception in thread "main" java.lang.IllegalArgumentException: You > asked > for ordinal 0 which is not known for the requested Constant type > org.gnome.gtk.ResponseType > Yeah, oops. I've [now] seen that a few times too. On Wed, 2010-08-11 at 16:42 +0200, Vreixo Formoso wrote: > > Attached is a bundle, so if you want to merge & try it, go ahead :) > > I'm getting the same error as Guillaume. Ok > The problem is the stop > condition in the while. The usual way to do that is while (Gtk.eventsPending()) { Gtk.mainIterationDo(false); } where "true" makes it blocking. I've never tried it with true. But anyway. > You also need a this.present() so the Dialog is actually shown. Yeah, I realized that when I was out running yesterday afternoon and thinking about this... :) > Ok once I fixed this I tested it, but it doesn't work either. Oh well > The reason > is the same as in the original problem. That's what I figured. > When native > gtk_main_iteration_do() is called, we hold a 2-level nested lock: the > one of the signal handler, plus the one in GtkMain.mainIterationDo(). > > I really think the easier approach is... Ok, before we go back to that, one more thought: The reason I roughed up that patch was to give me a starting point to suggest something else. I don't know if it helps, but: What would happen if fired off a tiny little worker thread inside our run()? I mention this because I used to do that a lot in my 2.x days — [what are now called] Button.Clicked handlers almost always started their own worker thread to do other things. So I'm wondering if it'd help here... Inside run() 1. fire off a worker thread, 2. have that worker thread be what calls maybe in some combination with Thread's join()...? Anyway. > to replace the synchronize blocks > with another kind of Lock, as I have described in my previous mail. Yeah, that's what we had from January as the likely necessary change. There are about 57 manually written synchronized () blocks in our code. So catching them all will take a bit of work. More pertinently, that would be a LOT of effort to essentially make *1* method work. I don't want to say "we don't have coverage of Dialog's run()" but... ++ I wanted to have another look at the deadlock. Here it is: "Thread-0" prio=10 tid=0x0000000001643800 nid=0x7416 waiting for monitor entry [0x00007ff241449000] java.lang.Thread.State: BLOCKED (on object monitor) at org.gnome.gtk.GtkProgressBar.pulse(GtkProgressBar.java:70) - waiting to lock <0x00007ff282b63bf0> (a org.gnome.gdk.Gdk$Lock) at org.gnome.gtk.ProgressBar.pulse(ProgressBar.java:137) at DialogDeadlock$Worker.run(DialogDeadlock.java:124) at java.lang.Thread.run(Thread.java:636) "main" prio=10 tid=0x000000000146f000 nid=0x740b runnable [0x00007ff29d044000] java.lang.Thread.State: RUNNABLE at org.gnome.gtk.GtkDialog.gtk_dialog_run(Native Method) at org.gnome.gtk.GtkDialog.run(GtkDialog.java:228) - locked <0x00007ff282b63bf0> (a org.gnome.gdk.Gdk$Lock) at org.gnome.gtk.Dialog.run(Dialog.java:244) at DialogDeadlock$1.onClicked(DialogDeadlock.java:65) at org.gnome.gtk.GtkButton.receiveClicked(GtkButton.java:392) at org.gnome.gtk.GtkMain.gtk_main(Native Method) - locked <0x00007ff282b63bf0> (a org.gnome.gdk.Gdk$Lock) at org.gnome.gtk.GtkMain.main(GtkMain.java:82) at org.gnome.gtk.Gtk.main(Gtk.java:119) at DialogDeadlock.main(DialogDeadlock.java:91) Hm. [actually I added the second "locked" line, because I know it's there; Java's thread dump doesn't show it because it was taken with MonitorEnter() not synchronized()] Now that I look at this, I'm not sure that manually letting the entire lock go is the right thing. Back to basics: The whole point of GTK's thread awareness is the requirement that only 1 thread by accessing GTK at a time. Fine. What that means is "one thread can make a call (that has a cascading effect, emits signals, etc etc, but that there's only one processing chain until that cascade is done". That way, between cascades, GTK is in a safe state. The complication is the main loop. gtk_main() actually releases the lock when it's idle. They don't really tell you that, but this mimics the behaviour of Object's wait() which also relinquishes its lock, then later tries to reacquire it. So my thought is that that is the only known safe point: when the main loop thinks it is ok to relinquish the lock, then something else can grab it and do GTK calls. Now you'll remember that we have access to the gdk_threads_enter() gdk_threads_leave() functions. We override them. But maybe, just maybe, that means that if we *also* do our own main loop iteration, then we can control the safe point ourselves, rather than it being magic. The point is "not do I have the lock" but "is GTK at a safe point where another thread can take over?" Interesting. ++ This is all just speculation. But if you go and fire off a nested main loop (which is what gtk_dialog_run() does, which is what calling gtk_main() a second time does) and you are *inside* a signal handler already, then ... what? Hm. AfC Sydney |
From: Vreixo F. <met...@ya...> - 2010-08-12 01:01:20
|
Hi! Andrew: > What would happen if fired off a tiny little worker thread inside our > run()? > > [...] > So I'm wondering if it'd help here... Inside run() > > 1. fire off a worker thread, > 2. have that worker thread be what calls > > maybe in some combination with Thread's join()...? I guess the new thread would block when attempting to enter the synchronized block in the translation layer, as the lock is held by the original thread in the signal handler. As that thread is waiting for the new (in the join()) I guess that will cause a deadlock. > There are about 57 manually written synchronized () blocks in our > code. > So catching them all will take a bit of work. With some luck, a regular expression hacking and eclipse replace feature will do the job for us... I could try. > More pertinently, that would be a LOT of effort to essentially make > *1* > method work. I don't want to say "we don't have coverage of Dialog's > run()" but... Yeah, I agree with you... This might better be job for the future, there are more interesting things to do. In the mean time, we could just warn about this ugly feature of our run. Another approach is to add a second run(), maybe called runWithoutBlock() to be called inside signal handlers, but it does not fix the 3-level lock neither. Anyway, it is currently not so bad, anybody can use the Response signal if it needs multi-threading working with run()... Note I have been writing this kind of apps for years and never found that problem until this week. Cheers Vreixo |
From: Andrew C. <an...@op...> - 2010-10-08 19:11:38
|
On Thu, 2010-08-12 at 03:00 +0200, Vreixo Formoso wrote: > Anyway, it is currently not so bad, anybody can use the Response signal > if it needs multi-threading working with run()... Note I have been > writing this kind of apps for years and never found that problem until > this week. So I think we should put a big fat warning on Dialog's run() during this release. This issue is something we need to address, but we don't have a patch yet and I think it'll take some serious engineering to address well. Until then, we should encourage people to connect to Dialog.Reponse, I think. I'll write a patch to his effect if people agree. AfC New York |
From: Guillaume M. <res...@gm...> - 2010-10-09 07:15:02
|
> I'll write a patch to his effect if people agree. Agreed. I can also write the patch since I'm the one that discovered the bug in the first place :) -- Guillaume Mazoyer - http://www.respawner.fr/ |
From: Andrew C. <an...@op...> - 2010-10-09 22:21:47
|
On Sat, 2010-10-09 at 09:14 +0200, Guillaume Mazoyer wrote: > > I'll write a patch to his effect if people agree. > Agreed. I can also write the patch ... Be my guest :) AfC New York |
From: Guillaume M. <res...@gm...> - 2010-10-10 11:29:39
|
> Be my guest :) Well, how should I do that then? Only by writing some Javadoc to explain the problem and giving the workaround? I want to do it quickly, then you will be able to release 4.0.17 quickly. -- Guillaume Mazoyer - http://www.respawner.fr/ |
From: Andrew C. <an...@op...> - 2010-10-11 05:17:03
|
On Sun, 2010-10-10 at 13:29 +0200, Guillaume Mazoyer wrote: > > Be my guest :) > Well, how should I do that then? > Only by writing some Javadoc to explain the problem and giving the > workaround? Yeah, that'd be more than fine. We won't mark it @deprecated because it isn't, but we can mark that it is (due to a bug) unsuitable for use in threaded programs. AfC New York |
From: Guillaume M. <res...@gm...> - 2010-10-11 17:26:27
Attachments:
dialog-run-bug.patch
|
Here is the bundle to patch the Dialog's run() Javadoc. I put the warning right after the first paragraph. Please review it before pushing it into mainline. You know that my English can be bad sometime ;) -- Guillaume Mazoyer - http://www.respawner.fr/ |
From: Andrew C. <an...@op...> - 2010-08-11 00:44:53
|
On Tue, 2010-08-10 at 15:46 +0200, Vreixo Formoso wrote: > > 1. Take our own lock (which we could then manually unlock in special > > cases) > > > > 2. Just mess with a manual MonitorExit/MonitorEnter > > > > 3. Seriously mess with the entire main loop process: > > Or maybe 4. Write our own run() Just a thought I had talking with Guillaume earlier today: we could maybe just write a run() method in Dialog that does the workaround. ie, rather than making the developer connect to Dialog.Response, do so inside run() ourselves. As a convenience. The trick would be getting that method to block. The only way I could see to do that would be to nest a call to Gtk.main()... which might bring us back to the same problem. AfC Sydney |
From: Andrew C. <an...@op...> - 2010-08-11 05:54:55
Attachments:
dialog-run-outer.patch
|
On Wed, 2010-08-11 at 10:44 +1000, Andrew Cowie wrote: > Or maybe > > 4. Write our own run() I had a preliminary try at it; see attached. The idiom is pretty common; little inner class handler with a field so we can capture a value and return it. By itself, this doesn't fix our lock problem. But I thought I'd show it to you in case it caused you to think of anywhere interesting to go from here. Obviously the fact that we are iterating the real [outer] main loop rather than creating a(n implicit) new inner one is what is different compared to the real gtk_dialog_run(). Maybe that alone will help? Someone with a threaded program that demonstrates the bug will need to test that... Attached is a bundle, so if you want to merge & try it, go ahead :) AfC Sydney |
From: Guillaume M. <res...@gm...> - 2010-08-11 13:02:39
Attachments:
dialog-run-deadlock-prototype.patch
|
> Someone with a threaded program that > demonstrates the bug will need to test that... I sure can do that. I also wrote a prototype to prove a deadlock. You can merge the bundle and test it too. I did test the workaround and it seems that it does not work. Here is the first line of the stacktrace. Exception in thread "main" java.lang.IllegalArgumentException: You asked for ordinal 0 which is not known for the requested Constant type org.gnome.gtk.ResponseType -- Guillaume Mazoyer - http://www.respawner.fr/ |