Thread: [Dbus-cxx-devel] Some 'sent' messages don't get sent
Status: Beta
Brought to you by:
rvinyard
From: Patrick A. <pa...@ph...> - 2009-08-28 09:20:52
Attachments:
AmCtl.log.gz
|
Hi all: I'm having a bit of a problem with messages occasionally getting 'stuck' and being unsent somehow. I've turned on libdbus debugging and dbus-cxx debugging, and I think I see why it's happening, but I can't see where the bug is. One bug I *did* find that I really, really thought would fix it, but didn't, was that Connection::on_remove_watch_callback was signalling "signal_add_watch" rather than "signal_remove_watch", and similarly Connection::on_watch_toggled_callback was signaling "signal_add_watch" instead of "signal_watch_toggled" (connection.cpp, lines 845 and 851) I see this problem when one process calls a method in another process, and the reply message gets "lost" - it never shows up in a D-Bus monitor, and I'm pretty sure it's never getting sent: A "correct" reply looks like this. I put a debug statement in Connection::send to see when 'send' is called, so that's what starts this debugging output. 3081477008: Connection::send 16976: LOCK: dbus_connection_send 16976: Message 0x88d7df8 (2 no path no interface no member 'i') for :1.36 added to outgoing queue 0x88d79d0, 1 pending to send 16976: Message 0x88d7df8 serial is 5 16976: _dbus_connection_do_iteration_unlocked start 16976: UNLOCK: _dbus_connection_acquire_io_path 16976: _dbus_connection_acquire_io_path locking io_path_mutex 16976: _dbus_connection_acquire_io_path start connection->io_path_acquired = 0 timeout = 0 16976: _dbus_connection_acquire_io_path end connection->io_path_acquired = 1 we_acquired = 1 16976: _dbus_connection_acquire_io_path unlocking io_path_mutex 16976: LOCK: _dbus_connection_acquire_io_path 16976: Transport iteration flags 0x1 timeout -1 connected = 1 16976: iteration flags = write timeout = -1 read_watch = 0x88d7680 write_watch = 0x88d7658 fd = 3 16976: do_writing(), have_messages = 1, fd = 3 16976: wrote 52 bytes of 52 16976: Message 0x88d7df8 (2 no path no interface no member 'i') removed from outgoing queue 0x88d79d0, 0 left to send 16976: check_write_watch(): needed = 0 on connection 0x88d79d0 watch 0x88d7658 fd = 3 outgoing messages exist 0 16976: UNLOCK: protected_change_watch 16976: LOCK: protected_change_watch 16976: ... leaving do_iteration() 16976: _dbus_transport_do_iteration end 16976: _dbus_connection_release_io_path locking io_path_mutex 16976: _dbus_connection_release_io_path start connection->io_path_acquired = 1 16976: _dbus_connection_release_io_path unlocking io_path_mutex 16976: _dbus_connection_do_iteration_unlocked end 16976: _dbus_connection_send_preallocated_and_unlock middle 16976: dispatch status = complete is_connected = 1 16976: UNLOCK: _dbus_connection_update_dispatch_status_and_unlock 3081477008: Object::handle_message: message was handled And here's one from later in that session, when the message was "lost" and the reply timed out: 3081477008: Connection::send 16976: LOCK: dbus_connection_send 16976: Message 0x88d9bf0 (2 no path no interface no member 's') for :1.37 added to outgoing queue 0x88d79d0, 1 pending to send 16976: Message 0x88d9bf0 serial is 8 16976: _dbus_connection_do_iteration_unlocked start 16976: UNLOCK: _dbus_connection_acquire_io_path 16976: _dbus_connection_acquire_io_path locking io_path_mutex 16976: _dbus_connection_acquire_io_path start connection->io_path_acquired = 1 timeout = 0 16976: _dbus_connection_acquire_io_path waiting 0 for IO path to be acquirable 16976: _dbus_connection_acquire_io_path end connection->io_path_acquired = 1 we_acquired = 0 16976: _dbus_connection_acquire_io_path unlocking io_path_mutex 16976: LOCK: _dbus_connection_acquire_io_path 16976: _dbus_connection_do_iteration_unlocked end 3081477008: Dispatcher::on_wakeup_main 16976: _dbus_connection_send_preallocated_and_unlock middle 16976: dispatch status = complete is_connected = 1 16976: UNLOCK: _dbus_connection_update_dispatch_status_and_unlock 3081477008: Object::handle_message: message was handled >From what I've been able to figure out, whenever dbus_connection_send can't write to the socket immediately, it's supposed to queue the message, and if the queue went from 0->1 it's supposed to toggle the write watch to tell the main thread to select()/poll() on the write fd. As far as I can tell, that's not happening for some reason. I never see on_watch_toggled_callback called, and the write watch is never enabled (it's added to the watch list unenabled). The next time that Connection::send is called, it had 2 pending to send, then 3 pending to send, etc., so clearly the write watch isn't being handled. At that point the programs which are connecting to it usually die due to a timeout expiring, so it's not a temporary thing. I worked around the problem by having Connection::send call Connection::flush, but that's not exactly the ideal solution. Works for now, though... I'm attaching the full debug log - it's pretty verbose, and there's a lot going on (there's output from the main program as well, and also a few additional debugging outputs I added). (Note that this is libdbus 1.2.16., and dbus-cxx 0.4.1) Patrick |
From: Patrick A. <pa...@ph...> - 2009-08-28 20:39:51
|
Hi again: Sorry for the flurry of messages (OK, two is not a flurry), but I did a bit more debugging staring around at code. I think the problem's a race condition between the watch thread and the dispatch thread (and I really hope this isn't just some bug in my version of D-Bus, but I didn't see anything like it in the bugzilla for D-Bus). >From what I can see, the problem is that: 1) a request comes in, which wakes the watch thread, and calls dbus_handle_watch. 2) This calls "_dbus_connection_acquire_io_path" in _dbus_connection_handle_watch, which locks the I/O path. 3) While the I/O path is held, _dbus_connection_handle_watch calls _dbus_transport_handle_watch, which reads the message, and calls _dbus_connection_queue_received_message_link. 4) _dbus_connection_queue_received_message_link calls _dbus_connection_wakeup_mainloop, which wakes the dispatch thread. NOTE: I have no idea why _dbus_connection_queue_received_message_link does this. I don't believe it should - it has no reason to wake up the main loop. The watch handler, when it's done, should wake the main loop by changing the dispatch status. 5) The dispatch thread wakes up, grabs the data, dispatches it, calls the function. That function returns a value, which causes Connection::send to be called, which calls dbus_connection_send. 6) dbus_connection_send calls _dbus_connection_acquire_io_path, which fails (since it's still being held in the watch thread!) and the message quietly gets queued without anyone knowing that it needs to be. To me, this looks like a bug in D-Bus: if _dbus_connection_acquire_io_path fails during a write, it should enable the write watch if it's disabled: this might seem silly, as select() will immediately say "writeable", but the write watch shouldn't care *why* a write got queued, only that it did and it needs to be handled when the application gets a chance. A (bad) workaround in dbus-cxx would be a mutex preventing messages from being dispatched while in the read handler in the watch thread. Patrick |
From: Rick L V. Jr <rvi...@cs...> - 2009-08-30 03:43:03
|
Two definitely aren't a flurry. I should have some time to look further into it early next week, but it looks like you've mostly tracked it down. If dbus-cxx needs temporary patches to get around bugs that sounds fine to me. --- Rick Patrick Allison wrote: > Hi again: > > Sorry for the flurry of messages (OK, two is not a flurry), but I did a > bit more debugging staring around at code. I think the problem's a race > condition between the watch thread and the dispatch thread (and I really > hope this isn't just some bug in my version of D-Bus, but I didn't see > anything like it in the bugzilla for D-Bus). > >>From what I can see, the problem is that: > 1) a request comes in, which wakes the watch thread, and calls > dbus_handle_watch. > > 2) This calls "_dbus_connection_acquire_io_path" in > _dbus_connection_handle_watch, which locks the I/O path. > > 3) While the I/O path is held, _dbus_connection_handle_watch calls > _dbus_transport_handle_watch, which reads the message, and calls > _dbus_connection_queue_received_message_link. > > 4) _dbus_connection_queue_received_message_link calls > _dbus_connection_wakeup_mainloop, which wakes the dispatch thread. NOTE: > I have no idea why _dbus_connection_queue_received_message_link does this. > I don't believe it should - it has no reason to wake up the main loop. The > watch handler, when it's done, should wake the main loop by changing the > dispatch status. > > 5) The dispatch thread wakes up, grabs the data, dispatches it, calls the > function. That function returns a value, which causes Connection::send to > be called, which calls dbus_connection_send. > > 6) dbus_connection_send calls _dbus_connection_acquire_io_path, which > fails (since it's still being held in the watch thread!) and the message > quietly gets queued without anyone knowing that it needs to be. > > To me, this looks like a bug in D-Bus: if _dbus_connection_acquire_io_path > fails during a write, it should enable the write watch if it's disabled: > this might seem silly, as select() will immediately say "writeable", but > the write watch shouldn't care *why* a write got queued, only that it did > and it needs to be handled when the application gets a chance. > > A (bad) workaround in dbus-cxx would be a mutex preventing messages from > being dispatched while in the read handler in the watch thread. > > Patrick > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Dbus-cxx-devel mailing list > Dbu...@li... > https://lists.sourceforge.net/lists/listinfo/dbus-cxx-devel |
From: Rick L. V. Jr. <rvi...@cs...> - 2009-09-01 17:11:30
|
Patrick Allison wrote: > A (bad) workaround in dbus-cxx would be a mutex preventing messages from > being dispatched while in the read handler in the watch thread. > I've forwarded your emails to the dbus development list to hopefully start a discussion as to whether there is a bug in dbus or dbus-cxx' use of dbus. In the meantime I've commited what I hope will fix the problem. It's difficult for me to test since my systems don't seem to experience the same bug. In short, the fix (hopefully only temporary until we figure out which library the bug is in) simply initiates processing in the dispatch thread if any read or write watch activity occurred. That should allow the dispatch thread and watch thread to operate at full capacity without any dependent locking and still make sure that any silently queued messages are still dealt with. The fixed version is currently in subversion: http://sourceforge.net/projects/dbus-cxx/develop Let me know if this fixes the problem and I'll push out a general release. Thanks Patrick for tracking this one down, ---- Rick |
From: Patrick A. <pa...@ph...> - 2009-09-01 18:03:57
|
Rick: I also added a bug to the dbus bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=23584 with a bit more information, including a logfile showing the problem happening. It also may be possible to create a testcase that forces this problem to occur by adding a mutex or something inside on_wakeup_main that prevents that function from exiting until the dispatch loop executes once. I'll check your fix as soon as I can to see if it works. Patrick On Tue, 1 Sep 2009, Rick L. Vinyard, Jr. wrote: > Patrick Allison wrote: > >> A (bad) workaround in dbus-cxx would be a mutex preventing messages from >> being dispatched while in the read handler in the watch thread. >> > > I've forwarded your emails to the dbus development list to hopefully start > a discussion as to whether there is a bug in dbus or dbus-cxx' use of > dbus. > > In the meantime I've commited what I hope will fix the problem. It's > difficult for me to test since my systems don't seem to experience the > same bug. > > In short, the fix (hopefully only temporary until we figure out which > library the bug is in) simply initiates processing in the dispatch thread > if any read or write watch activity occurred. > > That should allow the dispatch thread and watch thread to operate at full > capacity without any dependent locking and still make sure that any > silently queued messages are still dealt with. > > The fixed version is currently in subversion: > http://sourceforge.net/projects/dbus-cxx/develop > > Let me know if this fixes the problem and I'll push out a general release. > > Thanks Patrick for tracking this one down, > > ---- > > Rick > > |