#60 Concurrency problem with Network::isConnected causes crash

closed-fixed
None
5
2009-07-06
2009-07-03
No

This is a follow-up to bug 2801261 where I reported random crashes of unknown origin. I think I've now isolated a reproducible case.

Setup is as follows.

* Two ports are created and connected, let them be W for writing and R for reading (W --> R).
* Thread A is constantly writing to W
* Thread B is constantly blocking at R for reading.

This works normally. However, if in the writing thread A we use this kind of code:

if (Network::isConnected (W, R)) { <do the writing here> }

then a segmentation fault or hang is quickly obtained, with the same kind of backtraces found in the earlier bug. Removing the call to isConnected makes the crash to disappear.

I attach a testcase. However, this testcase uses mixed Ada/C++ code.

Most of the bugcase is wrapper code to call YARP from Ada, but I don't think that it has any bearing on the case. I think inspecting the file src/bug01.adb should give the gist of the code pattern. The *.cpp files contain the glue Ada/C calls to YARP.

I didn't prepare a pure C++ example because this is the quickest for me, but if this information is not enough for isolating the bug I could try to devote some time to it.

If you want to compile it you need GNAT GPL 2009 from http://libre.adacore.com, issue a

$ gprbuild

and execute with

$ obj/bug01

Discussion

  • Testcase for triggering the bug.

     
    Attachments
  • Ok, this is starting to make sense. If I remember, you were seeing crashes in "admin" interface code; that interface is indeed used by the isConnected() method. Thanks for the test case, this narrows things down.

    While I'm working on fixing this, for you it might be acceptable to just avoid this problem by calling yarp::os::Port::getOutputCount() to check if there is an outgoing connection. This is a much faster method.

     
    • assigned_to: nobody --> eshuy
     
  • I can't replicate the error with your test case, or with a pure C++ translation of it. But this is not unusual with a race condition. I'll review the YARP code involved.

     
  • One thing I've noticed is that running within gdb it still happens, but with valgrind I haven't yet got one instance. So perhaps the faster the computer the more likely, since valgrind slows down things quite a bit. This is a fairly quick box, intel quad core @2.83

    By the way I've seen it in debian lenny (stock ACE), and ubuntu 9.04 with ACE4YARP, stock ACE 2.6.3 and 2.7

    Perhaps removing any output to console can raise the likelihood?

     
  •  
    Attachments
  • Tried removing comments, optimizing compile etc but still don't see the failure. However, while reviewing the code, I found a non-thread-safe access to the current list of connections associated with a port in the interface that isConnected() uses. I committed a fix in CVS (patch to src/libYARP_OS/src/PortCore.cpp attached). This was a terrible goof, I really appreciate your persistence in tracking it down. There's a series of acknowledgments to you in the ChangeLog.

    Please let me know if the fix eliminates the failures you are seeing.

     
    • status: open --> closed-fixed
     
  • Hi Paul,

    I've tried the CVS head and indeed the crash has disappeared, were previously the bugcase rarely ran for more than a thousand iterations. Now it goes up to hundreds of thousands without crash, so I'm confident the fix is sound.

    So I'm closing it, thanks!