Menu

#53 Possible concurrency error in Network::connect

v1.0 (example)
closed-out-of-date
None
5
2017-11-02
2009-02-25
No

I believe I may have run into a concurrency error in the Network::connect method. If many calls to Network::connect are made in a program that simply sends a Bottle through a port, it will eventually come to a deadlock. I've tested this in two different computers with YARP's head cvs revision and with both ACE 5.6.7 and 5.6.8 (released a couple of days ago). In both machines the results were similar.

These symptoms where accompanied with errors in YARP's regression tests. Meanwhile, Paul Fitzpatrick gave me a statically linked version of harness_os which passed all tests. But statically linked versions of the example code I wrote to exemplify the problem failed to work correctly on my machine.

I attach example code that demonstrates the problem. Just decompress, compile, run the "echo" program in one shell and the "producer" in another. "producer" is sending Bottles to the "echo" program which is sending them back. Before each write to the port a Network::connect is called in order to establish the connection between the opened ports. Besides the first time it is run, it is being called to connect already connected ports. I know this is not the optimal way of managing connections between ports, but the program coming to a deadlock is not the behavior one would expect. I would expect the code to work, probably with a greater overhead to send the Bottles because of the repeated attempts to connect already connected ports.

Discussion

  • Marco Barbosa

    Marco Barbosa - 2009-02-25

    Example code that demonstrates the problem.

     
  • Marco Barbosa

    Marco Barbosa - 2009-02-26

    This problem was detected with the following system setups:

    openSUSE 11.0 i686 - kernel 2.6.24.20-0.1-pae
    openSUSE 11.1 x86_64- kernel 2.6.27.7-9-default

     
  • Paul Fitzpatrick

    Hi Marco,
    Thanks for submitting this. I'm checking this out. I'm running into a bunch of problems with recent versions of ACE, related to very basic thread and semaphore usage. A basic test of ACE that we have lying around that does not use YARP is failing:
    http://eris.liralab.it/wiki/Getting_Started#Check_your_Installation
    It could be there's a new detail to compilation, I'm checking. However, this may be completely unrelated to your problem in the end.
    Cheers,
    Paul

     
  • Paul Fitzpatrick

    • assigned_to: nobody --> eshuy
     
  • Paul Fitzpatrick

    Ok, I can confirm there is a new issue with ACE compilation flags. I tried compiling ACE myself on Ubuntu, and checked to see what flags it compiles itself with. Then I cloned those flags over to YARP by adding the following line after the PROJECT line in $YARP_ROOT/CMakeLists.txt:

    ADD_DEFINITIONS(-DACE_HAS_LINUX_NPTL -D_REENTRANT -DACE_HAS_AIO_CALLS -D_GNU_SOURCE -DACE_HAS_EXCEPTIONS -D__ACE_INLINE__)

    This was sufficient to make YARP go from passing no regression tests to passing all regression tests.

    The problem now is how to figure out this set of defines automatically, because people compile ACE in all sorts of ways. For packaged versions of ACE, ideally pkg-config should know - but it doesn't seem to on Ubuntu, it just reports:
    Cflags: -I${includedir}

    Shame they don't just put this stuff in a header file...

     
  • Paul Fitzpatrick

    Compiling YARP against ACE 5.6.3 as packaged on Ubuntu gives fairly random regression failures. The common link can be found by running the regression tests in verbose mode:
    $YARP_DIR/bin/harness_os verbose regression

    Lots of messages of this form show up:
    yarp(b530cb90): semaphore wait failed - could be gdb attaching

    This is a semaphore failure that used to be triggered by some uses of gdb, but in this case seems to have some other cause. It seems that YARP is basically running without semaphore protection, so there will be random failures depending on timing.

    No solution as yet (other than using ACE4YARP, http://eris.liralab.it/wiki/ACE4YARP\).

     
  • Paul Fitzpatrick

    Semaphores do seem to be implicated in this problem. Under valgrind with ACE4YARP, there are no problems, whereas on the problem systems there's a peppering of reports like this:

    ==30603== Conditional jump or move depends on uninitialised value(s)
    ==30603== at 0x443CF1D: sem_post@@GLIBC_2.1 (in /lib/i686/cmov/libpthread-2.7.so)
    ==30603== by 0x80F66E0: ACE_Semaphore::release() (Semaphore.inl:59)
    ==30603== by 0x80F66F7: yarp::os::impl::SemaphoreImpl::post() (SemaphoreImpl.h:66)
    ==30603== by 0x81518AE: yarp::os::impl::ThreadImpl::changeCount(int) (ThreadImpl.cpp:247)
    ==30603== by 0x8151A0F: yarp::os::impl::ThreadImpl::start() (ThreadImpl.cpp:197)
    ==30603== by 0x8150673: yarp::os::Thread::start() (Thread.cpp:127)
    ==30603== by 0x80D41EC: ThreadTest::testMin() (ThreadTest.cpp:364)
    ==30603== by 0x80D5F67: ThreadTest::runTests() (ThreadTest.cpp:394)
    ==30603== by 0x815300D: yarp::os::impl::UnitTest::run(int, char**) (UnitTest.cpp:126)
    ==30603== by 0x8152517: yarp::os::impl::UnitTest::runSubTests(int, char**) (UnitTest.cpp:95)
    ==30603== by 0x8153066: yarp::os::impl::UnitTest::run(int, char**) (UnitTest.cpp:130)
    ==30603== by 0x80D7C5D: main (harness.cpp:63)
    ==30603== Uninitialised value was created by a heap allocation
    ==30603== at 0x40237EE: operator new(unsigned int, std::nothrow_t const&) (vg_replace_malloc.c:244)
    ==30603== by 0x4148448: ACE_Semaphore::ACE_Semaphore(unsigned int, int, char const*, void*, int) (in /usr/lib/libACE-5.6.3.so)
    ==30603== by 0x80B2F5F: yarp::os::impl::SemaphoreImpl::SemaphoreImpl(int) (SemaphoreImpl.h:32)
    ==30603== by 0x8151783: getThreadMutex() (ThreadImpl.cpp:26)
    ==30603== by 0x8152028: yarp::os::impl::ThreadImpl::ThreadImpl() (ThreadImpl.cpp:82)
    ==30603== by 0x8150C58: ThreadCallbackAdapter::ThreadCallbackAdapter(yarp::os::Thread&) (Thread.cpp:22)
    ==30603== by 0x8150AFF: yarp::os::Thread::Thread() (Thread.cpp:80)
    ==30603== by 0x80D3C1C: ThreadTest::Thread0::Thread0() (ThreadTest.cpp:48)
    ==30603== by 0x80D41D6: ThreadTest::testMin() (ThreadTest.cpp:363)
    ==30603== by 0x80D5F67: ThreadTest::runTests() (ThreadTest.cpp:394)
    ==30603== by 0x815300D: yarp::os::impl::UnitTest::run(int, char**) (UnitTest.cpp:126)
    ==30603== by 0x8152517: yarp::os::impl::UnitTest::runSubTests(int, char**) (U

     
  • Paul Fitzpatrick

    The Semaphore problem can be solved by simply bypassing ACE and using the linux semaphore implementation directly. This bypass is committed to CVS, so the CVS version of YARP now passes regression tests on the Ubuntu 8.10 with the stock ACE package. We're still looking to see what the underlying problem is.

     
  • Paul Fitzpatrick

    After further discussion, the ACE Semaphore bypass was turned off in CVS, to keep things broken until the underlying problem is fully understood.

     
  • Daniele E. Domenichelli

    • status: open --> closed-out-of-date
    • Group: --> v1.0 (example)