Menu

#819 5 seconds timeout when starting a device server in a subprocess

closed-wont-fix
nobody
None
C++ API
5
2017-01-09
2016-10-10
No

While I was trying to write some unit-tests for pytango, I noticed an intriguing timeout issue in the C++ API: it takes a long time to start a device server in a subprocess after some unrelated tango calls in the parent process. More precisely, the Util object instanciation takes 5 seconds.

Here's the code to reproduce this issue:

import os
import tango
from tango.server import Device, DeviceMeta

class Test(Device):
    __metaclass__ = DeviceMeta

# Unrelated tango call
db = tango.Database()
# Run the server in a subprocess (using fork)
os.wait() if os.fork() else Test.run_server()

I'm fairly confident the equivalent C++ code will produce the same issue.

Discussion

  • Bourtembourg Reynald

    Dear Vincent,

    thanks for the bug report.
    Actually, it seems to be a bad idea to do a fork in a multithreaded program as stated on this link:
    http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them.
    If you are using fork, you should rather start your device server using one of the functions from the exec family. This is what the Starter device server is doing for instance. (Starter is actually doing a double fork + execxxx)
    In the example you provided, the creation of the Database object will initialize CORBA (as a client) and connect to the database CORBA object. This will create some CORBA threads. omniORB will create a special thread called the Scavenger thread which will scan the CORBA threads for idle connections (every 5 s by default, defined by ORBscanGranularity parameter) and kill the idle connection threads automatically after a while (180s or 120s by default).
    When the device server is started, Tango tries to destroy the ORB which was previously created, because it was created for a CORBA client only. We need an ORB for a CORBA server in this case.
    The 5 seconds timeout you are observing is happening when Tango tries to destroy the previously created ORB. omniORB is then waiting for each previoulsy created thread to stop with a timeout corresponding to the ORBscanGranularity parameter (5 seconds by default). But because of what is described in the link I provided before, especially because of some critical sections/mutexes, the child process cannot stop the threads which were created by the parent process so it has to wait until the end of this timeout.
    One work-around to remove this 5 seconds waiting time is to set ORBscanGranularity environment variable to 0. This will disable the omniORB ScanAvenger thread and make the ORB destroy fast BUT you have to be conscious that idle connection threads won't be removed any longer if you do that! This might be acceptable in your use case, since this is for unit tests but please be careful when using that.
    Please be aware that passing -ORBscanGranularity 0 as argin parameter when starting the device server will not change the behaviour since this parameter will be taken into account only after the previous ORB has been destroyed. You would still get the timeout in this case. Using the ORBscanGranularity environment variable should work to make this 5 seconds waiting time disappear.
    What you could do as well is setting ORBscanGranularity to 0 and pass the -ORBscanGranularity argin parameter set to 5 at device server creation. (I am not familiar with PyTango but I guess there should be a way to pass this kind of parameter at device server creation.) This way, the device server will still run with the Scavenger thread and idle connections threads will still be automatically removed for your device server.

    Hoping this helps a bit.
    Kind regards

    Reynald (with the help of Manu for trying to understand this issue).

     
  • Vincent Michel

    Vincent Michel - 2016-10-27

    Thanks Reynald for the thorough answer, it's much clearer now.

    If you are using fork, you should rather start your device server using one of the functions from the exec family. This is what the Starter device server is doing for instance. (Starter is actually doing a double fork + execxxx)

    In this case, we can't really do that. In fact, the device classes are created in test functions (sometimes dynamically) so there is no specific server to run using exec.

    Please be aware that passing -ORBscanGranularity 0 as argin parameter when starting the device server will not change the behaviour since this parameter will be taken into account only after the previous ORB has been destroyed. You would still get the timeout in this case. Using the ORBscanGranularity environment variable should work to make this 5 seconds waiting time disappear.

    Indeed, I just checked with this code:

    args = 'Test test -ORBscanGranularity 0'.split(' ')
    os.wait() if os.fork() else Test.run_server(args)
    

    And the 5 seconds timeout is still here.

    Using the ORBscanGranularity environment variable should work to make this 5 seconds waiting time disappear.
    What you could do as well is setting ORBscanGranularity to 0 and pass the -ORBscanGranularity argin parameter set to 5 at device server creation. (I am not familiar with PyTango but I guess there should be a way to pass this kind of parameter at device server creation.) This way, the device server will still run with the Scavenger thread and idle connections threads will still be automatically removed for your device server.

    This solution actually works fine!

    import os
    import tango
    from tango.server import Device, DeviceMeta
    
    class Test(Device):
        __metaclass__ = DeviceMeta
    
    # Set granularity to 0
    os.environ['ORBscanGranularity'] = '0'
    # Unrelated tango call
    db = tango.Database()
    # Run the server in a subprocess (using fork)
    os.wait() if os.fork() else Test.run_server()
    

    Thanks a lot for your help.

    I can add a bit of information about my experimentation with unit-testing.

    The thread approach:

    • Pro: let us access the internals of the device instance at runtime using mocks
    • Con: stopping a server a starting a new one causes a segfault (known pytango issue)

    The subprocess approach:

    • Pro: let us restart servers without crashing
    • Con: ORBscanGranularity needs to be set to 0 (might cause other problems?)

    Then I use a library called pytest-xdist that let me combine those approaches to get the best of both worlds. In fact, this lib has a --boxed option to automatically run the collected tests in isolated subprocesses. Since that happens before any kind of CORBA operation, the problem we've been discussing does not apply. Then the test function can either run one server in a thread, or several servers in subprocesses, depending on the use case.

    Thanks,
    /Vincent

     
  • Bourtembourg Reynald

    • status: open --> closed-wont-fix
     

Log in to post a comment.