Menu

#787 Event errors persisting after server restart

closed
nobody
None
C++ API
5
2017-01-06
2016-04-05
No

We are experiencing issues with event subscriptions that only seem to manifest when subscribing to (change) events from large numbers of devices (100s) across many servers (10s). When restarting the servers, there is a certain probability that the client's subscriptions for all devices on a given server are somewhat corrupted.

The symtoms are that the "corrupt" subscriptions still appear to receive change events as usual, but they also receive errors like this every 10 seconds or so:

Error: tango://nb-johfor-0:10000/r3-312u5/wat/fsw-02/state
Tango error stack
Severity = ERROR
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()

This is the same error that is correctly reported when the servers are really down, it just never goes away for some devices.

I have reproduced this running locally on my machine, with some 1000 dummy python devices across 20 servers, with polling on the State attribute, and with a minimal client written in C++ that listens to the State attribute for all the devices. After killing the servers and starting them again, but keeping the client running, around 3-4 random servers (and the corresponding 100s of devices) exhibit the above problem.

It is an Ubuntu machine, and I have tested with TANGO 8.1.2 (distribution packages) and 9.2.2 (built from tarball). Both times with ZMQ 4.0.5. We also see it in out CentOS 7 production environment and with PyTango clients.

Some thoughts: It seems like this is related to the ZMQ "keepalive" thread, somehow. Perhaps it's not being informed about the new subscription and therefore keeps reporting the error. Also, the randomness in the behavior suggests some race condition. Also, I'm assuming this is a client issue, I have not really looked into if other kinds of devices make a difference.

Discussion

  • Emmanuel Taurel

    Emmanuel Taurel - 2016-04-07

    Hi Johan,

    Is it possible for you to minimise your Python device server code and your C++ client code (but reproducing the problem) and send them to us. We could use them to reproduce and (hopefully) fix the problem

    Thank's in advance

    Emmanuel

     
  • Emmanuel Taurel

    Emmanuel Taurel - 2016-04-12

    Hello,

    Bug fix now commited in the repo. A patch file for Tango 9.2.2 will be available soon

    Cheers

    Emmanuel

     
  • Andreas Persson

    Andreas Persson - 2016-04-14

    Hi Emmanuel,
    Great news! Will you also make a patch for Tango-8.1.2? I have tried to port the changes to the 8.1.2 source distribution (on top of patches 1-4). It seems to solve this problem but I can't tell if it breaks something else. Can you have a look at the attached patch and let us know if it is correct?

    Thanks,
    Andreas

     
  • Emmanuel Taurel

    Emmanuel Taurel - 2016-04-14

    Hi Andreas,

    I don't think we will make a patch for Tango 8. The patch attached to your post seems correct to me.
    Note it also solves bug 788 but I don't think you will consider this as a problem!

    Cheers

    Emmaunel

     
  • Andreas Persson

    Andreas Persson - 2016-04-14

    Ok, thanks for checking.

    Cheers,
    Andreas

     
  • Bourtembourg Reynald

    • Status: open --> closed
     

Log in to post a comment.