We are experiencing issues with event subscriptions that only seem to manifest when subscribing to (change) events from large numbers of devices (100s) across many servers (10s). When restarting the servers, there is a certain probability that the client's subscriptions for all devices on a given server are somewhat corrupted.
The symtoms are that the "corrupt" subscriptions still appear to receive change events as usual, but they also receive errors like this every 10 seconds or so:
Error: tango://nb-johfor-0:10000/r3-312u5/wat/fsw-02/state
Tango error stack
Severity = ERROR
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()
This is the same error that is correctly reported when the servers are really down, it just never goes away for some devices.
I have reproduced this running locally on my machine, with some 1000 dummy python devices across 20 servers, with polling on the State attribute, and with a minimal client written in C++ that listens to the State attribute for all the devices. After killing the servers and starting them again, but keeping the client running, around 3-4 random servers (and the corresponding 100s of devices) exhibit the above problem.
It is an Ubuntu machine, and I have tested with TANGO 8.1.2 (distribution packages) and 9.2.2 (built from tarball). Both times with ZMQ 4.0.5. We also see it in out CentOS 7 production environment and with PyTango clients.
Some thoughts: It seems like this is related to the ZMQ "keepalive" thread, somehow. Perhaps it's not being informed about the new subscription and therefore keeps reporting the error. Also, the randomness in the behavior suggests some race condition. Also, I'm assuming this is a client issue, I have not really looked into if other kinds of devices make a difference.
Hi Johan,
Is it possible for you to minimise your Python device server code and your C++ client code (but reproducing the problem) and send them to us. We could use them to reproduce and (hopefully) fix the problem
Thank's in advance
Emmanuel
Hello,
Bug fix now commited in the repo. A patch file for Tango 9.2.2 will be available soon
Cheers
Emmanuel
Hi Emmanuel,
Great news! Will you also make a patch for Tango-8.1.2? I have tried to port the changes to the 8.1.2 source distribution (on top of patches 1-4). It seems to solve this problem but I can't tell if it breaks something else. Can you have a look at the attached patch and let us know if it is correct?
Thanks,
Andreas
Hi Andreas,
I don't think we will make a patch for Tango 8. The patch attached to your post seems correct to me.
Note it also solves bug 788 but I don't think you will consider this as a problem!
Cheers
Emmaunel
Ok, thanks for checking.
Cheers,
Andreas