Hi,
I'm working at Instrumental Technologies company Slovenia (known mostly from Libera BPM instruments). We have developed a generic Tango Adapter for our instruments, which acts like a protocol wrapper, mapping our internal network protocol (mci) to the Tango protocol. Our product is based on Tango 8.1.2 framework.
We have noticed that occasionally if ZmqEventSupplier::push_event() metod is intensively called from multiple threads the application crashes. I have investegated a problem and fixed it on our local copy, but it would be nice the fix would be integrated into official Tango repository. Note that by observing the code from latest release tag (release_9_0_7) the problem is still there.
The problem is in incorrect locking mechanism within push_event() method. The ZmqEventSupplier class is a singleton instance class, responsible for sending zmq messages from server to the clients. It supports that several threads subscribe their data messages by calling push_event() method. Their requests are serialized by using internal push_mutex resource. Beside the message data, which are specified by the caller, push_event() method encapsulates several intemal messages, The problem pops up with endian_mess entity. This message holds the information about the endian of the host system and this information is always sent to the clients together with message data. The catch is that each time the zmq msg is sent to the client it is marked as invalid. In order to send endian_mess message next time push_event() method keeps a copy of endian data in endian_mess_2 variable, which is never sent and thus it is always valid. After the endian_mess is sent through zmq channel it is copied from endian_mess_2 to become valid again for the next resent.
The problem is that copying data from endian_mess_2 to endian_mess is not alway controlled by push_mutex. Thus it could happen that before endian_mess becomes valid another thread would gain execution priority and would sent invalid endian_mess zmq message, which lead in zmq library into ASSERT code, which terminates the application.
I have fixed the problem by releasing push_mutex after the endian_mess becomes valid, what is evident in the attached diff file. Please observe the change and if it is accepted and if you give me write access I could commit a change.
br, damijan
Hi Damijan,
Thank's very much for the detailed analysis of the problem and for the patch. The fix has been included in SVN and will obviously be part of next release (or next patch)
Emmanuel