There is a serious deadlock issue when using TCP with OpenSIPS (1.8.0-tls). I found this paper which has the same conclusion (but is discussing OpenSER circa 2008): http://www.cs.rice.edu/CS/Architecture/docs/ram-ispass08.pdf
I'll quote the relevant part of Section 6:
This can lead to deadlock in the following situation. When a
worker process requests a connection from
the supervisor process, it then blocks waiting to receive that
ﬁle descriptor. If, at the same time, the supervisor process
blocks waiting to send a new connection to the same worker
(since the buffer at the receiver is full), the two processes
will deadlock. Once the supervisor process deadlocks, no
other worker can make progress either, as they will quickly
need their own connections from the supervisor process.
Similarly, no new connections will be accepted. This clearly
illustrates that in an event-driven server, one must be careful
to only read from sockets when the event mechanism says
there is something to read and only write to sockets when
the event mechanism says there is space to write.
I can reliably reproduce this deadlock with any number of TCP children. Interestingly it seems to happen faster with a larger number of children. Under constant load, once the main TCP process deadlocks, all the children will as well.
It seems to be rate related. Using SIPp to drive TCP traffic to an OpenSIPS server, 50 registers/second do not encounter the deadlock issue. However, if increase the traffic load a deadlock will occur within 30 seconds. My theory is that if the TCP children can't process a message and reply faster than they are coming in (in this case faster than 20ms) then the deadlock will occur.
For completeness the GDB backtrace output of the deadlocked processes when running two TCP children are attached.
Log in to post a comment.