#578 TCP Deadlock

1.8.x
closed-fixed
core (110)
9
2013-01-28
2012-11-09
David Sanders
No

There is a serious deadlock issue when using TCP with OpenSIPS (1.8.0-tls). I found this paper which has the same conclusion (but is discussing OpenSER circa 2008): http://www.cs.rice.edu/CS/Architecture/docs/ram-ispass08.pdf

I'll quote the relevant part of Section 6:

This can lead to deadlock in the following situation. When a
worker process requests a connection from
the supervisor process, it then blocks waiting to receive that
file descriptor. If, at the same time, the supervisor process
blocks waiting to send a new connection to the same worker
(since the buffer at the receiver is full), the two processes
will deadlock. Once the supervisor process deadlocks, no
other worker can make progress either, as they will quickly
need their own connections from the supervisor process.
Similarly, no new connections will be accepted. This clearly
illustrates that in an event-driven server, one must be careful
to only read from sockets when the event mechanism says
there is something to read and only write to sockets when
the event mechanism says there is space to write.

I can reliably reproduce this deadlock with any number of TCP children. Interestingly it seems to happen faster with a larger number of children. Under constant load, once the main TCP process deadlocks, all the children will as well.

It seems to be rate related. Using SIPp to drive TCP traffic to an OpenSIPS server, 50 registers/second do not encounter the deadlock issue. However, if increase the traffic load a deadlock will occur within 30 seconds. My theory is that if the TCP children can't process a message and reply faster than they are coming in (in this case faster than 20ms) then the deadlock will occur.

For completeness the GDB backtrace output of the deadlocked processes when running two TCP children are attached.

Discussion

<< < 1 2 (Page 2 of 2)
  • Hi David,

    I attached a patch that should solve the deadlock - it detects when TCP main is about to block on sending an accept/read command to the TCP workers and if so, will drop the TCP conn....more or less there is nothing more you can do about.

    Please test and let me know if you still have the blocking.

    BTW, do you know a way to set net.unix.max_dgram_qlen from code , per socket ?? I haven't found anything :(

    Thanks and regards,
    Bogdan

     
    • status: open --> open-fixed
     
  • David Sanders
    David Sanders
    2013-01-24

    Hey Bogdan,

    I'll test out the patch ASAP and get back to you.

    Regarding the net.unix.max_dgram_qlen, I don't think it's possible to set from code, unfortunately. At least from what I've read about it. Really unfortunate in this case.

    - David

     
  • Backport done

     
    • status: open-fixed --> closed-fixed
     
<< < 1 2 (Page 2 of 2)