#578 TCP Deadlock

1.8.x
closed-fixed
core (110)
9
2013-01-28
2012-11-09
David Sanders
No

There is a serious deadlock issue when using TCP with OpenSIPS (1.8.0-tls). I found this paper which has the same conclusion (but is discussing OpenSER circa 2008): http://www.cs.rice.edu/CS/Architecture/docs/ram-ispass08.pdf

I'll quote the relevant part of Section 6:

This can lead to deadlock in the following situation. When a
worker process requests a connection from
the supervisor process, it then blocks waiting to receive that
file descriptor. If, at the same time, the supervisor process
blocks waiting to send a new connection to the same worker
(since the buffer at the receiver is full), the two processes
will deadlock. Once the supervisor process deadlocks, no
other worker can make progress either, as they will quickly
need their own connections from the supervisor process.
Similarly, no new connections will be accepted. This clearly
illustrates that in an event-driven server, one must be careful
to only read from sockets when the event mechanism says
there is something to read and only write to sockets when
the event mechanism says there is space to write.

I can reliably reproduce this deadlock with any number of TCP children. Interestingly it seems to happen faster with a larger number of children. Under constant load, once the main TCP process deadlocks, all the children will as well.

It seems to be rate related. Using SIPp to drive TCP traffic to an OpenSIPS server, 50 registers/second do not encounter the deadlock issue. However, if increase the traffic load a deadlock will occur within 30 seconds. My theory is that if the TCP children can't process a message and reply faster than they are coming in (in this case faster than 20ms) then the deadlock will occur.

For completeness the GDB backtrace output of the deadlocked processes when running two TCP children are attached.

Discussion

1 2 > >> (Page 1 of 2)
  • David Sanders
    David Sanders
    2012-11-09

    Backtrace of deadlocked TCP main process

     
  • David Sanders
    David Sanders
    2012-11-09

    Backtrace of first deadlocked TCP child process

     
  • David Sanders
    David Sanders
    2012-11-09

    Backtrace of second deadlocked TCP child process

     
  • David Sanders
    David Sanders
    2012-11-09

    • priority: 5 --> 7
     
  • David Sanders
    David Sanders
    2012-11-09

    I took the liberty of upgrading this to a higher priority bug since it can completely deadlock TCP traffic for a server if the call rate gets too high.

     
  • Hi David - thank you for the report - I will look into it asap !

    Regards,
    Bogdan

     
    • assigned_to: nobody --> bogdan_iancu
    • priority: 7 --> 9
     
  • David Sanders
    David Sanders
    2012-11-24

    Hi Bogdan,

    Is there any news on this?

    Could you give some kind of prediction on when this could be fixed by? Before the end of the year, or not until 1.9 or?

    Any information on a timeline would help me figure out how to proceed with my project in a timely manner.

    Thanks,
    - David

     
  • David Sanders
    David Sanders
    2012-12-03

    Until this can be fixed, I've found that it can be minimized by tweaking "net.unix.max_dgram_qlen" in sysctl. This defaults to 10, and I've seen the issue greatly reduced by increasing it to 100.

     
  • Hi David,

    Spent some time in debugging this issue and got the conclusion that it is not an actual deadlock, but more an over-load - because all TCP worker process are busy, the main TCP proc is keep assigning task to them until the buffer the the socket used for sending tasks gets full and the write op blocks.

    As a fix, I see making that write non-blocking and if main TCP proc detects overloading of the the TCP workers to start dropping new connect events and connection with incoming traffic that cannot be assigned for processing.

    I will work more on this.

    Regards,
    Bogdan

    Regards,
    Bogdan

     
1 2 > >> (Page 1 of 2)