Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#646 Offiline RTP Proxy causes system freeze

1.10.x
open
modules (454)
7
2013-08-28
2013-04-23
Digipigeon
No

I have a setup running multiple RTP Proxy instances, when multiple rtp proxies fail, CPU utilisation approaches 100%, this does not usually happen with 1 or 2 failures, but when 4+ fail (this can even be 4 instances on an 8 core/instance machine), the issues arrise.

The severity of this event happening has caused the system to stop responding for short periods, observations notice intermittent periods of no logging information and substantial packet drops (due to high CPU usage).

Once RTP Proxy instances that have failed come back online, CPU utilisation drops right down and everything functions as expected.

Discussion

  • Razvan Crainea
    Razvan Crainea
    2013-04-24

    • assigned_to: nobody --> razvancrainea
     
  • Digipigeon
    Digipigeon
    2013-04-24

    Hello Răzvan,

    I would make the same assumption, which is expected, however I don't know why this would lead to 100% CPU usage, surely probing only involved sending and waiting for a udp packet.

    I am using 32 processes on an 8 core machine.
    I am not using rtpproxy_disable_tout parameter.
    I have not tried to disable via MI command.

    Regards Jonathan

     
  • Razvan Crainea
    Razvan Crainea
    2013-04-24

    Hi!

    When opensips is in 100% CPU usage, can you run the following command:

    gdb $OPENSIPS $PID -batch --eval-command="bt full"

    Note you have to replace $OPENSIPS with OpenSIPS' executable path and $PID with the value of the pid that holds the CPU in 100%.

    Best regards,
    Răzvan

     
  • Digipigeon
    Digipigeon
    2013-04-24

    Hi Răzvan,

    Thank you, I will make a note of this, however it is my aim not to let my system enter this state again and I have only been able to replicate with live traffic, so I am not sure when I will have the opportunity to run this again. I will keep you updated though.

    Kind Regards Jonathan

     
  • Digipigeon
    Digipigeon
    2013-04-24

    Dump 1

     
    Attachments
  • Digipigeon
    Digipigeon
    2013-04-24

    Dump 2

     
    Attachments
  • Digipigeon
    Digipigeon
    2013-04-24

    Well this happened sooner than I was expecting, I managed to run the commands on the processes that were taking up the CPU usage, however I cant gurentee that I was able to run the script before the process completed its high usage.

    Please see files, Dump 1 and Dump 2

     
  • Digipigeon
    Digipigeon
    2013-05-02

    • priority: 5 --> 7
     
  • Digipigeon
    Digipigeon
    2013-05-02

    I have updated my code in an attempt to circumvent the issue to include:
    modparam("rtpproxy", "rtpproxy_disable_tout", -1)

    My observations on the above modification are as follows:
    1. CPU regularly runs ah a higher value than before modification.
    2. The CPU still reaches 100% utilisation which then causes kernal packet drops (SIP, just because of high load).
    3. The 100% CPU usage (through my limited observations) does not appear last as long in locked up state.

    Further to this I can observe that multiple RTP servers fail successively, however the chance that 5 RTP servers go offline at the time is unlikely, however I am considering that the successive fails of multiple rtp servers could be a effect of the high CPU utilisation then causing packet drops rather than multiple.

    However in addition this, I have setup a ICMP ping as well as a V ping via UDP in the control channel, both with return negligible packet loss.

     
  • Digipigeon
    Digipigeon
    2013-05-31

    When locked up, if I run an strace against any of the main processes I get

    write(6, "\10\232\4\33\276\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)

    constantly been repeated.

    I have been told that this may be fd #6, if I run lsof on this process I get

    opensips 26254 opensips 6w FIFO 0,8 0t0 8497654 pipe

    Since first observing FIFO, I removed mi_fifo completely just in case it was related, however the problem still persists.

    Not sure if this information will help any.

    Kind Regards Regards Jonathan

     
  • Razvan Crainea
    Razvan Crainea
    2013-05-31

    Hi, Jonathan!

    So basically this happens only when multiple rtpproxy servers are down, right? It's more like a spike in the CPU, it is not a permanent deadblock?
    I think the problem is that having that many proxies disabled, OpenSIPS tries to re-establish the connection with all of them, blocking the signalling.
    Have you tried setting the "rtpproxy_disable_tout" to a higher value (like 3600 seconds)? If my assumptions are correct, the spikes will appear more rare (every hour for a 3600 seconds disable timeout). Also do you have the "rtpproxy_timeout" parameter set? What value? Have you tried setting it to a lower value, let's say "0.3"?

    Best regards,
    Răzvan

     
  • Digipigeon
    Digipigeon
    2013-05-31

    Hi Răzvan,

    Yes that's correct, all cores sit at 100% indefinitely unless I stop it.

    Unfortunately I have not been able to create any strong correlation with RTP Proxy parameters. Any changes to rtpproxy_disable_tout and rtpproxy_timeout don't seem to produce any distinctive results. (I question myself RTP Proxy Module is the sole cause of the issue).

    With rtpproxy_disable_tout=10 and rtpproxy_timeout=0.05 I have experienced almost no lockups for 14 days, then yesterday and today without any config change the problem is happening every 30 minutes. I have reset the RTP Proxy but the problem persists, I am wondering if there is some external trigger, possibly induced by incoming packets.

    I have tried on some of my traffic to bypass the RTP Proxy, but I am not able to do this with all traffic for privacy reasons. When traffic without engaging the RTP proxy, it appears to trigger less often.

    Kind Regards Jonathan

     
  • Digipigeon
    Digipigeon
    2013-05-31

    I have setup the servers pointing to a single <1ms latency RTP Proxy, I have experienced a lockup of all 32 threads.

    each thread is returning

    write(6, "\20Xp\236=\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
    or
    write(6, "\270\340\332\236=\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
    or different data inside.
    with strace.

    However RTP Ticks is 0.

    I this issue might not be related to RTP Proxy after all.

    Kind Regards Jonathan

     
  • Razvan Crainea
    Razvan Crainea
    2013-05-31

    Can you detail who's file descriptor is 6?

     
  • Digipigeon
    Digipigeon
    2013-05-31

    I did

    lsof -p [opensips_id]

    and I got

    opensips 26254 opensips 6w FIFO 0,8 0t0 8497654 pipe

    Please let me know if there is any other commands that you would like me to run.

     
  • Razvan Crainea
    Razvan Crainea
    2013-05-31

    When blocked, please try to fetch the backtrace of all processes. You should use a bash script similar to this one:

    for PID in $(pidof opensips); do gdb $OPENSIPS $PID -batch --eval-command="bt full"; done

     
  • Digipigeon
    Digipigeon
    2013-05-31

    Hi Razvan,

    After looking at this gdb with my limited knowledge I have identified that rabbitmq was the culprit this time around.

    I still believe that there is an issue with RTP Proxy as this issue started happening before I implemented RabbitMQ. I will wait until I can see a lockup which I believe is caused by RTP Proxy, then I will run the same script again and update.

    Kind Regards Jonathan

     
  • Digipigeon
    Digipigeon
    2013-05-31

    Hi Razvan,

    Thanks to that little bit of bash script, I was able to capture the following

    http://www.digipigeon.com/debug_gdb2.gz

    I have had a quick look at it and it appears that this is catching the original issue that I was getting and it triggered in relation to the RTP Proxy Engagement.

    Similar issues happened, 100% utilisation, etc.

    Please take a look.

    Kind Regards Jonathan

     
  • Digipigeon
    Digipigeon
    2013-06-03

    Hi Razvan,

    Has this been able to capture the bit that you require? or is there any other commands that I can run for you?

    Kind Regards

    Jonathan

     
  • Razvan Crainea
    Razvan Crainea
    2013-06-05

    Hi, Jonathan!

    Unfortunately the problem you have here is not simple at all. Basically the idea is that most of your SIP workers are stuck waiting for responses from the rtpproxy servers that are down. And they also acquire the transaction lock. Therefore any reply within that transaction can't be processed (because the previous process was not finished), and the process busy-waits (that's why you see the spikes).

    I saw that you are using 'rtpproxy_autobridge' feature of the RTPProxy module. Is this mandatory to your platform? If not, can you try to remove it for a while?

    Also, you said that when you experience this, the proxy does not recover at all, right? Here's a script that also adds the PID of the process to the output:

    for PID in $(pidof opensips); do
      echo $PID
      gdb $OPENSIPS $PID -batch --eval-command="bt full"
    done
    

    I'd like you to run it two times, with a few seconds between them (around 10 seconds). What I want to find out is if OpenSIPS is really stuck (and where) or it is still able to process traffic.

    Also, run the output of the following command would be useful:

    opensipsctl ps
    

    Finally, can you tell me how many rtpproxy servers your using and how many of them are down?

    Best regards,
    Răzvan

     
    Last edit: Razvan Crainea 2013-06-05
  • Digipigeon
    Digipigeon
    2013-06-27

    Hi Razvan,

    The problem has just recurred for me, however there is sensitive information in these dumps, can you please provide me a direct address that I can send this to you on?

    Kind Regards Jonathan

     
  • Hi all,

    I think that I experience the same issue.
    Have you identified the problem during your privates discussions?

    Kind Regards,

    Igor

     
  • Digipigeon
    Digipigeon
    2013-08-28

    Hello Igor,

    Unfortunately this problem still persists, it is due to some very outdated code inside the RTP module. I have gone through it and lots of gdb traces. It seems to be a collection of alot of things inside not working correctly.

    • Locking waiting for replies.
    • Multiple retries of previously disabled RTP servers.

    I have pretty much given up on this been fixed, so I have made some script modifications (which basically remove some features) from the RTP Proxy module, it dosnt do all the fancy retries now. I have also deployed 7 RTP servers (more than we really need), but its increased my stability now. However I can still cause this crash by increasing the load on single servers.

    http://www.kamailio.org/wiki/devel/rtpproxy-ng - I am hoping that this will get ported over to opensips, then we will have something ALOT better to use.

    Best of luck.

    Kind Regards Jonathan

     
  • Hello Jonathan,

    Thank you for your reply. So, as I understand, you can reproduce the problem by increasing the number of concurrent calls on a single RTP Proxy server?
    Just to know, how much calls cause the crash?

    Kinds Regards, Igor

     
  • Digipigeon
    Digipigeon
    2013-08-28

    I can reproduce it if I set our dispatcher to go to a single server. The problem is worsened by using a single RTP server as well.

    I can reproduce this with as little as 300 calls on a 2X Quad Core 2Ghz box.

    Kind Regards Jonathan