I have a setup running multiple RTP Proxy instances, when multiple rtp proxies fail, CPU utilisation approaches 100%, this does not usually happen with 1 or 2 failures, but when 4+ fail (this can even be 4 instances on an 8 core/instance machine), the issues arrise.
The severity of this event happening has caused the system to stop responding for short periods, observations notice intermittent periods of no logging information and substantial packet drops (due to high CPU usage).
Once RTP Proxy instances that have failed come back online, CPU utilisation drops right down and everything functions as expected.
Hello!
My assumption is that OpenSIPS is trying to probe the other RTPProxy servers without any success. That's probably why the CPU goes that high.
How many OpenSIPS processes are you using? Are you using the rtpproxy_disable_tout [1] parameter? If yes, what value are you setting?
Have you tried to permanently disable via MI commands [2 ]the RTPProxy servers that failed?
[1] http://www.opensips.org/html/docs/modules/1.9.x/rtpproxy#id250074
[2] http://www.opensips.org/html/docs/modules/1.9.x/rtpproxy#id293384
Best regards,
Răzvan
Hello Răzvan,
I would make the same assumption, which is expected, however I don't know why this would lead to 100% CPU usage, surely probing only involved sending and waiting for a udp packet.
I am using 32 processes on an 8 core machine.
I am not using rtpproxy_disable_tout parameter.
I have not tried to disable via MI command.
Regards Jonathan
Hi!
When opensips is in 100% CPU usage, can you run the following command:
gdb $OPENSIPS $PID -batch --eval-command="bt full"
Note you have to replace $OPENSIPS with OpenSIPS' executable path and $PID with the value of the pid that holds the CPU in 100%.
Best regards,
Răzvan
Hi Răzvan,
Thank you, I will make a note of this, however it is my aim not to let my system enter this state again and I have only been able to replicate with live traffic, so I am not sure when I will have the opportunity to run this again. I will keep you updated though.
Kind Regards Jonathan
Dump 1
Dump 2
Well this happened sooner than I was expecting, I managed to run the commands on the processes that were taking up the CPU usage, however I cant gurentee that I was able to run the script before the process completed its high usage.
Please see files, Dump 1 and Dump 2
I have updated my code in an attempt to circumvent the issue to include:
modparam("rtpproxy", "rtpproxy_disable_tout", -1)
My observations on the above modification are as follows:
1. CPU regularly runs ah a higher value than before modification.
2. The CPU still reaches 100% utilisation which then causes kernal packet drops (SIP, just because of high load).
3. The 100% CPU usage (through my limited observations) does not appear last as long in locked up state.
Further to this I can observe that multiple RTP servers fail successively, however the chance that 5 RTP servers go offline at the time is unlikely, however I am considering that the successive fails of multiple rtp servers could be a effect of the high CPU utilisation then causing packet drops rather than multiple.
However in addition this, I have setup a ICMP ping as well as a V ping via UDP in the control channel, both with return negligible packet loss.
When locked up, if I run an strace against any of the main processes I get
write(6, "\10\232\4\33\276\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
constantly been repeated.
I have been told that this may be fd #6, if I run lsof on this process I get
opensips 26254 opensips 6w FIFO 0,8 0t0 8497654 pipe
Since first observing FIFO, I removed mi_fifo completely just in case it was related, however the problem still persists.
Not sure if this information will help any.
Kind Regards Regards Jonathan
Hi, Jonathan!
So basically this happens only when multiple rtpproxy servers are down, right? It's more like a spike in the CPU, it is not a permanent deadblock?
I think the problem is that having that many proxies disabled, OpenSIPS tries to re-establish the connection with all of them, blocking the signalling.
Have you tried setting the "rtpproxy_disable_tout" to a higher value (like 3600 seconds)? If my assumptions are correct, the spikes will appear more rare (every hour for a 3600 seconds disable timeout). Also do you have the "rtpproxy_timeout" parameter set? What value? Have you tried setting it to a lower value, let's say "0.3"?
Best regards,
Răzvan
Hi Răzvan,
Yes that's correct, all cores sit at 100% indefinitely unless I stop it.
Unfortunately I have not been able to create any strong correlation with RTP Proxy parameters. Any changes to rtpproxy_disable_tout and rtpproxy_timeout don't seem to produce any distinctive results. (I question myself RTP Proxy Module is the sole cause of the issue).
With rtpproxy_disable_tout=10 and rtpproxy_timeout=0.05 I have experienced almost no lockups for 14 days, then yesterday and today without any config change the problem is happening every 30 minutes. I have reset the RTP Proxy but the problem persists, I am wondering if there is some external trigger, possibly induced by incoming packets.
I have tried on some of my traffic to bypass the RTP Proxy, but I am not able to do this with all traffic for privacy reasons. When traffic without engaging the RTP proxy, it appears to trigger less often.
Kind Regards Jonathan
I have setup the servers pointing to a single <1ms latency RTP Proxy, I have experienced a lockup of all 32 threads.
each thread is returning
write(6, "\20Xp\236=\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
or
write(6, "\270\340\332\236=\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
or different data inside.
with strace.
However RTP Ticks is 0.
I this issue might not be related to RTP Proxy after all.
Kind Regards Jonathan
Can you detail who's file descriptor is 6?
I did
lsof -p [opensips_id]
and I got
opensips 26254 opensips 6w FIFO 0,8 0t0 8497654 pipe
Please let me know if there is any other commands that you would like me to run.
When blocked, please try to fetch the backtrace of all processes. You should use a bash script similar to this one:
for PID in $(pidof opensips); do gdb $OPENSIPS $PID -batch --eval-command="bt full"; done
This is available here: http://www.digipigeon.com/debug_gdb.gz
Hi Razvan,
After looking at this gdb with my limited knowledge I have identified that rabbitmq was the culprit this time around.
I still believe that there is an issue with RTP Proxy as this issue started happening before I implemented RabbitMQ. I will wait until I can see a lockup which I believe is caused by RTP Proxy, then I will run the same script again and update.
Kind Regards Jonathan
Hi Razvan,
Thanks to that little bit of bash script, I was able to capture the following
http://www.digipigeon.com/debug_gdb2.gz
I have had a quick look at it and it appears that this is catching the original issue that I was getting and it triggered in relation to the RTP Proxy Engagement.
Similar issues happened, 100% utilisation, etc.
Please take a look.
Kind Regards Jonathan
Hi Razvan,
Has this been able to capture the bit that you require? or is there any other commands that I can run for you?
Kind Regards
Jonathan
Hi, Jonathan!
Unfortunately the problem you have here is not simple at all. Basically the idea is that most of your SIP workers are stuck waiting for responses from the rtpproxy servers that are down. And they also acquire the transaction lock. Therefore any reply within that transaction can't be processed (because the previous process was not finished), and the process busy-waits (that's why you see the spikes).
I saw that you are using 'rtpproxy_autobridge' feature of the RTPProxy module. Is this mandatory to your platform? If not, can you try to remove it for a while?
Also, you said that when you experience this, the proxy does not recover at all, right? Here's a script that also adds the PID of the process to the output:
I'd like you to run it two times, with a few seconds between them (around 10 seconds). What I want to find out is if OpenSIPS is really stuck (and where) or it is still able to process traffic.
Also, run the output of the following command would be useful:
Finally, can you tell me how many rtpproxy servers your using and how many of them are down?
Best regards,
Răzvan
Last edit: Razvan Crainea 2013-06-05
Hi Razvan,
The problem has just recurred for me, however there is sensitive information in these dumps, can you please provide me a direct address that I can send this to you on?
Kind Regards Jonathan
Hi all,
I think that I experience the same issue.
Have you identified the problem during your privates discussions?
Kind Regards,
Igor
Hello Igor,
Unfortunately this problem still persists, it is due to some very outdated code inside the RTP module. I have gone through it and lots of gdb traces. It seems to be a collection of alot of things inside not working correctly.
I have pretty much given up on this been fixed, so I have made some script modifications (which basically remove some features) from the RTP Proxy module, it dosnt do all the fancy retries now. I have also deployed 7 RTP servers (more than we really need), but its increased my stability now. However I can still cause this crash by increasing the load on single servers.
http://www.kamailio.org/wiki/devel/rtpproxy-ng - I am hoping that this will get ported over to opensips, then we will have something ALOT better to use.
Best of luck.
Kind Regards Jonathan