OpenSIPS/OpenSER-a versatile SIP Server / Bugs / #646 Offiline RTP Proxy causes system freeze

Razvan Crainea - 2013-04-24

Hello!

My assumption is that OpenSIPS is trying to probe the other RTPProxy servers without any success. That's probably why the CPU goes that high.
How many OpenSIPS processes are you using? Are you using the rtpproxy_disable_tout [1] parameter? If yes, what value are you setting?
Have you tried to permanently disable via MI commands [2 ]the RTPProxy servers that failed?

[1] http://www.opensips.org/html/docs/modules/1.9.x/rtpproxy#id250074
[2] http://www.opensips.org/html/docs/modules/1.9.x/rtpproxy#id293384

Best regards,
Răzvan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Razvan Crainea - 2013-04-24

assigned_to: nobody --> razvancrainea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-04-24

Hello Răzvan,

I would make the same assumption, which is expected, however I don't know why this would lead to 100% CPU usage, surely probing only involved sending and waiting for a udp packet.

I am using 32 processes on an 8 core machine.
I am not using rtpproxy_disable_tout parameter.
I have not tried to disable via MI command.

Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Razvan Crainea - 2013-04-24

Hi!

When opensips is in 100% CPU usage, can you run the following command:

gdb $OPENSIPS $PID -batch --eval-command="bt full"

Note you have to replace $OPENSIPS with OpenSIPS' executable path and $PID with the value of the pid that holds the CPU in 100%.

Best regards,
Răzvan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-04-24

Hi Răzvan,

Thank you, I will make a note of this, however it is my aim not to let my system enter this state again and I have only been able to replicate with live traffic, so I am not sure when I will have the opportunity to run this again. I will keep you updated though.

Kind Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-04-24

Dump 1

dump1.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-04-24

Dump 2

dump2.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-04-24

Well this happened sooner than I was expecting, I managed to run the commands on the processes that were taking up the CPU usage, however I cant gurentee that I was able to run the script before the process completed its high usage.

Please see files, Dump 1 and Dump 2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-02

priority: 5 --> 7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-02

I have updated my code in an attempt to circumvent the issue to include:
modparam("rtpproxy", "rtpproxy_disable_tout", -1)

My observations on the above modification are as follows:
1. CPU regularly runs ah a higher value than before modification.
2. The CPU still reaches 100% utilisation which then causes kernal packet drops (SIP, just because of high load).
3. The 100% CPU usage (through my limited observations) does not appear last as long in locked up state.

Further to this I can observe that multiple RTP servers fail successively, however the chance that 5 RTP servers go offline at the time is unlikely, however I am considering that the successive fails of multiple rtp servers could be a effect of the high CPU utilisation then causing packet drops rather than multiple.

However in addition this, I have setup a ICMP ping as well as a V ping via UDP in the control channel, both with return negligible packet loss.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

When locked up, if I run an strace against any of the main processes I get

write(6, "\10\232\4\33\276\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)

constantly been repeated.

I have been told that this may be fd #6, if I run lsof on this process I get

opensips 26254 opensips 6w FIFO 0,8 0t0 8497654 pipe

Since first observing FIFO, I removed mi_fifo completely just in case it was related, however the problem still persists.

Not sure if this information will help any.

Kind Regards Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Razvan Crainea - 2013-05-31

Hi, Jonathan!

So basically this happens only when multiple rtpproxy servers are down, right? It's more like a spike in the CPU, it is not a permanent deadblock?
I think the problem is that having that many proxies disabled, OpenSIPS tries to re-establish the connection with all of them, blocking the signalling.
Have you tried setting the "rtpproxy_disable_tout" to a higher value (like 3600 seconds)? If my assumptions are correct, the spikes will appear more rare (every hour for a 3600 seconds disable timeout). Also do you have the "rtpproxy_timeout" parameter set? What value? Have you tried setting it to a lower value, let's say "0.3"?

Best regards,
Răzvan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

Hi Răzvan,

Yes that's correct, all cores sit at 100% indefinitely unless I stop it.

Unfortunately I have not been able to create any strong correlation with RTP Proxy parameters. Any changes to rtpproxy_disable_tout and rtpproxy_timeout don't seem to produce any distinctive results. (I question myself RTP Proxy Module is the sole cause of the issue).

With rtpproxy_disable_tout=10 and rtpproxy_timeout=0.05 I have experienced almost no lockups for 14 days, then yesterday and today without any config change the problem is happening every 30 minutes. I have reset the RTP Proxy but the problem persists, I am wondering if there is some external trigger, possibly induced by incoming packets.

I have tried on some of my traffic to bypass the RTP Proxy, but I am not able to do this with all traffic for privacy reasons. When traffic without engaging the RTP proxy, it appears to trigger less often.

Kind Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

I have setup the servers pointing to a single <1ms latency RTP Proxy, I have experienced a lockup of all 32 threads.

each thread is returning

write(6, "\20Xp\236=\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
or
write(6, "\270\340\332\236=\177\0\0", 8) = -1 EAGAIN (Resource temporarily unavailable)
or different data inside.
with strace.

However RTP Ticks is 0.

I this issue might not be related to RTP Proxy after all.

Kind Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Razvan Crainea - 2013-05-31

Can you detail who's file descriptor is 6?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

I did

lsof -p [opensips_id]

and I got

opensips 26254 opensips 6w FIFO 0,8 0t0 8497654 pipe

Please let me know if there is any other commands that you would like me to run.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Razvan Crainea - 2013-05-31

When blocked, please try to fetch the backtrace of all processes. You should use a bash script similar to this one:

for PID in $(pidof opensips); do gdb $OPENSIPS $PID -batch --eval-command="bt full"; done

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

This is available here: http://www.digipigeon.com/debug_gdb.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

Hi Razvan,

After looking at this gdb with my limited knowledge I have identified that rabbitmq was the culprit this time around.

I still believe that there is an issue with RTP Proxy as this issue started happening before I implemented RabbitMQ. I will wait until I can see a lockup which I believe is caused by RTP Proxy, then I will run the same script again and update.

Kind Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-05-31

Hi Razvan,

Thanks to that little bit of bash script, I was able to capture the following

http://www.digipigeon.com/debug_gdb2.gz

I have had a quick look at it and it appears that this is catching the original issue that I was getting and it triggered in relation to the RTP Proxy Engagement.

Similar issues happened, 100% utilisation, etc.

Please take a look.

Kind Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-06-03

Hi Razvan,

Has this been able to capture the bit that you require? or is there any other commands that I can run for you?

Kind Regards

Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Razvan Crainea - 2013-06-05

Hi, Jonathan!

Unfortunately the problem you have here is not simple at all. Basically the idea is that most of your SIP workers are stuck waiting for responses from the rtpproxy servers that are down. And they also acquire the transaction lock. Therefore any reply within that transaction can't be processed (because the previous process was not finished), and the process busy-waits (that's why you see the spikes).

I saw that you are using 'rtpproxy_autobridge' feature of the RTPProxy module. Is this mandatory to your platform? If not, can you try to remove it for a while?

Also, you said that when you experience this, the proxy does not recover at all, right? Here's a script that also adds the PID of the process to the output:

for PID in $(pidof opensips); do echo $PID gdb $OPENSIPS $PID -batch --eval-command="bt full" done

I'd like you to run it two times, with a few seconds between them (around 10 seconds). What I want to find out is if OpenSIPS is really stuck (and where) or it is still able to process traffic.

Also, run the output of the following command would be useful:

opensipsctl ps

Finally, can you tell me how many rtpproxy servers your using and how many of them are down?

Best regards,
Răzvan

Last edit: Razvan Crainea 2013-06-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-06-27

Hi Razvan,

The problem has just recurred for me, however there is sensitive information in these dumps, can you please provide me a direct address that I can send this to you on?

Kind Regards Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Potjevlesch - 2013-08-28

Hi all,

I think that I experience the same issue.
Have you identified the problem during your privates discussions?

Kind Regards,

Igor

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Digipigeon - 2013-08-28

Hello Igor,

Unfortunately this problem still persists, it is due to some very outdated code inside the RTP module. I have gone through it and lots of gdb traces. It seems to be a collection of alot of things inside not working correctly.

Locking waiting for replies.

Multiple retries of previously disabled RTP servers.

I have pretty much given up on this been fixed, so I have made some script modifications (which basically remove some features) from the RTP Proxy module, it dosnt do all the fancy retries now. I have also deployed 7 RTP servers (more than we really need), but its increased my stability now. However I can still cause this crash by increasing the load on single servers.

http://www.kamailio.org/wiki/devel/rtpproxy-ng - I am hoping that this will get ported over to opensips, then we will have something ALOT better to use.

Best of luck.

Kind Regards Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Offiline RTP Proxy causes system freeze

Group

Searches

Help

#646 Offiline RTP Proxy causes system freeze

Discussion