OpenSIPS/OpenSER-a versatile SIP Server / Bugs / #541 Lockup when using dialog pings

#541 Lockup when using dialog pings

Milestone: 1.8.x

Status: closed-fixed

Owner: Vladut-Stefan Paiu

Labels: modules (454)

Priority: 5

Updated: 2012-08-08

Created: 2012-07-11

Creator: Ryan Bullock

Private: No

After enabling dialog pings on a fairly busy system facing the caller (via create_dialog("PB")) I have experienced a couple instances where opensips locks up and stops processing messages. During this time CPU usage also sky rockets.

At debug=3 nothing interesting seems to show up in the logs.

Restarting opensips leaves it operational for awhile, but can lockup again. Removing the "P" option from create_dialog stops the lockups all together.

ping_interval was set to 60 seconds, timeout_avp is used and set in the script.

Seems like there might be a deadlock/livelock issue with the dialog pings.

Opensips build information:
version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9084M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 14:07:51 Jun 13 2012 with gcc 4.4.6

Sorry if the details are light, but not much information was generated (logs stopped at the time of the lock as well), other than messages not being processed and high cpu usage.

Discussion

Vladut-Stefan Paiu - 2012-07-12

assigned_to: nobody --> vladut-paiu

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vladut-Stefan Paiu - 2012-07-12

Hello,

When OpenSIPS get stuck again in 100% CPU, can you please get an OpenSIPS PID that's stuck and do
gdb [path_to_opensips_binary] [pid]
inside gdb do
bt full
and paste here the output so I can further debug this.

Regards,
Vlad

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-13

bt full, missing symbols

opensips_bad_bt.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-13

Probably not useful, got a bt from a process but unfortunately gdb couldn't load the debug symbols for opensips. Trying to figure why that is and will try to catch it again and provide a better backtrace.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-18

bt full symbols

opensips_pingtimer_btfull

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-18

Full bt added from a runaway process.

I updated opensips as well, new version information:

version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9141M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 13:06:46 Jul 13 2012 with gcc 4.4.6

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vladut-Stefan Paiu - 2012-07-19

Hello Ryan,

Could you please try the attached patch ? It should give out some more information about the dead-lock.
The patch prints some info to the logging facility, so when OpenSIPS dies out, please send the log file.

Regards,
Vlad

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vladut-Stefan Paiu - 2012-07-19

deadlock_dlg.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-27

Deadlock with extra logs

opensips_deadlock_2.log.bz2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-27

Hey Vlad,

Been doing some stress testing in a small test lab to get this to happen and have had some good success.

I attached the log with the applied patch that you requested.

Since I can get this to happen in a lab now, I should be able to turn on some more logging if you that would help.

Regards,

Ryan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-30

BT of the process that fetched the lock in get_timeout_dlgs

opensips_deadlock_get_timeout_dlgs.bt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-30

Attached a full backtrace for the process doing get_timeout_dlgs() which seems to be where the lock is occurring. Deadlock seems occur when this function calls dlg_lock_dlg(current);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-30

Ok, I think I see why this is happening.

get_timeout_dlgs() FIRST grabs the ping_timer lock and SECOND grabs the lock on the current dialog.

When a timeout happens unref_dlg() FIRST grabs the lock on the current dialog. remove_ping_timer() is then called and grabs the ping_timer lock SECOND.

Therefore it is possible for one processes to hold the ping_timer lock and be stuck trying to lock the current dialog, while another processes that has already locked the current dialog (for teardown) and is attempting to get the ping_timer lock.

Hope this helps.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vladut-Stefan Paiu - 2012-07-31

Hello,

Can you please update to the latest 1.8 branch ?
I have just committed a fix for this. Let me know if the issue still happens.

Regards,
Vlad

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ryan Bullock - 2012-07-31

Looks like that fixed it.

Put opensips under heavy load in the lab and it ran clean for 6 hours. Normally I would get the deadlock almost immediately or after just a few minutes of running traffic.

I will be able to test it in production next week as well, but this looks good.

Thanks for all the help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bogdan-Andrei Iancu - 2012-08-08

status: open-accepted --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.