After enabling dialog pings on a fairly busy system facing the caller (via create_dialog("PB")) I have experienced a couple instances where opensips locks up and stops processing messages. During this time CPU usage also sky rockets.
At debug=3 nothing interesting seems to show up in the logs.
Restarting opensips leaves it operational for awhile, but can lockup again. Removing the "P" option from create_dialog stops the lockups all together.
ping_interval was set to 60 seconds, timeout_avp is used and set in the script.
Seems like there might be a deadlock/livelock issue with the dialog pings.
Opensips build information:
version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9084M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 14:07:51 Jun 13 2012 with gcc 4.4.6
Sorry if the details are light, but not much information was generated (logs stopped at the time of the lock as well), other than messages not being processed and high cpu usage.
Hello,
When OpenSIPS get stuck again in 100% CPU, can you please get an OpenSIPS PID that's stuck and do
gdb [path_to_opensips_binary] [pid]
inside gdb do
bt full
and paste here the output so I can further debug this.
Regards,
Vlad
bt full, missing symbols
Probably not useful, got a bt from a process but unfortunately gdb couldn't load the debug symbols for opensips. Trying to figure why that is and will try to catch it again and provide a better backtrace.
bt full symbols
Full bt added from a runaway process.
I updated opensips as well, new version information:
version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9141M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 13:06:46 Jul 13 2012 with gcc 4.4.6
Hello Ryan,
Could you please try the attached patch ? It should give out some more information about the dead-lock.
The patch prints some info to the logging facility, so when OpenSIPS dies out, please send the log file.
Regards,
Vlad
Deadlock with extra logs
Hey Vlad,
Been doing some stress testing in a small test lab to get this to happen and have had some good success.
I attached the log with the applied patch that you requested.
Since I can get this to happen in a lab now, I should be able to turn on some more logging if you that would help.
Regards,
Ryan
BT of the process that fetched the lock in get_timeout_dlgs
Attached a full backtrace for the process doing get_timeout_dlgs() which seems to be where the lock is occurring. Deadlock seems occur when this function calls dlg_lock_dlg(current);
Ok, I think I see why this is happening.
get_timeout_dlgs() FIRST grabs the ping_timer lock and SECOND grabs the lock on the current dialog.
When a timeout happens unref_dlg() FIRST grabs the lock on the current dialog. remove_ping_timer() is then called and grabs the ping_timer lock SECOND.
Therefore it is possible for one processes to hold the ping_timer lock and be stuck trying to lock the current dialog, while another processes that has already locked the current dialog (for teardown) and is attempting to get the ping_timer lock.
Hope this helps.
Hello,
Can you please update to the latest 1.8 branch ?
I have just committed a fix for this. Let me know if the issue still happens.
Regards,
Vlad
Looks like that fixed it.
Put opensips under heavy load in the lab and it ran clean for 6 hours. Normally I would get the deadlock almost immediately or after just a few minutes of running traffic.
I will be able to test it in production next week as well, but this looks good.
Thanks for all the help.