Menu

#541 Lockup when using dialog pings

1.8.x
closed-fixed
modules (454)
5
2012-08-08
2012-07-11
No

After enabling dialog pings on a fairly busy system facing the caller (via create_dialog("PB")) I have experienced a couple instances where opensips locks up and stops processing messages. During this time CPU usage also sky rockets.

At debug=3 nothing interesting seems to show up in the logs.

Restarting opensips leaves it operational for awhile, but can lockup again. Removing the "P" option from create_dialog stops the lockups all together.

ping_interval was set to 60 seconds, timeout_avp is used and set in the script.

Seems like there might be a deadlock/livelock issue with the dialog pings.

Opensips build information:
version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9084M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 14:07:51 Jun 13 2012 with gcc 4.4.6

Sorry if the details are light, but not much information was generated (logs stopped at the time of the lock as well), other than messages not being processed and high cpu usage.

Discussion

  • Vladut-Stefan Paiu

    • assigned_to: nobody --> vladut-paiu
    • status: open --> open-accepted
     
  • Vladut-Stefan Paiu

    Hello,

    When OpenSIPS get stuck again in 100% CPU, can you please get an OpenSIPS PID that's stuck and do
    gdb [path_to_opensips_binary] [pid]
    inside gdb do
    bt full
    and paste here the output so I can further debug this.

    Regards,
    Vlad

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-13

    bt full, missing symbols

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-13

    Probably not useful, got a bt from a process but unfortunately gdb couldn't load the debug symbols for opensips. Trying to figure why that is and will try to catch it again and provide a better backtrace.

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-18

    bt full symbols

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-18

    Full bt added from a runaway process.

    I updated opensips as well, new version information:

    version: opensips 1.8.0-notls (x86_64/linux)
    flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
    ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
    poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
    svnrevision: 2:9141M
    @(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
    main.c compiled on 13:06:46 Jul 13 2012 with gcc 4.4.6

     
  • Vladut-Stefan Paiu

    Hello Ryan,

    Could you please try the attached patch ? It should give out some more information about the dead-lock.
    The patch prints some info to the logging facility, so when OpenSIPS dies out, please send the log file.

    Regards,
    Vlad

     
  • Vladut-Stefan Paiu

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-27

    Deadlock with extra logs

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-27

    Hey Vlad,

    Been doing some stress testing in a small test lab to get this to happen and have had some good success.

    I attached the log with the applied patch that you requested.

    Since I can get this to happen in a lab now, I should be able to turn on some more logging if you that would help.

    Regards,

    Ryan

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-30

    BT of the process that fetched the lock in get_timeout_dlgs

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-30

    Attached a full backtrace for the process doing get_timeout_dlgs() which seems to be where the lock is occurring. Deadlock seems occur when this function calls dlg_lock_dlg(current);

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-30

    Ok, I think I see why this is happening.

    get_timeout_dlgs() FIRST grabs the ping_timer lock and SECOND grabs the lock on the current dialog.

    When a timeout happens unref_dlg() FIRST grabs the lock on the current dialog. remove_ping_timer() is then called and grabs the ping_timer lock SECOND.

    Therefore it is possible for one processes to hold the ping_timer lock and be stuck trying to lock the current dialog, while another processes that has already locked the current dialog (for teardown) and is attempting to get the ping_timer lock.

    Hope this helps.

     
  • Vladut-Stefan Paiu

    Hello,

    Can you please update to the latest 1.8 branch ?
    I have just committed a fix for this. Let me know if the issue still happens.

    Regards,
    Vlad

     
  • Ryan Bullock

    Ryan Bullock - 2012-07-31

    Looks like that fixed it.

    Put opensips under heavy load in the lab and it ran clean for 6 hours. Normally I would get the deadlock almost immediately or after just a few minutes of running traffic.

    I will be able to test it in production next week as well, but this looks good.

    Thanks for all the help.

     
  • Bogdan-Andrei Iancu

    • status: open-accepted --> closed-fixed
     

Log in to post a comment.