#559 crashes in the dialog module with opensips 1.8.1-tls

closed-fixed
None
5
2012-10-29
2012-09-21
Dragos Oancea
No

Hi ,

The setup where I am experiencing these crashes has the following particularities:

1. TLS

My TLS config looks like this:

listen = tls:212.162.x.x:5061
listen = tcp:172.20.x.x:5060

# tls params
disable_tls = no
tls_verify_server = 1
tls_verify_client = 0
tls_require_client_certificate = 0
tls_method = TLSv1
tls_certificate = "/etc/pki/CA/certs/mydomain.crt"
tls_private_key = "/etc/pki/CA/private/mydomain.key"
tls_ca_list = "/etc/pki/CA/certs/ca.crt"

# tcp parameters
tcp_connection_lifetime=1860
open_files_limit=8192
tcp_max_connections=4096

2. SIP ping for NAT bypassing
I create the dialog with
create_dialog("BPp") , sending OPTIONS to both sides .

I also call fix_nated_contact() on requests and replies.

3. I have calls where a protocol change is necessary , like UDP->TLS , TLS->UDP , TLS->TCP .

I use:
modparam("rr", "enable_double_rr", 1)

4. I use the same database for multiple SIP proxies.

Unfortunally opensips was not compiled with debugging symbols.

Here 1st crash ( I saw this one ALOT of times) :

Core was generated by `/usr/sbin/opensips -P /var/run/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
Missing separate debuginfos, use: debuginfo-install opensips-1.8.1-3.el6.x86_64
(gdb) bt full
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#1 0x000000000047d4ca in start_timer_processes ()
No symbol table info available.
#2 0x000000000042a814 in main ()
No symbol table info available.

Core was generated by `/usr/sbin/opensips -P /var/run/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
Missing separate debuginfos, use: debuginfo-install opensips-1.8.1-3.el6.x86_64
(gdb) i r
rax 0x7f15e430a388 139731999433608
rbx 0x1 1
rcx 0x7f15e43020a0 139731999400096
rdx 0x400 1024
rsi 0x1 1
rdi 0x200000009 8589934601
rbp 0x7f15e4792570 0x7f15e4792570
rsp 0x7fff372daaa0 0x7fff372daaa0
r8 0x7f15e4792558 139732004185432
r9 0x6 6
r10 0x4000 16384
r11 0x0 0
r12 0x7f15e47926e8 139732004185832
r13 0x200000009 8589934601
r14 0x769660 7771744
r15 0x734cc8 7556296
rip 0x7f160948c7e8 0x7f160948c7e8 <dlg_ping_routine+280>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) bt
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
#1 0x000000000047d4ca in start_timer_processes ()
#2 0x000000000042a814 in main ()
(gdb) bt full
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#1 0x000000000047d4ca in start_timer_processes ()
No symbol table info available.
#2 0x000000000042a814 in main ()
No symbol table info available.
(gdb)

this crash happends from time to time , unrelated to the load apparently.

machine configuration:

Centos 6.3
Linux somedomain.com 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64 x86_64 x86_64 GNU/Linux

opensips -V
version: opensips 1.8.1-tls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: unknown
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 17:04:54 Sep 8 2012 with gcc 4.4.6

openssl version:
openssl-1.0.0-20.el6_2.5.x86_64

======================

-------------------------------------------------------------------
-------------------------------------------------------------------

Here is the 2nd crash - it could be related to the first. (this is on another box, with a
slightly different config file (it listens to UDP as well)

I did not see this showing like below in gdb on opensips 1.7.2 (it never used to crash in unref_dlg , only in dlg_ping_routing())

Core was generated by `/usr/sbin/opensips -P /var/run/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fe93ab05491 in unref_dlg () from /usr/lib64/opensips/modules/dialog.so
Missing separate debuginfos, use: debuginfo-install opensips-1.8.1-3.el6.x86_64
(gdb) bt full
#0 0x00007fe93ab05491 in unref_dlg () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#1 0x00007fe93ab0e7f5 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#2 0x000000000047d4ca in start_timer_processes ()
No symbol table info available.
#3 0x000000000042a814 in main ()
No symbol table info available.

machine configuration (almost the same config as 2st crash):

Centos 6.3
Linux somedomain.com 2.6.32-279.2.1.el6.x86_64 #1 SMP Fri Jul 20 01:55:29 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

opensips -V
version: opensips 1.8.1-tls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: unknown
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 17:04:54 Sep 8 2012 with gcc 4.4.6

opessl version:
openssl-1.0.0-20.el6_2.5.x86_64

================

syslog of the crashed process:
http://pastebin.com/F3hu1WY5

please notice something that looks like a memory leak in opensips:

CRITICAL:dialog:unref_dlg: bogus ref -1788757169 with cnt 1 for dlg 0x7fe915eccd30 [0:0] with clid '' and tags 'EDH-RSA-DES-CBC3-SHA' ''

tags "EDH-RSA-DES-CBC3-SHA" ? these are ssl ciphers. how could they be related to the dialog module ?

One particular thing related to the second bug is that just before it crashed, opensips was showing alot of messages like this,
coming everytime from the same remote unknown client , but with different source port:

ERROR:core:tls_accept: New TLS connection from 82.113.x.x:43574 failed to accept: rejected by client
ERROR:core:tls_accept: New TLS connection from 82.113.x.x:61586 failed to accept: rejected by client

I have core files, let me know if I can upload them somewhere.

Cheers,
Dragos

Discussion

  • Dragos Oancea
    Dragos Oancea
    2012-09-27

    Hi

    It happened again.

    Program terminated with signal 11, Segmentation fault.
    #0 0x00007fe93ab05491 in unlink_unsafe_dlg (dlg=0x7fe915eccd30, cnt=1) at dlg_hash.c:706
    706 dlg->next->prev = dlg->prev;

    (gdb) bt full
    #0 0x00007fe93ab05491 in unlink_unsafe_dlg (dlg=0x7fe915eccd30, cnt=1) at dlg_h
    ash.c:706
    No locals.
    #1 unref_dlg (dlg=0x7fe915eccd30, cnt=1) at dlg_hash.c:740
    d_entry = 0x7fe915c3f990
    __FUNCTION__ = "unref_dlg"
    #2 0x00007fe93ab0e7f5 in dlg_ping_routine (ticks=<value optimized out>, attr=<v
    alue optimized out>) at dlg_timer.c:529
    expired = <value optimized out>
    it = 0x0
    curr = 0x0
    dlg = 0x7fe915eccd30
    __FUNCTION__ = "dlg_ping_routine"
    #3 0x000000000047d4ca in timer_ticker () at timer.c:355
    t = 0x7fe93c19c810
    #4 run_timer_process () at timer.c:425
    multiple = 10
    cnt = <value optimized out>
    tv = {tv_sec = 0, tv_usec = 0}
    #5 start_timer_processes () at timer.c:522
    tpl = 0x7fe93c20aa60
    pid = <value optimized out>
    __FUNCTION__ = "start_timer_processes"
    #6 0x000000000042a814 in main_loop (argc=<value optimized out>, argv=<value optimized out>) at main.c:933
    i = <value optimized out>
    pid = <value optimized out>
    si = 0x0
    startup_done = 0x0
    chd_rank = 0
    rc = <value optimized out>
    load_p = 0x0
    #7 main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1520
    cfg_log_stderr = <value optimized out>
    cfg_stream = <value optimized out>
    c = <value optimized out>
    r = <value optimized out>
    tmp = 0x1 <Address 0x1 out of bounds>
    tmp_len = <value optimized out>
    port = <value optimized out>
    proto = <value optimized out>
    options = 0x4eb5a8 "f:cCm:M:b:l:n:N:rRvdDETSVhw:t:u:g:P:G:W:o:"
    ret = -1
    seed = 1183817865
    rfd = <value optimized out>
    __FUNCTION__ = "main"

    This function is called when Opensips would try to end the dialog sending BYE to both sides, due to the fact it did not receive a reply to OPTIONS request that it's sending.

    These are the comments of a good samaritan on IRC (freenode / #opensips) , in regard to this bug :

    .......
    > how many process have the run_timer_process ?
    > Only one or multiple ?
    > the dialog may have been deleted after get_timeout_dlgs() which hold the lock for lock_get(ping_timer->lock);
    > OR
    > /* no longer reffed in list */
    > unref_dlg(dlg,1);
    .......
    > how many process have the run_timer_process ?
    > Only one or multiple ?
    > the dialog may have been deleted after get_timeout_dlgs() which hold the lock for lock_get(ping_timer->lock);
    > OR
    > /* no longer reffed in list */
    > unref_dlg(dlg,1);
    > /* dlg is still reffed in TM callback. deletion from memory
    > * will happen only on 408 timeout */
    > so that if you ping a leg
    > and there is no answer
    > and while you called get_timeout_dlgs() you received a 408
    > the dialog may be destroyed

    > if it happens randomly when a pinged leg doesn't answer then it might be related to a race condition with the 408 (or any other path that destroy the dialog)
    > (the dialog extracted from the ping list)

    ..........

    PS: Thank you for the comments, these might really help!

    Cheers,
    Dragos

     
  • Hello,

    Can you apply the attached patch, and see if you still experience these issues ?

    Thanks and Regards,
    Vlad

     
  •  
    Attachments
  • Hello,

    Also, for the TLS memory corruption, can you please try the patch for this trunk revision : http://opensips.svn.sourceforge.net/viewvc/opensips?view=revision&revision=9268

    It added support for proper locking within the openssl library. It's not yet back-ported to the 1.8 branch, as we felt it needed a little bit more testing.

    Regards,
    Vlad

     
  • Dragos Oancea
    Dragos Oancea
    2012-10-01

    Hi

    Thank you for the feedback.
    I applied the patch three days ago. Looks good so far.

    Cheers,
    Dragos

     
  • Hello,

    Ok,cool. Is this with both patches applied ( dlg patch + TLS patch ) ?
    Please keep us up to date if things remain stable in time.

    Regards,
    Vlad

     
  • Dragos Oancea
    Dragos Oancea
    2012-10-02

    Hi

    Yes, both patches were applied at the same time.
    The only thing changed is the ping_timeout option which was 20 seconds, now is 40 seconds.
    But I am not sure this is relevant.

    It is looking good so far. I had no crash but I also had to restart opensips yesterday because I made a script change to disable tcp keepalive which was draining out the battery of our mobile devices.

    I also noticed that tcp keepalive is enabled by default in opensips 1.8.1 , the tcp socket is created with the SO_KEEPALIVE option regardless of the state of the tcp_keepalive switch of the script.

    I will keep you updated.

    Cheers,
    Dragos Oancea

     
  • Dragos Oancea
    Dragos Oancea
    2012-10-08

    Hi

    Still running without a crash.
    I've put the patched opensips with the same routing script on a second box (production).

    Cheers,
    Dragos

     
  • Dragos Oancea
    Dragos Oancea
    2012-10-11

    Hi

    Still no crash.

    Dragos

     
    • assigned_to: nobody --> vladut-paiu
     
  • Hello,

    Ok, thanks for the help.

    Regards,
    Vlad

     
    • status: open --> closed-fixed