#353 segfault in f_malloc.c, chains from dialog module

1.6.x
closed-out-of-date
modules (454)
7
2015-04-06
2011-02-15
No

We've had several crashes now, R7081 (this is pretty much 1.6.3 pulled the day of release with a couple of bug fix patches in textops and loadbalancer applied):

It seems to be fairly consistently happening around free_dlg_dlg in dlg_hash.c which is triggering it -- we have several cores showing this behavior.

What are the next steps for debugging this? This is on a production machine and we haven't been able to reproduce it elsewhere, so we are somewhat sensitive to load issues while trying to troubleshoot this.

Thanks.

Program terminated with signal 11, Segmentation fault.
#0 0x00000000004a2bcf in fm_insert_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:155
155 (*f)->prev = &(frag->u.nxt_free);
(gdb) bt full
#0 0x00000000004a2bcf in fm_insert_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:155
f = 0x2b875eba4070
hash = 3
#1 fm_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:460
f = 0x2b875f41be40
n = <value optimized out>
__FUNCTION__ = "fm_free"
#2 0x00002b875e14dcde in free_dlg_dlg () at dlg_hash.c:168
i = 0
#3 destroy_dlg_table () at dlg_hash.c:234
dlg = 0x2b875f319a80
i = 5
#4 0x00002b875e13a651 in mod_destroy () at dialog.c:742
No locals.
#5 0x0000000000476ff4 in destroy_modules () at sr_module.c:370
t = 0x785328
foo = 0x785258
#6 0x00000000004252e0 in cleanup (show_status=1) at main.c:336
No locals.
#7 0x00000000004261fb in handle_sigs () at main.c:533
chld = 0
chld_status = 139
i = <value optimized out>
do_exit = 1
__FUNCTION__ = "handle_sigs"
#8 0x000000000042a5a9 in main_loop (argc=3, argv=0x4) at main.c:913
i = 4
pid = <value optimized out>
si = 0x0
startup_done = 0x0
chd_rank = 16
__FUNCTION__ = "main_loop"
#9 main (argc=3, argv=0x4) at main.c:1388
cfg_log_stderr = <value optimized out>
cfg_stream = 0x831e010
c = <value optimized out>
r = <value optimized out>
tmp = 0x4fc447 "H\215\005\262\333#"
tmp_len = <value optimized out>
port = <value optimized out>
proto = <value optimized out>
ret = <value optimized out>
seed = 586369539
rfd = <value optimized out>
__FUNCTION__ = "main"
(gdb)

(gdb) frame 0
#0 0x00000000004a2bcf in fm_insert_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:155
155 (*f)->prev = &(frag->u.nxt_free);
(gdb) list
150
151 /*insert it here*/
152 frag->prev = f;
153 frag->u.nxt_free=*f;
154 if( *f )
155 (*f)->prev = &(frag->u.nxt_free);
156
157 *f=frag;
158 qm->free_hash[hash].no++;
159 free_plus(qm , frag->size);
(gdb) info locals
f = 0x2b875eba4070
hash = 3
(gdb) print *f
$1 = (struct fm_frag *) 0x2b865f235290
(gdb) print *(*f)
Cannot access memory at address 0x2b865f235290
(gdb) frame 1
#1 fm_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:460
460 fm_insert_free(qm, f);
(gdb) print f
$2 = (struct fm_frag *) 0x2b875f41be40
(gdb) list
455 goto join;
456 }
457
458 no_join:
459
460 fm_insert_free(qm, f);
461 }
462
463
464 #ifdef DBG_F_MALLOC
(gdb)

And a second crash here a few hours after restarting from above:

Core was generated by `/usr/local/opensips/sbin/opensips -u opensips'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004a2bcf in fm_insert_free (qm=0x2aca30c14000, p=<value optimized out>) at mem/f_malloc.c:155
155 (*f)->prev = &(frag->u.nxt_free);
(gdb) bt full
#0 0x00000000004a2bcf in fm_insert_free (qm=0x2aca30c14000, p=<value optimized out>) at mem/f_malloc.c:155
f = 0x2aca30c14070
hash = 3
#1 fm_free (qm=0x2aca30c14000, p=<value optimized out>) at mem/f_malloc.c:460
f = 0x2aca315c1718
n = <value optimized out>
__FUNCTION__ = "fm_free"
#2 0x00002aca301bdcde in free_dlg_dlg () at dlg_hash.c:168
i = 0
#3 destroy_dlg_table () at dlg_hash.c:234
dlg = 0x2aca31692c70
i = 0
#4 0x00002aca301aa651 in mod_destroy () at dialog.c:742
No locals.
#5 0x0000000000476ff4 in destroy_modules () at sr_module.c:370
t = 0x785328
foo = 0x785258
#6 0x00000000004252e0 in cleanup (show_status=1) at main.c:336
No locals.
#7 0x00000000004261fb in handle_sigs () at main.c:533
chld = 0
chld_status = 139
i = <value optimized out>
do_exit = 1
__FUNCTION__ = "handle_sigs"
#8 0x000000000042a5a9 in main_loop (argc=3, argv=0x4) at main.c:913
i = 4
pid = <value optimized out>
si = 0x0
startup_done = 0x0
chd_rank = 16
__FUNCTION__ = "main_loop"
#9 main (argc=3, argv=0x4) at main.c:1388
cfg_log_stderr = <value optimized out>
cfg_stream = 0x14cfb010
c = <value optimized out>
r = <value optimized out>
tmp = 0x4fc447 "H\215\005\262\333#"

Discussion

  • Bogdan-Andrei Iancu

    Hi Robert,

    The traces you posted show a crash in the shutdown procedure - can you check in the opensips logs to see if the shutdown was a normal one or a result of another crash? I try to exclude the possibility of a double coredump (one of the opensips procs crashes a generates a cores -> entire opensips starts to shutdown a crashes again -> second corefile, overwritting the first one).

    Regards,
    Bogdan

     
  • Bogdan-Andrei Iancu

    • labels: 1134769 --> modules
    • priority: 5 --> 7
    • assigned_to: nobody --> bogdan_iancu
     
  • Robert Smith

    Robert Smith - 2011-02-16

    The shutdown was a sig11 which caused the attendant process to exit on sig15. I'll try to look at the root cause, but I originally thought this was off of the sip receiver process.

     
  • Bogdan-Andrei Iancu

    Hi Robert,

    to get both core file, you can just set the /proc/sys/kernel/core_pattern to some name which will be unique per process.
    Or just set /proc/sys/kernel/core_uses_pid to '1', so you'll get a core.PID file for every process that crashed.

    And post the backtrace of the first corefile.

    Regards,
    Bogdan

     
  • Robert Smith

    Robert Smith - 2011-02-18

    Hi bogdan -- new crash file is attached, uac_crash.corefile.txt.

    Thanks,

     
  • Bogdan-Andrei Iancu

    Hi Robert,

    Thanks for the info. It seams to be a generic memory corruption which is a a bit impossible to trace only from core file.

    The best way to trace this is by enabling and running with the memory debugger:

    1. edit Makefile.defs file and remove from the DEFS string the F_MALLOC define (delete "-DF_MALLOC" line) and add the DBG_QM_MALLOC define (insert a "-DDBG_QM_MALLOC" line).

    2. recompile everything

    3. set memlog=6 and memdump=1 in your configuration script (to get the memory dump).

    Start opensips and if memory corruption is detected, opensips will stop generating core file and a memory dump.

    Regards,
    Bogdan

     
  • Robert Smith

    Robert Smith - 2011-03-03

    Putting memory support in, and only memdumping (level 2 as log level is currently 3 and this is a production machine) is causing a sigabrt to occur. The pertinent info:

    /usr/local/opensips/sbin/opensips[27869]: CRITICAL:core:qm_debug_frag: qm_*: prev. fragm. tail overwritten(c0c0c0c0, abcdefe5)[0x2b71c0960470:0x2b71c09604a0]!

    <aidanna> seems like the fragment is here: 2011-03-03T15:01:29.496041+00:00 /usr/local/opensips/sbin/opensips[27846]: 17533. N address=0x2b71c09604a0 frag=0x2b71c0960470 size=40 used=1
    <aidanna> 2011-03-03T15:01:29.496049+00:00 /usr/local/opensips/sbin/opensips[27846]: alloc'd from proxy.c: hostent_shm_cpy(83)

     
  • Robert Smith

    Robert Smith - 2011-03-03

    I have the full contents of memdump=2 in a logfile (it's quite large) that I can send privately if it's needed. I don't know if this is providing a lot of detail about the crash though, it's pretty ambiguously to the memory manager.

     
  • Robert Smith

    Robert Smith - 2011-03-04

    (gdb) bt
    #0 fm_remove_free (qm=0x2ade7185d000, p=0x2ade71fd2450, size=<value optimized out>) at mem/f_malloc.c:175
    #1 fm_realloc (qm=0x2ade7185d000, p=0x2ade71fd2450, size=<value optimized out>) at mem/f_malloc.c:530
    #2 0x00002ade70dee4b4 in shm_realloc (dlg=0x2ade722260c8, tag=0x7fffbd4d4240, rr=0x7fffbd4d40f0, contact=0x7fffbd4d4100, cseq=0x7fffbd4d4110, sock=0x784710) at ../../mem/shm_mem.h:212
    #3 dlg_add_leg_info (dlg=0x2ade722260c8, tag=0x7fffbd4d4240, rr=0x7fffbd4d40f0, contact=0x7fffbd4d4100, cseq=0x7fffbd4d4110, sock=0x784710) at dlg_hash.c:309
    #4 0x00002ade70de472c in init_leg_info (dlg=0x2ade722260c8, msg=0x7a3968, t=<value optimized out>, tag=0x7fffbd4d4240) at dlg_handlers.c:242
    #5 0x00002ade70de9a6b in push_reply_in_dialog (t=0x2ade7205c1d8, type=<value optimized out>, param=<value optimized out>) at dlg_handlers.c:297
    #6 dlg_onreply (t=0x2ade7205c1d8, type=<value optimized out>, param=<value optimized out>) at dlg_handlers.c:342
    #7 0x00002ade6f4ed8db in run_trans_callbacks (type=16, trans=0x2ade7205c1d8, req=0x2ade71ed8ac0, rpl=0x7a3968, code=<value optimized out>) at t_hooks.c:208
    #8 0x00002ade6f508107 in relay_reply (t=0x2ade7205c1d8, p_msg=0x7a3968, branch=<value optimized out>, msg_status=183, cancel_bitmap=0x7fffbd4d44e8) at t_reply.c:1147
    #9 0x00002ade6f508990 in reply_received (p_msg=0x7a3968) at t_reply.c:1494
    #10 0x0000000000421e98 in forward_reply (msg=0x7a3968) at forward.c:561
    #11 0x00000000004570c2 in receive_msg (
    buf=0x758f60 "XXXXXXXXXXX"..., len=1208, rcv_info=0x7fffbd4d45d0) at receive.c:200
    #12 0x000000000049be14 in udp_rcv_loop () at udp_server.c:492
    #13 0x000000000042a65d in main_loop (argc=3, argv=<value optimized out>) at main.c:818
    #14 main (argc=3, argv=<value optimized out>) at main.c:1388

    In this circumstance, it looks like in dlg_add_leg_info, the callID in dialog is getting overwritten:

    print dlg->callid.s
    $10 = 0x2ade72226178 "d71c112bbc80838a0410c89217c4968d@10.2.1.68sip:6785008449@10.2.1.68sip:19186128830@my.sipprovider.comgent: OpenSIPS SIP (1.6.3-notls (x86_64/linux))\r\n\r\n\r\n\r\nngP\002"

     
  • Bogdan-Andrei Iancu

    Hi Robert,

    Do you still experience the crash with the latest opensips version ?

    Regards,
    Bogdan

     
  • Bogdan-Andrei Iancu

    • status: open --> closed-out-of-date
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks