#404 Segfault while freeing dialog

trunk
closed-fixed
modules (454)
5
2011-10-12
2011-08-11
No

Hi,

OpenSIPs crashes randomly while processing Bye reply, when destroying / freeing the associated dialog. So far, this has happened 4 times in last 2 days on one of my production servers. There are two core dumps generated upon each crash,

BT on first core dump gives following,

Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 -u root -g root'.
Program terminated with signal 11, Segmentation fault.
#0 0x97761500 in ?? ()
(gdb) bt
#0 0x97761500 in ?? ()
#1 0xb723abd3 in free_dlg_dlg (dlg=<value optimized out>) at dlg_hash.c:163
#2 0xb7243f5f in destroy_dlg (dlg=0x979d04a0, cnt=1) at dlg_hash.c:225
#3 unref_dlg (dlg=0x979d04a0, cnt=1) at dlg_hash.c:742
#4 0xb7247a42 in dual_bye_event (dlg=0x979d04a0, req=<value optimized out>, extra_unref=1) at dlg_req_within.c:274
#5 0xb72491c0 in bye_reply_cb (t=0x976af43c, type=256, ps=0xbfddce54) at dlg_req_within.c:294
#6 0xb728e657 in run_trans_callbacks (type=256, trans=0x976af43c, req=0x0, rpl=0x82d241c, code=200) at t_hooks.c:212
#7 0xb72a6d93 in local_reply (t=0x976af43c, p_msg=0x82d241c, branch=0, msg_status=200, cancel_bitmap=0xbfddcf88) at t_reply.c:1358
#8 0xb72a82c2 in reply_received (p_msg=0x82d241c) at t_reply.c:1503
#9 0x0806c913 in forward_reply (msg=0x82d241c) at forward.c:568
#10 0x080a4d32 in receive_msg (
buf=0x81d7480 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 77.66.2.136;branch=z9hG4bK2aa.63be8597.0;received=77.66.2.136\r\nFrom: <sip:+642102908458@77.66.2.136>;tag=as2bde889d\r\nTo: <sip:01100165022863605@77.66.2.136:5060;user=p"..., len=428, rcv_info=<value optimized out>) at receive.c:203
#11 0x080fbd5e in udp_rcv_loop () at udp_server.c:419
#12 0x08074d28 in main_loop (argc=9, argv=0xbfddd2f4) at main.c:879
#13 main (argc=9, argv=0xbfddd2f4) at main.c:1497

BT on second core dump gives following,

Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 -u root -g root'.
Program terminated with signal 6, Aborted.
#0 0xb78ab424 in __kernel_vsyscall ()
(gdb) bt
#0 0xb78ab424 in __kernel_vsyscall ()
#1 0xb75c3751 in raise () from /lib/i686/cmov/libc.so.6
#2 0xb75c6b82 in abort () from /lib/i686/cmov/libc.so.6
#3 0x08070d64 in sig_alarm_abort (signo=14) at main.c:435
#4 <signal handler called>
#5 get_lock (ticks=0, param=0x0) at ../../evi/../fastlock.h:178
#6 dialog_update_db (ticks=0, param=0x0) at dlg_db_handler.c:972
#7 0xb72249e1 in mod_destroy () at dialog.c:755
#8 0x080ce108 in destroy_modules () at sr_module.c:371
#9 0x08070858 in cleanup (show_status=1) at main.c:344
#10 0x080714a9 in handle_sigs () at main.c:545
#11 0x08075245 in main_loop (argc=9, argv=0xbfddd2f4) at main.c:990
#12 main (argc=9, argv=0xbfddd2f4) at main.c:1497

The most disturbing part of these crashes for me is that it causes all running calls to hang up. Previously in case of OpenSIPs crash running calls use to remain unaffected.

I am using SVN trunk revision 8215, and have create_dialog("BPp") on very first INVITE in my dial plan.

Feel free to ask for more info as needed.

Thank you.

Discussion

  • Muhammad Shahzad

    Another crash, this time i was watching opensips logs and got this additional info,

    Aug 11 23:46:05 nghw-2674 osip-service[1803]: CRITICAL:core:qm_free: freeing already freed pointer, first free: dlg_hash.c: free_dlg_dlg(171) - aborting
    Aug 11 23:46:21 nghw-2674 osip-service[1838]: CRITICAL:core:receive_fd: EOF on 37
    Aug 11 23:46:21 nghw-2674 osip-service[1779]: NOTICE:presence:destroy: destroy module ...

     
  • Vladut-Stefan Paiu

    Hello Muhammad,

    Is it possible that you also attach an OpenSIPS log in full debug for a call where this crash happens ? I have an idea about the crash, but I would need a log in debug=4 to be sure.

    Thanks and Regards,
    Vlad

     
  • Muhammad Shahzad

    I am afraid i won't be able to do that since crashes are random and the server where its happening have a lot of production traffic (1500+ sip registrations, 300+ calls etc.) on it.

    I do have the core dump though, but their sizes are around 500+MB.

     
  • Muhammad Shahzad

    i updated the trunk to revision 8242, and now it crashes due to some memory leak, may be its a different bug. Anyway, here is BT.

    Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 -u root -g root'.
    Program terminated with signal 6, Aborted.
    #0 0xb7706424 in __kernel_vsyscall ()
    (gdb) bt
    #0 0xb7706424 in __kernel_vsyscall ()
    #1 0xb741e751 in raise () from /lib/i686/cmov/libc.so.6
    #2 0xb7421b82 in abort () from /lib/i686/cmov/libc.so.6
    #3 0x0810344e in qm_insert_free (qm=0xb70b0337, p=0x970dac6c, file=0xb70b0337 "L:dialog:%s: extract_ftc_hdrs ok but no from extracted : [%.*s]\n",
    func=0xb70b0dbb "g:%s: no more shm\n", line=171) at mem/q_malloc.c:169
    #4 qm_free (qm=0xb70b0337, p=0x970dac6c, file=0xb70b0337 "L:dialog:%s: extract_ftc_hdrs ok but no from extracted : [%.*s]\n", func=0xb70b0dbb "g:%s: no more shm\n",
    line=171) at mem/q_malloc.c:491
    #5 0xb7095cf6 in dlg_onreply (dialog=0x800) at dlg_handlers.c:477
    #6 unreference_dialog_create (dialog=0x800) at dlg_handlers.c:776
    #7 0xb709ef5f in unref_dlg (dlg=0x970da634, cnt=1) at dlg_hash.c:739
    #8 0xb70a2a42 in new_dlg_profile (profiles=0x1 <Address 0x1 out of bounds>, has_value=3070347656) at dlg_profile.c:278
    #9 add_profile_definitions (profiles=0x1 <Address 0x1 out of bounds>, has_value=3070347656) at dlg_profile.c:168
    #10 0xb70a41c0 in send_leg_msg (dlg=0x979541a8, method=0x100, src_leg=-1078038796, dst_leg=0, hdrs=0x97369734, body=0x979ba664, func=0x81a9e68 <debug>,
    param=0x972ac048, release=0) at dlg_req_within.c:494
    #11 0xb70e9657 in run_trans_callbacks (type=256, trans=0x979541a8, req=0x0, rpl=0xffffffff, code=408) at t_hooks.c:212
    #12 0xb7101d93 in local_reply (t=0x979541a8, p_msg=0xffffffff, branch=0, msg_status=408, cancel_bitmap=0xbfbe702c) at t_reply.c:1358
    #13 0xb70ec0a3 in fake_reply (ticks=70, attr=0x0) at timer.c:253
    #14 final_response_handler (ticks=70, attr=0x0) at timer.c:364
    #15 timer_routine (ticks=70, attr=0x0) at timer.c:1005
    #16 0x080e42fa in utimer_ticker () at timer.c:369
    #17 run_timer_process () at timer.c:407
    #18 start_timer_processes () at timer.c:522
    #19 0x080751bd in main_loop (argc=9, argv=0xbfbe71e4) at main.c:936
    #20 main (argc=9, argv=0xbfbe71e4) at main.c:1497

    It has crashes 3 times in last six hours, so i am think of reverting it to last stable revision 8201. :(

     
  • Vladut-Stefan Paiu

    Hello again,

    This second backtrace seems somehow corrupt.

    You are saying the 8201 is ok, no crashes as the reported ones ? While 8215 shows these problems ?

     
  • Muhammad Shahzad

    well, as far as i remember 8201 never crashed on its own, however, i did observed crash a couple of time while trying to stop the service (for maintenance reasons), but ignored them. The back trace of those crashes showed mysql update failed while sync dialog run time info to db. I didn't saved those core dumps to can't give much details.

    Anyways, my production server is now on 8201 and let see if it crashes, i will send you its BT.

     
  • Muhammad Shahzad

    It crashed again. This time with revision 8201. BT indicates same problem. Here is BT full.

    #0 0x97879f00 in ?? ()
    No symbol table info available.
    #1 0xb71ec703 in free_dlg_dlg (dlg=<value optimized out>) at dlg_hash.c:163
    i = 0
    __FUNCTION__ = "free_dlg_dlg"
    #2 0xb71f5a7f in destroy_dlg (dlg=0x9784ae04, cnt=1) at dlg_hash.c:225
    ret = <value optimized out>
    #3 unref_dlg (dlg=0x9784ae04, cnt=1) at dlg_hash.c:742
    d_entry = 0x970a8c50
    __FUNCTION__ = "unref_dlg"
    #4 0xb71f9562 in dual_bye_event (dlg=0x9784ae04, req=<value optimized out>, extra_unref=1) at dlg_req_within.c:274
    old_state = 5
    new_state = 5
    unref = 1
    ret = <value optimized out>
    __FUNCTION__ = "dual_bye_event"
    #5 0xb71face0 in bye_reply_cb (t=0x97760f68, type=256, ps=0xbfd4f9f4) at dlg_req_within.c:294
    __FUNCTION__ = "bye_reply_cb"
    #6 0xb7240657 in run_trans_callbacks (type=256, trans=0x97760f68, req=0x0, rpl=0x82d23b4, code=200) at t_hooks.c:212
    params = {req = 0x0, rpl = 0x82d23b4, code = 200, param = 0x97451d2c, extra1 = 0x0, extra2 = 0x0}
    cbp = 0x97451d20
    backup = 0x81e7488
    trans_backup = 0x97760f68
    __FUNCTION__ = "run_trans_callbacks"
    #7 0xb7258d93 in local_reply (t=0x97760f68, p_msg=0x82d23b4, branch=0, msg_status=200, cancel_bitmap=0xbfd4fb28) at t_reply.c:1358
    local_store = 0
    local_winner = 0
    reply_status = RPS_COMPLETED
    winning_msg = 0x974278c0
    winning_code = 200
    totag_retr = 0
    __FUNCTION__ = "local_reply"
    #8 0xb725a2c2 in reply_received (p_msg=0x82d23b4) at t_reply.c:1503
    last_uac_status = <value optimized out>
    branch = 0
    reply_status = <value optimized out>
    timer = <value optimized out>
    cancel_bitmap = 0
    t = 0x97760f68
    backup_list = 0x0
    has_reply_route = <value optimized out>
    __FUNCTION__ = "reply_received"
    ---Type <return> to continue, or q <return> to quit---
    #9 0x0806c913 in forward_reply (msg=0x82d23b4) at forward.c:568
    new_buf = <value optimized out>
    to = <value optimized out>
    new_len = <value optimized out>
    mod = 0x822ed98
    proto = 136148112
    id = <value optimized out>
    send_sock = 0x97879eff
    len = 137175988
    __FUNCTION__ = "forward_reply"
    #10 0x080a4d32 in receive_msg (
    buf=0x81d7480 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 77.66.2.136;branch=z9hG4bK6a98.fe2e7586.0;received=77.66.2.136\r\nFrom: <sip:+61430278886@77.66.2.136>;tag=as0939115b\r\nTo: <sip:011008801715609898@77.66.2.136:5060;user="..., len=429, rcv_info=<value optimized out>) at receive.c:203
    msg = 0x82d23b4
    start = {tv_sec = -1219149052, tv_usec = 137175912}
    __FUNCTION__ = "receive_msg"
    #11 0x080fbd5e in udp_rcv_loop () at udp_server.c:419
    len = 429
    buf = "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 77.66.2.136;branch=z9hG4bK6a98.fe2e7586.0;received=77.66.2.136\r\nFrom: <sip:+61430278886@77.66.2.136>;tag=as0939115b\r\nTo: <sip:011008801715609898@77.66.2.136:5060;user="...
    tmp = <value optimized out>
    from = 0x82d2368
    fromlen = 16
    ri = {src_ip = {af = 2, len = 4, u = {addrl = {2231517773, 3079082512, 3218406760, 1}, addr32 = {2231517773, 3079082512, 3218406760, 1}, addr16 = {16973,
    34050, 4624, 46983, 64872, 49108, 1, 0}, addr = "MB\002\205\020\022\207\267h\375Ô¿\001\000\000"}}, dst_ip = {af = 2, len = 4, u = {addrl = {
    2281849421, 0, 0, 0}, addr32 = {2281849421, 0, 0, 0}, addr16 = {16973, 34818, 0, 0, 0, 0, 0, 0}, addr = "MB\002\210", '\000' <repeats 11 times>}},
    src_port = 5060, dst_port = 5060, proto = 1, proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = {sa_family = 2,
    sa_data = "\023\304MB\002\205\000\000\000\000\000\000\000"}, sin = {sin_family = 2, sin_port = 50195, sin_addr = {s_addr = 2231517773},
    sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 50195, sin6_flowinfo = 2231517773, sin6_addr = {__in6_u = {
    __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}},
    bind_address = 0x822dae0}
    p = <value optimized out>
    __FUNCTION__ = "udp_rcv_loop"
    #12 0x08074d28 in main_loop (argc=9, argv=0xbfd4fe94) at main.c:879
    i = <value optimized out>
    pid = <value optimized out>
    si = <value optimized out>
    startup_done = 0x0
    chd_rank = 4
    #13 main (argc=9, argv=0xbfd4fe94) at main.c:1497
    cfg_log_stderr = 136501984
    cfg_stream = 0x3
    ---Type <return> to continue, or q <return> to quit---
    c = <value optimized out>
    r = <value optimized out>
    tmp = 0xbfd51970 ""
    tmp_len = <value optimized out>
    port = <value optimized out>
    proto = <value optimized out>
    ret = <value optimized out>
    seed = 734867912
    rfd = <value optimized out>
    __FUNCTION__ = "main"

     
  • Vladut-Stefan Paiu

    Hello,

    Could you please check the logs for critical messages saying "bogus ref" on the calls that you experience the crashes ?

     
  • Muhammad Shahzad

    couldn't found any thing matching "bogus ref" but do find a critical warning matching "bogus event", just before opensips crash.

    CRITICAL:dialog:log_next_state_dlg: bogus event 7 in state 2 for dlg 0x970b8874 [138:1657090481] with clid '584da4f976b14ce053d3c9345a09391e@77.66.2.136' and tags 'as6fd68670' 'SDopvfa99-9302873325129137220'
    CRITICAL:core:receive_fd: EOF on 37
    NOTICE:presence:destroy: destroy module ...
    CRITICAL:core:sig_alarm_abort: BUG - shutdown timeout triggered, dying...

     
  • Muhammad Shahzad

    I have removed all parameters from create_dialog function and system seems bit stable now. Though i am still watching for crash.

    However, this has created a new problem, that is, i am also using load_balancer module which is relaying calls to my media servers array. Now the problem is that whenever i execute lb_list command via any management interface i get high number of calls on each media server, while checking the media servers i get much less number of running calls. Which implies something is wrong with load_balancer module stats. I wonder what will happen if this wrong call count reaches to call capacity mentioned in load_balancer db table?

    Should i open a separate bug for this problem?

    Thank you.

     
  • Vladut-Stefan Paiu

    Removing the "PpB" parameters "solved" your problem because that's where the bug originated. OpenSIPS encounters a double free while processing the reply to a 200 Ok to a BYE generated internally by OpenSIPS. Which can be the case either when pinging is enabled and one endpoint does not respond in time, or when the overall dialog time limit is exceeded.

    I have failed to reproduce your scenario, so It would help a lot if you could either provide a full debug log, or if that isn't possible, maybe a full SIP trace for the call that makes OpenSIPS crash.

    And yes, the load_balancer issue seems like a different issue. Please open a separate bug report for this.

     
  • Muhammad Shahzad

    Since, my production machine is stable now so, i won't risk it to crash again. However, i do have a production clone, with exact same configuration as production. Let me put my team on it and see if they could reproduce it there while debug logs and sip traces are enabled.

    Thank you.

     
  • Bogdan-Andrei Iancu

    • assigned_to: nobody --> vladut-paiu
     
  • Vladut-Stefan Paiu

    Hello again,

    Any success in reproducing the scenario ?

    Also, another thing that you can do, both on your production machine and on your close, would be to enable memory debugging in your script. First of all, you would have to edit Makefile.defs, uncomment the line with
    -DDBG_QM_MALLOC \ and comment the line with
    -DF_MALLOC \

    and then recompile OpenSIPS.

    Last, you would have to edit your .cfg and add the following two lines :
    memlog=6
    memdump=1

    This will not do excessive logging of any kind, like turning up the general log level, but it would give us extra information about the memory in case of the crash.

    Thanks and Regards,
    Vlad

     
  • Muhammad Shahzad

    So far not able to reproduce it, will ask team to apply the changes you mentioned and try again.

    The production server on the other hand seems extremely stable now. :)

     
  • Vladut-Stefan Paiu

    Hello,

    We are trying to do a stable release of OpenSIPS 1.7 ( we're in Beta phase now ), so it would mean a lot to us if you could provide more debugging info ( log + trace ) of a scenario where this crash happens, so we can fix this issue, as this is a very critical bug.

    Thanks alot for your help.

    Regards,
    Vlad

     
  • Muhammad Shahzad

    OK then lets do this.

    This line is already enabled in Makefile.defs file from trunk of 8201.
    -DDBG_QM_MALLOC \

    This line is already disabled in Makefile.defs file from trunk of 8201.
    -DF_MALLOC \

    You may want to reverse it in stable release.

    In Makefile i additionally have following settings,
    #define PKG_MEM_POOL_SIZE 1024*1024*2*8 /*!< Used only if PKG_MALLOC is defined*/
    #define SHM_MEM_SIZE 32*8 /*!< Used if SH_MEM is defined*/

    In opensips.cfg, i have set,
    debug=3
    memlog=6
    memdump=1

    For sip traces i am using siptrace module with trace_dialog method immediately after create_dialog("BPp"), hope this serves the purpose.

    I am gona restart opensips with this setting at 0730hours GMT. Lets see when it crashes.

    Thank you.

     
  • Muhammad Shahzad

    its been over 56 hours and opensips still working fine without the crash, very strange..!

    Previously, with this same config, it was crashing almost every 6th hour. :-(

     
  • Vladut-Stefan Paiu

    Hello,

    Any more news about the crashes ?

    Regards,
    Vlad

     
  • Muhammad Shahzad

    Due to some other crash problems (already reported in this system), i decided to upgrade to opensips 1.7.0 stable release and its working perfect ever since. So i guess the whatever the problem was, its gone now, or at least not reproducing for 3-4 weeks.

    So, i suggest you close this bug now, i will open a new one or reopen this one again (if possible) if it happens again.

    Thank you.

     
  • Vladut-Stefan Paiu

    • status: open --> closed-fixed
     
  • Vladut-Stefan Paiu

    Hello,

    Ok, please return with a new bug report if this issue appears again.

    Regards,
    Vlad

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks