Menu

uftp & segfaults

banuchka
2015-04-10
2015-04-30
  • banuchka

    banuchka - 2015-04-10

    Hi, sometimes (often) i've had a problem with uftp. My setup is:

    one server for the transmit process, one server for aggregation proxy and a lot of client, some of them put operates with proxy and some with the server.

    Proxy process is running with:

    "/local/uftp/usr/sbin/uftpproxyd -r -L /local/logs/uftpproxy.log -x 2 -M 230.4.3.100 -N -20 -k key -t 2 -B 2097152"
    

    Server start file transfer process with:

    "/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6:0.01:90 -t 2 -B $((2*1024*1024)) -b 1200 -I eth0 -L #logfile# -S #statfile# -M 230.4.3.100 -P 230.5.5.x -j /local/uftp/proxy.txt -H #hosts# -D #dest# #source#"
    

    where #hosts# is the list of client's ip addresses.

    There is a core dump under gdb:

    # gdb /local/uftp/bin/uftp /local/tmp/php-cores/core-uftp.23104
    GNU gdb (GDB) SUSE (7.5.1-0.7.29)
    Copyright (C) 2012 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-suse-linux".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /local/uftp/bin/uftp...done.
    [New LWP 23104]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6 0.01 90 -t 2 -B 2097152 -b'.
    Program terminated with signal 6, Aborted.
    #0  0x00007f058a756885 in raise () from /lib64/libc.so.6
    (gdb) bt
    #0  0x00007f058a756885 in raise () from /lib64/libc.so.6
    #1  0x00007f058a757e61 in abort () from /lib64/libc.so.6
    #2  0x00007f058a79787f in __libc_message () from /lib64/libc.so.6
    #3  0x00007f058a79d088 in malloc_printerr () from /lib64/libc.so.6
    #4  0x00007f058a7a20cc in free () from /lib64/libc.so.6
    #5  0x000000000040b275 in send_regconf (finfo=0x7fff822528f0, attempt=7, do_regconf=1) at server_announce.c:242
    #6  0x00000000004155b0 in announce_phase (finfo=0x7fff822528f0) at server_phase.c:286
    #7  0x00000000004140f5 in send_files () at server_send.c:521
    #8  0x000000000041ec76 in main (argc=32, argv=0x7fff82252a98) at server_main.c:42
    

    Denis, maybe you know something and may say anything about that. Thanks in advance.

     

    Last edit: Dennis Bush 2015-04-10
  • Dennis Bush

    Dennis Bush - 2015-04-10

    I ran through a few tests while using valgrind, however I didn't see any memory leaks or invalid reads/writes come up.

    Can you try running the server under valgrind and see if it reports any memory related errors? That should help narrow down the problem.

    Also, what version are you using, and did you make any changes at all to the code?

     
  • banuchka

    banuchka - 2015-04-10

    No, nothing in the code changes.

    Version is 4.6.1 but segfault was on 4.6 too (Looks like it has started to generate faults after start using uftpproxy).

    And one more thing it doesnt fault under valgrind :)

     
  • Dennis Bush

    Dennis Bush - 2015-04-10

    Crashes aren't something that happens every time. It depends on exactly how the process's memory is laid out, and can change from one run to the next. But valgrind should at least mention something about whether some memory was stepped on that shouldn't have been or freed memory that was reused, and I would expect that to show up on every run. Whether it causes a segfault or not is where the randomness comes in.

    Give it another run, adding "--leak-check=full --show-reachable=yes" as parameters to valgrind, and post the results.

     
  • banuchka

    banuchka - 2015-04-16

    Hi, sorry for long response. There is an output from Valgrind:

    ===
    ==32345== Parent PID: 32343
    ==32345==
    ==32345==
    ==32345== HEAP SUMMARY:
    ==32345== in use at exit: 120 bytes in 2 blocks
    ==32345== total heap usage: 8,284 allocs, 8,282 frees, 1,170,890 bytes allocated
    ==32345==
    ==32345== 32 bytes in 1 blocks are still reachable in loss record 1 of 2
    ==32345== at 0x4C281D8: calloc (vg_replace_malloc.c:618)
    ==32345== by 0x59E63AF: _dlerror_run (in /lib64/libdl-2.11.3.so)
    ==32345== by 0x59E5F00: dlopen@@GLIBC_2.2.5 (in /lib64/libdl-2.11.3.so)
    ==32345== by 0x51B176C: ??? (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x51B1B42: FIPS_mode_set (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x517F5F8: OPENSSL_init (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x51A23A8: EVP_add_cipher (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x51394F6: OpenSSL_add_all_ciphers (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x51394DC: OPENSSL_add_all_algorithms_noconf (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x408528: crypto_init (encrypt_openssl.c:76)
    ==32345== by 0x41D4EC: pre_initialize (server_init.c:168)
    ==32345== by 0x41EC5F: main (server_main.c:39)
    ==32345==
    ==32345== 88 bytes in 1 blocks are possibly lost in loss record 2 of 2
    ==32345== at 0x4C2A0F5: malloc (vg_replace_malloc.c:291)
    ==32345== by 0x400C644: _dl_map_object_deps (in /lib64/ld-2.11.3.so)
    ==32345== by 0x4012452: dl_open_worker (in /lib64/ld-2.11.3.so)
    ==32345== by 0x400DE75: _dl_catch_error (in /lib64/ld-2.11.3.so)
    ==32345== by 0x4011E2A: _dl_open (in /lib64/ld-2.11.3.so)
    ==32345== by 0x59E5F9A: dlopen_doit (in /lib64/libdl-2.11.3.so)
    ==32345== by 0x400DE75: _dl_catch_error (in /lib64/ld-2.11.3.so)
    ==32345== by 0x59E633B: _dlerror_run (in /lib64/libdl-2.11.3.so)
    ==32345== by 0x59E5F00: dlopen@@GLIBC_2.2.5 (in /lib64/libdl-2.11.3.so)
    ==32345== by 0x51B176C: ??? (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x51B1B42: FIPS_mode_set (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345== by 0x517F5F8: OPENSSL_init (in /usr/lib64/libcrypto.so.0.9.8)
    ==32345==
    ==32345== LEAK SUMMARY:
    ==32345== definitely lost: 0 bytes in 0 blocks
    ==32345== indirectly lost: 0 bytes in 0 blocks
    ==32345== possibly lost: 88 bytes in 1 blocks
    ==32345== still reachable: 32 bytes in 1 blocks
    ==32345== suppressed: 0 bytes in 0 blocks
    ==32345==
    ==32345== For counts of detected and suppressed errors, rerun with: -v
    ==32345== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
    ===

     
  • Dennis Bush

    Dennis Bush - 2015-04-16

    This trace is showing some unfreed memory within the OpenSSL library at the time the program exited. While not optimal, something like this wouldn't cause a crash.

    You may need to do more extensive testing before the error shows up again while being traced.

     
  • banuchka

    banuchka - 2015-04-17

    ok, thank you. Maybe i need to update openssl and statical link it to uftp? what do you think? And is it possible?

     
  • Dennis Bush

    Dennis Bush - 2015-04-17

    I don't think that will help. The unfreed memory in OpenSSL probably isn't causing any issues.

    For now, continue to run normally. When it crashes again, post the full log of the server. There might be something in there that could point to where the problem is.

     
  • banuchka

    banuchka - 2015-04-22

    Last core and the last crash log

    # gdb /local/uftp/bin/uftp /local/tmp/php-cores/core-uftp.32271
    GNU gdb (GDB) SUSE (7.5.1-0.7.29)
    Copyright (C) 2012 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-suse-linux".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /local/uftp/bin/uftp...done.
    [New LWP 32271]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6 0.01 90 -t 2 -B 2097152 -b'.
    Program terminated with signal 6, Aborted.
    #0  0x00007f531c100885 in raise () from /lib64/libc.so.6
    (gdb) bt
    #0  0x00007f531c100885 in raise () from /lib64/libc.so.6
    #1  0x00007f531c101e61 in abort () from /lib64/libc.so.6
    #2  0x00007f531c14187f in __libc_message () from /lib64/libc.so.6
    #3  0x00007f531c147088 in malloc_printerr () from /lib64/libc.so.6
    #4  0x00007f531c14c0cc in free () from /lib64/libc.so.6
    #5  0x000000000040b275 in send_regconf (finfo=0x7fff0dc5aaa0, attempt=7, do_regconf=1) at server_announce.c:242
    #6  0x00000000004155b0 in announce_phase (finfo=0x7fff0dc5aaa0) at server_phase.c:286
    #7  0x00000000004140f5 in send_files () at server_send.c:521
    #8  0x000000000041ec76 in main (argc=32, argv=0x7fff0dc5ac48) at server_main.c:42
    
     

    Last edit: Dennis Bush 2015-04-22
  • Dennis Bush

    Dennis Bush - 2015-04-22

    Do you have the log from the server process? That is also needed for troubleshooting.

     
  • Dennis Bush

    Dennis Bush - 2015-04-27

    Thanks, that was I big help. I found the problem.

    When the server reads an ANNOUNCE from a proxy, it goes through the list of clients in the message and adds them to the list of clients for that proxy. However, it doesn't first check if the client is already in the list. Because each proxy only handles 1000 clients and in your case you have over 350 clients passing through a proxy, after two retransmissions of a REGISTER the server writes past the end of the client list for the proxy, resulting in unpredictable behavior which sometimes means it crashes.

    In server_announce.c, near the end of the add_proxy_dests function, you'll see the following lines:

            destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
            addcnt++;
    

    Change it to this:

            if (dupmsg) {
                destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
                addcnt++;
            }
    

    Give that a run and let me know if you see any other problems.

    I'll make a fix on my end for the next version. Instead of the above change, I'll actually just get rid of the client list for each proxy, since the server doesn't use that information anyway.

     
  • banuchka

    banuchka - 2015-04-28

    Hi,
    Thanks so much Denis. There is a patch:

    =====
    --- server_announce.c 2014-12-31 04:34:02.000000000 +0000
    +++ server_announce.c 2015-04-28 12:38:25.862299068 +0000
    @@ -465,8 +465,10 @@
    finfo->deststate[hostidx].conf_sent = 0;
    log1(finfo->group_id, finfo->file_id,
    " For client%s %s", dupmsg ? "+" : "", destlist[hostidx].name);
    - destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
    - addcnt++;
    + if (dupmsg) {
    + destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
    + addcnt++;
    + }
    }
    destlist[proxyidx].clientcnt += addcnt;
    }
    =====

    updated binary on server & proxy, is it enough? I mean i didn't update clients binaries.

     

    Last edit: banuchka 2015-04-28
  • banuchka

    banuchka - 2015-04-28

    few test runs without segfault, its amazing :) need to wait few days on working/prod load.

     
  • Dennis Bush

    Dennis Bush - 2015-04-28

    This change only affects the server. Clients and proxies are not affected.

     
  • banuchka

    banuchka - 2015-04-29

    Hi, there is a new segfault

    gdb /local/uftp/bin/uftp /local/tmp/php-cores/core-uftp.14363
    GNU gdb (GDB) SUSE (7.5.1-0.7.29)
    Copyright (C) 2012 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-suse-linux".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /local/uftp/bin/uftp...done.
    [New LWP 14363]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6 0.01 90 -t 2 -B 2097152 -b'.
    Program terminated with signal 6, Aborted.
    #0  0x00007fa98047f885 in raise () from /lib64/libc.so.6
    (gdb) bt
    #0  0x00007fa98047f885 in raise () from /lib64/libc.so.6
    #1  0x00007fa980480e61 in abort () from /lib64/libc.so.6
    #2  0x00007fa9804c087f in __libc_message () from /lib64/libc.so.6
    #3  0x00007fa9804c6088 in malloc_printerr () from /lib64/libc.so.6
    #4  0x00007fa9804cb0cc in free () from /lib64/libc.so.6
    #5  0x000000000040b275 in send_regconf (finfo=0x7fff48cb31a0, attempt=9, do_regconf=1) at server_announce.c:242
    #6  0x00000000004155b8 in announce_phase (finfo=0x7fff48cb31a0) at server_phase.c:286
    #7  0x00000000004140fd in send_files () at server_send.c:521
    #8  0x000000000041ec7e in main (argc=32, argv=0x7fff48cb3348) at server_main.c:42
    

    and there is a log https://dl.dropboxusercontent.com/u/21671126/aida-20150429-082957.log

     
  • Dennis Bush

    Dennis Bush - 2015-04-29

    My mistake, the fix I gave you was incorrect. Instead of:

            if (dupmsg) {
    

    It should be:

            if (!dupmsg) {
    
     
  • banuchka

    banuchka - 2015-04-29

    Thanks, will try.

     
  • banuchka

    banuchka - 2015-04-30

    Todays upload went fine, hope that finally helps me. Dennis, really thank you so much!

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.