Hi, sometimes (often) i've had a problem with uftp. My setup is:
one server for the transmit process, one server for aggregation proxy and a lot of client, some of them put operates with proxy and some with the server.
where #hosts# is the list of client's ip addresses.
There is a core dump under gdb:
# gdb /local/uftp/bin/uftp /local/tmp/php-cores/core-uftp.23104
GNU gdb (GDB) SUSE (7.5.1-0.7.29)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /local/uftp/bin/uftp...done.
[New LWP 23104]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6 0.01 90 -t 2 -B 2097152 -b'.
Program terminated with signal 6, Aborted.
#0 0x00007f058a756885 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f058a756885 in raise () from /lib64/libc.so.6
#1 0x00007f058a757e61 in abort () from /lib64/libc.so.6
#2 0x00007f058a79787f in __libc_message () from /lib64/libc.so.6
#3 0x00007f058a79d088 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007f058a7a20cc in free () from /lib64/libc.so.6
#5 0x000000000040b275 in send_regconf (finfo=0x7fff822528f0, attempt=7, do_regconf=1) at server_announce.c:242
#6 0x00000000004155b0 in announce_phase (finfo=0x7fff822528f0) at server_phase.c:286
#7 0x00000000004140f5 in send_files () at server_send.c:521
#8 0x000000000041ec76 in main (argc=32, argv=0x7fff82252a98) at server_main.c:42
Denis, maybe you know something and may say anything about that. Thanks in advance.
Last edit: Dennis Bush 2015-04-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Crashes aren't something that happens every time. It depends on exactly how the process's memory is laid out, and can change from one run to the next. But valgrind should at least mention something about whether some memory was stepped on that shouldn't have been or freed memory that was reused, and I would expect that to show up on every run. Whether it causes a segfault or not is where the randomness comes in.
Give it another run, adding "--leak-check=full --show-reachable=yes" as parameters to valgrind, and post the results.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, sorry for long response. There is an output from Valgrind:
===
==32345== Parent PID: 32343
==32345==
==32345==
==32345== HEAP SUMMARY:
==32345== in use at exit: 120 bytes in 2 blocks
==32345== total heap usage: 8,284 allocs, 8,282 frees, 1,170,890 bytes allocated
==32345==
==32345== 32 bytes in 1 blocks are still reachable in loss record 1 of 2
==32345== at 0x4C281D8: calloc (vg_replace_malloc.c:618)
==32345== by 0x59E63AF: _dlerror_run (in /lib64/libdl-2.11.3.so)
==32345== by 0x59E5F00: dlopen@@GLIBC_2.2.5 (in /lib64/libdl-2.11.3.so)
==32345== by 0x51B176C: ??? (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51B1B42: FIPS_mode_set (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x517F5F8: OPENSSL_init (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51A23A8: EVP_add_cipher (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51394F6: OpenSSL_add_all_ciphers (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51394DC: OPENSSL_add_all_algorithms_noconf (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x408528: crypto_init (encrypt_openssl.c:76)
==32345== by 0x41D4EC: pre_initialize (server_init.c:168)
==32345== by 0x41EC5F: main (server_main.c:39)
==32345==
==32345== 88 bytes in 1 blocks are possibly lost in loss record 2 of 2
==32345== at 0x4C2A0F5: malloc (vg_replace_malloc.c:291)
==32345== by 0x400C644: _dl_map_object_deps (in /lib64/ld-2.11.3.so)
==32345== by 0x4012452: dl_open_worker (in /lib64/ld-2.11.3.so)
==32345== by 0x400DE75: _dl_catch_error (in /lib64/ld-2.11.3.so)
==32345== by 0x4011E2A: _dl_open (in /lib64/ld-2.11.3.so)
==32345== by 0x59E5F9A: dlopen_doit (in /lib64/libdl-2.11.3.so)
==32345== by 0x400DE75: _dl_catch_error (in /lib64/ld-2.11.3.so)
==32345== by 0x59E633B: _dlerror_run (in /lib64/libdl-2.11.3.so)
==32345== by 0x59E5F00: dlopen@@GLIBC_2.2.5 (in /lib64/libdl-2.11.3.so)
==32345== by 0x51B176C: ??? (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51B1B42: FIPS_mode_set (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x517F5F8: OPENSSL_init (in /usr/lib64/libcrypto.so.0.9.8)
==32345==
==32345== LEAK SUMMARY:
==32345== definitely lost: 0 bytes in 0 blocks
==32345== indirectly lost: 0 bytes in 0 blocks
==32345== possibly lost: 88 bytes in 1 blocks
==32345== still reachable: 32 bytes in 1 blocks
==32345== suppressed: 0 bytes in 0 blocks
==32345==
==32345== For counts of detected and suppressed errors, rerun with: -v
==32345== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
===
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This trace is showing some unfreed memory within the OpenSSL library at the time the program exited. While not optimal, something like this wouldn't cause a crash.
You may need to do more extensive testing before the error shows up again while being traced.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't think that will help. The unfreed memory in OpenSSL probably isn't causing any issues.
For now, continue to run normally. When it crashes again, post the full log of the server. There might be something in there that could point to where the problem is.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
# gdb /local/uftp/bin/uftp /local/tmp/php-cores/core-uftp.32271
GNU gdb (GDB) SUSE (7.5.1-0.7.29)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /local/uftp/bin/uftp...done.
[New LWP 32271]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6 0.01 90 -t 2 -B 2097152 -b'.
Program terminated with signal 6, Aborted.
#0 0x00007f531c100885 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f531c100885 in raise () from /lib64/libc.so.6
#1 0x00007f531c101e61 in abort () from /lib64/libc.so.6
#2 0x00007f531c14187f in __libc_message () from /lib64/libc.so.6
#3 0x00007f531c147088 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007f531c14c0cc in free () from /lib64/libc.so.6
#5 0x000000000040b275 in send_regconf (finfo=0x7fff0dc5aaa0, attempt=7, do_regconf=1) at server_announce.c:242
#6 0x00000000004155b0 in announce_phase (finfo=0x7fff0dc5aaa0) at server_phase.c:286
#7 0x00000000004140f5 in send_files () at server_send.c:521
#8 0x000000000041ec76 in main (argc=32, argv=0x7fff0dc5ac48) at server_main.c:42
Last edit: Dennis Bush 2015-04-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When the server reads an ANNOUNCE from a proxy, it goes through the list of clients in the message and adds them to the list of clients for that proxy. However, it doesn't first check if the client is already in the list. Because each proxy only handles 1000 clients and in your case you have over 350 clients passing through a proxy, after two retransmissions of a REGISTER the server writes past the end of the client list for the proxy, resulting in unpredictable behavior which sometimes means it crashes.
In server_announce.c, near the end of the add_proxy_dests function, you'll see the following lines:
if (dupmsg) {
destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
addcnt++;
}
Give that a run and let me know if you see any other problems.
I'll make a fix on my end for the next version. Instead of the above change, I'll actually just get rid of the client list for each proxy, since the server doesn't use that information anyway.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
gdb /local/uftp/bin/uftp /local/tmp/php-cores/core-uftp.14363
GNU gdb (GDB) SUSE (7.5.1-0.7.29)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /local/uftp/bin/uftp...done.
[New LWP 14363]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/local/uftp/bin/uftp -R 100000 -W 10000 -s 10 -r 0.6 0.01 90 -t 2 -B 2097152 -b'.
Program terminated with signal 6, Aborted.
#0 0x00007fa98047f885 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fa98047f885 in raise () from /lib64/libc.so.6
#1 0x00007fa980480e61 in abort () from /lib64/libc.so.6
#2 0x00007fa9804c087f in __libc_message () from /lib64/libc.so.6
#3 0x00007fa9804c6088 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007fa9804cb0cc in free () from /lib64/libc.so.6
#5 0x000000000040b275 in send_regconf (finfo=0x7fff48cb31a0, attempt=9, do_regconf=1) at server_announce.c:242
#6 0x00000000004155b8 in announce_phase (finfo=0x7fff48cb31a0) at server_phase.c:286
#7 0x00000000004140fd in send_files () at server_send.c:521
#8 0x000000000041ec7e in main (argc=32, argv=0x7fff48cb3348) at server_main.c:42
Hi, sometimes (often) i've had a problem with uftp. My setup is:
one server for the transmit process, one server for aggregation proxy and a lot of client, some of them put operates with proxy and some with the server.
Proxy process is running with:
Server start file transfer process with:
where #hosts# is the list of client's ip addresses.
There is a core dump under gdb:
Denis, maybe you know something and may say anything about that. Thanks in advance.
Last edit: Dennis Bush 2015-04-10
I ran through a few tests while using valgrind, however I didn't see any memory leaks or invalid reads/writes come up.
Can you try running the server under valgrind and see if it reports any memory related errors? That should help narrow down the problem.
Also, what version are you using, and did you make any changes at all to the code?
No, nothing in the code changes.
Version is 4.6.1 but segfault was on 4.6 too (Looks like it has started to generate faults after start using uftpproxy).
And one more thing it doesnt fault under valgrind :)
Crashes aren't something that happens every time. It depends on exactly how the process's memory is laid out, and can change from one run to the next. But valgrind should at least mention something about whether some memory was stepped on that shouldn't have been or freed memory that was reused, and I would expect that to show up on every run. Whether it causes a segfault or not is where the randomness comes in.
Give it another run, adding "--leak-check=full --show-reachable=yes" as parameters to valgrind, and post the results.
Hi, sorry for long response. There is an output from Valgrind:
===
==32345== Parent PID: 32343
==32345==
==32345==
==32345== HEAP SUMMARY:
==32345== in use at exit: 120 bytes in 2 blocks
==32345== total heap usage: 8,284 allocs, 8,282 frees, 1,170,890 bytes allocated
==32345==
==32345== 32 bytes in 1 blocks are still reachable in loss record 1 of 2
==32345== at 0x4C281D8: calloc (vg_replace_malloc.c:618)
==32345== by 0x59E63AF: _dlerror_run (in /lib64/libdl-2.11.3.so)
==32345== by 0x59E5F00: dlopen@@GLIBC_2.2.5 (in /lib64/libdl-2.11.3.so)
==32345== by 0x51B176C: ??? (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51B1B42: FIPS_mode_set (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x517F5F8: OPENSSL_init (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51A23A8: EVP_add_cipher (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51394F6: OpenSSL_add_all_ciphers (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51394DC: OPENSSL_add_all_algorithms_noconf (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x408528: crypto_init (encrypt_openssl.c:76)
==32345== by 0x41D4EC: pre_initialize (server_init.c:168)
==32345== by 0x41EC5F: main (server_main.c:39)
==32345==
==32345== 88 bytes in 1 blocks are possibly lost in loss record 2 of 2
==32345== at 0x4C2A0F5: malloc (vg_replace_malloc.c:291)
==32345== by 0x400C644: _dl_map_object_deps (in /lib64/ld-2.11.3.so)
==32345== by 0x4012452: dl_open_worker (in /lib64/ld-2.11.3.so)
==32345== by 0x400DE75: _dl_catch_error (in /lib64/ld-2.11.3.so)
==32345== by 0x4011E2A: _dl_open (in /lib64/ld-2.11.3.so)
==32345== by 0x59E5F9A: dlopen_doit (in /lib64/libdl-2.11.3.so)
==32345== by 0x400DE75: _dl_catch_error (in /lib64/ld-2.11.3.so)
==32345== by 0x59E633B: _dlerror_run (in /lib64/libdl-2.11.3.so)
==32345== by 0x59E5F00: dlopen@@GLIBC_2.2.5 (in /lib64/libdl-2.11.3.so)
==32345== by 0x51B176C: ??? (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x51B1B42: FIPS_mode_set (in /usr/lib64/libcrypto.so.0.9.8)
==32345== by 0x517F5F8: OPENSSL_init (in /usr/lib64/libcrypto.so.0.9.8)
==32345==
==32345== LEAK SUMMARY:
==32345== definitely lost: 0 bytes in 0 blocks
==32345== indirectly lost: 0 bytes in 0 blocks
==32345== possibly lost: 88 bytes in 1 blocks
==32345== still reachable: 32 bytes in 1 blocks
==32345== suppressed: 0 bytes in 0 blocks
==32345==
==32345== For counts of detected and suppressed errors, rerun with: -v
==32345== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
===
This trace is showing some unfreed memory within the OpenSSL library at the time the program exited. While not optimal, something like this wouldn't cause a crash.
You may need to do more extensive testing before the error shows up again while being traced.
ok, thank you. Maybe i need to update openssl and statical link it to uftp? what do you think? And is it possible?
I don't think that will help. The unfreed memory in OpenSSL probably isn't causing any issues.
For now, continue to run normally. When it crashes again, post the full log of the server. There might be something in there that could point to where the problem is.
Last core and the last crash log
Last edit: Dennis Bush 2015-04-22
Do you have the log from the server process? That is also needed for troubleshooting.
Here it is: https://dl.dropboxusercontent.com/u/21671126/aida-20150427-095520.log
Thanks, that was I big help. I found the problem.
When the server reads an ANNOUNCE from a proxy, it goes through the list of clients in the message and adds them to the list of clients for that proxy. However, it doesn't first check if the client is already in the list. Because each proxy only handles 1000 clients and in your case you have over 350 clients passing through a proxy, after two retransmissions of a REGISTER the server writes past the end of the client list for the proxy, resulting in unpredictable behavior which sometimes means it crashes.
In server_announce.c, near the end of the add_proxy_dests function, you'll see the following lines:
Change it to this:
Give that a run and let me know if you see any other problems.
I'll make a fix on my end for the next version. Instead of the above change, I'll actually just get rid of the client list for each proxy, since the server doesn't use that information anyway.
Hi,
Thanks so much Denis. There is a patch:
=====
--- server_announce.c 2014-12-31 04:34:02.000000000 +0000
+++ server_announce.c 2015-04-28 12:38:25.862299068 +0000
@@ -465,8 +465,10 @@
finfo->deststate[hostidx].conf_sent = 0;
log1(finfo->group_id, finfo->file_id,
" For client%s %s", dupmsg ? "+" : "", destlist[hostidx].name);
- destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
- addcnt++;
+ if (dupmsg) {
+ destlist[proxyidx].clients[addcnt + startcnt] = hostidx;
+ addcnt++;
+ }
}
destlist[proxyidx].clientcnt += addcnt;
}
=====
updated binary on server & proxy, is it enough? I mean i didn't update clients binaries.
Last edit: banuchka 2015-04-28
few test runs without segfault, its amazing :) need to wait few days on working/prod load.
This change only affects the server. Clients and proxies are not affected.
Hi, there is a new segfault
and there is a log https://dl.dropboxusercontent.com/u/21671126/aida-20150429-082957.log
My mistake, the fix I gave you was incorrect. Instead of:
It should be:
Thanks, will try.
Todays upload went fine, hope that finally helps me. Dennis, really thank you so much!