From: Mantis B. T. <no...@bu...> - 2015-05-07 16:19:02
|
The following issue has been REOPENED. ====================================================================== http://bugs.bacula.org/view.php?id=2131 ====================================================================== Reported By: gnosek Assigned To: kern ====================================================================== Project: Bacula Bug Reports Issue ID: 2131 Category: Director Reproducibility: sometimes Severity: crash Priority: normal Status: feedback ====================================================================== Date Submitted: 2015-04-27 09:11 BST Last Modified: 2015-05-07 17:18 BST ====================================================================== Summary: [7.0.5] Director crash under high load Description: The director crashes under high load* repeatedly in the same place: [27709967.186206] bacula-dir[11925]: segfault at 8 ip 00007f7233e722ce sp 00007f71f0ff86b0 error 4 in libbac-7.0.5.so[7f7233e56000+61000] [27796283.993584] bacula-dir[23494]: segfault at 7300340181d0 ip 00007f52c8b162ce sp 00007f52517f96b0 error 4 in libbac-7.0.5.so[7f52c8afa000+61000] [27828469.670787] bacula-dir[3589]: segfault at 38 ip 00007f227d0682ce sp 00007f21dcff06b0 error 4 in libbac-7.0.5.so[7f227d04c000+61000] [27829425.776910] bacula-dir[5186]: segfault at 8 ip 00007f938ab1b2ce sp 00007f9320ff86b0 error 4 in libbac-7.0.5.so[7f938aaff000+61000] [27882600.617727] bacula-dir[7518]: segfault at 8 ip 00007f158f7092ce sp 00007f15417f96b0 error 4 in libbac-7.0.5.so[7f158f6ed000+61000] All these crashes point to BSOCK::recv: int32_t BSOCK::recv() { int32_t nbytes; int32_t pktsiz; bool locked = false; msg[0] = 0; // <-- crash msglen = 0; if (errors || is_terminated() || is_closed()) { return BNET_HARDEOF; } Three of these suggest a NULL BSOCK instance (the crash address at 8 is the offset of msg in the BSOCK struct), the other two are completely bogus and look like an overwritten (freed?) pointer to the BSOCK somewhere. * high load may be any of these factors: - lots of scheduled tasks (around 900 scheduled at the same time, up to 20 concurrently running) - storage daemon being slow to respond (high I/O) - director being slow to respond (it's on the same machine as the storage daemon) - database being slow to respond (still the same machine) The crashes usually appear a couple of minutes after the scheduled time, as the I/O load ramps up. After a crash, if I resubmit the tasks via bacula-console *slowly* (one every 20-30 seconds), the director tends to keep running. Shorter intervals, like 2 seconds, tend to crash the director after some time. Still, the task queue builds up mostly the same (basically no job has a chance to finish in 30 seconds). Steps to Reproduce: Schedule 1000 I/O heavy jobs at the same time (with 20-way concurrency on the storage daemon) and wait a couple of minutes. Not 100% reproducible as we had a decent peaceful period after upgrading to 7.0.5 (5.2.x crashed left and right) but apparently we have now crossed another threshold and the director started crashing again. Additional Information: Disassembling a snippet of BSOCK::recv where the crash occurs (objdump -dCS ./debian/tmp-build-pgsql/src/lib/.libs/bsock.o): msg[0] = 0; 12ee: 48 8b 47 08 mov 0x8(%rdi),%rax 12f2: c6 00 00 movb $0x0,(%rax) The crash is in the first of these two instructions, which roughly translate to: char *msg_ptr = *(((char*) this) + 8); *msg_ptr = 0; So it's actually a dereference of "this" that crashes, we don't even reach the "dereference msg" part. Sadly, I haven't been able to get a core dump yet, so I only got the %rip value to work from. ====================================================================== ---------------------------------------------------------------------- (0007075) gnosek (reporter) - 2015-04-27 23:35 http://bugs.bacula.org/view.php?id=2131#c7075 ---------------------------------------------------------------------- OK, got a core dump. Stack trace looks like this (the BSOCK is clearly bogus): http://bugs.bacula.org/view.php?id=0 BSOCK::recv (this=0x6d824804ca68) at bsock.c:496 http://bugs.bacula.org/view.php?id=1 0x00007f835dd6c8b4 in cram_md5_respond (bs=0x6d824804ca68, password=0x1ef9738 "(snipped 32 or so hex digits)", tls_remote_need=0x7f82df7fdb68, compatible=0x5b02db6d9248) at cram-md5.c:140 http://bugs.bacula.org/view.php?id=2 0x000000000040fb05 in authenticate_storage_daemon (jcr=0x28c2248, store=0x1efcc68) at authenticate.c:95 http://bugs.bacula.org/view.php?id=3 0x000000000042f60e in connect_to_storage_daemon (jcr=0x28c2248, retry_interval=10, max_retry_time=1800, verbose=1) at msgchan.c:132 http://bugs.bacula.org/view.php?id=4 0x0000000000413c01 in do_backup (jcr=0x28c2248) at backup.c:457 http://bugs.bacula.org/view.php?id=5 0x0000000000424019 in job_thread (arg=0x28c2248) at job.c:303 http://bugs.bacula.org/view.php?id=6 0x0000000000429b63 in jobq_server (arg=0x689620) at jobq.c:439 http://bugs.bacula.org/view.php?id=7 0x00007f835dd9646d in lmgr_thread_launcher (x=0x23dcf58) at lockmgr.c:1091 http://bugs.bacula.org/view.php?id=8 0x00007f835db35e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 http://bugs.bacula.org/view.php?id=9 0x00007f835d34d2ed in clone () from /lib/x86_64-linux-gnu/libc.so.6 http://bugs.bacula.org/view.php?id=10 0x0000000000000000 in ?? () Investigating further. ---------------------------------------------------------------------- (0007076) kern (administrator) - 2015-04-28 06:24 http://bugs.bacula.org/view.php?id=2131#c7076 ---------------------------------------------------------------------- I am unable to duplicate this problem and can run 1000 jobs here with no problem other than using a lot of resources. I am running on Ubuntu 14.04 There are two possible explanations: 1. You have a bad build (such as too much optimization in gnu c++) or you have hit gnu c++ bug. 2. Your processes may being killed by the kernel oom handler because you are probably way overloading your kernel/hardware. 3. It is possible that the stack is getting overwritten, but for the moment this has a very low probability in my mind. For item http://bugs.bacula.org/view.php?id=1: where did you get your binaries and what are the build options? For item http://bugs.bacula.org/view.php?id=2: Try "grep -i kill /var/log/messages" The link: http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html will explain a bit more about oom. For item http://bugs.bacula.org/view.php?id=3: To be looked at if none of the above are the problem. ---------------------------------------------------------------------- (0007077) gnosek (reporter) - 2015-04-28 07:05 http://bugs.bacula.org/view.php?id=2131#c7077 ---------------------------------------------------------------------- 1. This is a locally built package based on Ubuntu upstream package with the following changes: - 5.2.x sources replaced with 7.0.5 - Debian patches massaged to fit (mostly trivial stuff around the build process and default configs, nothing that actually touches the source) - disabled the internal signal handler (just let it crash via the OS): http://pastebin.com/xwhRP5Xt These are my notes from the package changelog: * New upstream release * Removed libbacpy* from installation (it's gone) * Removed signal handler for core-generating signals: SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGBUS, SIGFPE and last, but not least, SIGSEGV * Kept debug symbols enabled (no dh_strip) * Removed bacula-tray-monitor package (removed upstream) * Removed ABOUT-NLS and projects from docs (removed upstream) * Added -- --ignore-missing-info to dh_shlibdeps calls (this multiple-version stuff is _weird_ and libtool explicitly ignores soname and sets its own) Effective ./configure line (from config.log): $ ./configure --config-cache --host=x86_64-linux-gnu --build=x86_64-linux-gnu --prefix=/usr --with-archivedir=/nonexistant/path/to/file/archive/dir --sysconfdir=/etc/bacula --with-scriptdir=/etc/bacula/scripts --sharedstatedir=/var/lib/bacula --localstatedir=/var/lib/bacula --with-pid-dir=/var/run/bacula --with-smtp-host=localhost --with-working-dir=/var/lib/bacula --with-subsys-dir=/var/lock --mandir=${prefix}/share/man --infodir=${prefix}/share/info --enable-smartalloc --with-python --with-tcp-wrappers --with-openssl --with-libiconv-prefix=/usr/include --with-readline=/usr/include/readline --disable-conio --with-libintl-prefix=/usr/include --with-x --docdir=${prefix}/share/doc/bacula-common --htmldir=${prefix}/share/doc/bacula-common/html --libdir=${prefix}/lib/bacula --enable-batch-insert --disable-bwx-console --without-qwt --enable-ipv6 --with-dir-passowrd=XXX_DIRPASSWORD_XXX --with-fd-password=XXX_FDPASSWORD_XXX --with-sd-password=XXX_SDPASSWORD_XXX --with-mon-dir-password=XXX_MONDIRPASSWORD_XXX --with-mon-fd-password=XXX_MONFDPASSWORD_XXX --with-mon-sd-password=XXX_MONSDPASSWORD_XXX --with-db-name=XXX_DBNAME_XXX --with-db-user=XXX_DBUSER_XXX --with-db-password=XXX_DBPASSWORD_XXX --with-postgresql --without-sqlite --without-mysql --without-sqlite3 --disable-gnome --disable-bwx-console --disable-tray-monitor --disable-bat Compiler: g++-4.6.real (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 2. I've seen many OOM kills in my life, this is not one of them :) I had a slightly closer look at the core dump and jcr->store_sock in the authenticate_storage_daemon stack frame (which is passed directly as BSOCK* sd) is valid (points to valid memory, doesn't look obviously wrong).I guess I'll have to walk through the assembly in authenticate_storage_daemon and look for weird things (though I am the last to suggest compiler bugs). Are there any known issues with certain compiler options? A quick grep over the build tree suggests the following flags: 'CFLAGS=-g -Wall -D_FORTIFY_SOURCE=0 -O2' 'LDFLAGS=' 'CPPFLAGS=-D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security' (currently not sure why the flags are wildly different for C and C++ and whether it matters) ---------------------------------------------------------------------- (0007078) gnosek (reporter) - 2015-04-28 10:45 http://bugs.bacula.org/view.php?id=2131#c7078 ---------------------------------------------------------------------- No obvious miscompilation in authenticate_storage_daemon, sd is loaded into %rbp at the very start of the function and copied into %rdi before the call to cram_md5_respond. Still, the %rdx value (&compatible) seems bogus too and that comes straight from lea 0x28(%rsp),%rdx. I wonder how trustworthy the argument values in the core dump are if they're passed in registers. Hmm, if we (apparently) have %rsp and %rbp overwritten, could it be that the stack was overwritten in a function called from authenticate_storage_daemon and moved into registers in that function's epilogue? ---------------------------------------------------------------------- (0007079) kern (administrator) - 2015-04-28 11:17 http://bugs.bacula.org/view.php?id=2131#c7079 ---------------------------------------------------------------------- Would you try rebuilding the source with the following flags? CFLAGS = -g -O2 -Wall -fno-strict-aliasing -fno-exceptions -fno-rtti CPPFLAGS = -fno-strict-aliasing -fno-exceptions -fno-rtti This is what I use. I haven't looked at Intel assembly code since 1985 so currently, I do not understand the details you are discussing such as what %rsp and %rbp. Aside from build options, which you can try above to see if it resolves the problems, the other major possibility is that the stack is getting wiped out. I particulary dislike the FORTIFY_SOURCE option as even its documentation points out that it can make mistakes, which I have seen some time ago in Bacula. ---------------------------------------------------------------------- (0007080) gnosek (reporter) - 2015-04-28 11:24 http://bugs.bacula.org/view.php?id=2131#c7080 ---------------------------------------------------------------------- OK, I'll build a new package and report. ---------------------------------------------------------------------- (0007081) gnosek (reporter) - 2015-04-28 11:50 http://bugs.bacula.org/view.php?id=2131#c7081 ---------------------------------------------------------------------- (from bacula.h) /* Disable FORTIFY_SOURCE, because bacula uses is own memory * manager */ #ifdef _FORTIFY_SOURCE #undef _FORTIFY_SOURCE #endif Hmm, wouldn't that confuse the compiled binaries if e.g. bacula.h isn't included absolutely everywhere at the top? Still, I built the package with: 'CFLAGS=-g -Wall -D_FORTIFY_SOURCE=0 -fno-strict-aliasing -fno-exceptions -fno-rtti -O2' 'LDFLAGS=' 'CPPFLAGS=-fno-strict-aliasing -fno-exceptions -fno-rtti ' 'CXXFLAGS=-g -Wall -D_FORTIFY_SOURCE=0 -fno-strict-aliasing -fno-exceptions -fno-rtti -O2' and will give it a spin later today. ---------------------------------------------------------------------- (0007082) kern (administrator) - 2015-04-28 12:20 http://bugs.bacula.org/view.php?id=2131#c7082 ---------------------------------------------------------------------- No, the code in bacula.h to disable fortify source will not confuse binaries. First, bacula.h is included everywhere, and even if it is not, only the files compiled without it will make the calls to fortify library subroutines. My personal view is that Bacula already handles the problems the fortify source code attempts to fix, and it does so without the extra complexity/complications. ---------------------------------------------------------------------- (0007083) gnosek (reporter) - 2015-04-30 10:08 http://bugs.bacula.org/view.php?id=2131#c7083 ---------------------------------------------------------------------- The rebuild apparently helped (no director crash so far) but right now I have the storage daemon crashed twice (with a straight NULL dereference, no core yet). Though I guess this is a separate issue (I'll report my findings, if any, in a separate ticket). Thanks. ---------------------------------------------------------------------- (0007084) kern (administrator) - 2015-04-30 10:50 http://bugs.bacula.org/view.php?id=2131#c7084 ---------------------------------------------------------------------- OK, the fact that the Director is not crashing is very good (and normal for me). What is particularly worrisome is that the Ubuntu standard build seems to create unstable binaries :-) For your SD, please ensure that you use a SD built with the same options that you used on the Dir. There should not be any incompatibility between the Dir and SD no matter how you build them, but if you are using the Ubuntu build flags for the SD it could create the same problems that you saw for the DIR. ---------------------------------------------------------------------- (0007085) gnosek (reporter) - 2015-04-30 10:58 http://bugs.bacula.org/view.php?id=2131#c7085 ---------------------------------------------------------------------- Yep, rebuilt and reinstalled all bacula components. Will investigate further once it crashes again and I get a core dump (why oh why is ulimit -c 0 the default :/). FWIW, the distribution packages of 5.2.5 (or .6) were completely unstable with hundreds of jobs, 5.2.13 (built with same compiler options) were somewhat better and 7.0.5 was rock-solid until about a week ago. ---------------------------------------------------------------------- (0007086) gnosek (reporter) - 2015-04-30 12:35 http://bugs.bacula.org/view.php?id=2131#c7086 ---------------------------------------------------------------------- Both of the SD segfaults were deliberate crashes in logging code (e_msg and Jmsg functions). Possibly related to running out of disk space on the volume containing log files :/ Running a full batch of backups right now, with fingers crossed. Should be OK, at least the director seems solid now. Thanks again. ---------------------------------------------------------------------- (0007087) kern (administrator) - 2015-05-01 07:11 http://bugs.bacula.org/view.php?id=2131#c7087 ---------------------------------------------------------------------- Could you tell me the exact place or exact messages where these deliberate crashes occur? In general, Bacula should not crash for out of space conditions unless it is critical to its operation. Instead it should just fail jobs. ---------------------------------------------------------------------- (0007088) gnosek (reporter) - 2015-05-02 07:36 http://bugs.bacula.org/view.php?id=2131#c7088 ---------------------------------------------------------------------- I spoke too soon :/ Exact same crash (although compatible looks like a valid pointer now) after recompiling with the options indicated. http://bugs.bacula.org/view.php?id=0 BSOCK::recv (this=0x20) at bsock.c:496 http://bugs.bacula.org/view.php?id=1 0x00007f637d017294 in cram_md5_respond (bs=0x20, password=0x264e738 "(...)", tls_remote_need=0x7f6301ff2a28, compatible=0x7f6301ff2a2c) at cram-md5.c:140 http://bugs.bacula.org/view.php?id=2 0x000000000040fad5 in authenticate_storage_daemon (jcr=0x7f62f210f6c8, store=0x2651c68) at authenticate.c:95 http://bugs.bacula.org/view.php?id=3 0x000000000042ec8e in connect_to_storage_daemon (jcr=0x7f62f210f6c8, retry_interval=10, max_retry_time=1800, verbose=1) at msgchan.c:132 http://bugs.bacula.org/view.php?id=4 0x000000000042663b in cancel_storage_daemon_job (jcr=0x7f62f1f4f4c8) at job.c:622 http://bugs.bacula.org/view.php?id=5 0x0000000000412744 in wait_for_job_termination (jcr=0x7f62f1f4f4c8, timeout=<optimized out>) at backup.c:673 http://bugs.bacula.org/view.php?id=6 0x0000000000413c53 in do_backup (jcr=0x7f62f1f4f4c8) at backup.c:602 http://bugs.bacula.org/view.php?id=7 0x0000000000423d79 in job_thread (arg=0x7f62f1f4f4c8) at job.c:303 http://bugs.bacula.org/view.php?id=8 0x0000000000429723 in jobq_server (arg=0x686640) at jobq.c:439 http://bugs.bacula.org/view.php?id=9 0x00007f637d040a52 in lmgr_thread_launcher (x=0x7f62f02c7ad8) at lockmgr.c:1091 http://bugs.bacula.org/view.php?id=10 0x00007f637cde0e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 http://bugs.bacula.org/view.php?id=11 0x00007f637c80e2ed in clone () from /lib/x86_64-linux-gnu/libc.so.6 http://bugs.bacula.org/view.php?id=12 0x0000000000000000 in ?? () BTW, sorry, no, I can't tell you where the SD crashes came from, only that they were caused by a log message with MSG_ABORT (which I don't have) ---------------------------------------------------------------------- (0007089) kern (administrator) - 2015-05-02 08:14 http://bugs.bacula.org/view.php?id=2131#c7089 ---------------------------------------------------------------------- No, the pointer is 0x20, which is invalid. Although this could be a Bacula problem, t the moment, it seems to me that the most likely cause is some memory corruption, either from hardware problems or from Bacula itself. Were you running 1000 jobs at the time this happened? This looks supiciously like an old problem that I am 99% sure is corrected in a more recent 7.0.5. In addition, from the trace back, Bacula has already detected a problem attempting a backup job. Either you have canceled the job yourself or some fatal error has been found in the DIR or FD. This is not at all normal behavior for Bacula, and I do not believe that I will be able to resolve the problem, other than telling you to make sure all your components of Bacula are on the same version and use the simplified compile options that I gave you. ---------------------------------------------------------------------- (0007090) kern (administrator) - 2015-05-02 08:16 http://bugs.bacula.org/view.php?id=2131#c7090 ---------------------------------------------------------------------- There is something seriously wrong here, which is either related to overstressing Bacula or the OS with too many jobs or it is a bug that I cannot reproduce. Unsolvable. Sorry. ---------------------------------------------------------------------- (0007091) gnosek (reporter) - 2015-05-07 17:18 http://bugs.bacula.org/view.php?id=2131#c7091 ---------------------------------------------------------------------- FWIW (if anybody else hits a similar issue), looks stable after building with -fno-omit-frame-pointer. Quite a bit of cargo culting though, not sure if it really fixes the problem (bad code is generated only without frame pointers) or works around it (shuffles things around enough not to crash right now). Issue History Date Modified Username Field Change ====================================================================== 2015-04-27 09:11 gnosek New Issue 2015-04-27 17:31 gnosek Note Added: 0007074 2015-04-27 17:31 gnosek Note Deleted: 0007074 2015-04-27 23:35 gnosek Note Added: 0007075 2015-04-28 06:24 kern Note Added: 0007076 2015-04-28 06:24 kern Assigned To => kern 2015-04-28 06:24 kern Status new => feedback 2015-04-28 07:05 gnosek Note Added: 0007077 2015-04-28 07:05 gnosek Status feedback => assigned 2015-04-28 10:45 gnosek Note Added: 0007078 2015-04-28 11:17 kern Note Added: 0007079 2015-04-28 11:24 gnosek Note Added: 0007080 2015-04-28 11:50 gnosek Note Added: 0007081 2015-04-28 12:20 kern Note Added: 0007082 2015-04-30 10:08 gnosek Note Added: 0007083 2015-04-30 10:50 kern Note Added: 0007084 2015-04-30 10:58 gnosek Note Added: 0007085 2015-04-30 12:35 gnosek Note Added: 0007086 2015-05-01 07:11 kern Note Added: 0007087 2015-05-02 07:36 gnosek Note Added: 0007088 2015-05-02 08:14 kern Note Added: 0007089 2015-05-02 08:16 kern Note Added: 0007090 2015-05-02 08:16 kern Status assigned => closed 2015-05-02 08:16 kern Resolution open => unable to reproduce 2015-05-07 17:18 gnosek Note Added: 0007091 2015-05-07 17:18 gnosek Status closed => feedback 2015-05-07 17:18 gnosek Resolution unable to reproduce => reopened ====================================================================== |