From: Mantis B. T. <no...@bu...> - 2015-04-27 16:31:17
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=2131 ====================================================================== Reported By: gnosek Assigned To: ====================================================================== Project: Bacula Bug Reports Issue ID: 2131 Category: Director Reproducibility: sometimes Severity: crash Priority: normal Status: new ====================================================================== Date Submitted: 2015-04-27 09:11 BST Last Modified: 2015-04-27 17:31 BST ====================================================================== Summary: [7.0.5] Director crash under high load Description: The director crashes under high load* repeatedly in the same place: [27709967.186206] bacula-dir[11925]: segfault at 8 ip 00007f7233e722ce sp 00007f71f0ff86b0 error 4 in libbac-7.0.5.so[7f7233e56000+61000] [27796283.993584] bacula-dir[23494]: segfault at 7300340181d0 ip 00007f52c8b162ce sp 00007f52517f96b0 error 4 in libbac-7.0.5.so[7f52c8afa000+61000] [27828469.670787] bacula-dir[3589]: segfault at 38 ip 00007f227d0682ce sp 00007f21dcff06b0 error 4 in libbac-7.0.5.so[7f227d04c000+61000] [27829425.776910] bacula-dir[5186]: segfault at 8 ip 00007f938ab1b2ce sp 00007f9320ff86b0 error 4 in libbac-7.0.5.so[7f938aaff000+61000] [27882600.617727] bacula-dir[7518]: segfault at 8 ip 00007f158f7092ce sp 00007f15417f96b0 error 4 in libbac-7.0.5.so[7f158f6ed000+61000] All these crashes point to BSOCK::recv: int32_t BSOCK::recv() { int32_t nbytes; int32_t pktsiz; bool locked = false; msg[0] = 0; // <-- crash msglen = 0; if (errors || is_terminated() || is_closed()) { return BNET_HARDEOF; } Three of these suggest a NULL BSOCK instance (the crash address at 8 is the offset of msg in the BSOCK struct), the other two are completely bogus and look like an overwritten (freed?) pointer to the BSOCK somewhere. * high load may be any of these factors: - lots of scheduled tasks (around 900 scheduled at the same time, up to 20 concurrently running) - storage daemon being slow to respond (high I/O) - director being slow to respond (it's on the same machine as the storage daemon) - database being slow to respond (still the same machine) The crashes usually appear a couple of minutes after the scheduled time, as the I/O load ramps up. After a crash, if I resubmit the tasks via bacula-console *slowly* (one every 20-30 seconds), the director tends to keep running. Shorter intervals, like 2 seconds, tend to crash the director after some time. Still, the task queue builds up mostly the same (basically no job has a chance to finish in 30 seconds). Steps to Reproduce: Schedule 1000 I/O heavy jobs at the same time (with 20-way concurrency on the storage daemon) and wait a couple of minutes. Not 100% reproducible as we had a decent peaceful period after upgrading to 7.0.5 (5.2.x crashed left and right) but apparently we have now crossed another threshold and the director started crashing again. Additional Information: Disassembling a snippet of BSOCK::recv where the crash occurs (objdump -dCS ./debian/tmp-build-pgsql/src/lib/.libs/bsock.o): msg[0] = 0; 12ee: 48 8b 47 08 mov 0x8(%rdi),%rax 12f2: c6 00 00 movb $0x0,(%rax) The crash is in the first of these two instructions, which roughly translate to: char *msg_ptr = *(((char*) this) + 8); *msg_ptr = 0; So it's actually a dereference of "this" that crashes, we don't even reach the "dereference msg" part. Sadly, I haven't been able to get a core dump yet, so I only got the %rip value to work from. ====================================================================== ---------------------------------------------------------------------- (0007074) gnosek (reporter) - 2015-04-27 17:31 http://bugs.bacula.org/view.php?id=2131#c7074 ---------------------------------------------------------------------- Got a core dump, looks like the stack is overwritten :/ Core was generated by `/usr/sbin/bacula-dir -c /etc/bacula/bacula-dir.conf -u bacula -g bacula'. Program terminated with signal 11, Segmentation fault. http://bugs.bacula.org/view.php?id=0 0x0000000000020000 in ?? () (gdb) bt http://bugs.bacula.org/view.php?id=0 0x0000000000020000 in ?? () http://bugs.bacula.org/view.php?id=1 0x00000000553e6175 in ?? () http://bugs.bacula.org/view.php?id=2 0x0000000000020aef in ?? () http://bugs.bacula.org/view.php?id=3 0x00000000553e6175 in ?? () http://bugs.bacula.org/view.php?id=4 0x0000000000020aef in ?? () http://bugs.bacula.org/view.php?id=5 0x00000000ffffff88 in ?? () http://bugs.bacula.org/view.php?id=6 0x5d4dd7a09fcb3188 in ?? () http://bugs.bacula.org/view.php?id=7 0x353332383730323c in ?? () http://bugs.bacula.org/view.php?id=8 0x303334312e353630 in ?? () http://bugs.bacula.org/view.php?id=9 0x6240313435313531 in ?? () http://bugs.bacula.org/view.php?id=10 0x67656d2e326b6361 in ?? () http://bugs.bacula.org/view.php?id=11 0x6f632e6d61657469 in ?? () http://bugs.bacula.org/view.php?id=12 0x0000003e6c702e6d in ?? () http://bugs.bacula.org/view.php?id=13 0x547f48f68a4297a0 in ?? () http://bugs.bacula.org/view.php?id=14 0x99e1f6405ceb1518 in ?? () http://bugs.bacula.org/view.php?id=15 0xdf358d1929f4651f in ?? () http://bugs.bacula.org/view.php?id=16 0x4b4329aba43e7ca6 in ?? () http://bugs.bacula.org/view.php?id=17 0x9d61d7cec40bcc0c in ?? () http://bugs.bacula.org/view.php?id=18 0xfe20b72d8eb5320f in ?? () http://bugs.bacula.org/view.php?id=19 0xc06c3882b3baa008 in ?? () http://bugs.bacula.org/view.php?id=20 0x19491aee00000000 in ?? () http://bugs.bacula.org/view.php?id=21 0x00007fbf68026100 in ?? () http://bugs.bacula.org/view.php?id=22 0x0000000000000000 in ?? () Issue History Date Modified Username Field Change ====================================================================== 2015-04-27 09:11 gnosek New Issue 2015-04-27 17:31 gnosek Note Added: 0007074 ====================================================================== |