From: Mantis B. T. <no...@bu...> - 2009-08-26 17:26:03
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1343 ====================================================================== Reported By: hbrown Assigned To: ====================================================================== Project: bacula Issue ID: 1343 Category: Storage Daemon Reproducibility: random Severity: crash Priority: urgent Status: feedback ====================================================================== Date Submitted: 2009-08-05 20:02 BST Last Modified: 2009-08-26 18:25 BST ====================================================================== Summary: SD segfaults and dies during jobs Description: bacula-sd has been dying on me at seemingly random times while running jobs (spooling or despooling data). Here's the latest output with debug=200: ---- cbs-01-sd: askdir.c:205-854 do_reqest_vol_info return true slot=7 Volume=000006 cbs-01-sd: acquire.c:479-854 dir_update_vol_info. Release vol=000006 dev="Drive-0" (/dev/nst1) cbs-01-sd: acquire.c:496-854 1 writers, 1 reserve, dev="Drive-0" (/dev/nst1) cbs-01-sd: vol_mgr.c:212-854 List acquire:release_device(): 000006 in_use=1 on device "Drive-0" (/dev/nst1) Bacula interrupted by signal 11: Segmentation violation Kaboom! bacula-sd, cbs-01-sd got signal 11 - Segmentation violation. Attempting traceback. Kaboom! exepath=/usr/sbin/ Calling: /usr/sbin/btraceback /usr/sbin/bacula-sd 10026 cbs-01-sd: jcr.c:834-851 jid=851 leave set_old_job_status=R new_set=T cbs-01-sd: fd_cmds.c:156-851 <filed: append end session 20 cbs-01-sd: fd_cmds.c:217-851 store<file: append end session 20 cbs-01-sd: fd_cmds.c:156-851 <filed: append close session 20 cbs-01-sd: fd_cmds.c:260-851 <filed: append close session 20 cbs-01-sd: fd_cmds.c:268-851 >filed: 3000 OK close Status = 84 ---- ====================================================================== ---------------------------------------------------------------------- (0004496) hbrown (reporter) - 2009-08-05 20:03 http://bugs.bacula.org/view.php?id=1343#c4496 ---------------------------------------------------------------------- Forgot to say: thanks very much for Bacula and for any help you can offer. Please let me know if you require any further information. ---------------------------------------------------------------------- (0004497) kern (administrator) - 2009-08-05 21:20 http://bugs.bacula.org/view.php?id=1343#c4497 ---------------------------------------------------------------------- We need a good traceback with symbols to be able to debug this. You might try running the SD manually under the debugger as described in the Kaboom chapter of the manual. However, before doing that, I suggest that you install the Bacula project 3.0.2 el5 rpms that are posted on Source Forge. There may be a problem in your build. We haven't done lots of testing with TLS recently, so perhaps it is causing some problems. ---------------------------------------------------------------------- (0004498) hbrown (reporter) - 2009-08-05 22:52 http://bugs.bacula.org/view.php?id=1343#c4498 ---------------------------------------------------------------------- I've installed the 3.0.2 el5 RPMs from Sourceforge as you suggest: ---- $ rpm -qa |grep bacula bacula-mysql-3.0.2-1 bacula-mtx-3.0.2-1 ---- However, there doesn't appear to be any debugging info in them. Here's what I get when I run btraceback as UID bacula again: ---- Subject: Bacula GDB traceback of bacula-sd on agnatha.chibi.ubc.ca Sender: ba...@ch... To: ro...@ch... Date: Wed, 05 Aug 2009 14:43:01 -0700 (PDT) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread 0x2b1defa98230 (LWP 7013)] [New Thread 0x41663940 (LWP 7017)] (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) 0x00000031d72cc4c2 in select () from /lib64/libc.so.6 $1 = 762536547 $2 = 170349864 $3 = 170349928 /usr/lib64/bacula/btraceback.gdb:4: Error in sourced command file: No symbol table is loaded. Use the "file" command. ---- I grabbed the rpmbuild command from the platforms/contrib-rpm/build_rpm.sh command in the source tree, which I understand is what's used to build the official RPMs. However, the build_rpm.sh script deletes the debuginfo package, and I can't find any such package on Sourceforge. Is there somewhere else I should look? (For the record, I've modified btraceback just by adding "sudo" at the beginning of the gdb line.) ---------------------------------------------------------------------- (0004499) kern (administrator) - 2009-08-06 07:22 http://bugs.bacula.org/view.php?id=1343#c4499 ---------------------------------------------------------------------- Sorry, my advice was not the best. I had forgotten that the project does not upload the debug packages. I will make a note to ask them to do so in the future. At least we know that your build was no worse than the project's rpms. At this point, there are two choices: 1. put back your rpms. 2. Build from source. I prefer to build from source since it avoids the possibility that some of the numerous that RedHat throws onto the rpm builds are creating problems. I've uploaded a recommended productionconfig file (very different installation locations) that I recommend if you build from source (you may need to tweak it to turn off bat or something). Please also upload your bacula-dir.conf and bacula-sd.conf, but be sure to zap the passwords. I will be away a few days so Eric will most likely fill in. ---------------------------------------------------------------------- (0004500) hbrown (reporter) - 2009-08-06 17:54 http://bugs.bacula.org/view.php?id=1343#c4500 ---------------------------------------------------------------------- I'm attaching -dir and -sd.conf with no passwords and sanitized hostnames (except for noc, blinky and jgdesk-01, which were already mentioned earlier). I've used productionconfig to build bacula, but with these options commented out: ---- --enable-gnome --enable-wx-console --enable-tray-monitor --enable-bat ---- I've changed the "mailcommand", "operatorcommand" and "Changer Command" directives in the config files to reflect the new paths to bsmtp and mtx-changer, but have otherwise left them unchanged. I'm running the daemons by hand like so: ---- /opt/bacula/bin/bacula-dir -c /opt/bacula/etc/bacula-dir.conf -u bacula /opt/bacula/bin/bacula-fd -c /opt/bacula/etc/bacula-fd.conf -u bacula /opt/bacula/bin/bacula-sd -c /opt/bacula/etc/bacula-sd.conf -u bacula -d200 ---- I've tried killing -sd with signal 11, and got a successful traceback...didn't think earlier to try the traceback by killing, rather than interactively. I've run a couple small jobs just to make sure things are working, and it looks good. The regular backups will run in about an hour, so we'll see what happens then. ---------------------------------------------------------------------- (0004505) kern (administrator) - 2009-08-14 17:19 http://bugs.bacula.org/view.php?id=1343#c4505 ---------------------------------------------------------------------- If something comes up, please reopen the bug report and post the new information. ---------------------------------------------------------------------- (0004516) hbrown (reporter) - 2009-08-17 17:15 http://bugs.bacula.org/view.php?id=1343#c4516 ---------------------------------------------------------------------- Hi there -- bacula-sd crashed over the weekend, while it was in the middle of running differentials. From sd's log, it looks like one job had just finished, and the director was calling to cancel another. The btraceback script ran into problems, but I'm hopeful that I've got enough information to be useful to you. First off, here's what the btraceback email looked like: ---- Subject: Bacula GDB traceback of bacula-sd on agnatha.chibi.ubc.ca Date: Sun, 16 Aug 2009 11:20:27 -0700 (PDT) ptrace: No such process. /var/lib/bacula/20176: No such file or directory. $1 = 0 /opt/bacula/scripts/btraceback.gdb:2: Error in sourced command file: No symbol "exename" in current context. ---- (/var/lib/bacula is the working directory for the previous, RPM-based installation of bacula; the working directory for the new installation of bacula (which you got me to build, and which is still in use) is /opt/bacula/working. It's not mentioned in the btraceback script, which is completely stock except for prepending the gdb command with "sudo". The only thing I can figure is that, since the home directory for the user bacula is still /var/lib/bacula, it was looking there somehow. I'll change the home directory to point to /opt/bacula/working.) However, I *did* manage to recover the actual dump file itself, and I'm attaching that. I'm also attaching the last 100 lines of the debugging log for sd. Thanks again for your help, and please let me know if you need any further information. ---------------------------------------------------------------------- (0004517) hbrown (reporter) - 2009-08-17 17:33 http://bugs.bacula.org/view.php?id=1343#c4517 ---------------------------------------------------------------------- Sorry, scratch that -- the backtrace I attached was from last week. I should have checked things more carefully. I'm not sure that I have permissions to close this issue, so please feel free to do so...I'll reopen if I can get a backtrace. My apologies for the confusion. ---------------------------------------------------------------------- (0004534) hbrown (reporter) - 2009-08-25 18:07 http://bugs.bacula.org/view.php?id=1343#c4534 ---------------------------------------------------------------------- bacula-sd died again, some time last night. I was running it under gdb, so I was able to get traceback info by running the commands from btraceback.gdb manually. I'm attaching the file; I hope this will be useful. I'm still running it under a debugger. Please let me know if you need any further information. ---------------------------------------------------------------------- (0004535) kern (administrator) - 2009-08-25 21:53 http://bugs.bacula.org/view.php?id=1343#c4535 ---------------------------------------------------------------------- >From your productionconfig, I am pleased to see you using the recommended one directory :-) That should simplify your life for doing bare metal recovery. However, in looking at the SD traceback, I don't see any indication of a seg fault. Since all the threads seem to be exiting, I can only guess that you caught it quite a while after any seg fault. Normally, if Bacula runs into a fatal problem such as a seg fault, the signal will be trapped and Bacula will attempt to produce a dump of itself, then it will proceed to shut down the daemon. Does that make any sense to you? Bottom line, we need a dump just at the point of any seg fault. Though some of the other information is very nice to see, the only commands that are necessary to run at that time are: where thread apply all bt with the second being the most important. Within the output produced by "thread apply all bt" there should be one of the threads that show that a signal was intercepted. Best regards, Kern ---------------------------------------------------------------------- (0004536) hbrown (reporter) - 2009-08-25 22:19 http://bugs.bacula.org/view.php?id=1343#c4536 ---------------------------------------------------------------------- Crap -- sorry about that. I wondered about the lack of a signal myself... I was running -sd under gdb as shown in the Kaboom! chapter -- ie, "sudo gdb ./bacula-sd" and then "run -s -f -c /opt/bacula/etc/bacula-sd.conf" from within gdb. The -s option disabled the signal handling, but as I've written before I've found I'm unable to get a reliable backtrace when -sd crashes on me...which is weird, because I can send the signal manually and get a nice backtrace. I'm not sure exactly when -sd died, as I'm running jobs at night and found the crash this morning. I had thought that gdb would sort of hold -sd in place, as it was at the time of the crash, and allow me to get a useful backtrace -- was I wrong about that? Anyhow -- thanks again for all your help with this. I hope I can come up with some useful info, because I'd really like to get this working. ---------------------------------------------------------------------- (0004537) kern (administrator) - 2009-08-26 10:37 http://bugs.bacula.org/view.php?id=1343#c4537 ---------------------------------------------------------------------- I am not sure what went wrong, but when you run under the debugger and use: run -s -f ... the debugger will immediately stop on any signal (seg fault, ...) and return control to you and wait indefinitely. At that point if you do the "thread apply all bt" we should have what we need -- well about 5% of the time it requires more info. ---------------------------------------------------------------------- (0004543) hbrown (reporter) - 2009-08-26 17:07 http://bugs.bacula.org/view.php?id=1343#c4543 ---------------------------------------------------------------------- bacula-sd crashed again last night while running under gdb. By way of background, it looks like it crashed when it was trying to cancel a job on a machine that was not responding; the client has been uninstalled temporarily (unrelated issue), and I had not cancelled the job itself. I'm pretty certain this is where it crashed the previous night as well. I'm attaching the backtrace. I still have the debugger going with the original process, and I can leave it for the rest of today ('til, say, 16:00 PDT, or 23:30 UTC). Let me know if there's anything else I can do. ---------------------------------------------------------------------- (0004546) kern (administrator) - 2009-08-26 18:21 http://bugs.bacula.org/view.php?id=1343#c4546 ---------------------------------------------------------------------- Can you enter the following command into the debugger? where then post the output it generates? Thanks. ---------------------------------------------------------------------- (0004547) kern (administrator) - 2009-08-26 18:24 http://bugs.bacula.org/view.php?id=1343#c4547 ---------------------------------------------------------------------- Oh, sorry. Ignore my last request. I see it is at the bottom of the output already. ---------------------------------------------------------------------- (0004548) hbrown (reporter) - 2009-08-26 18:25 http://bugs.bacula.org/view.php?id=1343#c4548 ---------------------------------------------------------------------- I think it was in the traceback I included, but here it is again: ---- (gdb) where http://bugs.bacula.org/view.php?id=0 0x00000038522056b6 in ?? () from /usr/lib64/libz.so.1 http://bugs.bacula.org/view.php?id=1 0x0000003852204c70 in deflate () from /usr/lib64/libz.so.1 http://bugs.bacula.org/view.php?id=2 0x0000003853ebfb9d in ?? () from /lib64/libcrypto.so.6 http://bugs.bacula.org/view.php?id=3 0x0000003853ebf822 in COMP_compress_block () from /lib64/libcrypto.so.6 http://bugs.bacula.org/view.php?id=4 0x0000003855e201ac in ssl3_do_compress () from /lib64/libssl.so.6 http://bugs.bacula.org/view.php?id=5 0x0000003855e202fe in ?? () from /lib64/libssl.so.6 http://bugs.bacula.org/view.php?id=6 0x0000003855e206f8 in ssl3_write_bytes () from /lib64/libssl.so.6 http://bugs.bacula.org/view.php?id=7 0x00002b1fe116be93 in tls_bsock_writen (bsock=0x2aaaac098a18, ptr=0x12dc6b2c "", nbytes=174) at tls.c:609 http://bugs.bacula.org/view.php?id=8 0x00002b1fe114eaa0 in write_nbytes (bsock=0x12d89950, ptr=0x12dc6b2c "", nbytes=174) at bnet.c:128 http://bugs.bacula.org/view.php?id=9 0x00002b1fe11518a6 in BSOCK::send (this=0x2aaaac098a18) at bsock.c:360 http://bugs.bacula.org/view.php?id=10 0x00002b1fe1151b69 in BSOCK::fsend (this=0x2aaaac098a18, fmt=0x2b1fe117acd0 "Jmsg Job=%s type=%d level=%lld %s") at bsock.c:415 http://bugs.bacula.org/view.php?id=11 0x00002b1fe115fef1 in dispatch_message (jcr=0x12cf3768, type=6, mtime=1251260680, msg=0x41aba870 "cbs-01-sd JobId 1492: Job wellscope-ws-c_drive.2009-08-25_21.05.00_36 marked to be canceled.\n") at message.c:788 http://bugs.bacula.org/view.php?id=12 0x00002b1fe1160858 in Jmsg (jcr=0x12cf3768, type=6, mtime=0, fmt=0x43d9d0 "Job %s marked to be canceled.\n") at message.c:1171 http://bugs.bacula.org/view.php?id=13 0x000000000041c279 in cancel_cmd (cjcr=<value optimized out>) at dircmd.c:328 http://bugs.bacula.org/view.php?id=14 0x000000000041c710 in handle_connection_request (arg=<value optimized out>) at dircmd.c:233 http://bugs.bacula.org/view.php?id=15 0x00002b1fe1172cae in workq_server (arg=<value optimized out>) at workq.c:346 http://bugs.bacula.org/view.php?id=16 0x0000003851e06367 in start_thread () from /lib64/libpthread.so.0 http://bugs.bacula.org/view.php?id=17 0x00000038512d309d in clone () from /lib64/libc.so.6 ---- Let me know if you need anything else. Issue History Date Modified Username Field Change ====================================================================== 2009-08-05 20:02 hbrown New Issue 2009-08-05 20:02 hbrown File Added: sd-log-2-shortened.txt 2009-08-05 20:03 hbrown Note Added: 0004496 2009-08-05 21:20 kern Note Added: 0004497 2009-08-05 21:20 kern Status new => feedback 2009-08-05 22:52 hbrown Note Added: 0004498 2009-08-06 07:18 kern File Added: productionconfig 2009-08-06 07:22 kern Note Added: 0004499 2009-08-06 07:23 kern Priority normal => urgent 2009-08-06 17:54 hbrown Note Added: 0004500 2009-08-06 17:55 hbrown File Added: bacula-dir.conf 2009-08-06 17:56 hbrown File Added: bacula-sd.conf 2009-08-14 17:19 kern Note Added: 0004505 2009-08-14 17:19 kern Status feedback => closed 2009-08-14 17:19 kern Resolution open => unable to duplicate 2009-08-17 17:15 hbrown Note Added: 0004516 2009-08-17 17:15 hbrown Status closed => feedback 2009-08-17 17:15 hbrown Resolution unable to duplicate => reopened 2009-08-17 17:22 hbrown File Added: sd-log.3.shortened 2009-08-17 17:23 hbrown File Added: bacula.4887.traceback 2009-08-17 17:33 hbrown Note Added: 0004517 2009-08-25 18:07 hbrown Note Added: 0004534 2009-08-25 18:07 hbrown File Added: sd.traceback 2009-08-25 21:53 kern Note Added: 0004535 2009-08-25 22:19 hbrown Note Added: 0004536 2009-08-26 10:37 kern Note Added: 0004537 2009-08-26 17:07 hbrown Note Added: 0004543 2009-08-26 17:07 hbrown Issue Monitored: hbrown 2009-08-26 17:08 hbrown File Added: sd.backtrace.aug_26_2009 2009-08-26 18:21 kern Note Added: 0004546 2009-08-26 18:24 kern Note Added: 0004547 2009-08-26 18:25 hbrown Note Added: 0004548 ====================================================================== |