From: <bac...@li...> - 2006-07-31 13:58:17
|
A BUGNOTE has been added to this bug. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000645 ====================================================================== Reported By: quelrod Assigned To: ====================================================================== Project: bacula Bug ID: 645 Category: Director Reproducibility: random Severity: crash Priority: normal Status: feedback ====================================================================== Date Submitted: 07-19-2006 07:40 PDT Last Modified: 07-31-2006 06:58 PDT ====================================================================== Summary: bacula crashes some nights when backups kick off Description: Some nights backups are fine, other nights they aren't. I ran bacula-dir in gdb and waited for a crash. The first night it was running in gdb it didn't crash. However, last night it did. bconsole is hung and it is sort of still running but it can't do anything and you can't connect to it. ====================================================================== ---------------------------------------------------------------------- quelrod - 07-19-2006 07:43 PDT ---------------------------------------------------------------------- I forgot to mention some other strange behavior related to this. After it crashes doing a status dir in bconsole doesn't show the last successful backups. Before the crash they showed up in the status dir output. Also, I failed to mention the log shows absolutely nothing. ---------------------------------------------------------------------- quelrod - 07-21-2006 07:09 PDT ---------------------------------------------------------------------- I verified the crash is still present in 1.38.11. bacula-dir ran in the debugger fine the first night, but the second night crashed. Latest gdb bt below: Program received signal SIGPIPE, Broken pipe. [Switching to Thread 1077783456 (LWP 28820)] 0xffffe410 in __kernel_vsyscall () (gdb) Thread 4 (Thread 1084386224 (LWP 28840)): http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000000 0xffffe410 in __kernel_vsyscall () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000001 0x4015c3fb in __read_nocancel () from /lib/tls/libpthread.so.0 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000002 0x080963d3 in read_nbytes () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000003 0x080977cd in bnet_recv () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000004 0x0807f846 in handle_UA_client_request () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000005 0x080b4b79 in workq_server () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000006 0x40157297 in start_thread () from /lib/tls/libpthread.so.0 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000007 0x4033037e in clone () from /lib/tls/libc.so.6 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000008 0x40a26bb0 in ?? () Thread 3 (Thread 1082284976 (LWP 28826)): http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000000 0xffffe410 in __kernel_vsyscall () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000001 0x40159cec in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000002 0x080b3e3a in watchdog_thread () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000003 0x40157297 in start_thread () from /lib/tls/libpthread.so.0 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000004 0x4033037e in clone () from /lib/tls/libc.so.6 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000005 0x40825bb0 in ?? () Thread 2 (Thread 1080183728 (LWP 28825)): http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000000 0xffffe410 in __kernel_vsyscall () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000001 0x40328f61 in ___newselect_nocancel () from /lib/tls/libc.so.6 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000002 0x08099694 in bnet_thread_server () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000003 0x0807f556 in connect_thread () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000004 0x40157297 in start_thread () from /lib/tls/libpthread.so.0 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000005 0x4033037e in clone () from /lib/tls/libc.so.6 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000006 0x40624bb0 in ?? () Thread 1 (Thread 1077783456 (LWP 28820)): http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000000 0xffffe410 in __kernel_vsyscall () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000001 0x4015c37b in __write_nocancel () from /lib/tls/libpthread.so.0 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000002 0x4005d96d in vio_write () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000003 0x4005e0a8 in net_real_write () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000004 0x4005e2f8 in net_flush () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000005 0x4005e855 in net_write_command () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000006 0x4005c16b in cli_advanced_command () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000007 0x400593bd in mysql_send_query () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000008 0x40059460 in mysql_real_query () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000009 0x40030cf5 in mysql_query () from /usr/lib/libmysqlclient_r.so.14 http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000010 0x08088958 in InsertDB () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000011 0x08089da7 in db_create_job_record () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000012 0x0805d506 in run_job () http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000013 0x0804d3c2 in main () Here is the stranger part and again I have disappearing backups in status dir. Before the crash I had backups from 19-Jul and 20-Jul that ran. After the crash all of those backups fail to appear in status dir. ---------------------------------------------------------------------- kern - 07-23-2006 01:55 PDT ---------------------------------------------------------------------- I don't believe that this is a bug. You have not shown any job report output, and you have shown no evidence that Bacula has "crashed". What appears to have happened is a SIGPIPE, which a communications error that is out of Bacula's control, and does not cause Bacula to crash. I suggest you submit this as a support request. See the support page of the web site: www.bacula.org Re-open this bug only after getting support and only if you have solid evidence of a Bacula crash. ---------------------------------------------------------------------- quelrod - 07-27-2006 09:39 PDT ---------------------------------------------------------------------- I have tried other support channels: irc, mailing lists, etc. In general no one seems interested in the problem. In any case shouldn't bacula handle the SIGPIPE instead of having the console hang? After the problem: treebeard:/usr/bacula/bin # ./bconsole Connecting to Director treebeard:9101 and it sits there. I cannot ctrl-c the thing and have to kill the pid. Of course if it couldn't connect it would immediately indicate that. Also, here is part of my logfile: 26-Jul 08:22 treebeard-dir: Start Backup JobId 249, Job=NS2_Backup.2006-07-25_20 .00.08 26-Jul 08:22 treebeard-dir: Bacula 1.38.11 (28Jun06): 26-Jul-2006 08:22:30 JobId: 249 Job: NS2_Backup.2006-07-25_20.00.08 Backup Level: Incremental, since=2006-07-24 10:02:04 Client: "ns2-fd" i386-pc-linux-gnu,debian,3.1 FileSet: "NS2" 2006-07-12 10:21:43 Pool: "Inc-Pool" Storage: "Dell132t" Scheduled time: 25-Jul-2006 20:00:07 Start time: 26-Jul-2006 08:22:12 End time: 26-Jul-2006 08:22:30 Elapsed time: 18 secs Priority: 16 FD Files Written: 79 SD Files Written: 79 FD Bytes Written: 20,488,692 (20.48 MB) SD Bytes Written: 20,496,742 (20.49 MB) Rate: 1138.3 KB/s Software Compression: None Volume name(s): 000016 Volume Session Id: 28 Volume Session Time: 1153320579 Last Volume Bytes: 86,255,010,163 (86.25 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK Notice the date Jul 26? status dir shows the last backup as: 199 Full 30,492 327,135,920 OK 12-Jul-06 10:30 NS2_Backup Whether the SIGPIPE can be fully handled the issue of jobs running but then not showing up in all status output is certainly problematic. Here is what else is curious the system was running happily for a good month without this problem. I wonder if some underlying library is causing problems now? All backups are done locally so communications errors shouldn't be network driven but would be some sort of software problem. Tell me what to do, try, information to provide, etc. as I want to get this bug solved and get you everything you need to aide in that. Also, the problem only seems to mainfest during nightly runs and I haven't found a way to reproduce it. Also, sometimes it happens every night, other times every other night. It's really tempting to think the systems are cursed given how erratic the behavior is :P ---------------------------------------------------------------------- quelrod - 07-31-2006 06:41 PDT ---------------------------------------------------------------------- Same problem but slightly different behavior. *status dir treebeard-dir Version: 1.38.11 (28 June 2006) i686-pc-linux-gnu suse 10.0 Daemon started 27-Jul-06 11:25, 19 Jobs run since started. Scheduled Jobs: Level Type Pri Scheduled Name Volume =================================================================================== At this point I get a Program received signal SIGPIPE, Broken pipe. [Switching to Thread 1086487472 (LWP 27358)] 0xffffe410 in __kernel_vsyscall () I haven't ever had a command such as status dir directly trigger this before. Interestingly enough I can also cont in gdb and the status dir output finishes and it is running again. Does this indicate a SIGPIPE that bacula should handle by ignoring it instead of hanging? ---------------------------------------------------------------------- kern - 07-31-2006 06:58 PDT ---------------------------------------------------------------------- I don't think you are following the instructions in the Kaboom chapter to the letter when you are running Bacula under gdb. You must execute it with: run -s -f -c .../bacula-dir.conf In that case, neither Bacula nor gdb will ever get a SIGPIPE. SIGPIPE are normal signals that Bacula intercepts and handles quite nicely. They happen all the time when one end suddenly disconnects. Bacula never crashes because of a SIGPIPE, but it may fail the running program depending on the circumstances. Bug History Date Modified Username Field Change ====================================================================== 07-19-06 07:40 quelrod New Bug 07-19-06 07:40 quelrod Bug Monitored: quelrod 07-19-06 07:43 quelrod Bugnote Added: 0001791 07-21-06 07:09 quelrod Bugnote Added: 0001792 07-23-06 01:55 kern Bugnote Added: 0001793 07-23-06 01:55 kern Resolution open => not a bug 07-23-06 01:55 kern Status new => closed 07-27-06 09:39 quelrod Bugnote Added: 0001804 07-27-06 09:39 quelrod Resolution not a bug => reopened 07-27-06 09:39 quelrod Status closed => feedback 07-31-06 06:41 quelrod Bugnote Added: 0001807 07-31-06 06:58 kern Bugnote Added: 0001808 ====================================================================== |