From: Mantis B. T. <no...@bu...> - 2009-05-27 21:08:20
|
The following issue has been CLOSED ====================================================================== http://bugs.bacula.org/view.php?id=1287 ====================================================================== Reported By: bobhetzel Assigned To: ebollengier ====================================================================== Project: bacula Issue ID: 1287 Category: Storage Daemon Reproducibility: have not tried Severity: block Priority: normal Status: closed Resolution: fixed Fixed in Version: ====================================================================== Date Submitted: 2009-05-12 18:27 BST Last Modified: 2009-05-27 22:07 BST ====================================================================== Summary: sd hangs Description: bacula-sd crashed only a few minutes after starting backups. This is not 100% repeatable but has happened once with bacula 3.0.0 and now once with 3.0.1 which has only been installed for a week. I'll restart the sd with debugging turned on but perhaps these traceback files will be of some help in the mean time. ====================================================================== ---------------------------------------------------------------------- (0004192) bobhetzel (reporter) - 2009-05-12 19:15 http://bugs.bacula.org/view.php?id=1287#c4192 ---------------------------------------------------------------------- I also forgot to mention that doing a 'status storage' produces the following... *status storage Automatically selected Storage: Dell-PV136T Connecting to Storage daemon Dell-PV136T at gyrus:9103 Failed to connect to Storage daemon Dell-PV136T. ==== You have messages. * 12-May 13:10 gyrus-dir JobId 0: Fatal error: authenticate.c:120 Director unable to authenticate with Storage daemon at "gyrus:9103". Possible causes: Passwords or names not the same or Maximum Concurrent Jobs exceeded on the SD or SD networking messed up (restart daemon). Please see http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000 for help. ---------------------------------------------------------------------- (0004195) ebollengier (administrator) - 2009-05-13 09:08 http://bugs.bacula.org/view.php?id=1287#c4195 ---------------------------------------------------------------------- Can you attach your SD and DIR configuration files (without password) ? (mostly storage and device sections) Can you also compile bacula with the --enable-lockmgr ? Any lock problems will stop the daemon and generate traces on working dir (we need them *traceback, *backtrace) Thanks ---------------------------------------------------------------------- (0004198) bobhetzel (reporter) - 2009-05-13 15:15 http://bugs.bacula.org/view.php?id=1287#c4198 ---------------------------------------------------------------------- I've uploaded those two files, recompiled with the --enable-lockmgr config option and am now running that version so I'll report back when I next see the problem. One question though, should I also turn on debugging output or is that more automatic with that option? ---------------------------------------------------------------------- (0004199) ebollengier (administrator) - 2009-05-13 15:22 http://bugs.bacula.org/view.php?id=1287#c4199 ---------------------------------------------------------------------- It could be useful to turn debug at level 50 on the storage. setdebug trace=1 storage=xxx level=50 It should not overload your system. ---------------------------------------------------------------------- (0004200) ebollengier (administrator) - 2009-05-13 15:33 http://bugs.bacula.org/view.php?id=1287#c4200 ---------------------------------------------------------------------- It seems that you are still using the Prefer Mounted Volumes = no, and as in http://bugs.bacula.org/view.php?id=1213, we repeat that: Please be aware that this option is not recommended as it creates a lot of volume movement and can lead to these kinds of deadlock situations. You can accomplish the same thing (spreading jobs among several volumes) by careful use of pools. New deadlock manager and traces can permit to track this problem, however, please note that you are using a *dangerous* option for production. ---------------------------------------------------------------------- (0004201) ebollengier (administrator) - 2009-05-13 16:01 http://bugs.bacula.org/view.php?id=1287#c4201 ---------------------------------------------------------------------- If you are in a blocking situation, note that you can use "kill -SEGV $(pidof bacula-sd)" to generate all traces. ---------------------------------------------------------------------- (0004222) kern (administrator) - 2009-05-19 12:40 http://bugs.bacula.org/view.php?id=1287#c4222 ---------------------------------------------------------------------- Problem identified: This is a deadlock that involves four different threads and five different locks. Each thread hold a lock that another thread wants. No solution yet. ---------------------------------------------------------------------- (0004230) kern (administrator) - 2009-05-21 16:52 http://bugs.bacula.org/view.php?id=1287#c4230 ---------------------------------------------------------------------- Please try applying the patch attached to this bug report. I believe that it will fix your bug. The instructions for applying it are at the top of the bug report. ---------------------------------------------------------------------- (0004233) kern (administrator) - 2009-05-25 08:59 http://bugs.bacula.org/view.php?id=1287#c4233 ---------------------------------------------------------------------- I'm closing the bug because I believe that it is fixed with the attached patch. If it is not, please re-open the bug and provide another nice traceback like your previous one, but with the patch applied. ---------------------------------------------------------------------- (0004249) bobhetzel (reporter) - 2009-05-27 18:23 http://bugs.bacula.org/view.php?id=1287#c4249 ---------------------------------------------------------------------- I just noticed that backups appear to have stalled again on me. This doesn't happen every day but happened once before since the patch. Here's the console log of what I just noted. *status storage Automatically selected Storage: Dell-PV136T Connecting to Storage daemon Dell-PV136T at gyrus:9103 gyrus-sd Version: 3.0.1 (30 April 2009) x86_64-unknown-linux-gnu redhat Daemon started 27-May-09 10:45, 20 Jobs run since started. Heap: heap=2,408,448 smbytes=2,209,052 max_bytes=2,340,126 bufs=505 max_bufs=525 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 Running Jobs: Writing: Incremental Backup job bjf7 JobId=82100 Volume="LTO313L2" pool="Default" device="IBMLTO2-1" (/dev/nst0) spooling=1 despooling=0 despool_wait=0 Files=49 Bytes=21,441,060 Bytes/sec=264,704 FDReadSeqNo=715 in_msg=585 out_msg=5 fd=11 Writing: Incremental Backup job cat41-d610 JobId=82105 Volume="LTO313L2" pool="Default" device="IBMLTO2-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=1 Files=65 Bytes=1,148,468,626 Bytes/sec=517,561 FDReadSeqNo=18,083 in_msg=17894 out_msg=9 fd=26 Writing: Incremental Backup job crh6-gx280 JobId=82112 Volume="LTO313L2" pool="Default" device="IBMLTO2-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=0 Files=136 Bytes=2,388,250,864 Bytes/sec=6,187,178 FDReadSeqNo=37,560 in_msg=37180 out_msg=9 fd=22 Writing: Incremental Backup job d8q96p61 JobId=82117 Volume="LTO300L2" pool="Default" device="IBMLTO2-2" (/dev/nst1) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDReadSeqNo=6 in_msg=6 out_msg=4 fd=15 Writing: Incremental Backup job ded3-gx280 JobId=82120 Volume="LTO313L2" pool="Default" device="IBMLTO2-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=1 Files=94 Bytes=32,070,390 Bytes/sec=395,930 FDReadSeqNo=1,272 in_msg=1006 out_msg=9 fd=14 Writing: Incremental Backup job dollyd JobId=82121 Volume="LTO300L2" pool="Default" device="IBMLTO2-2" (/dev/nst1) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDReadSeqNo=6 in_msg=6 out_msg=4 fd=17 Writing: Incremental Backup job drdavisoffice JobId=82123 Volume="LTO313L2" pool="Default" device="IBMLTO2-1" (/dev/nst0) spooling=1 despooling=0 despool_wait=0 Files=155 Bytes=2,030,668,315 Bytes/sec=25,069,979 FDReadSeqNo=32,264 in_msg=31828 out_msg=5 fd=16 Writing: Incremental Backup job djh5-gx270 JobId=82124 Volume="" pool="Default" device="IBMLTO2-2" (/dev/nst1) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== Jobs waiting to reserve a drive: 3605 JobId=82124 wants free drive but device "IBMLTO2-1" (/dev/nst0) is busy. ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 82110 Incr 0 0 Cancel 27-May-09 12:38 cks-gx620 82109 Incr 0 0 Error 27-May-09 12:40 ceg3-e6500 82108 Incr 0 0 Error 27-May-09 12:45 cdm15-1tpzlc1 82111 Incr 0 0 Error 27-May-09 12:50 clifford3 82113 Incr 0 0 Error 27-May-09 12:56 cshs 82089 Incr 147 2.125 G OK 27-May-09 12:56 adahil 82116 Incr 0 0 Error 27-May-09 13:01 d820-8lc62d1 82103 Incr 172 14.84 M OK 27-May-09 13:01 bxo2-gx280 82114 Incr 0 0 Error 27-May-09 13:06 cva9-o755 82115 Incr 0 0 Error 27-May-09 13:11 cxj57-gx270 ==== Device status: Autochanger "Dell-PV136T" with devices: "IBMLTO2-1" (/dev/nst0) "IBMLTO2-2" (/dev/nst1) Device "IBMLTO2-1" (/dev/nst0) is mounted with: Volume: LTO313L2 Pool: Default Media type: LTO-2 Slot 31 is loaded in drive 0. Configured device capabilities: EOF BSR BSF FSR FSF EOM REM !RACCESS AUTOMOUNT LABEL !ANONVOLS ALWAYSOPEN Device state: OPENED TAPE LABEL !MALLOC APPEND !READ !EOT !WEOT !EOF !NEXTVOL !SHORT !MOUNTED num_writers=5 reserved=0 block=0 Device parameters: Archive name: /dev/nst0 Device name: IBMLTO2-1 File=261 block=11529 Min block=0 Max block=0 Total Bytes=250,326,107,136 Blocks=3,880,302 Bytes/block=64,512 Positioned at File=261 Block=11,529 Device "IBMLTO2-2" (/dev/nst1) is not open. Device is being initialized. Drive 1 status unknown. Configured device capabilities: EOF BSR BSF FSR FSF EOM REM !RACCESS AUTOMOUNT LABEL !ANONVOLS ALWAYSOPEN Device state: !OPENED TAPE !LABEL !MALLOC !APPEND !READ !EOT !WEOT !EOF !NEXTVOL !SHORT !MOUNTED num_writers=0 reserved=2 block=3 Device parameters: Archive name: /dev/nst1 Device name: IBMLTO2-2 File=0 block=0 Min block=0 Max block=0 ==== Used Volume status: LTO300L2 on device "IBMLTO2-2" (/dev/nst1) Reader=0 writers=0 devres=1 volinuse=1 LTO313L2 on device "IBMLTO2-1" (/dev/nst0) Reader=0 writers=4 devres=0 volinuse=1 ==== Data spooling: 4 active jobs, 7,239,890,570 bytes; 5 total jobs, 7,239,890,570 max bytes/job. Attr spooling: 5 active jobs, 212,536 bytes; 4 total jobs, 212,536 max bytes. ==== You have messages. *mess 27-May 13:12 ded3-gx280 JobId 82120: VSS Writer (BackupComplete): "MSDEWriter", State: 0x1 (VSS_WS_STABLE) 27-May 13:12 gyrus-sd JobId 82120: Job write elapsed time = 00:01:10, Transfer rate = 458.1 K bytes/second 27-May 13:12 ded3-gx280 JobId 82120: VSS Writer (BackupComplete): "Microsoft Writer (Service State)", State: 0x1 (VSS_WS_STABLE) 27-May 13:12 gyrus-sd JobId 82120: Committing spooled data to Volume "LTO313L2". Despooling 32,103,472 bytes ... 27-May 13:12 ded3-gx280 JobId 82120: VSS Writer (BackupComplete): "Microsoft Writer (Bootable State)", State: 0x1 (VSS_WS_STABLE) 27-May 13:12 ded3-gx280 JobId 82120: VSS Writer (BackupComplete): "WMI Writer", State: 0x1 (VSS_WS_STABLE) 27-May 13:14 bjf7 JobId 82100: VSS Writer (BackupComplete): "Microsoft Writer (Service State)", State: 0x1 (VSS_WS_STABLE) 27-May 13:14 gyrus-sd JobId 82100: Job write elapsed time = 00:03:18, Transfer rate = 1.889 M bytes/second 27-May 13:14 gyrus-sd JobId 82100: Committing spooled data to Volume "LTO313L2". Despooling 374,433,173 bytes ... 27-May 13:14 bjf7 JobId 82100: VSS Writer (BackupComplete): "MSDEWriter", State: 0x1 (VSS_WS_STABLE) 27-May 13:14 bjf7 JobId 82100: VSS Writer (BackupComplete): "Microsoft Writer (Bootable State)", State: 0x1 (VSS_WS_STABLE) 27-May 13:14 bjf7 JobId 82100: VSS Writer (BackupComplete): "WMI Writer", State: 0x1 (VSS_WS_STABLE) 27-May 13:15 drdavisoffice: VSS Writer (BackupComplete): "Microsoft Writer (Service State)", State: 0x1 (VSS_WS_STABLE) 27-May 13:15 gyrus-sd JobId 82123: Job write elapsed time = 00:03:52, Transfer rate = 24.47 M bytes/second 27-May 13:15 drdavisoffice: VSS Writer (BackupComplete): "Microsoft Writer (Bootable State)", State: 0x1 (VSS_WS_STABLE) 27-May 13:15 gyrus-sd JobId 82123: Committing spooled data to Volume "LTO313L2". Despooling 5,683,817,423 bytes ... 27-May 13:15 drdavisoffice: VSS Writer (BackupComplete): "MSDEWriter", State: 0x1 (VSS_WS_STABLE) 27-May 13:15 drdavisoffice: VSS Writer (BackupComplete): "WMI Writer", State: 0x1 (VSS_WS_STABLE) 27-May 13:16 gyrus-sd JobId 82121: Fatal error: 3992 Bad autochanger "load slot 21, drive 1": ERR=Child died from signal 15: Termination. Results=Program killed by Bacula (timeout) 27-May 13:16 gyrus-sd JobId 82121: Fatal error: fd_cmds.c:170 Command error with FD, hanging up. Append data error. 27-May 13:16 dollyd JobId 82121: Fatal error: ../../filed/job.c:1817 Bad response to Append Data command. Wanted 3000 OK data , got 3903 Error append data and then there was more job log output. ---------------------------------------------------------------------- (0004250) bobhetzel (reporter) - 2009-05-27 18:44 http://bugs.bacula.org/view.php?id=1287#c4250 ---------------------------------------------------------------------- I should note that it seems to hang on the status storage command right after it writes out "Used Volume status:". However it seemed to eventually time something out and continue. Now, on looking even further into it, it seems to have loaded a tape into a drive I don't have bacula configured to use. Perhaps something went wrong with the scsi bus. I'll re-boot everything now. Go ahead and re-close this again--sorry about that. ---------------------------------------------------------------------- (0004256) kern (administrator) - 2009-05-27 22:07 http://bugs.bacula.org/view.php?id=1287#c4256 ---------------------------------------------------------------------- Yes, I don't think the new problems you had are related to the previous (hopefully fixed bug). From what I see, it looks like your autochanger stalled or at least took longer than the timeout set for it -- but that is another problem hopefully hardware and not Bacula related. Issue History Date Modified Username Field Change ====================================================================== 2009-05-12 18:27 bobhetzel New Issue 2009-05-12 18:27 bobhetzel File Added: btb-dir.txt 2009-05-12 18:27 bobhetzel File Added: btb-sd.txt 2009-05-12 19:15 bobhetzel Note Added: 0004192 2009-05-13 09:08 ebollengier Note Added: 0004195 2009-05-13 09:09 ebollengier Status new => assigned 2009-05-13 09:09 ebollengier Assigned To => ebollengier 2009-05-13 09:09 ebollengier Status assigned => feedback 2009-05-13 15:11 bobhetzel File Added: bacula-dir.conf.txt 2009-05-13 15:11 bobhetzel File Added: bacula-sd.conf 2009-05-13 15:15 bobhetzel Note Added: 0004198 2009-05-13 15:22 ebollengier Note Added: 0004199 2009-05-13 15:33 ebollengier Note Added: 0004200 2009-05-13 16:01 ebollengier Note Added: 0004201 2009-05-19 12:40 kern Note Added: 0004222 2009-05-19 12:40 kern Status feedback => confirmed 2009-05-19 12:40 kern Summary sd hangs. could be similar to bug 1213 => sd hangs 2009-05-19 12:40 kern Additional Information Updated 2009-05-21 16:51 kern File Added: 3.0.1-sd-hang.patch 2009-05-21 16:52 kern Note Added: 0004230 2009-05-21 16:52 kern Status confirmed => resolved 2009-05-25 08:59 kern Note Added: 0004233 2009-05-25 08:59 kern Resolution open => fixed 2009-05-25 09:00 kern Status resolved => closed 2009-05-27 18:23 bobhetzel Note Added: 0004249 2009-05-27 18:23 bobhetzel Status closed => feedback 2009-05-27 18:23 bobhetzel Resolution fixed => reopened 2009-05-27 18:44 bobhetzel Note Added: 0004250 2009-05-27 22:07 kern Note Added: 0004256 2009-05-27 22:07 kern Status feedback => closed 2009-05-27 22:07 kern Resolution reopened => fixed ====================================================================== |