From: Mantis B. T. <no...@bu...> - 2009-05-27 22:50:07
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1298 ====================================================================== Reported By: slords Assigned To: ====================================================================== Project: bacula Issue ID: 1298 Category: Storage Daemon Reproducibility: random Severity: minor Priority: normal Status: feedback ====================================================================== Date Submitted: 2009-05-27 19:26 BST Last Modified: 2009-05-27 23:50 BST ====================================================================== Summary: Storage daemon crash during copy job Description: While trying to copy jobs to tape I had the storage daemon crash. I restarted the SD under gdb and tried to do the copy job again. It crashed again and I was able to capture a backtrace. ====================================================================== ---------------------------------------------------------------------- (0004252) slords (reporter) - 2009-05-27 20:04 http://bugs.bacula.org/view.php?id=1298#c4252 ---------------------------------------------------------------------- Not sure if this helps at all but I've noticed that most of the times just before it crashes the status storage shows something like this first (notice the "" Volume): Running Jobs: Reading: Full Copy job archive JobId=273 Volume="Monthly-0002" pool="Monthly" device="VTL-Drive0" (/var/spool/bacula/drive0) Writing: Full Copy job archive JobId=273 Volume="WA3004" pool="Archive" device="Bullfrog-Drive0" (/dev/nst0) spooling=1 despooling=0 despool_wait=0 Files=104,423 Bytes=3,747,038,174 Bytes/sec=4,158,754 FDSocket closed Writing: Full Copy job archive JobId=283 Volume="" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Reading: Full Copy job archive JobId=285 Volume="" pool="Monthly" device="VTL-Drive0" (/var/spool/bacula/drive0) Writing: Full Copy job archive JobId=285 Volume="WA3004" pool="Archive" device="Bullfrog-Drive0" (/dev/nst0) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Writing: Full Copy job archive JobId=295 Volume="" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Reading: Full Copy job archive JobId=299 Volume="Monthly-0004" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) Writing: Full Copy job archive JobId=299 Volume="WA3004" pool="Archive" device="Bullfrog-Drive0" (/dev/nst0) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Writing: Full Copy job archive JobId=305 Volume="" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Reading: Full Copy job archive JobId=307 Volume="Monthly-0003" pool="Monthly" device="VTL-Drive1" (/var/spool/bacula/drive1) Writing: Full Copy job archive JobId=307 Volume="WA3005" pool="Archive" device="Bullfrog-Drive1" (/dev/nst1) spooling=0 despooling=1 despool_wait=0 Files=31,914 Bytes=342,206,430 Bytes/sec=9,777,326 FDSocket closed Reading: Full Copy job archive JobId=309 Volume="Monthly-0003" pool="Monthly" device="VTL-Drive2" (/var/spool/bacula/drive2) Writing: Full Copy job archive JobId=309 Volume="WA3005" pool="Archive" device="Bullfrog-Drive1" (/dev/nst1) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== This usually leads to a drive blocking like this: Device status: Autochanger "VTL" with devices: "VTL-Drive0" (/var/spool/bacula/drive0) "VTL-Drive1" (/var/spool/bacula/drive1) "VTL-Drive2" (/var/spool/bacula/drive2) "VTL-Drive3" (/var/spool/bacula/drive3) Autochanger "Bullfrog" with devices: "Bullfrog-Drive0" (/dev/nst0) "Bullfrog-Drive1" (/dev/nst1) Device "VTL-Drive0" (/var/spool/bacula/drive0) is not open. Device is BLOCKED waiting for mount of volume "Monthly-0002", Pool: Monthly Media type: File Drive 0 status unknown. Device "VTL-Drive1" (/var/spool/bacula/drive1) is mounted with: Volume: Monthly-0003 Pool: *unknown* Media type: File Slot 278 is loaded in drive 1. Total Bytes Read=343,526,400 Blocks Read=5,325 Bytes/block=64,512 Positioned at File=0 Block=1,574,218,947 Device "VTL-Drive2" (/var/spool/bacula/drive2) is not open. Device is being initialized. Drive 2 is not loaded. Device "VTL-Drive3" (/var/spool/bacula/drive3) is not open. Device is BLOCKED waiting for mount of volume "Monthly-0004", Pool: Monthly Media type: File Drive 3 status unknown. Device "Bullfrog-Drive0" (/dev/nst0) is mounted with: Volume: WA3004 Pool: Archive Media type: AIT-3 Slot 4 is loaded in drive 0. Total Bytes=793,368,576 Blocks=12,297 Bytes/block=64,517 Positioned at File=0 Block=12,298 Device "Bullfrog-Drive1" (/dev/nst1) is mounted with: Volume: WA3005 Pool: Archive Media type: AIT-3 Slot 5 is loaded in drive 1. Total Bytes=4,721,181,696 Blocks=73,182 Bytes/block=64,512 Positioned at File=8 Block=2,033 ==== Usually after 3-5 minutes (time it takes to finish a copy job) of mounting the correct volume in the right drive the SD will crash. ---------------------------------------------------------------------- (0004253) slords (reporter) - 2009-05-27 20:34 http://bugs.bacula.org/view.php?id=1298#c4253 ---------------------------------------------------------------------- Here is what it looks like when I mount the requested volumes above. Notice the two jobs (273, 285) both reading from the same drive/volume: Running Jobs: Reading: Full Copy job archive JobId=273 Volume="Monthly-0002" pool="Monthly" device="VTL-Drive0" (/var/spool/bacula/drive0) Writing: Full Copy job archive JobId=273 Volume="WA3004" pool="Archive" device="Bullfrog-Drive0" (/dev/nst0) spooling=1 despooling=0 despool_wait=0 Files=104,423 Bytes=3,747,101,035 Bytes/sec=1,281,060 FDSocket closed Writing: Full Copy job archive JobId=283 Volume="" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Reading: Full Copy job archive JobId=285 Volume="Monthly-0002" pool="Monthly" device="VTL-Drive0" (/var/spool/bacula/drive0) Writing: Full Copy job archive JobId=285 Volume="WA3004" pool="Archive" device="Bullfrog-Drive0" (/dev/nst0) spooling=1 despooling=0 despool_wait=0 Files=4,347 Bytes=48,181,188 Bytes/sec=8,030,198 FDSocket closed Writing: Full Copy job archive JobId=295 Volume="" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Reading: Full Copy job archive JobId=299 Volume="Monthly-0004" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) Writing: Full Copy job archive JobId=299 Volume="WA3004" pool="Archive" device="Bullfrog-Drive0" (/dev/nst0) spooling=1 despooling=0 despool_wait=0 Files=5,823 Bytes=259,644,285 Bytes/sec=19,972,637 FDSocket closed Writing: Full Copy job archive JobId=305 Volume="" pool="Monthly" device="VTL-Drive3" (/var/spool/bacula/drive3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Reading: Full Copy job archive JobId=315 Volume="Monthly-0017" pool="Monthly" device="VTL-Drive1" (/var/spool/bacula/drive1) Writing: Full Copy job archive JobId=315 Volume="WA3005" pool="Archive" device="Bullfrog-Drive1" (/dev/nst1) spooling=0 despooling=1 despool_wait=0 Files=73 Bytes=26,818,699,173 Bytes/sec=14,808,779 FDSocket closed Reading: Full Copy job archive JobId=317 Volume="Monthly-0034" pool="Monthly" device="VTL-Drive2" (/var/spool/bacula/drive2) Writing: Full Copy job archive JobId=317 Volume="WA3005" pool="Archive" device="Bullfrog-Drive1" (/dev/nst1) spooling=0 despooling=0 despool_wait=1 Files=54,055 Bytes=26,816,843,743 Bytes/sec=14,973,112 FDSocket closed ==== However lower down on that same status it shows only 1 reader for the used volume status: Used Volume status: Monthly-0002 on device "VTL-Drive0" (/var/spool/bacula/drive0) Reader=1 writers=0 devres=0 volinuse=1 Monthly-0004 on device "VTL-Drive3" (/var/spool/bacula/drive3) Reader=1 writers=0 devres=0 volinuse=1 Monthly-0017 on device "VTL-Drive1" (/var/spool/bacula/drive1) Reader=1 writers=0 devres=0 volinuse=1 Monthly-0034 on device "VTL-Drive2" (/var/spool/bacula/drive2) Reader=1 writers=0 devres=0 volinuse=1 WA3004 on device "Bullfrog-Drive0" (/dev/nst0) Reader=0 writers=3 devres=0 volinuse=1 WA3005 on device "Bullfrog-Drive1" (/dev/nst1) Reader=0 writers=2 devres=0 volinuse=1 ---------------------------------------------------------------------- (0004254) slords (reporter) - 2009-05-27 20:37 http://bugs.bacula.org/view.php?id=1298#c4254 ---------------------------------------------------------------------- sd-backtrace2 is what was taken after the crash from the notes above ---------------------------------------------------------------------- (0004257) kern (administrator) - 2009-05-27 22:20 http://bugs.bacula.org/view.php?id=1298#c4257 ---------------------------------------------------------------------- Yes the tracebacks show that the SD is "blowing itself up" with a failed ASSERT. This is because it has detected an inconsistent condition. I know why it is doing that, but what is difficult to know is exactly how it got into that state. It appears that it might be a race condition where two different Jobs are trying to do the same thing and it is not properly protected by a semaphore. Do you have a log of the Job output up to that point? I need to know something from a higher level (i.e. what jobs started; when; and what drives and volumes they were using). Would you create a debug listing with debug level=200 in the SD? It might help, but it looks like there is a lot of things happening at the same time, so it may not be so easy to decipher ... The Volume="" probably just means that the device has been closed or perhaps not fully opened yet. ---------------------------------------------------------------------- (0004258) slords (reporter) - 2009-05-27 23:07 http://bugs.bacula.org/view.php?id=1298#c4258 ---------------------------------------------------------------------- I've uploaded the bactrace and log files. The debug=200 is to large to attach so when I get home I'll set it up and provide a url to download it. ---------------------------------------------------------------------- (0004259) slords (reporter) - 2009-05-27 23:50 http://bugs.bacula.org/view.php?id=1298#c4259 ---------------------------------------------------------------------- The debug file can be downloaded from: http://lordsfam.net/server.lordsfam.net.trace.gz Issue History Date Modified Username Field Change ====================================================================== 2009-05-27 19:26 slords New Issue 2009-05-27 19:26 slords File Added: sd-backtrace.txt 2009-05-27 20:04 slords Note Added: 0004252 2009-05-27 20:34 slords Note Added: 0004253 2009-05-27 20:36 slords File Added: sd-backtrace2.txt 2009-05-27 20:37 slords Note Added: 0004254 2009-05-27 22:20 kern Note Added: 0004257 2009-05-27 22:20 kern Status new => feedback 2009-05-27 23:02 slords File Added: server.lordsfam.net.5868.bactrace 2009-05-27 23:07 slords File Added: log.gz 2009-05-27 23:07 slords Note Added: 0004258 2009-05-27 23:50 slords Note Added: 0004259 ====================================================================== |