From: Mantis B. T. <no...@bu...> - 2012-06-11 08:20:32
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1885 ====================================================================== Reported By: sistemas Assigned To: kern ====================================================================== Project: bacula Issue ID: 1885 Category: Storage Daemon Reproducibility: sometimes Severity: major Priority: normal Status: assigned ====================================================================== Date Submitted: 2012-06-07 10:38 BST Last Modified: 2012-06-11 09:20 BST ====================================================================== Summary: Some COPY jobs finish with error (Error: block.c:1001 Read error on fd=-1 at file:blk 0:0 on device...) Description: We make dozens COPY jobs every day. Most of them finish correctly but some of them (apparently randomly) finish with error. The error message that appears in the log is: Error: block.c:1001 Read error on fd=-1 at file:blk 0:0 on device XXXXXXXX. ERR=Invalid File Descriptor * The phrase "Invalid File Descriptor" perhaps is not correct cause I have translated it from spanish. Later, when Bacula retries again the copy job, it finish OK. Additional Information: This error has appeared only in Bacula 5.2.6 (we have been using bacula for years). I've attach a complete erroneous copy job log. ====================================================================== ---------------------------------------------------------------------- (0006365) sistemas (reporter) - 2012-06-07 10:39 http://bugs.bacula.org/view.php?id=1885#c6365 ---------------------------------------------------------------------- One more thing: we are using disk volumes, NOT tapes. The filesystem that stores the disk volumes is NOT FULL, so is not a space problem. ---------------------------------------------------------------------- (0006366) sistemas (reporter) - 2012-06-07 11:23 http://bugs.bacula.org/view.php?id=1885#c6366 ---------------------------------------------------------------------- One more thing (II): when copy job fails, bacula has written some data in the copy volumes. * Even in some multi-volume jobs copy job fails after write several volumes ---------------------------------------------------------------------- (0006372) kern (administrator) - 2012-06-09 10:22 http://bugs.bacula.org/view.php?id=1885#c6372 ---------------------------------------------------------------------- At a first look, it would seem that you have some volume corruption. Is there always some error or warning prior to the file descriptor being -1 such as the one on your log which says: Warning: Got MD5 digest but not same File as attributes This particular warning should never happen and could indicate volume data corruption. What kind of disk storage are you using (local, NFS, NAS, ...?). You might check your kernel logs to see if any disk errors are being reported. Unless you can find a way to reproduce it or generate some debug trace of what happened before the read error, it will be difficult to track it down if it is a Bacula bug rather than a hardware error. I'll add some additional debug code in the next version. ---------------------------------------------------------------------- (0006384) sistemas (reporter) - 2012-06-11 09:20 http://bugs.bacula.org/view.php?id=1885#c6384 ---------------------------------------------------------------------- The "Got MD5 digest..." message appears only in some cases: I've selected 7 copy jobs failed with the indicated error and only appears on 2 jobs... Anyway: there is some method to check a volume in order to detect a possible corruption? More: there is some method to verify a finished job in order to detect a possible corruption? We are using 2 separate storages via ISCSI: one only for "normal" jobs and the other only for copy jobs. Today I've seen that the Bacula version has been upgrade to 2.5.8 and that you have included the additional debug code you have told about. One question: How can I activate it (modifying config files / adding startup option) ? * I suppose the debug mode will overload bacula/system so...I wan to restrict it as much as I can About find a way to reproduce the error...I think is not possible due to its random nature: casually some minutes ago has appeared another erroneus copy job :-( But I can give you some additional data about this new failed job, hoping this can help you: 1) this time the "Got MD5 digest..." message has NOT appeared 2) the "source job" used ONLY the volume affected by the "Read error on fd=-1 at file:blk 0:0" message, not other volumes 3) I've re-runed the failed copy job (so: same source job/volume) ant this time copy job has been completed succesfully 4) the affected volume contains several jobs and ALL of them has been copied succesfully Issue History Date Modified Username Field Change ====================================================================== 2012-06-07 10:38 sistemas New Issue 2012-06-07 10:38 sistemas File Added: copy_job.err 2012-06-07 10:39 sistemas Note Added: 0006365 2012-06-07 11:23 sistemas Note Added: 0006366 2012-06-09 10:22 kern Note Added: 0006372 2012-06-09 10:22 kern Assigned To => kern 2012-06-09 10:22 kern Status new => feedback 2012-06-11 09:20 sistemas Note Added: 0006384 2012-06-11 09:20 sistemas Status feedback => assigned ====================================================================== |