From: Mantis B. T. <no...@bu...> - 2013-01-29 19:52:20
|
The following issue has been CLOSED ====================================================================== http://bugs.bacula.org/view.php?id=1885 ====================================================================== Reported By: sistemas Assigned To: kern ====================================================================== Project: bacula Issue ID: 1885 Category: Storage Daemon Reproducibility: sometimes Severity: major Priority: normal Status: closed Resolution: fixed Fixed in Version: ====================================================================== Date Submitted: 2012-06-07 09:38 UTC Last Modified: 2013-01-29 19:52 UTC ====================================================================== Summary: Some COPY jobs finish with error (Error: block.c:1001 Read error on fd=-1 at file:blk 0:0 on device...) Description: We make dozens COPY jobs every day. Most of them finish correctly but some of them (apparently randomly) finish with error. The error message that appears in the log is: Error: block.c:1001 Read error on fd=-1 at file:blk 0:0 on device XXXXXXXX. ERR=Invalid File Descriptor * The phrase "Invalid File Descriptor" perhaps is not correct cause I have translated it from spanish. Later, when Bacula retries again the copy job, it finish OK. Additional Information: This error has appeared only in Bacula 5.2.6 (we have been using bacula for years). I've attach a complete erroneous copy job log. ====================================================================== ---------------------------------------------------------------------- (0006365) sistemas (reporter) - 2012-06-07 09:39 http://bugs.bacula.org/view.php?id=1885#c6365 ---------------------------------------------------------------------- One more thing: we are using disk volumes, NOT tapes. The filesystem that stores the disk volumes is NOT FULL, so is not a space problem. ---------------------------------------------------------------------- (0006366) sistemas (reporter) - 2012-06-07 10:23 http://bugs.bacula.org/view.php?id=1885#c6366 ---------------------------------------------------------------------- One more thing (II): when copy job fails, bacula has written some data in the copy volumes. * Even in some multi-volume jobs copy job fails after write several volumes ---------------------------------------------------------------------- (0006372) kern (administrator) - 2012-06-09 09:22 http://bugs.bacula.org/view.php?id=1885#c6372 ---------------------------------------------------------------------- At a first look, it would seem that you have some volume corruption. Is there always some error or warning prior to the file descriptor being -1 such as the one on your log which says: Warning: Got MD5 digest but not same File as attributes This particular warning should never happen and could indicate volume data corruption. What kind of disk storage are you using (local, NFS, NAS, ...?). You might check your kernel logs to see if any disk errors are being reported. Unless you can find a way to reproduce it or generate some debug trace of what happened before the read error, it will be difficult to track it down if it is a Bacula bug rather than a hardware error. I'll add some additional debug code in the next version. ---------------------------------------------------------------------- (0006384) sistemas (reporter) - 2012-06-12 06:41 http://bugs.bacula.org/view.php?id=1885#c6384 ---------------------------------------------------------------------- The "Got MD5 digest..." message appears only in some cases: I've selected 7 copy jobs failed with the indicated error and only appears on 2 jobs... Anyway: there is some method to check a volume in order to detect a possible corruption? More: there is some method to verify a finished job in order to detect a possible corruption? We are using 2 separate storages via ISCSI: one only for "normal" jobs and the other only for copy jobs. Today I've seen that the Bacula version has been upgrade to 2.5.8 and that you have included the additional debug code you have told about. One question: How can I activate it (modifying config files / adding startup option) ? * I suppose the debug mode will overload bacula/system so...I wan to restrict it as much as I can About find a way to reproduce the error...I think is not possible due to its random nature: by chance some minutes ago has appeared another erroneus copy job :-( But I can give you some additional data about this new failed job, hoping this can help you: 1) this time the "Got MD5 digest..." message has NOT appeared 2) the "source job" used ONLY the volume affected by the "Read error on fd=-1 at file:blk 0:0" message, not other volumes 3) I've re-runed the failed copy job (so: same source job/volume) ant this time copy job has been completed succesfully 4) the affected volume contains several jobs and ALL of them has been copied succesfully ---------------------------------------------------------------------- (0006386) sistemas (reporter) - 2012-06-11 10:16 http://bugs.bacula.org/view.php?id=1885#c6386 ---------------------------------------------------------------------- OK I've downloaded/compiled/installed the new bacula version (obviously 5.2.8, not 2.5.8 as I've noted in my previous note), so now bacula is up to date. About debug mode: after some reading if I need to activate it... - how: -d option in the launcher? - what level? - I suppose only in the Storage Daemon, right? ---------------------------------------------------------------------- (0006392) sistemas (reporter) - 2012-06-12 07:21 http://bugs.bacula.org/view.php?id=1885#c6392 ---------------------------------------------------------------------- More data...I hope this time more useful for you: Yesterday I've activated debug mode in SD with level 100 and "fortunately" after that, they appeared FOUR new erroneous copy jobs. * I've uploaded the SD DEBUG file (bacula-sd.gz) and the updated NORMAL LOG file (bacula.log.gz) that includes data about recent jobs. Additional info: - Original jobs being "copied": 153949 153920 153783 154012 - Copy jobs failed: 154205/154206 154133/154134 154161/154162 154042/154043 * for every job I note 2 jobids: "current JobId"/"New Backup JobId" - In one of them (154133/154134) it appears a message about MD5 issues: "Warning: catreq.c:545 MD5 digest not same File=150 as attributes=149" * this text is slightly different from previous MD5 messages...probably due to changes done by you in the 5.2.8 version * MORE IMPORTANT: this morning I've re-runed copy job for the source job (153920) and now the copy job finish CORRECTLY without any MD5 issues...so apparently source job/volume are not corrupted (as you pointed yesterday as one option), at least in the source job/volume * the new copy job for 153920 jobid are 154269/154270 ---------------------------------------------------------------------- (0006399) kern (administrator) - 2012-06-17 17:19 http://bugs.bacula.org/view.php?id=1885#c6399 ---------------------------------------------------------------------- Would you upload your bacula-dir.conf Job, Pool, Storage, and FileSet definitions for one or two of the Copy jobs that fail. I want to see if I can reproduce it here. ---------------------------------------------------------------------- (0006406) sistemas (reporter) - 2012-06-18 13:02 http://bugs.bacula.org/view.php?id=1885#c6406 ---------------------------------------------------------------------- OK Our bacula configuration is distributed over several independent files (for storages, clients, jobs, etc). I've joined all data for one of the failed job/copy_job and I've attached to the Issue (with self explanatory filenames): - bacula-dir - bacula-sd * If you want/need our complete bacula configuration, I can give it to you but in a more "private" way (i.e. email). ---------------------------------------------------------------------- (0006604) kern (administrator) - 2013-01-29 19:52 http://bugs.bacula.org/view.php?id=1885#c6604 ---------------------------------------------------------------------- This turned out to be a race condition that was not at all easy to find. Only once I had a nice script that simulated this submitted by Arno, was I with great difficulty able to find and fix it. This *should* be fixed in version 5.2.13 Issue History Date Modified Username Field Change ====================================================================== 2012-06-07 09:38 sistemas New Issue 2012-06-07 09:38 sistemas File Added: copy_job.err 2012-06-07 09:39 sistemas Note Added: 0006365 2012-06-07 10:23 sistemas Note Added: 0006366 2012-06-09 09:22 kern Note Added: 0006372 2012-06-09 09:22 kern Assigned To => kern 2012-06-09 09:22 kern Status new => feedback 2012-06-11 08:20 sistemas Note Added: 0006384 2012-06-11 08:20 sistemas Status feedback => assigned 2012-06-11 10:16 sistemas Note Added: 0006386 2012-06-12 06:41 sistemas Note Edited: 0006384 2012-06-12 07:12 sistemas Note Added: 0006392 2012-06-12 07:13 sistemas Note Edited: 0006392 2012-06-12 07:13 sistemas File Added: bacula-sd.gz 2012-06-12 07:13 sistemas File Added: bacula.log.gz 2012-06-12 07:14 sistemas Note Edited: 0006392 2012-06-12 07:21 sistemas Note Edited: 0006392 2012-06-17 17:19 kern Note Added: 0006399 2012-06-17 17:19 kern Status assigned => feedback 2012-06-18 12:59 sistemas Note Added: 0006406 2012-06-18 12:59 sistemas Status feedback => assigned 2012-06-18 13:00 sistemas File Added: bacula-dir 2012-06-18 13:00 sistemas File Added: bacula-sd 2012-06-18 13:02 sistemas Note Edited: 0006406 2013-01-29 19:52 kern Note Added: 0006604 2013-01-29 19:52 kern Status assigned => closed 2013-01-29 19:52 kern Resolution open => fixed ====================================================================== |