From: Mantis B. T. <no...@bu...> - 2012-12-09 09:47:17
|
The following issue requires your FEEDBACK. ====================================================================== http://bugs.bacula.org/view.php?id=1958 ====================================================================== Reported By: waa Assigned To: ====================================================================== Project: bacula Issue ID: 1958 Category: Storage Daemon Reproducibility: random Severity: trivial Priority: urgent Status: feedback ====================================================================== Date Submitted: 2012-11-28 16:53 UTC Last Modified: 2012-12-09 09:47 UTC ====================================================================== Summary: Bacula-sd page allocation error during a job, but job terminates with status "OK" Description: Due to wordwrap I have linked a couple pastebin posts in this email rather than have them possibly be unreadable here. Also, one is over 400 lines long. Hope that is OK. I have a situation where the Bacula storage daemon is having a page allocation error during a job, but the very bad part is that the backups terminate with a status of "OK" - and the next job to try to use the volume complains about a volume/catalog size mismatch and then chooses another volume and continues on. I am very concerned that jobs may be terminating as "OK", but we may not be able to retrieve the data when/if it is actually needed. I have not tried yet, and I will shortly, but this causes me some concern. dmesg output showing bacula-sd page allocation failure: http://pastebin.com/XWsJxseG I was alerted to the problem when a job just ran and Bacula complained that the catalog size of a volume did not match the size of the volume on disk: --[snip]-- 28-Nov 02:30 backup-sd JobId 12913: Volume "6TB-FileTape_0239" previously written, moving to end of data. 28-Nov 02:30 backup-sd JobId 12913: Error: Bacula cannot write on disk Volume "6TB-FileTape_0239" because: The sizes do not match! Volume=290845229 Catalog=2023098160 28-Nov 02:30 backup-sd JobId 12913: Marking Volume "6TB-FileTape_0239" in Error in Catalog. --[snip]-- This same volume 6TB-FileTape_0239 was last written to the night before by two jobs, with no errors: http://pastebin.com/eSXfgsY4 But in Job 12898, the FD and SD bytes written differ: FD Bytes Written: 706,612,738 (706.6 MB) SD Bytes Written: 708,902,081 (708.9 MB) A query shows those two jobs on the volume: http://pastebin.com/bQ7qPaNJ Then, the very next job to use the 6TB-FileTape_0239 volume, shows the size mismatch error: http://pastebin.com/0KTxBE2M After Jobs 12898 and 12899 ran the previous night, nothing has touched the file "/opt/BaculaTapesDir/ZimbraBackups/6TB-FileTape_0239" The timestamp on the file is 2:37am, but Job 12899 shows an End Time of 27-Nov-2012 03:16:53, so I think that some data was not written to this volume, but Bacula did not notice the bacula-sd page allocation error(s) and the job was incorrectly terminated with an OK status. ls -la /opt/BaculaTapesDir/ZimbraBackups/*FileTape_0239 -rw-r----- 1 root bacula 290845229 Nov 27 02:37 /opt/BaculaTapesDir/ZimbraBackups/6TB-FileTape_0239 I would be happy to provide any additional information if it might help solve this problem. Server is Gentoo Linux with v3.6.6 kernel and Bacula v5.2.12 While there may actually be a problem with the backup server's hardware, I hope that Bacula will be able to detect such an issue with the storage daemon and fail the job, or give some indication rather than terminate jobs as OK when there is definitely something wrong. -- Bill Arlofski Reverse Polarity, LLC Steps to Reproduce: The page allocation errors are random, no way to accurately or consistently reproduce this issue. Additional Information: Thanks for any comments and thank you so much for the hard work that has been done to create and maintain Bacula! ====================================================================== ---------------------------------------------------------------------- (0006549) waa (reporter) - 2012-11-28 18:05 http://bugs.bacula.org/view.php?id=1958#c6549 ---------------------------------------------------------------------- I just performed restores for the two jobs that were on the volume flagged in error. Both jobs appear to have restored properly. Job 12899 was a bacula catalog (bacula.sql) file and it was complete, ending with a line: -- Dump completed on 2012-11-27 3:00:19 the job before it, Job 12898 restored 14,004 files which is what the catalog shows. I am still worried though and I do not understand how the SQL dump can complete at 3:00:19, but the volume it was written to has a much earlier timestamp. Thanks again! -- Bill Arlofski Reverse Polarity, LLC ---------------------------------------------------------------------- (0006559) kern (administrator) - 2012-12-09 09:47 http://bugs.bacula.org/view.php?id=1958#c6559 ---------------------------------------------------------------------- What is the page allocation error you are referring to? I see no page allocation error -- only a report of a difference in the size of the volume and the size, which is "normal" if you have communications line terminations. Issue History Date Modified Username Field Change ====================================================================== 2012-11-28 16:53 waa New Issue 2012-11-28 18:05 waa Note Added: 0006549 2012-12-09 09:47 kern Note Added: 0006559 2012-12-09 09:47 kern Severity major => trivial 2012-12-09 09:47 kern Status new => feedback ====================================================================== |