From: Mantis B. T. <no...@bu...> - 2011-03-02 23:01:59
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1666 ====================================================================== Reported By: craigmiskell Assigned To: ====================================================================== Project: bacula Issue ID: 1666 Category: bscan Reproducibility: always Severity: minor Priority: normal Status: new ====================================================================== Date Submitted: 2010-11-22 21:08 GMT Last Modified: 2011-03-02 23:01 GMT ====================================================================== Summary: Bscan recreates jobmedia records unnecessarily, and only creates one Description: When bscan is run on a volume with a job which has hit it's File Retention period, but not the Job Retention period, bscan will create a single new jobmedia record, even though there are existing jobmedia records in the database. Further, this record covers the entire section of the media where the job is, rather than inserting one record per on-media 'file', as is done by the backup job. Consequently, when running a restore from files on that job, there will be entries in the bsr as many times as there are jobmedia records that cover the required fileindexes (2, after a backup and a single bscan; one added for each subsequent bscan). The restore will complete, but often using the newly inserted jobmedia records rather than the more efficient ones from the original backup, which can be very slow due to having to scan the whole tape rather than fast forwarding. The final log for the restore also reports: "warning file count mismatch", and the expected file count differs from the restored file count, which can be concerning to some operators. The attached patch creates jobmedia records in the database equivalent to the original backup, but only if they don't already exist. In creating this patch, I saw what appears to be some oddness in the block numbers inserted into these records by the original backup, which I'll raise in another ticket. As arrogant as it sounds, I think the values inserted by this patch are more correct than the originals. I think that worries me slightly; either there's a true bug in the original record creation, or I don't understand what I'm doing. The patch has been reasonably well tested in some simple cases involving: a) single job on one volume b) three jobs (two clients) on one volume varying from a File Retention purge, missing JobMedia records, through to a full Purge of the original volume. All jobs were small (~30GB or less). I'm not in a position to easily test a backup that spans volumes, although I've attempted to handle that correctly and hope that it will work. Please let me know if you require a contributor agreement be signed for this patch; it could well be too large to just accept, although is still just a patch and is well contained. I'm happy either way. Steps to Reproduce: Perform a backup. Allow that backup to have it's File records pruned by File Retention, but not the Job records. (or just delete the job specific records from the File table for that job; I believe that's equivalent) bscan the media Look in the jobmedia table for that jobid to see the single new additional record. Prepare a restore and see that the bsr has multiple entries for the same file, one with limited VolAddr values, one with expansive values. Additional Information: The misleading status message, from restoring a single file, after having run bscan twice (records got nuked by File Retention again after the first bscan): 19-Nov 10:17 ausv01-dir JobId 553: Bacula ausv01-dir 5.0.0 (26Jan10): 19-Nov-201 > 0 10:17:39 > Build OS: i486-pc-linux-gnu debian 4.0 > JobId: 553 > Job: RestoreFiles.2010-11-19_09.30.55_11 > Restore Client: ausv01 > Start time: 19-Nov-2010 09:30:57 > End time: 19-Nov-2010 10:17:39 > Files Expected: 3 > Files Restored: 1 > Bytes Restored: 127,488 > Rate: 0.0 KB/s > FD Errors: 0 > FD termination status: OK > SD termination status: OK > Termination: Restore OK -- warning file count mismatch > ====================================================================== ---------------------------------------------------------------------- (0005670) craigmiskell (reporter) - 2010-11-22 21:26 http://bugs.bacula.org/view.php?id=1666#c5670 ---------------------------------------------------------------------- New version of the patch uploaded; who knew you needed semicolons at the end of *every* line :) (Got messed up in copying the patch from my existing dev code to my git repository; I'm fixing my processes now :)) ---------------------------------------------------------------------- (0005775) kern (administrator) - 2011-02-25 10:00 http://bugs.bacula.org/view.php?id=1666#c5775 ---------------------------------------------------------------------- After looking at this, I have the following comments: 1. Yes, something this size will require an FLA 2. Yes, I agree, bscan could use a bit of work. It is really meant to be a last resort tool, so I haven't put much time into it. 3. You have a good point that bscan should see if there are already JobMedia records and not add new ones in that case. 4. Your variable names are very hard to read because they do not follow either of our two conventions to separate parts with _ or by using judicious capitalization. 5. You have a number of globals such as prevmediafileindex, jobmedialastindex, ... This is doomed to failure, because bscan must simulate multiple simultaneous jobs, thus these variables must be "local" to the job, which is normally done by having them in the jcr. 6. It is *very* dangerous to attempt to compute indexes and blocks outsize the core code that does it correctly at least in writes. For reads these indexes and block numbers are not used much so they may need some tuning. Unless it is totally impossible to do this with our core routines (read_record), I wouldn't seriously consider including code that attempts to compute those values differently. 7. db_get_num_job_media... seems to me so specialized that it is far better to do it via a special SQL statement as we do for all such cases it Bacula core code. Otherwise we will end up with too many SQL API calls that are one-shot only. 8. I am slightly undecided on the following: I question the wisdom of writing a lot of code for properly restoring all the JobMedia records. Having one big one is certainly inefficient, but it gets the job done, and the number of people using bscan should really be minimal -- just increase your retention periods, the extra disk space is negligible these days. 9. I see that you have put a lot of time and effort into this so please don't get discouraged by the above comments. Bottom line: - We would appreciate a patch that just corrects the problem of doubling up the JobMedia records. - For the rest, you will need to convince me as the code looks like it would be more work to maintain than it is worth, and given the static variables I don't think it will work for anyone who used multiple simultaneous backup jobs. ---------------------------------------------------------------------- (0005780) craigmiskell (reporter) - 2011-02-28 22:51 http://bugs.bacula.org/view.php?id=1666#c5780 ---------------------------------------------------------------------- Thanks for the feedback, I appreciate it's constructive nature. Speaking philosophically to start with: bscan isn't quite as "last resort" for us as you consider it. For largely local political reasons, the additional disk space requirement for upping the retention time isn't negligible or low cost. It sucks, but it's true for us. When restoring off a bscan'd volume/job, we've already read the tape through once, taking X hours (3 or 4 is not unheard of on full LTO-4 tapes). Then on the inevitable restore, we have, on average, read half the tape (taking X/2) hours to get to the bit we're interested in. Sometimes more. As opposed to having all the JobMedia records and fast-forwarding to the correct block and reading it in really quickly. If this takes more than a working day (and with 3-4 hours per operation, it's not unlikely), this can impact on the ability to run the next nights backups (or make our people stay late etc.). Plus it's quicker/better for the user who's waiting for some probably important data. So I'm really rather keen on getting a patch together that is acceptable. I have a bunch of other stuff that's slightly higher in my queue first though, so I'd appreciate it if you could keep this bug open for a week or two. Then I'll submit: 1) A patch that just avoids the doubling up of records. I *think* that can be pulled out as separate, and it's well worth getting in while we argue^Wdiscuss the merits of the more substantial change: 2) A better patch that addresses your various concerns. This is what will take me some time; I've obviously missed some available existing code (block calculations), mainly due to lack of general familiarity with the entire code base. I'll hunt around, and may have some more questions; where's the best forum for asking them? ---------------------------------------------------------------------- (0005782) craigmiskell (reporter) - 2011-03-02 23:01 http://bugs.bacula.org/view.php?id=1666#c5782 ---------------------------------------------------------------------- Ok, I've been looking at the code a bit more, and I'm getting what you mean about the variables in the JCR now (multiple jobs on a tape). It appears to me that I'm going to have to check whether there are existing job media records on a per-job basis, around the time that we find the SOS_LABEL, and then store that, presumably in the JCR. So, is it ok to add bscan specific fields to the JCR? My gut instinct is to add a pointer to a structure, defined and owned by bscan, to keep the JCR clear of extraneous cruftage. It's not so bad for this one field, but the other code I hope to get in would need a bunch of fields that are *very* bscan specific. I'm interested in your opinion on this approach. Issue History Date Modified Username Field Change ====================================================================== 2010-11-22 21:08 craigmiskell New Issue 2010-11-22 21:08 craigmiskell File Added: 0001-Patch-to-make-bscan-do-better-with-jobmedia-records.patch 2010-11-22 21:25 craigmiskell File Added: 0001-Patch-to-make-bscan-do-better-with-jobmedia-records.patch2 2010-11-22 21:26 craigmiskell Note Added: 0005670 2010-11-23 08:49 mnalis Issue Monitored: mnalis 2010-12-30 11:31 thrasherbcn Issue Monitored: thrasherbcn 2011-01-18 21:56 slords Issue Monitored: slords 2011-02-25 10:00 kern Note Added: 0005775 2011-02-25 10:00 kern Status new => feedback 2011-02-28 22:51 craigmiskell Note Added: 0005780 2011-02-28 22:51 craigmiskell Status feedback => new 2011-03-02 23:01 craigmiskell Note Added: 0005782 ====================================================================== |