From: Robin O'L. <ro...@eq...> - 2008-10-14 16:18:03
|
On Wednesday, October 01, 2008 6:25 PM Kern Sibbald wrote: > On Wednesday 01 October 2008 17:50:24 Eli Shemer wrote: > > I wanted to know whether someone actually got to implement an incremental > > backup to act like rsync does upon file changes ? > > It is a project we are currently discussing as possible in a release following > 3.0.0 scheduled around the end of the year -- i.e. we hope to begin the > project sometime in 2009. If there is commercial funding of this project > (already one company interested) we could probably speed it up. We've had a preliminary look at doing this in an off-list discussion; I don't know Kern's latest thoughts, but this prompts me to think it would be good for us to have further discussion here. > > I'm looking to write a patch to the save_file callback in the bacula-fd's > > backup procedure to utilize the gnu's diff and patches and to send those > > over the line to the bacula-sd. Of course a corresponding implementation in > > the storage device will also be required. > ... > For it to work, you would need both the original file and the new > modified file, and I am not sure how you will accomplish that. This is the real problem, and pretty much excludes things like diff or xdelta. But rsync (and rdiff, as I'll mention later) gets around that need in an ingenious way. rsync assumes that you have a new version of a file locally and an old version of the file at the remote end. It then uses a clever rolling checksum algorithm to efficiently find identical blocks at any byte offset common to the new file and the old file, so only unique new blocks need be sent across the wire. The effect is to efficiently update the remote copy to be identical with the local copy. The least possible intrusion to bacula's way of working would be to have the client use rsync as a pre-processing step to efficiently mirror the files of interest on to the machine hosting the storage daemon, where they could be backed-up as normal (presumably using a file daemon running on the storage machine but pretending to be the client). Slightly better integration would be to still run the bacula file daemon on the client, but use rsync to shift those files across to the mirror area on the storage device, which would then continue the normal file-daemon processing from the copy. A further improvement would be for the storage daemon to extract the last version of each file from an old backup just as required; rsync would update it, then the new version would be stored in the new backup. This removes the need to have a complete mirror, but at the expense of having the director work out which old file needs to be retrieved and the cost of extracting it. All these options seem quite easy to do, but they have various drawbacks: the storage device has to have space for a mirror of the client's files; those files are unencrypted; we might have to worry about preserving metadata separately (rsync will do its best, but that may not be good enough if the source and destination have different filesystem types); things like VSS support might be tricky. > If you do figure out some clever way to have both files, then it would be > *much* better to add some code that uses the librsync library calls to > effectively implement what rsync does. Using rdiff (or librsync, on which it is built) gives us a couple more options. rdiff has the ability to extract the rolling checksum 'signature' of the first version of a file and save it for later use; this is typically about 10% of the file size, though you can tune it up or down in a trade-off between signature file size and compression efficiency. This signature can then be used with a later version of the file to work out the 'delta'---the minimal set of blocks that need to be sent over to the remote end. If the client file daemon keeps a local copy of the signature file, it can compute and send the delta over to the storage daemon without any further input from the storage daemon or director. But a better alternative would be to hold the signature file on another storage daemon (ideally not the remote storage daemon, but there could also be a local storage daemon available) or on the director. When the time comes to backup the latest version of the file, the client retrieves the appropriate signature file according to instructions from the director which can determine which signature file should be the basis for this new delta by knowing the level (full, differential or incremental) of the new backup. The client then uses librsync to compute the new delta from the new file and the old signature, and sends it to the remote storage daemon. Having sent the delta across to the remote storage daemon, we now have a choice of what to do with it. Again, the less intrusive method from a bacula code point of view is for the storage daemon to apply the delta to a previous version of the file it has mirrored or extracted from an old backup; the reconstructed new version is then backed up as normal. A much more appealing possibility is to store only the delta in the backup. This also has the advantage that the delta can be encrypted by the client. However, this changes a fundamental assumption in bacula and that is probably going to have consequences I haven't foreseen. Restoring a single file must now consist of retrieving the first version of the file, plus all the deltas and applying them in the right order. It's also fragile in the sense that if any of the intermediate backups is unavailable, the chain of deltas is broken and the file cannot be reconstructed. Still, I'm confident this is the best approach: to get the client to produce new rdiff signature files (kept on the client if there's space, or on some other storage local to the client, or on the director), to retrieve appropriate old rdiff signature files, to compute and send deltas to the remote storage daemon, to store deltas in the backup, to consolidate virtual backups (or to restore) by grouping the basis file and all required patches, to restore by sending the basis+deltas to the client file daemon, which has to apply the deltas in sequence. If everyone agrees this is the best way forward, I'd be happy to flesh out a proper project plan and see how we and others can help contribute. Robin O'Leary. -- email: ro...@eq... Equiinet Ltd., Edison Road, Dorcan, Tel.: +44 1793 603708 Swindon, SN3 5JX, U.K. 51.5558N,1.7286W |