From: Kern S. <ke...@si...> - 2010-07-02 20:39:59
|
On Friday 02 July 2010 22:21:08 Kern Sibbald wrote: > On Friday 02 July 2010 14:56:35 Howard Thomson wrote: > > Hi Kern, > > > > On Friday 02 July 2010, Kern Sibbald wrote: > > > On Thursday 01 July 2010 21:46:50 Howard Thomson wrote: > > > > Hi Kern, > > > > > > > > On Thursday 01 July 2010, Kern Sibbald wrote: > > > > > Hello Howard, > > > > > > > > > > What does "chunked" backup mean exactly? I am not sure what the > > > > > high level concept is here. Bacula can already backup > > > > > multi-gigabyte virtual disks, so obviously you are thinking about > > > > > something different. > > > > > > > > The concept that I am calling 'chunked backup' is sub-file > > > > incremental backup. > > > > > > > > Currently, for a 10Gb Virtualbox virtual disk, a Full-backup will > > > > backup the whole file. > > > > > > > > Subsequent incremental backups, where perhaps only 1Mb of the > > > > virtual-disk has changed, will backup the entire [10Gb] single file, > > > > because it has changed. > > > > > > > > Bacula currently records a hash-value for the entire file, whereas I > > > > am intending, in addition and for appropriately large files, to > > > > record a hash-value for sub-file chunks, to be able to selectively > > > > not backup those chunks when doing an incremental / differential > > > > backup. > > > > > > OK, now I understand. This is a feature that we are working on -- it > > > is actually a form of deduplication. Before implementing it, there are > > > a number of things that need to be decided and some important changes > > > in Bacula that need to be made. > > > > > > 1. By the way, I call these "deltas" that is it is some change to the > > > originally backed up image that must be applied. However, what is > > > different from an Incremental is two things: 1. only a part of the file > > > is saved. 2. *all* the deltas must be restored (not just the most > > > recent as is what happens for incremental backups). > > > > > > 2. From the above, you can see that we need some way of marking these > > > as deltas rather than incremental. Perhaps it could simply be called a > > > "delta" backup level rather than Incremental. > > > > > > 3. We need to decide how the "deltas" are going to be generated -- > > > there needs to be something to figure out what has changed, which > > > means, in general, you need access to the previous backups or some form > > > of hashing done by deduplication code. > > > > > > 4. Determine how the deltas are gong to be stored -- actually, IMO, > > > that is trivial it just needs a very small amount of code that looks > > > much like the sparse file handling code -- we may even be able to use > > > the same code. > > > > > > > I want to use Bacula to do full + incremental backups of my own > > > > system, to disk, without separating out virtual-disks into separate > > > > backups, with different recycle criteria for space constraint > > > > reasons. > > > > > > > > Current [admittedly] simple-minded incremental backups of my > > > > file-tree are much larger than they need to be ... > > > > > > Yes, much larger. We have some Bacula Systems scripts that help with > > > this for VirtualBox, but it is not integrated with Bacula as deltas > > > would be. > > > > > > This whole subject is non-trivial. > > > > It is certainly non-trivial ... > > > > Delta backup, to use your terminology, requires: > > > > 1/ Retrieve file-offset / hash-code pairs for file being backed up > > That is pretty straight forward. One just needs to do something similar to > what we do for Accurate backup and Base jobs, where information on prior > backups is sent to the FD. Of course, currently, we don't keep the file > offset as such for data that is backed up. > > In addition, I believe that you need one more item -- the length of the > delta. This would allow us to easily deal with different filesystems or > different filesystem block sizes. > > > 2/ Generate hash-code for each file-offset otherwise selected to backup > > That is also straight forward. It could be passed to the SD as a special > stream much like Unix attributes that would then be passed on to the > Director for insertion in the database. > > > 3/ Lookup file-offset in retrieved list and proceed with backup if > > either not found [sparse file chunk not backed up] or found but different > > OK, but again, I think we need a delta length. We might want to vary the > length of the delta found according to file systems, and such ... > > > 4/ Store all newly generated file-offset / hash-code pairs to the > > database. > > That is also straight forward. We would just implement a new stream that > is coming to the Director from the SD -- much like Unix attributes. It > would be just a different kind of database update. > > > Restore, of a delta backed-up file requires: > > > > 5/ Retrieve jobid (?) / file-offset pairs from database > > > > 6/ For each backup-stream read, selectively restore deltas as needed. > > Restoring all deltas, in the right order, would work but be > > bandwidth inefficient. > > > > In looking at all the relevant code, I am finding that the interation > > with the database, directly and indirectly, is the least obvious > > structure to extend and change ... > > Well, the most complicated and sensitive is to know in what table one puts > the information and to design the database records for that. Then one has > to modify the database and write the new routines to put the new data into > it. It isn't really hard but requires careful checking. I recently added > RestoreObjects in the database for Bacula Enterprise, and if it isn't > already in Branch-5.1 (main Community development branch), it will be there > sometime in July as we start finalizing off the 5.0.3 release, because we > will carefully check what items in Branch-5.1 need to be backported to > Branch-5.0 for the 5.0.3 release. > > One *big* question is exactly how to store this information. Bacula > currently has only one means to store multiple records of information about > a particular File, and that is the JobMedia records, which effectively > serve as the index to where the file data is on the give Volume. > > I think we will need something similar to the JobMedia record to store the > hash, the offset, and the size. Compared to the current Bacula tables, > this one could potentially hold an enormous number of records. In typical > deduplication software, from what I have read, such tables represent about > 30% of the size of all the data backed up. Of course, I don't expect to be > doing deltas on every file on the filesystem, but it certainly would be > useful for VM images and log files. > > > The comment on sparse file handling is, of course, correct and I am > > treating delta file backup as a special case of sparse file backup. > > > > It seems to be the responsibility of the SD to send relevant updates to > > the Director, currently at the end of each file. However, the SD has no > > knowledge of which file-offsets of a sparse file it has processed on > > behalf of an FD, so I am unclear at present as to how, and when, /4 > > updates to the database will occur. > > As I mentioned above, I suspect that would be best handled by a new stream > that is sent to the SD, who will then know how to send it to the Director. > Obviously as with Unix attributes, the SD will have to have some knowledge > of this stream. > > > When you say that this is being worked on, is it worth me continuing with > > current work-in-progress ? > > Probably yes, if it interests you, but we need to get the design nailed > down and agreed on, before doing any serious coding. After a little more thought, I think I should have been a good deal more positive in the above response. As I said, I have thought about this a lot, and so parts of it are well thought out. However, you have some very interesting new ideas that I had not thought of because I was thinking about doing something a bit more complicated, which is a sliding block deduplication, but I think your ideas about "chunks" (i.e. fixed block sizes) could drastically simplify the problem. I would like to see you continue working on this. Kern > > > I haven't altered many files yet in my git repo; I've spent more time > > reading code than writing it so far ...! > > I should have been more precise and say that we are in the design phase of > this project, but have not yet started programming -- must finish > Enterprise 4.0 and Community 5.0.3 releases first ... > > What do you think? > > Kern |