Re: [Bacula-devel] Chunked backup changes to Bacula

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Friday 02 July 2010 22:21:08 Kern Sibbald wrote:
> On Friday 02 July 2010 14:56:35 Howard Thomson wrote:
> > Hi Kern,
> >
> > On Friday 02 July 2010, Kern Sibbald wrote:
> > > On Thursday 01 July 2010 21:46:50 Howard Thomson wrote:
> > > > Hi Kern,
> > > >
> > > > On Thursday 01 July 2010, Kern Sibbald wrote:
> > > > > Hello Howard,
> > > > >
> > > > > What does "chunked" backup mean exactly?  I am not sure what the
> > > > > high level concept is here.  Bacula can already backup
> > > > > multi-gigabyte virtual disks, so obviously you are thinking about
> > > > > something different.
> > > >
> > > > The concept that I am calling 'chunked backup' is sub-file
> > > > incremental backup.
> > > >
> > > > Currently, for a 10Gb Virtualbox virtual disk, a Full-backup will
> > > > backup the whole file.
> > > >
> > > > Subsequent incremental backups, where perhaps only 1Mb of the
> > > > virtual-disk has changed, will backup the entire [10Gb] single file,
> > > > because it has changed.
> > > >
> > > > Bacula currently records a hash-value for the entire file, whereas I
> > > > am intending, in addition and for appropriately large files, to
> > > > record a hash-value for sub-file chunks, to be able to selectively
> > > > not backup those chunks when doing an incremental / differential
> > > > backup.
> > >
> > > OK, now I understand.  This is a feature that we are working on -- it
> > > is actually a form of deduplication.  Before implementing it, there are
> > > a number of things that need to be decided and some important changes
> > > in Bacula that need to be made.
> > >
> > > 1. By the way, I call these "deltas" that is it is some change to the
> > > originally backed up image that must be applied.  However, what is
> > > different from an Incremental is two things: 1. only a part of the file
> > > is saved.  2. *all* the deltas must be restored (not just the most
> > > recent as is what happens for incremental backups).
> > >
> > > 2. From the above, you can see that we need some way of marking these
> > > as deltas rather than incremental.  Perhaps it could simply be called a
> > > "delta" backup level rather than Incremental.
> > >
> > > 3. We need to decide how the "deltas" are going to be generated --
> > > there needs to be something to figure out what has changed, which
> > > means, in general, you need access to the previous backups or some form
> > > of hashing done by deduplication code.
> > >
> > > 4. Determine how the deltas are gong to be stored -- actually, IMO,
> > > that is trivial it just needs a very small amount of code that looks
> > > much like the sparse file handling code -- we may even be able to use
> > > the same code.
> > >
> > > > I want to use Bacula to do full + incremental backups of my own
> > > > system, to disk, without separating out virtual-disks into separate
> > > > backups, with different recycle criteria for space constraint
> > > > reasons.
> > > >
> > > > Current [admittedly] simple-minded incremental backups of my
> > > > file-tree are much larger than they need to be ...
> > >
> > > Yes, much larger.  We have some Bacula Systems scripts that help with
> > > this for VirtualBox, but it is not integrated with Bacula as deltas
> > > would be.
> > >
> > > This whole subject is non-trivial.
> >
> > It is certainly non-trivial ...
> >
> > Delta backup, to use your terminology, requires:
> >
> > 	1/ Retrieve file-offset / hash-code  pairs for file being backed up
>
> That is pretty straight forward.  One just needs to do something similar to
> what we do for Accurate backup and Base jobs, where information on prior
> backups is sent to the FD.  Of course, currently, we don't keep the file
> offset as such for data that is backed up.
>
> In addition, I believe that you need one more item -- the length of the
> delta. This would allow us to easily deal with different filesystems or
> different filesystem block sizes.
>
> > 	2/ Generate hash-code for each file-offset otherwise selected to backup
>
> That is also straight forward.  It could be passed to the SD as a special
> stream much like Unix attributes that would then be passed on to the
> Director for insertion in the database.
>
> > 	3/ Lookup file-offset in retrieved list and proceed with backup if
> > either not found [sparse file chunk not backed up] or found but different
>
> OK, but again, I think we need a delta length.  We might want to vary the
> length of the delta found according to file systems, and such ...
>
> > 	4/ Store all newly generated file-offset / hash-code pairs to the
> > database.
>
> That is also straight forward.  We would just implement a new stream that
> is coming to the Director from the SD -- much like Unix attributes.  It
> would be just a different kind of database update.
>
> > Restore, of a delta backed-up file requires:
> >
> > 	5/ Retrieve jobid (?) / file-offset pairs from database
> >
> > 	6/ For each backup-stream read, selectively restore deltas as needed.
> > 		Restoring all deltas, in the right order, would work but be
> > 		bandwidth inefficient.
> >
> > In looking at all the relevant code, I am finding that the interation
> > with the database, directly and indirectly, is the least obvious
> > structure to extend and change ...
>
> Well, the most complicated and sensitive is to know in what table one puts
> the information and to design the database records for that.  Then one has
> to modify the database and write the new routines to put the new data into
> it. It isn't really hard but requires careful checking.  I recently added
> RestoreObjects in the database for Bacula Enterprise, and if it isn't
> already in Branch-5.1 (main Community development branch), it will be there
> sometime in July as we start finalizing off the 5.0.3 release, because we
> will carefully check what items in Branch-5.1 need to be backported to
> Branch-5.0 for the 5.0.3 release.
>
> One *big* question is exactly how to store this information.  Bacula
> currently has only one means to store multiple records of information about
> a particular File, and that is the JobMedia records, which effectively
> serve as the index to where the file data is on the give Volume.
>
> I think we will need something similar to the JobMedia record to store the
> hash, the offset, and the size.  Compared to the current Bacula tables,
> this one could potentially hold an enormous number of records.  In typical
> deduplication software, from what I have read, such tables represent about
> 30% of the size of all the data backed up.  Of course, I don't expect to be
> doing deltas on every file on the filesystem, but it certainly would be
> useful for VM images and log files.
>
> > The comment on sparse file handling is, of course, correct and I am
> > treating delta file backup as a special case of sparse file backup.
> >
> > It seems to be the responsibility of the SD to send relevant updates to
> > the Director, currently at the end of each file. However, the SD has no
> > knowledge of which file-offsets of a sparse file it has processed on
> > behalf of an FD, so I am unclear at present as to how, and when, /4
> > updates to the database will occur.
>
> As I mentioned above, I suspect that would be best handled by a new stream
> that is sent to the SD, who will then know how to send it to the Director.
> Obviously as with Unix attributes, the SD will have to have some knowledge
> of this stream.
>
> > When you say that this is being worked on, is it worth me continuing with
> > current work-in-progress ?
>
> Probably yes, if it interests you, but we need to get the design nailed
> down and agreed on, before doing any serious coding.

After a little more thought, I think I should have been a good deal more 
positive in the above response.  As I said, I have thought about this a lot, 
and so parts of it are well thought out.  However, you have some very 
interesting new ideas that I had not thought of because I was thinking about 
doing something a bit more complicated, which is a sliding block 
deduplication, but I think your ideas about "chunks" (i.e. fixed block sizes) 
could drastically simplify the problem.  I would like to see you continue 
working on this.

Kern

>
> > I haven't altered many files yet in my git repo; I've spent more time
> > reading code than writing it so far ...!
>
> I should have been more precise and say that we are in the design phase of
> this project, but have not yet started programming -- must finish
> Enterprise 4.0 and Community 5.0.3 releases first ...
>
> What do you think?
>
> Kern