Thanks for your Feature request.
- currently Bacula has file level deduplication implemented in version 5.0.0.
It is done in a novel way (at least I have not seen it in any other product)
that permits the sys admin to optimize deduplication at the file level.
- Bacula Systems has a research and development project on a filesystem block
level deduplication (quite different from your suggestion) that is showing a
lot of promise
- the Bacula project is just now discussing a new deduplication scheme based
on partial file deduplication using sliding blocks (not at all the same as
filesystem block based block level deduplication.
- It is possible that as part of the above mentioned partial file
deduplication project that we will design a new Volume format, but this is
not a strict requirement for doing partial file deduplication.
- Bacula's current Volume format is not designed to handle filesystem block
level deduplication, so any kind of project that attempts to "align" data in
the current Bacula Volume format based on filesystem blocks, then do
deduplication is, in my opinion, doomed to failure.
So, there are at least three different major techniques that can be used in
1. File level deduplication, which Bacula has and which works with files
backed up to tape as well as disk using Bacula's current Volume format.
2. Filesystem level block deduplication (using snapshot technology).
3. Individual file block deduplication (there are multiple variations of this
Item 1 is already implemented in Bacula version 5.0.0
Item 2 has been internally successfully demonstrated by Bacula Systems for
Linux systems and we are working on the same for Windows systems -- we hope
this will be ready for release 3Q2010 with testing possibly sooner. To
totally automate it we may need some extensions to the current Volume format,
but they are rather minor and do not require any Volume design changes.
Item 3 is a new project only in early discussion phase on the bacula-devel
list. It looks very promissing but needs a lot of work.
With all the above, I do not think that it is yet time to discuss changing the
Bacula Volume format (though a new (second) Volume format is one of the
options I am considering for item 3.
On Wednesday 10 February 2010 02:29:48 Darren Mackay wrote:
> Item : Support for file-system / volume / san dedup for file devices
> Date: 10 Feb 2010
> Origin: Darren Mackay (Velitium)
> What: File devices should provide support for block based deduplication
> provided by the underlying file-systems / volume manager / san.
> Why: A number of file-systems / volume managers / sans now provide block
> based deduplication. For block level dedup, it is not uncommon for
> deduplication ratios to be to be 3x, 4x, or 5x for unstructured data.
> Currently it appears (forgive me and advise if this is actually incorrect,
> as this is drawn upon a number of forum posts) that that bacula storage
> daemon is packing the data-stream back-2-back, which prevents block based
> duplication as the data-stream is not aligned to blocks as defined by the
> underlying storage device. I have also read several posts that indicate
> that bacula may multiplex data streams, which in the case of underlying
> dedup, would further prevent dedup from be performed.
> Allowing for dedup in the underlying file-system / volume / san would also
> alleviate the need for sysadmins to tune baselines between different hosts
> which use the same storage daemon file device(s).
> Based on limited testing, some dedup is able be performed, but the number
> of duplicate blocks detected is limited. For instance, consecutive full
> backs from a single client machine (approx 200GB, both o/s and unstructured
> file data) for only a single concurrent job should have resulted in a
> significant portion of the backup to be detected as duplicate blocks by the
> underlying storage (OpenSolaris ZFS in this case), however, the actual
> ration of dedup detected for the 2nd full backup was approx 70k blocks (~
> 8.5GB). Subsequent runs of the full backup yielded similar results.
> Allowing for metadata, I would have expected at least 80% of the full
> backup to dedup.
> Several levels of dedup support, which could be implemented in a staged
> Phase 1 - File device dedup support
> - This would allow for dedup between file devices on the same system)
> - Add padding at the end of each file to a user configurable block size.
> DedupBlockSize = 8k (configurable, in bytes)
> - If the configuration options is missing, then disable all support for
> underlying dedup for file devices.
> Phase 2 - Autodetection of dedup supported file-systems
> - When dedup is provided by the host o/s of the file system device, the
> storage daemon should detect if dedup is enabled for the file device
> location. For Solaris / Opensolaris ZFS, this value is available through
> the filesystem extended properties. In this case, if dedup is enabled for
> the ZFS filesystem, the storage daemon should read the filesystem block
> size as use this value. (note - ZFS also uses variable block sizes, and
> thus will only allocate the require size if the requirement is less than
> the actual block size)
> Phase 3 - Alignment of the datastream to underlying file-system blocks and
> separate of bacula metadata to separate blocks
> - This would allow for underlying storage system deduplication between both
> bacula file devices and real data stored elsewhere on the file-system /
> volume / san.