Wessel great to hear someone is tinkering with improving backuppc
I've spent a fair bit of thinking about backuppc and have pondered the
write something from scratch vs trying to improve backuppc. I'm quite
fond of backuppc because of:
* VERY nice web interface
* Multiple restore methods, nice browsing, etc.
* Fairly robust
* compatible with standard client side tools like rsync/tar
* Fairly painless to backup 10-100 windows/linux/osx machines.
My main complaints:
* the pool hard to scale in size or performance
* It's hard to do offsite backups (all backups in 2 or more places)
* management issues with how big a pool can be, how hard it is to
replicate, and how hard it is to fsck (depending on size and
filesystem of course).
Especially these days when 36x3TB isn't hard to fit in a server. Nor
are multiple gbit/10G/40G network connections.
2nd on my wish list is encryption, ideally people backing up files from
their client wouldn't have to trust the admin to not violate their privacy.
To this end I've been writing some code to:
* Keep client side sqlite database for metadata and file -> blob
* Use all available client side CPU cores (even my $200 tablet has 4)
for compression, encryption, and checksum. Keeping up with a GigE
network connection is non-trivial.
* implement a network protocol (using protobufs) for a dedupe enabled
upload to server
* Protocol allows for "I have these new encrypted blobs, server which
ones don't you have?".
The above code is written, benchmarked, tested, but certainly not
What isn't written is the ability to let servers trade encrypted blobs
to ensure replications goals are met. Ideally it would allow the use
case of allowing 3 sys admins in an organization want to trade 10TB so
they can all have 3 copies of their backups. Assuming GigE connections
and reasonable low churn it seems feasible.
> The ideas overlap to a limited extent with the ideas that Craig posted
I strongly agree with Craig's identification of the weaknesses and his
proposed solution of content addressable storage as a replacement of
> to this list. For instance, no more hardlinks, and garbage collection is
> done using flat-file databases. Some things are quite different. I'll try
> to explain my ideas here.
I cringe at the thought of a flat-file database. Seems little reason to
no use sqllite or similar, then let the folks that prefer mysql,
postgres, sql flavor of the day implement whatever is needed.
Most importantly it should be reliable, and not freak if the sum of the
metadata is larger than ram. IMO sqlite or similar is the easiest way
to do that.
> 1) I/O: writes are fast, reads are slow. Random reads are very slow.
> Writing can be done asynchronously. Even random writes can be elevatored to
> the platters fairly efficiently. Reading, on the other hand, is blocking to
> the process doing it. Running many processes in parallel can alleviate the
> random-read bottleneck a little but it's still best to do sequential reads
> whenever possible.
Not sure in what context you are discussing this. It's not an issue
client side (IMO), and server side you mostly want fast metadata reads,
and fast blob writes. After all I'd expect 100s of backups (writes to
the pool) for every read (a restore).
Ideally I could have N disks and put 1/N of the sha512 blobs on each.
Or maybe 2/N if I want to survive a single disk failure. Much like how
Hadoop file system works.
Of course even better would be a network service like say the hadoop
distributed file system or similar blob/object store. Sadly I don't
know of a nice robust and easy to use content addressable storage system
that would be ideal for backuppc.
> 2) CPU and RAM are cheap.
> BackupPC parallelizes well and for larger installations 12-core or even
> 24-core systems can be had for reasonable prices. Memory is dirt cheap.
Agreed on cores and ram being cheap. Ideally a hardlinked pool
replacement would allow:
* file dedupe
* scaling performance and storage with additional disks
* Easy ability to check integrity
So basically it sounds like the ideal system would be CAS:
> File data
> All files are to be split into chunks of (currently) 2 megabytes. These
> chunks are addressed by their sha512 hashes (that is, the filename for each
> chunk is simply the base64 encoded hash). Any compression is applied after
> splitting. This provides dedup even for large files that change or
> otherwise differ only slightly.
Seems low, why not 10-100 times that? More sequential I/O is better
right? In most cases I've seen where block level dedupe would be a big
win they were doing something unsafe in the first place. Like say for
instance backing up a large live database. (like Mysql, Zodb, or exchange).
The main reason for conserving seeks is that in most use cases churn is
low, so most of the work is identifying what pieces need to be uploaded
from a client and if the server already has any of those pieces.
> The use of sha512 will eliminate collisions completely, and provide safe
I suspect about half the people will argue that's too much, and about
half will argue that's too little. Ideally this would be settable. In
particular there's some promising candidates that are likely to be much
faster than sha512, yet still be very collision resistant... thus the
> dedup without expensive whole-file comparisons. This obviates the need for
> a lot of disk reading.
I don't follow how sha512/CAS avoids the need for a lot of disk reading.
So say a 1GB file on the client changes. With file level dedupe the
client basically has to say:
"Hey server, do you have checksum foo?"
The server then does a single flat file or sql lookup and says yes or no.
With 2MB blocks you have to do the above 512 times, right?
Additionally any store or retrieve will have to do 512 seeks. Keep in
mind that an average disk these days does 120MB/sec or so, and even a
small RAID can do many times that.
In either dedupe case on the client side you have to either have a cache
of time stamp, filename -> hash and trust the timestamps. Or you have
to read the entire file and compute checksums.
Seems like the biggest win would be to have the client send you the
complete file checksum. I think Craig mentioned some magic to be
checksum compatible with rsync using some particular flag or another.