Thread: Re: [BackupPC-devel] BackupPC 4.0 features - attribute file and backup storage

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Jeffrey,

> Were there previous topics that you posted to the mailing list that I
> missed? :P

They might have arrived out of order.

> Hopefully, the code will expose building-block subroutines for
> reconstructing the current backup from the reverse time deltas so that
> people can create their own extensions for accessing the contents of
> specific backups (e.g., like the backuppc fuse file system).

Yes, it's pretty easy to reconstruct a backup view.

> By extension, would it be possible to write all the routines in
> checksum agnostic fashion so that the choice of MD5 or sha256sum or
> whatever could be made by the user by just supplying the right perl
> calling routine or wrapper. This would make it possible for users to
> use whatever checksum they feel comfortable with -- potentially
> trading off speed for reliability

Let me think about this.  I want to match the full-file checksum
in rsync because it allows some clever things to be done.  For
example, with some hacks, it would allow a new client file to
be matched even the first time if it is in the pool and you use
the --checksum option to rsync (this requires some changes to
the server-side rsync).

> 1. Would it be possible to allow for simple user-extensions to add other
>    fields. This would make it easier and more robust to extend should the
>    need or desirability of adding other file attributes arise. 

Good idea.  Let me think about that.

> 2. Does the attrib file also presumably contain a pool file name so
>    that you can determine the pool file location?

The digest directly allows the pool name to be constructed.  The digest
includes an extension in case of collisions.

> Will there be a routine to convert legacy 3.x stored backups to 4.x
> format?

The pool effectively gets migrated as each pool file is needed for 4.x.
But the old backups are not.

> In some ways having to keep around all the 3.x hard-links would defeat
> a lot of the easy copying benefits of 4.x or would require using a
> separate partition. 

It's a good point.  However, such a utility would probably have a very
long running time.  It wouldn't be reversible, and I would be very
concerned about cases where it was killed or failed before finishing.

I was thinking I would add a utility that lists all the 4.x directories
(below $TOPDIR) that should be backed up.  Those wouldn't have hardlinks.
That could be used with a user-supplied script to backup or replicate
all the 4.x backups.

> Couple of questions:

> 1. Does that mean that you are limited to 2^14 client hard links?

No.  It's 2^14 files, each of which could, in theory subject
to memory, hold a large number of entries.

> 2. Does the hardlink attrib file contain a list of the full file paths
>    of all the similarly hard-linked files?

No.  Just like a regular file sysem the reverse lookup isn't easy.

>    This would be helpful for programs like my BackupPC_deleteFile
>    since if you want to delete one hard linked file, it would be
>    necessary presumably to decrement the nlinks count in all the
>    attrib files of the other linked files.

No, that's not necessary.  The convention is that if nlinks >= 2
in a file entry, all that means is the real data is in the inode
entry (and the nlinks could actually be 1).

>    Plus in general, it is nice
>    to know what else is hard-linked without having to do an exhaustive
>    search through the entire backup.
> 
>    In fact, I would imagine that the atribYY file would only need to
>    contains such a list and little if any other information since that
>    is all that is required presumably to restore the hard-links since
>    the actual inode properties are stored (redundantly) in each of the
>    individual attrib file representations.

True.  But you have a replicate and update all the file attributes
in that case.  So any operation (like chmod etc) would require all
file attributes to be changed.

> Other questions/suggestions:
> 1. Are the pool files now labeled with the full file md5sums?
>    If so, are you able to get that right out of the rsync checksums
>    (for protocol >=30 or so) or do they need to be computed
>    separately?

The path name of the pool file gives the md5 sum.

>    How is the (unlikely) event of an md5sum collision handled?

Chains are still used for collisions.  They are obviously unlikely
in typical use, but I have been testing it with the now well-known
files that have collisions.

>    Is it always still necessary to compare actual file contents when
>    adding a new file just in case the md5sums collide?

Yes.

>    If so, would it be better to just use a "better" checksum so that
>    you wouldn't need to worry about collisions and wouldn't have to
>    always do a disk IO and CPU consuming file content comparison to
>    check for the unlikely event of collision?

Good point, but as I mentioned above I would like to take advantage
of the fact that it is the same full-file checksum as rsync.

> 2. If you are changing the appended rsync digest format for cpool
>    files using rsync, I think it might be helpful to also store the
>    uncompressed filesize in the digest There are several use cases
>    (including verifying rsync checksums where the filesize is required
>    to determine the blocksize) where I have needed to decompress the
>    entire file just to find out its size (and since I am in the pool
>    tree I don't have access to the attrib file to know its size).

Currently 4.x won't use the appended checksums.  I'll explain how
I'm implemented rsync on the server side in another email.  I could
add that for later, but it is more complex.

>    Similarly, it might be nice to *always* have the md5sum checksum
>    (or other) appended to the file even when not using the rsync
>    transfer method. This would help with validating file
>    integrity. Even if the md5sum is used for the pool file name, it
>    may be nice to allow for an alternative more reliable checksum to
>    be stored in the file envelope itself.

Interesting idea.

> 3. In the absence of pool hard links, how do you know when a pool file
>    can be deleted? Is there some table where the counts are
>    incremented/decremented as new backups are added or deleted?

Yes - see my other emails (assuming I sent them all out).

> 4. Do you have any idea whether 4.0 will be or less resource
>    intensive (e.g., disk IO, network, cpu) than 3.x when it comes to:
>    		- Backing up
> 		- Reconstructing deltas in the web interface (or fuse filesystem)
> 		- Restoring data

I hope it will be a lot more efficient, but I don't have any data
yet.  There are several areas where efficiency will be much better.
For example, with reverse-deltas, a full or incremental backup with
no changes shouldn't need any significant disk writes.  In contrast
3.x has to create a directory tree and in the case of a full make
a complete set of hardlinks.

Also, rsync on the server side will be based on a native C rsync.
I'll send another email about that.

> Most importantly, 4.0 sounds great and exciting!

I hope to get some time to work on it again.  Unfortunately I haven't
made any progress in the last 4 months.  Work has been very busy.

Craig

Thread: Re: [BackupPC-devel] BackupPC 4.0 features - attribute file and backup storage

backuppc-devel