From: Craig B. <cba...@us...> - 2011-03-03 07:24:28
|
Jeffrey, > Were there previous topics that you posted to the mailing list that I > missed? :P They might have arrived out of order. > Hopefully, the code will expose building-block subroutines for > reconstructing the current backup from the reverse time deltas so that > people can create their own extensions for accessing the contents of > specific backups (e.g., like the backuppc fuse file system). Yes, it's pretty easy to reconstruct a backup view. > By extension, would it be possible to write all the routines in > checksum agnostic fashion so that the choice of MD5 or sha256sum or > whatever could be made by the user by just supplying the right perl > calling routine or wrapper. This would make it possible for users to > use whatever checksum they feel comfortable with -- potentially > trading off speed for reliability Let me think about this. I want to match the full-file checksum in rsync because it allows some clever things to be done. For example, with some hacks, it would allow a new client file to be matched even the first time if it is in the pool and you use the --checksum option to rsync (this requires some changes to the server-side rsync). > 1. Would it be possible to allow for simple user-extensions to add other > fields. This would make it easier and more robust to extend should the > need or desirability of adding other file attributes arise. Good idea. Let me think about that. > 2. Does the attrib file also presumably contain a pool file name so > that you can determine the pool file location? The digest directly allows the pool name to be constructed. The digest includes an extension in case of collisions. > Will there be a routine to convert legacy 3.x stored backups to 4.x > format? The pool effectively gets migrated as each pool file is needed for 4.x. But the old backups are not. > In some ways having to keep around all the 3.x hard-links would defeat > a lot of the easy copying benefits of 4.x or would require using a > separate partition. It's a good point. However, such a utility would probably have a very long running time. It wouldn't be reversible, and I would be very concerned about cases where it was killed or failed before finishing. I was thinking I would add a utility that lists all the 4.x directories (below $TOPDIR) that should be backed up. Those wouldn't have hardlinks. That could be used with a user-supplied script to backup or replicate all the 4.x backups. > Couple of questions: > 1. Does that mean that you are limited to 2^14 client hard links? No. It's 2^14 files, each of which could, in theory subject to memory, hold a large number of entries. > 2. Does the hardlink attrib file contain a list of the full file paths > of all the similarly hard-linked files? No. Just like a regular file sysem the reverse lookup isn't easy. > This would be helpful for programs like my BackupPC_deleteFile > since if you want to delete one hard linked file, it would be > necessary presumably to decrement the nlinks count in all the > attrib files of the other linked files. No, that's not necessary. The convention is that if nlinks >= 2 in a file entry, all that means is the real data is in the inode entry (and the nlinks could actually be 1). > Plus in general, it is nice > to know what else is hard-linked without having to do an exhaustive > search through the entire backup. > > In fact, I would imagine that the atribYY file would only need to > contains such a list and little if any other information since that > is all that is required presumably to restore the hard-links since > the actual inode properties are stored (redundantly) in each of the > individual attrib file representations. True. But you have a replicate and update all the file attributes in that case. So any operation (like chmod etc) would require all file attributes to be changed. > Other questions/suggestions: > 1. Are the pool files now labeled with the full file md5sums? > If so, are you able to get that right out of the rsync checksums > (for protocol >=30 or so) or do they need to be computed > separately? The path name of the pool file gives the md5 sum. > How is the (unlikely) event of an md5sum collision handled? Chains are still used for collisions. They are obviously unlikely in typical use, but I have been testing it with the now well-known files that have collisions. > Is it always still necessary to compare actual file contents when > adding a new file just in case the md5sums collide? Yes. > If so, would it be better to just use a "better" checksum so that > you wouldn't need to worry about collisions and wouldn't have to > always do a disk IO and CPU consuming file content comparison to > check for the unlikely event of collision? Good point, but as I mentioned above I would like to take advantage of the fact that it is the same full-file checksum as rsync. > 2. If you are changing the appended rsync digest format for cpool > files using rsync, I think it might be helpful to also store the > uncompressed filesize in the digest There are several use cases > (including verifying rsync checksums where the filesize is required > to determine the blocksize) where I have needed to decompress the > entire file just to find out its size (and since I am in the pool > tree I don't have access to the attrib file to know its size). Currently 4.x won't use the appended checksums. I'll explain how I'm implemented rsync on the server side in another email. I could add that for later, but it is more complex. > Similarly, it might be nice to *always* have the md5sum checksum > (or other) appended to the file even when not using the rsync > transfer method. This would help with validating file > integrity. Even if the md5sum is used for the pool file name, it > may be nice to allow for an alternative more reliable checksum to > be stored in the file envelope itself. Interesting idea. > 3. In the absence of pool hard links, how do you know when a pool file > can be deleted? Is there some table where the counts are > incremented/decremented as new backups are added or deleted? Yes - see my other emails (assuming I sent them all out). > 4. Do you have any idea whether 4.0 will be or less resource > intensive (e.g., disk IO, network, cpu) than 3.x when it comes to: > - Backing up > - Reconstructing deltas in the web interface (or fuse filesystem) > - Restoring data I hope it will be a lot more efficient, but I don't have any data yet. There are several areas where efficiency will be much better. For example, with reverse-deltas, a full or incremental backup with no changes shouldn't need any significant disk writes. In contrast 3.x has to create a directory tree and in the case of a full make a complete set of hardlinks. Also, rsync on the server side will be based on a native C rsync. I'll send another email about that. > Most importantly, 4.0 sounds great and exciting! I hope to get some time to work on it again. Unfortunately I haven't made any progress in the last 4 months. Work has been very busy. Craig |
From: Fresel M. - hi c. e.U. <m.f...@hi...> - 2011-03-23 14:52:47
|
hi just signed onto that list :) RE for a Post by Jeffrey J. Kosowsky <backuppc@ko...> - 2011-03-03 15:43 > Conversely, I would like to raise a suggestion I mentioned a while > back with reference to 3.x. I think it would be great to have the > ability to mark a backup to be saved and not automatically deleted > based upon the expiry rules. Currently, I can fake it by renaming the > backup (+/- adding a symlink to the original name). But it would be > really nice to have an officially-supported convention that allows > individual backups to be protected. My recommendation would be to add > a suffix (e.g., .save) to the backup number. The particular use case I > have in mind is when you upgrade a system (or otherwise make major > changes) and specifically want to save the last backup of the > pre-upgrade version. ACK - It would be great to mark a single one as "undeletable" by any kind cleanup-mechanims Sometimes we create multiple backups in small timeframes (i. e before, during after some system changes) Some kind of "protect this backup from deletion" would be really nice .... |
From: Fresel M. - hi c. e.U. <m.f...@hi...> - 2011-03-23 15:42:58
|
re to Jeffrey J. Kosowsky <backuppc@ko...> - 2011-03-02 15:59 > If you are changing the appended rsync digest format for cpool > files using rsync, I think it might be helpful to also store the > uncompressed filesize in the digest There are several use cases > (including verifying rsync checksums where the filesize is required > to determine the blocksize) where I have needed to decompress the > entire file just to find out its size (and since I am in the pool > tree I don't have access to the attrib file to know its size). > re to Jeffrey J. Kosowsky <backuppc@ko...> - 2011-03-03 16:40 > Alternatively, if you want the first time hack to work then you could > make the pool file name equal to: <md5sum>_<SHA-256sum> which would > still be smaller than SHA-512sum and I would wager that we are > unlikely ever to start seeing lots of files with simultaneous > collisions of the md5 and the SHA-256 checksums. In a sense, the > SHA-256 checksum would act like a unique chain suffix and since it > would always be there you never would have to actually decompress and > compare the files to see if a chain is necessary. Plus you then would > have two essentially independent checksums built into the file name. i would propose to extend it to <MD5>_<SHA_256>_NULL_<uncompressed_FILESIZE> by default and an option for (if user enables it :) <MD5>_<SHA256>_<SHA512>_<uncompressed_FILESIZE> maybe somebody wants to to recalculate the SHA512 sums afterwards (in idle time?) - therefore the "NULL" in the default name above indeed ... this would generate very long filenames: as for the name-length limit of 255 32_64_128_<filesize> meaning there would be space left for 27 more characters (10^26) so we could also append Filesizes of ... uuh ... wait ... 10^12 - Terabyte 10^15 - Petabyte 10^18 - Exabyte .... well ... very big files :) Having all kinds of checksums and sizes already calculated - these information may be reused for custom user-scripts like # integration testing of pool using md5, sha256 AND sha512 :) # appending .sha256 or sha512 files in archive-operations # post-dump integrity tests on client ... Greetings Mike |
From: Jeffrey J. K. <bac...@ko...> - 2011-03-23 18:41:02
|
Fresel Michal - hi competence e.U. wrote at about 16:42:50 +0100 on Wednesday, March 23, 2011: > re to Jeffrey J. Kosowsky <backuppc@ko...> - 2011-03-02 15:59 > > If you are changing the appended rsync digest format for cpool > > files using rsync, I think it might be helpful to also store the > > uncompressed filesize in the digest There are several use cases > > (including verifying rsync checksums where the filesize is required > > to determine the blocksize) where I have needed to decompress the > > entire file just to find out its size (and since I am in the pool > > tree I don't have access to the attrib file to know its size). > > > > > re to Jeffrey J. Kosowsky <backuppc@ko...> - 2011-03-03 16:40 > > > Alternatively, if you want the first time hack to work then you could > > make the pool file name equal to: <md5sum>_<SHA-256sum> which would > > still be smaller than SHA-512sum and I would wager that we are > > unlikely ever to start seeing lots of files with simultaneous > > collisions of the md5 and the SHA-256 checksums. In a sense, the > > SHA-256 checksum would act like a unique chain suffix and since it > > would always be there you never would have to actually decompress and > > compare the files to see if a chain is necessary. Plus you then would > > have two essentially independent checksums built into the file name. > > > i would propose to extend it to > <MD5>_<SHA_256>_NULL_<uncompressed_FILESIZE> > by default > > and an option for (if user enables it :) > <MD5>_<SHA256>_<SHA512>_<uncompressed_FILESIZE> > maybe somebody wants to to recalculate the SHA512 sums afterwards (in idle time?) - therefore the "NULL" in the default name above > > indeed ... this would generate very long filenames: I don't see the advantage of having SHA256 and SHA512. Let users choose one or the other. The only reason I proposed adding another checksum is if people are worried about MD5 collisions. So the goal would be to pick a 2nd checksum whether SHA256 or SHA512 or any other choice that the user believes to be sufficiently unique. Having the uncompressed filesize may be nice but it is not critical to unique pool naming which after all is the purpose of the checksums. > as for the name-length limit of 255 > 32_64_128_<filesize> > meaning there would be space left for 27 more characters (10^26) > > so we could also append Filesizes of ... uuh ... wait ... > 10^12 - Terabyte > 10^15 - Petabyte > 10^18 - Exabyte .... > well ... very big files :) > > Having all kinds of checksums and sizes already calculated - these information may be reused for custom user-scripts like > # integration testing of pool using md5, sha256 AND sha512 :) > # appending .sha256 or sha512 files in archive-operations > # post-dump integrity tests on client ... > > Greetings > Mike > > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > BackupPC-devel mailing list > Bac...@li... > List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel > Wiki: http://backuppc.wiki.sourceforge.net > Project: http://backuppc.sourceforge.net/ |
From: Fresel M. - hi c. e.U. <m.f...@hi...> - 2011-03-23 19:11:16
|
hi Jeffrey, posted "Thinking of 4.0 - change of compression level" afterwards suggesting creation of some kind of ".info" file the SHA256 and SHA512 checksums would be included in that file so would be the uncompressed size the "file_naming" change would thus be irrelevant Am 23.03.2011 um 19:40 schrieb Jeffrey J. Kosowsky: > I don't see the advantage of having SHA256 and SHA512. why not calculate them now (i.e when the server is idle?) to have it for future use? who knows what rsync will be next year? not within near future but: i.e. sha256 for blocks and sha512 for full file? so we would have at least our full_file checksums present > Let users choose one or the other. can be realized by that info-file it's still the user's decission on what additional checksums are created .... > The only reason I proposed adding another > checksum is if people are worried about MD5 collisions. So the goal > would be to pick a 2nd checksum whether SHA256 or SHA512 or any other > choice that the user believes to be sufficiently unique. not really worried about colission but about file-integrity on the server's pooled file + time to recheck today it's quite common to privide all 3 of them when downloading via web .... > Having the uncompressed filesize may be nice but it is not critical to > unique pool naming which after all is the purpose of the checksums. might be implemented on some kind of "info" file Greetings Mike |