From: Craig B. <cba...@us...> - 2011-03-02 08:33:05
|
The next topic is the pool structure in 4.x. Here are the differences in pool file storage between 3.x and 4.x: - Digest changes from partial MD4 to full-file MD5. This will significantly reduce pool collisions - in almost all installations there will be no pool collisions. The most common exception will be if someone uses the now well-known constructed cases of different files with MD5 collisions. In 3.x a partial MD4 digest is used, so collisions are more common. Also, the file system's hardlink limit can also cause more entries in a pool file chain. In 4.x reference counting is done using a simple database, so the file system hardlink limit isn't relevant. - If pool files do collide, a chain is created by appending one or more bytes to the MD5 digest as a counter. The first instance of a pool file will have a regular 16 byte digest. The next file that is different but has the same MD5 digest will be stored as a 17 byte digest with an extra byte of 0x01. The 256th file in the chain (unlikely of course) will have two more bytes appended: 0x0100. The extension is basically the file index with leading 0x00 bytes removed. - 4.x doesn't use hardlinks (except as inherited from existing 3.x pools). - In 4.x pool files are never renamed. In 3.x pool files in a chain of repeated digests will be renamed if one of the middle files is deleted. In the unlikely even there is a chain of repeated files in 4.x, and one of the files is deleted (ie: no longer referenced), then it is replaced by a zero-length file. That acts as a tag that searching through the chain should continue past that point, and also acts as a tag that that file can be replaced by a real pool file when the next file is added. - In 4.x the pool files are stored two-levels deep, with 128 directories at each level. The directories are numbered in hex from 00 to fe in steps of 2. The directory names are based on the first two bytes of the MD5 digest, each anded with 0xfe. For example, a file with digest 0458d9d0e9ddd2b6b21a1e60b6cdf323 will be stored in: CPOOL_DIR/04/58/0458d9d0e9ddd2b6b21a1e60b6cdf323 while a file with digest 09682c6df94c87b1e9ee6e1d0d89e8f2 will be stored in: CPOOL_DIR/08/68/09682c6df94c87b1e9ee6e1d0d89e8f2 (notice that 0x09 & 0xfe == 0x08). In 3.x the directories are three levels deep, with 16 directories at each level based on the first 3 hex digests of the partial MD4 digest. So in 3.x there are 16^3 = 4096 leaf directories, while in 4.x there are 128 * 128 = 16384 leaf directories. - The 3.x and 4.x CPOOL_DIR is the same. The trees below are separate because of the directory naming conventions. - In 4.x when pool file matching occurs the full-file MD5 digest is needed to match files. There is also a flag, $bpc->{PoolV3}, that determines whether old 3.x pool files should be checked too. Currently that flag is hardcoded and I need to make it autodetect whether there are any old pool files (I guess based on BackupPC_nightly?). If PoolV3 is set and there are no candidate 4.x files, then the old digest is computed too and 3.x candidate pool files are also checked for matches. If an old pool 3.x file is matched, then that file is renamed to the corresponding 4.x pool file path (based on the MD5 digest). This file might still have multiple hardlinks due to the existing 3.x backups. As those backups are expired, eventually the link count on the pool file will decrease to 1. - For backing up the BackupPC store in a mixed V3/V4 environment it should be possible just copy the new V4 pool and new V4 backups (without worrying about hardlinks that might remain on pool files from V3 backups). However, I need to devise a way of determining the paths of the V4 backups. Perhaps I should add a utility that lists all the directories that should be backed up? Craig |