From: Jeffrey J. K. <bac...@ko...> - 2009-06-03 23:37:05
|
Holger Parplies wrote at about 23:45:35 +0200 on Wednesday, June 3, 2009: > Hi, > > Peter Walter wrote on 2009-06-03 16:15:37 -0400 [Re: [BackupPC-users] Backing up a BackupPC server]: > > [...] > > My understanding is that, if it were not for the > > hardlinks, that rsync transfers to another server would be more > > feasible; > > right. > > > that processing the hardlinks requires significant cpu > > resources, memory resources, and that access times are very slow, > > Memory: yes. CPU: I don't think so. Access times very slow? Well, the inodes > referenced from one directory are probably scattered all over the place, so > traversing the file tree (e.g. "find $TopDir -ls") is probably slower than > in "normal" directories. Or do you mean swapping slows down memory accesses > by several orders of magnitude? > > > compared to processing ordinary files. Is my understanding correct? If > > so, then what I would think of doing is (a) shutting down backuppc (b) > > creating a "dump" file containing the hardlink metadata (c) backing up > > the pooled files and the dump file using rsync (d) restarting backuppc. > > I really don't need a live, working copy of the backuppc file system - > > just a way to recreate it from a backup if necessary, using an "undump" > > program that recreated the hardlinks from the dump file. Is this > > approach feasible? > > Yes. I'm just not certain how you would test it. You can undoubtedly > restore your pool to a new location, but apart from browsing a few random > files, how would you verify it? Maybe create a new "dump" and compare the > two ... > > Have you got the resources to try this? I believe I've got most of the code > we'd need. I'd just need to take it apart ... > Holger, one thing I don't understand is that if you create a dump table associating inodes with pool file hashes, aren't we back in the same situation as using rsync -H? i.e., for large pool sizes, the table ends up using up all memory and bleeding into swap which means that lookups start taking forever causing the system to thrash. Specifically, I would assume that rsync -H basically is constructing a similar table when it deals with hard links, though perhaps there are some savings in this case since we know something about the structure of the BackupPC file data -- i.e., we know that all the hard links have as one of their links a link to a pool file. Alternatively, maybe it would be faster (and more memory efficient) to use two sorted tables as follows: 1. Crawl through the *pool* heirarchy doing the following: - Copy over pool file entries to the new destination - Create a table consisting of ordered pairs: (inode, pool-file-name) Call this the "pool table". This step takes O(n) time where n= number of pool entries. 2. Sort the pool table by inode number. This should take O(nlogn) time 3. Crawl through the pc heirarchy doing the following: - Create corresponding sub directories immediately in the new destination - Copy over non-linked files immediately to the new destination - For linked files create a table consisting of the the ordered pairs: (inode, pc-file-path-name) Call this the "pc table". This step takes O(m) time where m= number of non-zero length backup files. 4. Sort the pc table by inode number. This should take O(mlogm) time 5. "Join" the pc table using the common inode number, to create a new table (call it the Link table) of ordered pairs: pc-file-path-name, pool-file-name This should take O(m) time (note m>n) and is not memory intensive since the tables are both linearly ordered by inode number. 6. Iterate through the Link table to create the new hard links over in the new destination using something like: link(full_path_of(pool-file-name), pc-file-path-name) This should take O(m) time Note: in reality steps 5 & 6 would probably be combined without the need to create an intermediate table. This would allow the entire above algorithm to be done in O(mlogm) time with the only memory intensive steps being those required to sort the pool and pc tables. However, since sorting is a well studied problem, we should be able to use memory efficient algorithms for that. I would be curious to know how how in the real world the time (and memory usage) compares to copy over a large (say multi Terabyte) BackupPC topdir varies for the following methods: 1. cp -ad 2. rsync -H 3. Copy using a single table of pool inode numbers 4. Copy using a sorted table of pool inode numbers and pc hierarchy inode numbers |