Re: [BackupPC-users] Backing up a BackupPC server

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Holger Parplies wrote at about 23:45:35 +0200 on Wednesday, June 3, 2009:
 > Hi,
 > 
 > Peter Walter wrote on 2009-06-03 16:15:37 -0400 [Re: [BackupPC-users] Backing up a BackupPC server]:
 > > [...]
 > > My understanding is that, if it were not for the 
 > > hardlinks, that rsync transfers to another server would be more 
 > > feasible;
 > 
 > right.
 > 
 > > that processing the hardlinks requires significant cpu 
 > > resources, memory resources, and that access times are very slow, 
 > 
 > Memory: yes. CPU: I don't think so. Access times very slow? Well, the inodes
 > referenced from one directory are probably scattered all over the place, so
 > traversing the file tree (e.g. "find $TopDir -ls") is probably slower than
 > in "normal" directories. Or do you mean swapping slows down memory accesses
 > by several orders of magnitude?
 > 
 > > compared to processing ordinary files. Is my understanding correct? If 
 > > so, then what I would think of doing is (a) shutting down backuppc (b) 
 > > creating a "dump" file containing the hardlink metadata (c) backing up 
 > > the pooled files and the dump file using rsync (d) restarting backuppc. 
 > > I really don't need a live, working copy of the backuppc file system - 
 > > just a way to recreate it from a backup if necessary, using an "undump" 
 > > program that recreated the hardlinks from the dump file. Is this 
 > > approach feasible?
 > 
 > Yes. I'm just not certain how you would test it. You can undoubtedly
 > restore your pool to a new location, but apart from browsing a few random
 > files, how would you verify it? Maybe create a new "dump" and compare the
 > two ...
 > 
 > Have you got the resources to try this? I believe I've got most of the code
 > we'd need. I'd just need to take it apart ...
 > 

Holger, one thing I don't understand is that if you create a dump
table associating inodes with pool file hashes, aren't we back in the
same situation as using rsync -H? i.e., for large pool sizes, the
table ends up using up all memory and bleeding into swap which means
that lookups start taking forever causing the system to
thrash. Specifically, I would assume that rsync -H basically is
constructing a similar table when it deals with hard links, though
perhaps there are some savings in this case since we know something
about the structure of the BackupPC file data -- i.e., we know that
all the hard links have as one of their links a link to a pool file.

Alternatively, maybe it would be faster (and more memory efficient) to
use two sorted tables as follows:

1. Crawl through the *pool* heirarchy doing the following:
   	- Copy over pool file entries to the new destination
	- Create a table consisting of ordered pairs: 
	  		 (inode, pool-file-name)
   	  Call this the "pool table". This step takes O(n) time where n=
   	  number of pool entries.

2. Sort the pool table by inode number. This should take O(nlogn) time

3. Crawl through the pc heirarchy doing the following:
	- Create corresponding sub directories immediately in the new destination
	- Copy over non-linked files immediately to the new destination
	- For linked files create a table consisting of the the ordered
	  pairs:
	  		(inode, pc-file-path-name)
   	  Call this the "pc table". This step takes O(m) time where m=
   	  number of non-zero length backup files.

4. Sort the pc table by inode number. This should take O(mlogm) time

5. "Join" the pc table using the common inode number, to create a new
   table (call it the Link table) of ordered pairs:
   		 pc-file-path-name, pool-file-name
   This should take O(m) time (note m>n) and is not memory intensive
   since the tables are both linearly ordered by inode number.

6. Iterate through the Link table to create the new hard links over in
   the new destination using something like:
   	   link(full_path_of(pool-file-name), pc-file-path-name)
   This should take O(m) time

   Note: in reality steps 5 & 6 would probably be combined without the
   need to create an intermediate table.

This would allow the entire above algorithm to be done in O(mlogm)
time with the only memory intensive steps being those required to sort
the pool and pc tables. However, since sorting is a well studied
problem, we should be able to use memory efficient algorithms for
that.

I would be curious to know how how in the real world the time (and
memory usage) compares to copy over a large (say multi Terabyte)
BackupPC topdir varies for the following methods:

1. cp -ad
2. rsync -H
3. Copy using a single table of pool inode numbers
4. Copy using a sorted table of pool inode numbers and pc hierarchy
   inode numbers