[BackupPC-users] Problems with hardlink-based backups...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi there.

Firstly, this isn't a backuppc-specific question, but it is of
relevance to backup-pc users (due to backuppc architecture), so there
might be people here with insight on the subject (or maybe someone can
point me to a more relevant project or mailing list).

My problem is as follows... with backup systems based on complete
hardlink-based snapshots, you often end up with a large number of
hardlinks. eg, at least one per server file, per backup generation,
per file.

Now, this is fine most of the time... but there is a problem case that
comes up because of this.

If the servers you're backing up, themselves have a huge number of
files (like, hundreds of thousands or millions even), that means that
you end up making a huge number of hardlinks on your backup server,
for each backup generation.

Although inefficient in some ways (using up a large number of inode
entries in the filesystem tables), this can work pretty nicely.

Where the real problem comes in, is if admins want to use 'updatedb',
or 'du' on the linux system. updatedb gets a *huge* database and uses
up tonnes of cpu & ram  (so, I usually disable it). And 'du' can take
days to run, and make multi-gb files.

Here's a question for backuppc users (and people who use hardlink
snapshot-based backups in general)... when your backup server, that
has millions of hardlinks on it, is running low on space, how do you
correct this?

The most obvious thing is to find which host's backups are taking up
the most space, and then remove some of the older generations.

Normally the simplest method to do this, is to run a tool like 'du',
and then perhaps view the output in xdiskusage. (One interesting thing
about 'du', is that it's clever about hardlinks, so doesn't count the
disk usage twice. I think it must keep a table in memory of visited
inodes, which had a link count of 2 or greater).

However, with a gazillion hardlinks, du takes forever to run, and has
a massive output. In my case, about 3-4 days, and about 4-5 GB output
file.

My current setup is a basic hardlink snapshot-based backup scheme, but
backuppc (due to it's pool structure, where hosts have generations of
hardlink snapshot dirs) would have the same problems.

How do people solve the above problem?

(I also imagine that running "du" to check disk usage of backuppc data
is also complicated by the backuppc pool, but at least you can exclude
the pool from the "du" scan to get more usable results).

My current fix is an ugly hack, where I go through my snapshot backup
generations (from oldest to newest), and remove all redundant hard
links (ie, they point to the same inodes as the same hardlink in the
next-most-recent generation). Then that info goes into a compressed
text file that could be restored from later. And after that, compare
the next 2-most-recent generations and so on.

But yeah, that's a very ugly hack... I want to do it better and not
re-invent the wheel. I'm sure this kind of problem has been solved
before.

fwiw, I was using rdiff-backup before. It's very du-friendly, since
only the differences between each backup generation is stored (rather
than a large number of hardlinks). But I had to stop using it, because
with servers with a huge number of files it uses up a huge amount of
memory + cpu, and takes a really long time. And the mailing list
wasn't very helpful with trying to fix this, so I had to change to
something new so that I could keep running backups (with history).
That's when I changed over to a hardlink snapshots approach, but that
has other problems, detailed above. And my current hack (removing all
redundant hardlinks and empty dir structures) is kind of similar to
rdiff-backup, but coming from another direction.

Thanks in advance for ideas and advice.

David.