Thread: [BackupPC-users] Problems with hardlink-based backups...

backuppc-users

[BackupPC-users] Problems with hardlink-based backups...

From: David <wiz...@gm...> - 2009-08-17 11:53:08

Hi there.

Firstly, this isn't a backuppc-specific question, but it is of
relevance to backup-pc users (due to backuppc architecture), so there
might be people here with insight on the subject (or maybe someone can
point me to a more relevant project or mailing list).

My problem is as follows... with backup systems based on complete
hardlink-based snapshots, you often end up with a large number of
hardlinks. eg, at least one per server file, per backup generation,
per file.

Now, this is fine most of the time... but there is a problem case that
comes up because of this.

If the servers you're backing up, themselves have a huge number of
files (like, hundreds of thousands or millions even), that means that
you end up making a huge number of hardlinks on your backup server,
for each backup generation.

Although inefficient in some ways (using up a large number of inode
entries in the filesystem tables), this can work pretty nicely.

Where the real problem comes in, is if admins want to use 'updatedb',
or 'du' on the linux system. updatedb gets a *huge* database and uses
up tonnes of cpu & ram  (so, I usually disable it). And 'du' can take
days to run, and make multi-gb files.

Here's a question for backuppc users (and people who use hardlink
snapshot-based backups in general)... when your backup server, that
has millions of hardlinks on it, is running low on space, how do you
correct this?

The most obvious thing is to find which host's backups are taking up
the most space, and then remove some of the older generations.

Normally the simplest method to do this, is to run a tool like 'du',
and then perhaps view the output in xdiskusage. (One interesting thing
about 'du', is that it's clever about hardlinks, so doesn't count the
disk usage twice. I think it must keep a table in memory of visited
inodes, which had a link count of 2 or greater).

However, with a gazillion hardlinks, du takes forever to run, and has
a massive output. In my case, about 3-4 days, and about 4-5 GB output
file.

My current setup is a basic hardlink snapshot-based backup scheme, but
backuppc (due to it's pool structure, where hosts have generations of
hardlink snapshot dirs) would have the same problems.

How do people solve the above problem?

(I also imagine that running "du" to check disk usage of backuppc data
is also complicated by the backuppc pool, but at least you can exclude
the pool from the "du" scan to get more usable results).

My current fix is an ugly hack, where I go through my snapshot backup
generations (from oldest to newest), and remove all redundant hard
links (ie, they point to the same inodes as the same hardlink in the
next-most-recent generation). Then that info goes into a compressed
text file that could be restored from later. And after that, compare
the next 2-most-recent generations and so on.

But yeah, that's a very ugly hack... I want to do it better and not
re-invent the wheel. I'm sure this kind of problem has been solved
before.

fwiw, I was using rdiff-backup before. It's very du-friendly, since
only the differences between each backup generation is stored (rather
than a large number of hardlinks). But I had to stop using it, because
with servers with a huge number of files it uses up a huge amount of
memory + cpu, and takes a really long time. And the mailing list
wasn't very helpful with trying to fix this, so I had to change to
something new so that I could keep running backups (with history).
That's when I changed over to a hardlink snapshots approach, but that
has other problems, detailed above. And my current hack (removing all
redundant hardlinks and empty dir structures) is kind of similar to
rdiff-backup, but coming from another direction.

Thanks in advance for ideas and advice.

David.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Les M. <les...@gm...> - 2009-08-17 13:06:09

David wrote:
> 
> Where the real problem comes in, is if admins want to use 'updatedb',
> or 'du' on the linux system. updatedb gets a *huge* database and uses
> up tonnes of cpu & ram  (so, I usually disable it). And 'du' can take
> days to run, and make multi-gb files.

You can exclude directories from the updatedb runs.  Du doesn't make any files 
unless you redirect its output - and it can be constrained to the relevant top 
level directories with the -s option.

> Here's a question for backuppc users (and people who use hardlink
> snapshot-based backups in general)... when your backup server, that
> has millions of hardlinks on it, is running low on space, how do you
> correct this?

Backuppc maintains its own status showing how much space the pool uses and how 
much is left on the filesystem. So you just look at that page often enough to 
not run out of space.

> The most obvious thing is to find which host's backups are taking up
> the most space, and then remove some of the older generations.
> 
> Normally the simplest method to do this, is to run a tool like 'du',
> and then perhaps view the output in xdiskusage. (One interesting thing
> about 'du', is that it's clever about hardlinks, so doesn't count the
> disk usage twice. I think it must keep a table in memory of visited
> inodes, which had a link count of 2 or greater).
> 
> However, with a gazillion hardlinks, du takes forever to run, and has
> a massive output. In my case, about 3-4 days, and about 4-5 GB output
> file.
> 
> My current setup is a basic hardlink snapshot-based backup scheme, but
> backuppc (due to it's pool structure, where hosts have generations of
> hardlink snapshot dirs) would have the same problems.
> 
> How do people solve the above problem?

Backuppc won't start a backup run if the disk is more than 95% (configurable) full.


> (I also imagine that running "du" to check disk usage of backuppc data
> is also complicated by the backuppc pool, but at least you can exclude
> the pool from the "du" scan to get more usable results).
> 
> My current fix is an ugly hack, where I go through my snapshot backup
> generations (from oldest to newest), and remove all redundant hard
> links (ie, they point to the same inodes as the same hardlink in the
> next-most-recent generation). Then that info goes into a compressed
> text file that could be restored from later. And after that, compare
> the next 2-most-recent generations and so on.
> 
> But yeah, that's a very ugly hack... I want to do it better and not
> re-invent the wheel. I'm sure this kind of problem has been solved
> before.

It is best done pro-actively, avoiding the problem instead of trying to fix it 
afterwards because with everything linked, it doesn't help to remove old 
generations of files that still exist.  So generating the stats daily and 
observing them (both human and your program) before starting the next run is the 
way to go.

> fwiw, I was using rdiff-backup before. It's very du-friendly, since
> only the differences between each backup generation is stored (rather
> than a large number of hardlinks). But I had to stop using it, because
> with servers with a huge number of files it uses up a huge amount of
> memory + cpu, and takes a really long time. And the mailing list
> wasn't very helpful with trying to fix this, so I had to change to
> something new so that I could keep running backups (with history).
> That's when I changed over to a hardlink snapshots approach, but that
> has other problems, detailed above. And my current hack (removing all
> redundant hardlinks and empty dir structures) is kind of similar to
> rdiff-backup, but coming from another direction.

Also, you really want your backup archive on its own mounted filesystem so it 
doesn't compete with anything else for space and to give you the possibility of 
doing an image copy if you need a backup since other methods will be too slow to 
be practical.  And 'df' will tell you what you need to know about a filesystem 
fairly quickly.

-- 
   Les Mikesell
     les...@gm...

Re: [BackupPC-users] Problems with hardlink-based backups...

From: David <wiz...@gm...> - 2009-08-18 14:25:55

Thanks for the replies

On Mon, Aug 17, 2009 at 3:05 PM, Les Mikesell<les...@gm...> wrote:
> You can exclude directories from the updatedb runs

Only works if the data you want to exclude (such as older snapshots)
are kept in a relatively small number of directories, or you need to
make a lot of exclude rules, like one for each backup. In my case,
each backed up server/user PC/etc, is independant, and has it's own
directory structure with snaphots, etc.

And actually backuppc also has a problematic layout for locate rules:

__TOPDIR__/pc/$host/nnn <- One of those directories for each backup version.

So basically, if you have a large number of files on a server, it
seems like you need to entirely exclude the server from updatedb,
otherwise the snapshot directories are going to cause a huge updatedb
database.

Which kind of defeats the point of having updatedb running on the
backup server. Which is why I've disabled it here :-(.

> Du doesn't make any files unless you redirect its output

Usually I make du files on servers, so I can copy the files back to my
workstation, and use a graphical tool like xdiskusage to get a better
idea of where space is used.

>- and it can be constrained to the relevant top
> level directories with the -s option.

Yep, but it is still going to take days :-(. And then afterwards you
often still need to run 'du' on those lower levels to see where the
space is actually going.

> Backuppc maintains its own status showing how much space the pool uses and how
> much is left on the filesystem. So you just look at that page often enough to
> not run out of space.

Sounds like a 'df'- like display on the web page, but for the backuppc
pool rather than a partition.

Please correct me if I'm mistaken, but that doesn't really help people
who want to find which files and dirs are taking up the most space, so
they can address it (like, tweak the number of backed up generations,
or exclude additional directories/file patterns, etc).

Normally people use a tool like 'du' for that, but 'du' itself is next
to unusable when you have a massive filesystem, which can easily be
created by hardlink snapshot-based backup systems :-(

>
> Backuppc won't start a backup run if the disk is more than 95% (configurable) full.
>

Sounds useful, but it doesn't really address my problem of 'du' (and
locatedb, and others) having major problems with this kind of backup
layout.

>
> It is best done pro-actively, avoiding the problem instead of trying to fix it
> afterwards because with everything linked, it doesn't help to remove old
> generations of files that still exist.  So generating the stats daily and
> observing them (both human and your program) before starting the next run is the
> way to go.
>

1. Removing old generations does help. The idea is to remove old
"churn" that took place in that version. In other words, files which
no longer have any references after that generation is removed
(because all previous generations referring to those files via hard
links, are also gone by this point).

2. Proactive is good, but again, with a massive directory structure,
it's hard to use tools like du to check which backups you need to
finetune/prune/etc.

>
> Also, you really want your backup archive on its own mounted filesystem so it
> doesn't compete with anything else for space and to give you the possibility of
> doing an image copy if you need a backup since other methods will be too slow to
> be practical.  And 'df' will tell you what you need to know about a filesystem
> fairly quickly.
>

Our backups are stored under a LVM which is used only for backups. But
again, the problem is not disk usage causing issues for other
processes. The problem is, once the allocated area is running out of
space, how to check *where* that space is going to, so you can take
informed action. 'df' is only going to tell you that you're low on
space, not where the space is going.

- David.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Les M. <les...@gm...> - 2009-08-18 15:35:17

David wrote:
> 
>> You can exclude directories from the updatedb runs
> 
> Only works if the data you want to exclude (such as older snapshots)
> are kept in a relatively small number of directories, or you need to
> make a lot of exclude rules, like one for each backup. In my case,
> each backed up server/user PC/etc, is independant, and has it's own
> directory structure with snaphots, etc.
> 
> And actually backuppc also has a problematic layout for locate rules:
> 
> __TOPDIR__/pc/$host/nnn <- One of those directories for each backup version.
> 
> So basically, if you have a large number of files on a server, it
> seems like you need to entirely exclude the server from updatedb,
> otherwise the snapshot directories are going to cause a huge updatedb
> database.
> 
> Which kind of defeats the point of having updatedb running on the
> backup server. Which is why I've disabled it here :-(.

Why not just exclude the _TOPDIR_ - or the mount point if this is on its 
own filesystem?

>> Backuppc maintains its own status showing how much space the pool uses and how
>> much is left on the filesystem. So you just look at that page often enough to
>> not run out of space.
> 
> Sounds like a 'df'- like display on the web page, but for the backuppc
> pool rather than a partition.

It keeps both a summary of pool usage (current and yesterday) and totals 
for each backup run of number of files broken down by new and existing 
files in the pool and the size before and after compression.  A glance 
at the pool percent usage and daily change tells you where you stand.

> Please correct me if I'm mistaken, but that doesn't really help people
> who want to find which files and dirs are taking up the most space, so
> they can address it (like, tweak the number of backed up generations,
> or exclude additional directories/file patterns, etc).

There's not a good way to figure out which files might be in all of your 
backups and thus not help space-wise when you remove any instance(s) of 
it.  But the per-host, per-run stats where you can see the rate of new 
files being picked up and how much they compress is very helpful.

> Normally people use a tool like 'du' for that, but 'du' itself is next
> to unusable when you have a massive filesystem, which can easily be
> created by hardlink snapshot-based backup systems :-(

That's probably why backuppc does it internally - that and keeping track 
of compression stats and which files are new.

>> It is best done pro-actively, avoiding the problem instead of trying to fix it
>> afterwards because with everything linked, it doesn't help to remove old
>> generations of files that still exist.  So generating the stats daily and
>> observing them (both human and your program) before starting the next run is the
>> way to go.
>>
> 
> 1. Removing old generations does help. The idea is to remove old
> "churn" that took place in that version. In other words, files which
> no longer have any references after that generation is removed
> (because all previous generations referring to those files via hard
> links, are also gone by this point).

Of course, but you do it by starting with a smaller number of runs than 
you expect to be able to hold.  Then after you see that the space 
consumed is staying stable you can adjust the amount of history to keep.

> 2. Proactive is good, but again, with a massive directory structure,
> it's hard to use tools like du to check which backups you need to
> finetune/prune/etc.

This may well be a problem with whatever method you use.  It is handled 
reasonable well in backuppc.

>> Also, you really want your backup archive on its own mounted filesystem so it
>> doesn't compete with anything else for space and to give you the possibility of
>> doing an image copy if you need a backup since other methods will be too slow to
>> be practical.  And 'df' will tell you what you need to know about a filesystem
>> fairly quickly.
>>
> 
> Our backups are stored under a LVM which is used only for backups. But
> again, the problem is not disk usage causing issues for other
> processes. The problem is, once the allocated area is running out of
> space, how to check *where* that space is going to, so you can take
> informed action. 'df' is only going to tell you that you're low on
> space, not where the space is going.

One other thing - backuppc only builds a complete tree of links for full 
backups which by default run once a week with incrementals done on the 
other days.  Incremental runs build a tree of directories but only the 
new and changed files are populated, with a notation for deletions.  The 
web browser and restore processes merge the backing full on the fly and 
the expire process knows not to remove fulls until the incrementals that 
depend on it have expired as well.  That, and the file compression might 
take care of most of your problems.

-- 
    Les Mikesell
     les...@gm...

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jon C. <can...@gm...> - 2009-08-18 15:49:47

On Tue, Aug 18, 2009 at 10:25 AM, David<wiz...@gm...> wrote:

> Sounds useful, but it doesn't really address my problem of 'du' (and
> locatedb, and others) having major problems with this kind of backup
> layout.
>

A personal desire on your part to use a specific tool to get
information that is presented in other ways hardly constitues a
problem with BackupPC.  The linking structure within BackupPC is the
"magic" behind deduping files.  That it creates a huge number of
directory entries with a resulting smaller number of inode entries is
the whole point.  Use the status pages to determine where your space
is going.  It gives you information about the apparent size (full size
if you weren't de-duping") and the unique size (that portion of each
backup that was new.  This information is a whole lot more useful that
whatever your gonna get from DU.  DU takes so long because its a dumb
tool that does what its told and you are in effect telling it to
iterate accross each server multiple times (1 per retained backup) for
each server you backup.  If you did this against the actual clients
the time would be similiar to doing it against BackupPC's topdir.

As a side note are you letting available space dictate you retention
policy?  It sounds like you don't want to fund the retention policiy
you've specified otherwise you wouldn't be out of disk space.  Buy
more disk or reduce your retention numbers for backups.

Look at the Host Summary page.  Those servers with the largest "Full
Size" or a disspoportionate number of retained fulls/incrementals are
the hosts to focus pruning efforts on. Now select a candidate and
drill into the details for that host.  On the "Host ??? Backup
Summary" page look at the "File Size/Count Reuse Summary" table.  Look
for backups with a large "New Files - Size/MB" value.  These are the
backups where your host gained weight.  You can review the "XferLOG"
to get a list of files in this backup (note the number before the
filename is the file size).  Now you can go to the filesystem and
wholesale delete a backup or pick/choose through a backup for a
particular file (user copies a DVD BLOB to their server).  This wont
immediately free the space (although someone posted a tool that will)
as you will have to wait for the pool cleanup to run.  If its a
particular file, you may need to go through several backups to find
and kill the file (again someone posted a tool to do this I believe).

Voila', you've put your system on a diet, but beware, you do this once
and management will expect you to keep solving their under resourced
backup infrastructure by doing it again and again.  Each time your
forced to make decisions about is this file really junk or might a
user crawl up my backside when they find it can't be restored.  You've
also violated the sanctity of your backups and this could cause
problems if your ever forced to do some foresics on your system for a
legal case.

-- 
Jonathan Craig

Re: [BackupPC-users] Problems with hardlink-based backups...

From: David <wiz...@gm...> - 2009-08-19 10:37:44

Thanks for the replies.

Firstly, I think I should reiterate a few things I mentioned in the first post.

I haven't actually used BackupPC yet, mainly read through it's docs,
and trying to judge how well it and it's storage system would work in
our environment.

I'm mainly asking questions on this list first, to get an idea of how
well it handles the kind of issues I've experienced so far (with
things like hardlinks to huge filesystems), before I spend more time
playing with BackupPC and looking into migrating our backups to it.

And like I said before, this isn't a BackupPC-specific complaint, more
a general problem with hardlink-based backup systems (as opposed to
rdiffs, or various other schemes). So I'm checking how sysadmins
typically handle these kinds of issues.

Also, I'm not too experienced with backup "best practices",
methodologies, etc. Still learning, and seeing what works best. And
heh, our (relatively small) company didn't even have a real backup
system before, and I'm still the only person here that seems to take
them seriously >_>. Fortunately, the boss has started seeing the light
(after a near disaster in the server room), and acquired some more
hardware. But nobody besides me seems to have time to actually setup
things and make sure they're running. And I'm not even one of the
network admins/tech support, I'm actually a programmer and I was never
actually asked to work on the backups ^^; The actual network
admins/tech support don't really know much about backups D: (or have
time to work with them).

Anyway, hopefully the above will give you a better idea of my angle on
this. I'm not trying to criticize BackupPC, but rather figure out what
kind of backup scheme is going to work here (and be easy to
admin/diagnose/hack/etc), whether it is BackupPC, or something else
(that may or may not use hardlinks).

On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<les...@gm...> wrote:
>
> Why not just exclude the _TOPDIR_ - or the mount point if this is on its
> own filesystem?
>

Because most of the interesting files on the backup server (at least
in my case), are the files being backed up. I'm a lot more interested
in being able to quickly find those files, than random stuff under
/etc, /usr, etc.

>
> There's not a good way to figure out which files might be in all of your
> backups and thus not help space-wise when you remove any instance(s) of
> it.  But the per-host, per-run stats where you can see the rate of new
> files being picked up and how much they compress is very helpful.
>

Thanks for this info. At least with per-host stats, it's easier to
narrow down where to run du if I need to, instead of over the entire
backup partition.

A couple of random questions:

1. How well does BackupPC work when you manually make changes to the
pool behind it's back? (like removing a host, or some of the host's
history, via the command line). Can you make it "resync/repair" it's
database?

2) Is there a recommended approache for "backing up" BackupPC databases?

In case they go corrupt and so on. Or is a simple rsync safe?

3) Is it possible to use BackupPC's logic on the command-line, with a
bunch of command-line arguments, without setting up config files?

That would be awesome for scripting and so on, for people who want to
use just parts of it's logic (like the pooled system for instance),
rather than the entire backup system. I tend to prefer that kind of
"unix tool" design.

>
> Of course, but you do it by starting with a smaller number of runs than
> you expect to be able to hold.  Then after you see that the space
> consumed is staying stable you can adjust the amount of history to keep.
>

Ah right. I think this is a fundamental difference in approach. With
the backup systems I've used before, space usage is going to keep
growing forever, until you take steps to fix it. Either manually, or
by some kind of scripting, and so far I haven't added scripting, so I
rely on du to know where to manually recover space.

Basically, I was using rdiff-backup for along time. That tool keeps
all the history, until you run it with a command-line argument to
prune the oldest revisions.

And also, I don't see a great need to pro-actively recover space most
of the time. The large majority of servers/users/etc have a relatively
small amount of change. So it's kind of cool to be able to get *any*
of the earlier daily snapshots, for the last few years.

Although ironically, the servers with the largest amount of churn (and
harddrive usage on backup server), are the ones you'd actually want to
keep old versions for (like yearlies, monthlies, etc). But with
rdiff-backup, that isn't really possible without some major repo
surgery :-). You end up throwing away all the oldest versions when
space runs low.

Also, I'm influenced by revision control tools, like git/svn/etc. I
don't like to throw away old versions, unless it's really necessary.

And, if you have a lot of harddrive space on the backup server, then
may as well actually make use of it, to store as many versions as
possible. And then only remove oldest versions where needed.

The above backup philosophy (based partly on rdiff-backup limitations)
has served me well so far, but I guess I need to unlearn some of it,
particularly if I want to use a hardlink-based backup system.

>
> One other thing - backuppc only builds a complete tree of links for full
> backups which by default run once a week with incrementals done on the
> other days.  Incremental runs build a tree of directories but only the
> new and changed files are populated, with a notation for deletions.  The
> web browser and restore processes merge the backing full on the fly and
> the expire process knows not to remove fulls until the incrementals that
> depend on it have expired as well.  That, and the file compression might
> take care of most of your problems.

Ah, very interesting info, thanks. I read the info on incrementals in
the docs, and mainly picked up that "rsync is a good thing" :-)

AA couple of questions, pardon my noobiness:

If rsync is used, then what is the difference between an incremental
and a full backup?

ie, do  "full" backups copy all the data over (if using rsync), or
just the changed files?

And, what kind of disadvantage is there if you only do (rsync-based)
incrementals and don't ever make full backups?

On Tue, Aug 18, 2009 at 5:49 PM, Jon Craig<can...@gm...> wrote:
> A personal desire on your part to use a specific tool to get
> information that is presented in other ways hardly constitues a
> problem with BackupPC.

Again, I'm not criticizing BackupPC specifically. And indeed it seems
that BackupPC has ways which can reduce the problem. Specifically
incremental backups, as opposed to a large number (hundreds/thousands)
of "full" snapshot directories, each containing a huge number of
hardlinks (possibly millions), for several such servers.

My angle is that Linux sysadmins have certain tools they like to use,
and saying they can't use them effectively due to the backup
architecture is kind of problematic.

I guess though, that the philosophy behind rdiff-backup (keep every
single version, until you want to start removing oldest) isn't really
compatible with BackupPC, or other schemes that keep an actual
filesystem entry for every version of every file, even when there are
no changes in those files.

Probably I need to think more about using a more traditional scheme
(keep a fixed number of backups, X daily, Y weekly, Z monthly, etc),
instead of "keep versions forever, until you need to start recovering
harddrive space".

> The linking structure within BackupPC is the
> "magic" behind deduping files.  That it creates a huge number of
> directory entries with a resulting smaller number of inode entries is
> the whole point.

Yeah, I like that. But the problem I see is this:

(From BackupPC docs)

"Therefore, every file in the pool will have at least 2 hard links
(one for the pool file and one for the backup file below
__TOPDIR__/pc). Identical files from different backups or PCs will all
be linked to the same file. When old backups are deleted, some files
in the pool might only have one link. BackupPC_nightly checks the
entire pool and removes all files that have only a single link,
thereby recovering the storage for that file."

Therefore, if you want to keep tonnes of history (like, every day for
the past 3 years), for a server with lots of files, then it sounds
like you need to actually have a huge number of filesystem entries.

I think if I wanted to use BackupPC, and still be able to use du and
friends effectively, I'd need to do some combination of:

1) Use incrementals for most of the backups, to limit the number of
hardlinks created, as Les Mikesell described.

2) Stop trying to keep history for every single day for years (rather
keep 1 for the last X days, last Y weeks, Z months, etc).

This would also mean having to spend less time managing space.
Although at the moment it only comes up every few weeks/months, and
had been pretty fast with du & xdiskusage, at least until I switched
over from rdiff-backup to a "make a hardlink snapshot every day"
process :-(.

> Use the status pages to determine where your space
> is going.  It gives you information about the apparent size (full size
> if you weren't de-duping") and the unique size (that portion of each
> backup that was new.  This information is a whole lot more useful that
> whatever your gonna get from DU.  DU takes so long because its a dumb
> tool that does what its told and you are in effect telling it to
> iterate accross each server multiple times (1 per retained backup) for
> each server you backup.  If you did this against the actual clients
> the time would be similiar to doing it against BackupPC's topdir.

And furthermore, hardlink-based storage does cause ambiguous du
output, even if the time it took to run wasn't an issue. Which is
another thing about hardlink-based backups which annoys me (compared
to when I was using rdiff-backup), and one of the reasons why I'm
currently running my own very hackish "de-duping" script on our backup
server.

Nice that BackupPC maintains these stats separately. Although kind of
annoying (imo), that you have to go through it's frontend to see this
info, rather than being able to tell from standard linux commands (for
scripting purposes and so on).

And also it bothers me that those kind of stats can potentially go out
of synch with the harddrive (maybe you delete part of the pool by
mistake).

Is there a way to make BackupPC "repair" it's database, by re-scanning
it's pool? Or some kind of recommended procedure for fixing problems
like this?

>
> As a side note are you letting available space dictate you retention
> policy?  It sounds like you don't want to fund the retention policiy
> you've specified otherwise you wouldn't be out of disk space.  Buy
> more disk or reduce your retention numbers for backups.
>

More like, there wasn't a backup or retention policy to begin with D:.
I hacked together some scripts that use rdiff-backup and other tools,
and then added them to the backup server crontab.

And since we have a fairly large backup server (compared to the
servers being backed up), I let the older backups build up for a while
to take advantage of the space, and then free a chunk of space
manually when the scripts email me about space issues.

But now I can't "free a chunk of space manually" that easily any more,
since "du" doesn't work :-(.

At least thanks to the discussions in this thread, I have a few more
ideas for my own scripts, even if I don't use BackupPC in the end.

> Look at the Host Summary page.  Those servers with the largest "Full
> Size" or a disspoportionate number of retained fulls/incrementals are
> the hosts to focus pruning efforts on. Now select a candidate and

Ah, thanks. This is very useful info. So you can find which
files/transfers/etc caused a given host to use a huge amount of
storage.

> Voila', you've put your system on a diet, but beware, you do this once
> and management will expect you to keep solving their under resourced
> backup infrastructure by doing it again and again.

Well, the good news is that nobody here seems to care about the
backups much, until the moment they're needed. The fact we have them
at all is kind of a bonus D:. At least I'm starting to get the boss
(we're a pretty small company) on my side. Just that nobody besides
myself has time to work on things like this.

Anyway, thanks again for the replies. This thread has been educational
so far :-)

David.

PS: Random question: Does backuppc have tools for making offsite,
offline backups? Like copying a subset of the recent BackupPC backups
over to a set of external drives (in encrypted format) and then taking
the drives home or something like that.

Or alternately, are there recommended tools for this? I made a script
for this, but want to see how people here usually handle this.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Adam G. <mai...@we...> - 2009-08-19 11:58:30

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David wrote:
> On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<les...@gm...> wrote:
>> Why not just exclude the _TOPDIR_ - or the mount point if this is on its
>> own filesystem?
> Because most of the interesting files on the backup server (at least
> in my case), are the files being backed up. I'm a lot more interested
> in being able to quickly find those files, than random stuff under
> /etc, /usr, etc.

Yes, and this is something I'd like to have in backuppc (please find a
file on any host, in any backup number, with the string abc in it's
filename). This isn't possible without using the standard tools like
find, and waiting for it to traverse all the directories and backups
etc.. (well, you could use grep on the logfiles to find it, which would
probably be faster)...

>> There's not a good way to figure out which files might be in all of your
>> backups and thus not help space-wise when you remove any instance(s) of
>> it.  But the per-host, per-run stats where you can see the rate of new
>> files being picked up and how much they compress is very helpful.
> Thanks for this info. At least with per-host stats, it's easier to
> narrow down where to run du if I need to, instead of over the entire
> backup partition.
> 
> A couple of random questions:
> 
> 1. How well does BackupPC work when you manually make changes to the
> pool behind it's back? (like removing a host, or some of the host's
> history, via the command line). Can you make it "resync/repair" it's
> database?

Removing hosts, or individual backups doesn't affect the pool, and in my
experience, this works just fine. Although I would advise against doing
it, simply because you never know exactly what might get stuffed up....

I've had a remote client rename about 10G of images, so I simply did a
cp -al from the previous full backup into the current partial (aborted
full) backup, and then continued the full backup. It then noticed all
the old filenames were gone, found the new filenames were already
downloaded (hardlinked really), and continued on nicely.
I've also deleted individual files (vmware disk image files, dvd images,
etc) and not had a problem.

Of course, if you are going to do things like that, you should try and
use the tools that have recently been written to help do this properly.

> 2) Is there a recommended approache for "backing up" BackupPC databases?
> In case they go corrupt and so on. Or is a simple rsync safe?

Stop backuppc, umount the partition, and use dd to copy to another
partition, or else use RAID1 with three members, stop backuppc, umount,
remove a member, and you have your backup.

Rsync *should* work fine for smaller pools/number of files, as long as
you have lots of RAM on both ends.... Eventually, you will get a pool
size (number of files) where it will stop working...

> 3) Is it possible to use BackupPC's logic on the command-line, with a
> bunch of command-line arguments, without setting up config files?

No, not really.

> That would be awesome for scripting and so on, for people who want to
> use just parts of it's logic (like the pooled system for instance),
> rather than the entire backup system. I tend to prefer that kind of
> "unix tool" design.

You really sound like a programmer <EG> (yes I have read the rest of
your post already)...

After configuring backuppc, there are some things you can do to
basically cancel out all the automated features of backuppc and just use
it's pieces manually. Though I think if you actually used backuppc
normally first, you would be unlikely to want to do this.

>> Of course, but you do it by starting with a smaller number of runs than
>> you expect to be able to hold.  Then after you see that the space
>> consumed is staying stable you can adjust the amount of history to keep.
> 
> Ah right. I think this is a fundamental difference in approach. With
> the backup systems I've used before, space usage is going to keep
> growing forever, until you take steps to fix it. Either manually, or
> by some kind of scripting, and so far I haven't added scripting, so I
> rely on du to know where to manually recover space.
> 
> Basically, I was using rdiff-backup for along time. That tool keeps
> all the history, until you run it with a command-line argument to
> prune the oldest revisions.

You specify in advance how many incremental and full backups you want,
what period you want to keep them on, etc. Then backuppc *can*
automatically prune the relevant backups to keep what you have asked
for. One specific point is that you can keep your daily (incremental)
backups for the past month, then every second one for two months, and
all fulls (weekly) for the past 6 months, every 4th full for the past
two years, etc...

> And also, I don't see a great need to pro-actively recover space most
> of the time. The large majority of servers/users/etc have a relatively
> small amount of change. So it's kind of cool to be able to get *any*
> of the earlier daily snapshots, for the last few years.

I never recover space on any of my backuppc servers either, but
sometimes I increase the number of backups I want to keep :) Yes, some
things are cool, but they are rarely useful... However, I have one
customer whose backuppc server keeps *every* backup it has ever
completed, and that has been running for over 3 years now.

> Although ironically, the servers with the largest amount of churn (and
> harddrive usage on backup server), are the ones you'd actually want to
> keep old versions for (like yearlies, monthlies, etc). But with
> rdiff-backup, that isn't really possible without some major repo
> surgery :-). You end up throwing away all the oldest versions when
> space runs low.

Which is the problem with those tools. Sometimes you want to keep the
backup from 7 years ago, but you don't really need every daily backup
for the past 7 years! This is where backuppc is quite helpful...

> Also, I'm influenced by revision control tools, like git/svn/etc. I
> don't like to throw away old versions, unless it's really necessary.

When it is necessary, do you want to always throw away the oldest
version though ?

> And, if you have a lot of harddrive space on the backup server, then
> may as well actually make use of it, to store as many versions as
> possible. And then only remove oldest versions where needed.

Again, you might not want to remove the oldest, you might want to remove
some of the in between backups...

> The above backup philosophy (based partly on rdiff-backup limitations)
> has served me well so far, but I guess I need to unlearn some of it,
> particularly if I want to use a hardlink-based backup system.

Or just get more disk space...

> If rsync is used, then what is the difference between an incremental
> and a full backup?

Basically, the full will read every file on the client and backuppc
server, and compare checksums. The incremental will skip this full
checksum comparison.

> ie, do  "full" backups copy all the data over (if using rsync), or
> just the changed files?

No, both full and incremental will only transfer the modified portions
of the modified files (if using rsync).

> And, what kind of disadvantage is there if you only do (rsync-based)
> incrementals and don't ever make full backups?

In the older versions (which my above client started with, and this is
the config I started with), an incremental backup would compare the
remote client with the last *full* backup, so over time, you needed to
transfer more and more data over the network. In current versions, you
can backup compared to the last incremental of a lower level (not sure
how many levels you can get, but you can do
[0,1,0,0,2,1,1,3,2,2,4,3,3,5,4,4,6] etc.. or whatever you like... not
sure how many entries can be included there.

After working out how this affected backuppc (along with the huge amount
of extra work to "fill in" the backups in the web interface), I just
configured full backups every 3 days. The only real difference between a
full and incremental is the amount of IO load and CPU load on the client
(and backuppc server), and hence the time it takes to complete a backup.
You really should schedule a regular full backup anyway.

Also, another reason for regular full backups is so you don't need to
keep every full backup, you can drop every second (or every fourth etc)
backup to recover space...

> My angle is that Linux sysadmins have certain tools they like to use,
> and saying they can't use them effectively due to the backup
> architecture is kind of problematic.

It isn't that they can't be used... they are just slow, and there are
more efficient methods to obtain the same information. I could use find
or grep or du on my massive maildir's, but they suck and there are other
methods to get some of the answers I need, other times, I have to use
du/find/etc...

> Probably I need to think more about using a more traditional scheme
> (keep a fixed number of backups, X daily, Y weekly, Z monthly, etc),
> instead of "keep versions forever, until you need to start recovering
> harddrive space".

You can still keep versions forever, just set the keepcnt values to very
high values... 15 years, or 50 years, etc... The difference is with
backuppc you have more flexibility on *which* backups you remove to
recover space... Consider the common case of a growing log file, you
backup every day, and the file is rotated each month. So, you have 30
versions of the same file, yet you don't really need 29 of them since
all the data is included in the last/30th one... etc.. lots of examples
I'm sure you can think of :)

> But the problem I see is this:
> 
> (From BackupPC docs)
> 
> "Therefore, every file in the pool will have at least 2 hard links
> (one for the pool file and one for the backup file below
> __TOPDIR__/pc). Identical files from different backups or PCs will all
> be linked to the same file. When old backups are deleted, some files
> in the pool might only have one link. BackupPC_nightly checks the
> entire pool and removes all files that have only a single link,
> thereby recovering the storage for that file."
> 
> Therefore, if you want to keep tonnes of history (like, every day for
> the past 3 years), for a server with lots of files, then it sounds
> like you need to actually have a huge number of filesystem entries.

Yes, but is that a problem?

With 5 hosts being backed up, I have 401 full backups, and 3303
incremental backups, using 36TB of storage prior to pooling and
compression. (ie, if we didn't have hardlinks or compression).

We have approx 1.9M unique files in the pool using only 680GB of disk space.

I'm not sure how to calculate the actual number of inodes used... (df -i
doesn't seem to work as we are using reiserfs, I'm sure you would get
major issues doing this on ext2/3 etc..)

> I think if I wanted to use BackupPC, and still be able to use du and
> friends effectively, I'd need to do some combination of:
> 
> 1) Use incrementals for most of the backups, to limit the number of
> hardlinks created, as Les Mikesell described.
> 
> 2) Stop trying to keep history for every single day for years (rather
> keep 1 for the last X days, last Y weeks, Z months, etc).

or just be more patient with how long those tools take to run, and
realise that they might stop working one day if your pool/etc gets too
big...

> This would also mean having to spend less time managing space.
> Although at the moment it only comes up every few weeks/months, and
> had been pretty fast with du & xdiskusage, at least until I switched
> over from rdiff-backup to a "make a hardlink snapshot every day"
> process :-(.

or just get more disk space :)

> And furthermore, hardlink-based storage does cause ambiguous du
> output, even if the time it took to run wasn't an issue. Which is
> another thing about hardlink-based backups which annoys me (compared
> to when I was using rdiff-backup), and one of the reasons why I'm
> currently running my own very hackish "de-duping" script on our backup
> server.

Or is it that you don't know the right tool for this job which annoys
you (a little sarcasm :)...

> Nice that BackupPC maintains these stats separately. Although kind of
> annoying (imo), that you have to go through it's frontend to see this
> info, rather than being able to tell from standard linux commands (for
> scripting purposes and so on).

As far as I know, the format of the files this information is stored in
is well documented, and as such you could write scripts to your hearts
content to read/parse this simple text files, and get any information
you desire...

> And also it bothers me that those kind of stats can potentially go out
> of synch with the harddrive (maybe you delete part of the pool by
> mistake).

Ummm, don't make mistakes :) or if you do, fix the stats...

> Is there a way to make BackupPC "repair" it's database, by re-scanning
> it's pool? Or some kind of recommended procedure for fixing problems
> like this?

I am pretty sure there is no such tools... you either live with it until
the relevant backups are purged, or you manually stuff around,
potentially making the problem even worse (ie, messing it up in a way
that you don't know you have messed it up, as opposed to knowing it is
wrong).

>> As a side note are you letting available space dictate you retention
>> policy?  It sounds like you don't want to fund the retention policiy
>> you've specified otherwise you wouldn't be out of disk space.  Buy
>> more disk or reduce your retention numbers for backups.
> And since we have a fairly large backup server (compared to the
> servers being backed up), I let the older backups build up for a while
> to take advantage of the space, and then free a chunk of space
> manually when the scripts email me about space issues.
> 
> But now I can't "free a chunk of space manually" that easily any more,
> since "du" doesn't work :-(.

rm -rf TopDir/pc/host/nnn where nnn is a random incr backup number or a
full backup which no remaining incr relies on it seems to work pretty
well. Though I'd advise adjusting the values in the config file and
letting backuppc purge the backups itself.

> Well, the good news is that nobody here seems to care about the
> backups much, until the moment they're needed. The fact we have them
> at all is kind of a bonus D:. At least I'm starting to get the boss
> (we're a pretty small company) on my side. Just that nobody besides
> myself has time to work on things like this.

Once you lose all the data, everybody will have plenty of time :) You
can't afford not to have good backups! (But hey, *we* all know that....)

One other thing that should be considered, the point of using backuppc
is that lots of other people use it, and have checked that there is no
bugs etc in it. As such, we are somewhat certain that we will get back
the correct data as long as we treat it correctly (don't fiddle with
it's storage behind it's back)... Home grown scripts/programs can be
hugely rewarding/etc, but you will never get the same
reliability/certainty about the software. Of course, you also have to
write all the improvements yourself, instead of just downloading the new
version that someone else was nice enough to write for you :)

> PS: Random question: Does backuppc have tools for making offsite,
> offline backups? Like copying a subset of the recent BackupPC backups
> over to a set of external drives (in encrypted format) and then taking
> the drives home or something like that.

Yes, you can archive backups... One of my customers plugs in a esata
drive, crontab runs a script to mount the drive, create the tar files of
the most recent backups onto a staging (internal raid array) area,
delete the files from the external disk, and then copy the new tar files
onto the esata, and finally delete the files from the staging area...

Lots of checks/etc to make sure we are doing the correct things, and
alerts (or OK's) are reported back to the monitoring system as needed.

> Or alternately, are there recommended tools for this? I made a script
> for this, but want to see how people here usually handle this.

This is where custom scripts/plugins are best utilised. A single program
can't determine the possible needs of every user.... :)

I hope the above information is useful to you, please note it is just my
wordy opinion, and probably hardly worth the electrons used to display
it. Please recycle them thoughtfully...

Regards,
Adam
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkqL6NAACgkQGyoxogrTyiUAfwCfbrQU8HrY4NgcYzihRuv1kMLs
HOsAnjFVA/ALzyrQtJZKwaLTnSREvDmu
=ANHr
-----END PGP SIGNATURE-----

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jim W. <pr...@gm...> - 2009-08-19 12:18:49

On 8/19/09, David <wiz...@gm...> wrote:

> 2) Is there a recommended approache for "backing up" BackupPC databases?
> In case they go corrupt and so on. Or is a simple rsync safe?
...
> PS: Random question: Does backuppc have tools for making offsite,
> offline backups? Like copying a subset of the recent BackupPC backups
> over to a set of external drives (in encrypted format) and then taking
> the drives home or something like that.

Hi again David - HashBackup could be used to backup your BackupPC
server.  Basically, you'd just hook an external USB drive to your
server and take a backup.  I've tested it with directories containing
millions of files with 32000 hard links to each file on a Linux box
with only 1GB of memory, so it scales very well and doesn't have
rsync's memory problems.

The backup is AES encrypted, and with a 3-4 line config file
containing userids and passwords, you can send your encrypted backup
offsite to Amazon S3, FTP servers, or remote ssh accounts.

The beta site is:  http://sites.google.com/site/hashbackup

Jim

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jim L. <tr...@ol...> - 2009-08-19 13:57:45

David wrote:
> And like I said before, this isn't a BackupPC-specific complaint, more
> a general problem with hardlink-based backup systems (as opposed to
> rdiffs, or various other schemes). So I'm checking how sysadmins
> typically handle these kinds of issues.

Typically, we exclude such application-specific trees from updatedb and 
other tree-traversal processes.  Another best practice is to put the 
pool on it's own filesystem/storage.

> Anyway, hopefully the above will give you a better idea of my angle on
> this. I'm not trying to criticize BackupPC, but rather figure out what
> kind of backup scheme is going to work here (and be easy to
> admin/diagnose/hack/etc), whether it is BackupPC, or something else
> (that may or may not use hardlinks).

I didn't think open-source backup systems that pool storage even existed 
before I found BackupPC, so if you're concerned about storage pooling, 
you only have to worry about BackupPC.  Only commercial solutions do 
that (ie. Connected Dataprotector), and that is one of the reasons I 
installed BackupPC last week and am doing my best to commit to it, since 
it can result in a huge savings win if you're backing up more than 5 
machines with identical OSes/configurations.
-- 
Jim Leonard (tr...@ol...)            http://www.oldskool.org/
Help our electronic games project:           http://www.mobygames.com/
Or check out some trippy MindCandy at     http://www.mindcandydvd.com/
A child borne of the home computer wars: http://trixter.wordpress.com/

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Les M. <les...@gm...> - 2009-08-19 14:02:10

David wrote:
> 
> I haven't actually used BackupPC yet, mainly read through it's docs,
> and trying to judge how well it and it's storage system would work in
> our environment.

Why not set up a test machine?  It is trivial to install, especially if you use 
the ubuntu package or the one from the epel repository on RHEL or Centos.

> Also, I'm not too experienced with backup "best practices",
> methodologies, etc. Still learning, and seeing what works best.

Backuppc is very configurable (browse though the docs and note all the settings 
you can change) but the defaults are pretty good so you can get reasonable 
results without changing much.

>> Why not just exclude the _TOPDIR_ - or the mount point if this is on its
>> own filesystem?
>>
> 
> Because most of the interesting files on the backup server (at least
> in my case), are the files being backed up. I'm a lot more interested
> in being able to quickly find those files, than random stuff under
> /etc, /usr, etc.

Backuppc provides a web interface for easy browsing, so if you know where 
something was on the original target you can find it easily.  It does mangle the 
filenames and compress the contents so it is harder - but not impossible to work 
directly with the filesystem.  Where it is appropriate, you can assign 'owners' 
of the target hosts so they can control and access them directly so you don't 
have to be involved.

> 1. How well does BackupPC work when you manually make changes to the
> pool behind it's back? (like removing a host, or some of the host's
> history, via the command line). Can you make it "resync/repair" it's
> database?

Forcing a 'full' run will fix about anything.  There are some tricks to keep the 
stats right - and I think someone on the list has a script to do things cleanly. 
  But, drastic measures like that are rarely necessary because you can control 
expiration on a per-host basis and normally it takes care of itself.

> 2) Is there a recommended approache for "backing up" BackupPC databases?
> 
> In case they go corrupt and so on. Or is a simple rsync safe?

This is a big issue.  Up to a certain size (depending mostly on  the number of 
files and amount of RAM you have), rsync -H will work, but there are limits. 
Image copies of the partition will always work.  Personally I like to keep the 
archive small enough to fit on a single disk (so 2 TB or less these days) and 
raid-mirror to a swappable drive.

> 3) Is it possible to use BackupPC's logic on the command-line, with a
> bunch of command-line arguments, without setting up config files?

It does have command line tools.  But they are less convenient than letting the 
system work as designed.

> That would be awesome for scripting and so on, for people who want to
> use just parts of it's logic (like the pooled system for instance),
> rather than the entire backup system. I tend to prefer that kind of
> "unix tool" design.

It's all in perl.  If you want to change something you might as well do it in 
the base script...

> Ah right. I think this is a fundamental difference in approach. With
> the backup systems I've used before, space usage is going to keep
> growing forever, until you take steps to fix it. Either manually, or
> by some kind of scripting, and so far I haven't added scripting, so I
> rely on du to know where to manually recover space.

Expiration is designed in and tunable - per host.

> And, if you have a lot of harddrive space on the backup server, then
> may as well actually make use of it, to store as many versions as
> possible. And then only remove oldest versions where needed.
> 
> The above backup philosophy (based partly on rdiff-backup limitations)
> has served me well so far, but I guess I need to unlearn some of it,
> particularly if I want to use a hardlink-based backup system.

There is also an 'archive host' concept to generate a fairly standard tar 
archive out of the backup for one or more of your targets - or you can do it 
with the command line tool.  For really long term storage that is a better 
approach since you can restore it without any special programs - but you lose 
the space-sharing storage.

> AA couple of questions, pardon my noobiness:
> 
> If rsync is used, then what is the difference between an incremental
> and a full backup?
> 
> ie, do  "full" backups copy all the data over (if using rsync), or
> just the changed files?

Fulls add the --ignore-times option to the run and re-reads everything on the 
target for a block-checksum comparison in addition to rebuilding the backup tree 
completely.

> And, what kind of disadvantage is there if you only do (rsync-based)
> incrementals and don't ever make full backups?

Unless you do incremental 'levels', each incremental is based on the previous 
full so you end up copying more and more each run.

> My angle is that Linux sysadmins have certain tools they like to use,
> and saying they can't use them effectively due to the backup
> architecture is kind of problematic.

You get over that quickly when you have a system that takes care of itself.

> 2) Stop trying to keep history for every single day for years (rather
> keep 1 for the last X days, last Y weeks, Z months, etc).

You can do an 'exponential' series to keep some old copies but they get farther 
apart as they get older.  but it is better to get the things that need to be 
kept forever into some sort of version control system so backing up the current 
version of its repository lets you reconstruct the past.  Then let the rest expire.

> And also it bothers me that those kind of stats can potentially go out
> of synch with the harddrive (maybe you delete part of the pool by
> mistake).
> 
> Is there a way to make BackupPC "repair" it's database, by re-scanning
> it's pool? Or some kind of recommended procedure for fixing problems
> like this?

I think this happens nightly.

> PS: Random question: Does backuppc have tools for making offsite,
> offline backups? Like copying a subset of the recent BackupPC backups
> over to a set of external drives (in encrypted format) and then taking
> the drives home or something like that.
> 
> Or alternately, are there recommended tools for this? I made a script
> for this, but want to see how people here usually handle this.

Image copies always work, rsync sometimes works.  Even better is to just run 
another independent instance remotely and let it take care of itself.

-- 
   Les Mikesell
    les...@gm...

Re: [BackupPC-users] Problems with hardlink-based backups...

From: David <wiz...@gm...> - 2009-08-20 08:52:06

Thanks for the replies so far :-) They were very informative.

About BackupPC itself, I'm still evaluating whether or not to actually
use it, but I'm starting to decide against it. Here are my reasons:

1) We're not backing up a lot of machines with a huge amount of
duplicate data between machines. Just about every server and user's
data is different. The common stuff between servers (/usr/, etc),
isn't that large (the vast majority of storage is for unique data).
For user machines, we backup their user folders, not the entire C:
drive. Pooling common files from different machines isn't a priority.

2) A big problem, is user dbx files, 2 GB etc .. don't want to store
multiple copies of those. Actually a reverse diff-based approach works
a lot better imo.

My current backup system lets me define which kind of "history
storage" system to use for backups. rdiff-backup (for most places,
where it works), and hardlinks, for servers which cause problems with
rdiff-backup (although that led to my current problems with du &
locate, which I'm currently researching). I might add more "history
storage" systems, if I find something more appropriate later (eg
HashBackup or gibak), or write my own. I lose that kind of flexibility
if I change to most "fully integrated, prepackaged" backup systems
(like BackupPC and most others), as opposed to command-line tools
which you can script and mix & match to get a backup system that works
best for your setup.

Which is also why I asked earlier about the ability to mix and match
parts of BackupPC separately :-)

Yeah, there are downsides to home-brewed stuff, and I prefer to use
premade stuff most of the time. But when none of the existing stuff
matches my needs, I won't hesitate throw together scripts that do it
better (for my needs), by scripting command-line tools, or writing new
tools and then calling them. That's the unixy way :-). You can read
more here:

http://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy

3) The incremental/full system of backuppc is bothersome. I don't want
to copy over full servers later, after initial rsync (or if I do,
relatively infrequently, like once a month). I actually do want most
of the backups to be incremental (ie, how rsync does it in hardlink
snapshot-like schemes). But, a lot can change during incrementals, and
dealing with those multiple incremental levels seems kind of annoying,
although I'm sure there are good reasons for them.

4) Possibly lots of redundant config in the text files

This is very minor in the grand scheme of things, but it's a pet peeve
of mine, with backup systems I've seen in general. Every single server
or user backup etc has the complete backup details in the config
files, even if they are 99% similar.

This violates the "DRY" (Don't Repeat Yourself) programming principle
(yeah, I treat backup configuration as a programming exercise :-) ).

Like if you have 40 servers, then it looks like you need to define all
the details for all servers, rather than just defining the parts that
changed per-server.

In my own config, adding a new server to the backup config is as
simple as adding one line, like this to my server backups config file:

    bkp('192.168.0.2 complete backup') # Router

This basically adds the complete specification for a backup, to a list
of backups to be run (after all the backup configs are loaded into
memory, and filtered according to user command-line arguments and so
on to the main backup script, which itself is run daily from cron).

Earlier in the config file for server backups, there is a Python class
definition, where you define the details, kind of like a template. And
those classes can also inherit from other Python classes, to customize
a few details. Or take advantage of other Python programming
constructs.

Also, passwords are stored separately, in a secure text file, using a
~/.pgass-like format, that supports wildcards for individual fields
(for those of you familiar with PostgreSQL). eg:

    rsync:192.168.0.2::root:rrbackups:gib5Gryn

(gib5Gryn is a password I just generated with apg, for this example)

Although, these types of config files are more oriented at people who
prefer to edit text files directly (eg, programmers like myself :-) ),
and understand how classes, inheritance, and other programming-related
things work, rather than going through a web frontend. And adding a
templating-type system can introduce more complexity by itself.

Web frontends like BackupPC's are probably a lot more usable in
general though, especially for non-programmers :-)

David.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Les M. <les...@gm...> - 2009-08-20 12:48:57

David wrote:
> Thanks for the replies so far :-) They were very informative.
> 
> About BackupPC itself, I'm still evaluating whether or not to actually
> use it, but I'm starting to decide against it. Here are my reasons:
> 
> 1) We're not backing up a lot of machines with a huge amount of
> duplicate data between machines. Just about every server and user's
> data is different. The common stuff between servers (/usr/, etc),
> isn't that large (the vast majority of storage is for unique data).
> For user machines, we backup their user folders, not the entire C:
> drive. Pooling common files from different machines isn't a priority.
 >
> 2) A big problem, is user dbx files, 2 GB etc .. don't want to store
> multiple copies of those. Actually a reverse diff-based approach works
> a lot better imo.
> 
> My current backup system lets me define which kind of "history
> storage" system to use for backups. rdiff-backup (for most places,
> where it works), and hardlinks, for servers which cause problems with
> rdiff-backup (although that led to my current problems with du &
> locate, which I'm currently researching). I might add more "history
> storage" systems, if I find something more appropriate later (eg
> HashBackup or gibak), or write my own. I lose that kind of flexibility
> if I change to most "fully integrated, prepackaged" backup systems
> (like BackupPC and most others), as opposed to command-line tools
> which you can script and mix & match to get a backup system that works
> best for your setup.
> 
> Which is also why I asked earlier about the ability to mix and match
> parts of BackupPC separately :-)
> 
> Yeah, there are downsides to home-brewed stuff, and I prefer to use
> premade stuff most of the time. But when none of the existing stuff
> matches my needs, I won't hesitate throw together scripts that do it
> better (for my needs), by scripting command-line tools, or writing new
> tools and then calling them. That's the unixy way :-). You can read
> more here:
> 
> http://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy
> 
> 3) The incremental/full system of backuppc is bothersome. I don't want
> to copy over full servers later, after initial rsync (or if I do,
> relatively infrequently, like once a month). I actually do want most
> of the backups to be incremental (ie, how rsync does it in hardlink
> snapshot-like schemes). But, a lot can change during incrementals, and
> dealing with those multiple incremental levels seems kind of annoying,
> although I'm sure there are good reasons for them.
> 
> 4) Possibly lots of redundant config in the text files
> 
> This is very minor in the grand scheme of things, but it's a pet peeve
> of mine, with backup systems I've seen in general. Every single server
> or user backup etc has the complete backup details in the config
> files, even if they are 99% similar.
> 
> This violates the "DRY" (Don't Repeat Yourself) programming principle
> (yeah, I treat backup configuration as a programming exercise :-) ).
> 
> Like if you have 40 servers, then it looks like you need to define all
> the details for all servers, rather than just defining the parts that
> changed per-server.
> 
> In my own config, adding a new server to the backup config is as
> simple as adding one line, like this to my server backups config file:
> 
>     bkp('192.168.0.2 complete backup') # Router
> 
> This basically adds the complete specification for a backup, to a list
> of backups to be run (after all the backup configs are loaded into
> memory, and filtered according to user command-line arguments and so
> on to the main backup script, which itself is run daily from cron).
> 
> Earlier in the config file for server backups, there is a Python class
> definition, where you define the details, kind of like a template. And
> those classes can also inherit from other Python classes, to customize
> a few details. Or take advantage of other Python programming
> constructs.
> 
> Also, passwords are stored separately, in a secure text file, using a
> ~/.pgass-like format, that supports wildcards for individual fields
> (for those of you familiar with PostgreSQL). eg:
> 
>     rsync:192.168.0.2::root:rrbackups:gib5Gryn
> 
> (gib5Gryn is a password I just generated with apg, for this example)
> 
> Although, these types of config files are more oriented at people who
> prefer to edit text files directly (eg, programmers like myself :-) ),
> and understand how classes, inheritance, and other programming-related
> things work, rather than going through a web frontend. And adding a
> templating-type system can introduce more complexity by itself.
> 
> Web frontends like BackupPC's are probably a lot more usable in
> general though, especially for non-programmers :-)

So far the only thing your evaluation is actually right about is that backuppc 
isn't great at handling large files that have small changes in each run - 
although if they are compressible it may still be a win compared to other 
approaches.  It would save everyone a lot of time if you just tried it instead 
of guessing about the way it works and assuming it is wrong.   (For example, the 
config files only have per-host differences and inherit all other values from 
the master, and when you add a new host in the web interface you can tell it to 
copy an existing host config so you don't even have to retype that part.)  And 
everything is just perl snippets that you can hand-edit if you prefer.

-- 
   Les Mikesell
    les...@gm...

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Carl W. S. <ch...@re...> - 2009-08-20 13:17:22

On 08/20 10:51 , David wrote:
> 3) The incremental/full system of backuppc is bothersome. 

It's not really 'full' backups in the traditional sense when you use rsync
as the transport mechanism.

the 'incremental levels' are a one-option configuration setting, and there's
an example provided in the config file.

> 4) Possibly lots of redundant config in the text files

nope. you only set what you need changing in the per-host config file.

BackupPC isn't appropriate for all backup situations; but it works pretty
darned well. Try it. Read through config.pl and it will all make much more
sense.

-- 
Carl Soderstrom
Systems Administrator
Real-Time Enterprises
www.real-time.com

Re: [BackupPC-users] Problems with hardlink-based backups...

From: John R. <rou...@re...> - 2009-08-20 16:38:18

On Thu, Aug 20, 2009 at 08:17:10AM -0500, Carl Wilhelm Soderstrom wrote:
> On 08/20 10:51 , David wrote:
> > 4) Possibly lots of redundant config in the text files
> 
> nope. you only set what you need changing in the per-host config file.

Well not quite. It's getting better with $Conf{RsyncArgsExtra} for
example.  I don't have to copy the whole $Conf{RsyncArgs} stanza into
my pc/hostname.pl file.

If I want to append to or remove a particular entry from
$Conf{RsyncShareName} in config.pl, I have to manually copy the
current definition into pc/hostname.pl because I can't directly affect
that definition. If I add a new entry to the config.pl copy, it
doesn't propagate to the other hosts.

Now that being said, I generate the config files as part of the CM
system used to manage my systems. It would certainly be possible to
eliminate the redundancy to a large extent by using filepp, make etc.

-- 
				-- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

Re: [BackupPC-users] Problems with hardlink-based backups...

From: David <wiz...@gm...> - 2009-08-20 18:07:46

By the way, sorry for the tone of my previous mail.
I realized afterwords that I came across as condescending.
I think, I get a bit too obsessed with an idea or mindset sometimes.

Thanks for bearing with my noob questions and attitude. And I do need
to study BackupPC more before making ignorant assumptions :-(

For my immediate problem, I'm probably going to switch back to
rdiff-backup, but give it the --no-hard-links option, to reduce memory
usage. I totally missed that before. And I seriously don't need to
preserve /usr/bin/ etc hardlinks, or I can generate a list of files to
hardlink together (after restoration) separately.

Longer term, probably switching over to BackupPC is better. I'll
probably start by migrating some of the backups in the near future.
Still not too sure about the DBX mail files, have to consider that
further. Maybe it's not as big an issue as I think it's going to be,
especially if I'm not keeping every single daily version of DBX files
for the past X years.

David.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jim L. <tr...@ol...> - 2009-08-20 20:15:39

John Rouillard wrote:
> Well not quite. It's getting better with $Conf{RsyncArgsExtra} for
> example.  I don't have to copy the whole $Conf{RsyncArgs} stanza into
> my pc/hostname.pl file.

I haven't ever touched/created a hostname.pl file -- I've done 
everything in the GUI with "override" checked for a particular host's 
configuration.  It creates very neat and tidy hostname.pl files for me, 
with only the options I've overridden.
-- 
Jim Leonard (tr...@ol...)            http://www.oldskool.org/
Help our electronic games project:           http://www.mobygames.com/
Or check out some trippy MindCandy at     http://www.mindcandydvd.com/
A child borne of the home computer wars: http://trixter.wordpress.com/

Re: [BackupPC-users] Problems with hardlink-based backups...

From: John R. <rou...@re...> - 2009-08-20 21:55:46

On Thu, Aug 20, 2009 at 03:14:49PM -0500, Jim Leonard wrote:
> John Rouillard wrote:
> > Well not quite. It's getting better with $Conf{RsyncArgsExtra} for
> > example.  I don't have to copy the whole $Conf{RsyncArgs} stanza into
> > my pc/hostname.pl file.
> 
> I haven't ever touched/created a hostname.pl file -- I've done 
> everything in the GUI with "override" checked for a particular host's 
> configuration.  It creates very neat and tidy hostname.pl files for me, 

That doesn't scale well when you are running a couple of hundred hosts
across three different backup servers and you have standard backup
recipies for particular services on those hosts.

When you change the services so that new backups have to be added
(i.e. change recipies), or you move services and the configs have to
change it's a lot easier to have a single set of hostname.pl files to
distribute to all the backup servers. Doing it this way means not
going "oops you mean we didn't have an off site backup of that
filesystem?" and makes auditing on a regular basis (say weekly)
possible.

-- 
				-- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Les M. <les...@gm...> - 2009-08-20 22:33:50

John Rouillard wrote:
> On Thu, Aug 20, 2009 at 03:14:49PM -0500, Jim Leonard wrote:
>> John Rouillard wrote:
>>> Well not quite. It's getting better with $Conf{RsyncArgsExtra} for
>>> example.  I don't have to copy the whole $Conf{RsyncArgs} stanza into
>>> my pc/hostname.pl file.
>> I haven't ever touched/created a hostname.pl file -- I've done 
>> everything in the GUI with "override" checked for a particular host's 
>> configuration.  It creates very neat and tidy hostname.pl files for me, 
> 
> That doesn't scale well when you are running a couple of hundred hosts
> across three different backup servers and you have standard backup
> recipies for particular services on those hosts.

Unless you get 'owners' to go with the hosts...

> When you change the services so that new backups have to be added
> (i.e. change recipies), or you move services and the configs have to
> change it's a lot easier to have a single set of hostname.pl files to
> distribute to all the backup servers. Doing it this way means not
> going "oops you mean we didn't have an off site backup of that
> filesystem?" and makes auditing on a regular basis (say weekly)
> possible.

I think it would be a little nicer if there were another layer of 
inheritance - like a group config file that could be evaluated between 
the master and per-host configs so you could control settings that are 
common for several machines in one place.  But auditing should probably 
be done against the archive filesystem instead of the configs.

-- 
   Les Mikesell
     les...@gm...

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jim L. <tr...@ol...> - 2009-08-21 00:49:55

John Rouillard wrote:
> That doesn't scale well when you are running a couple of hundred hosts
> across three different backup servers and you have standard backup
> recipies for particular services on those hosts.

Using the GUI doesn't scale well, no, but it is not a requirement to 
include a copy of the entire config.pl for every host.  If you want to 
do so, nothing is stopping you, but it's not required, which was my point.
-- 
Jim Leonard (tr...@ol...)            http://www.oldskool.org/
Help our electronic games project:           http://www.mobygames.com/
Or check out some trippy MindCandy at     http://www.mindcandydvd.com/
A child borne of the home computer wars: http://trixter.wordpress.com/

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Michael S. <ms...@ch...> - 2009-08-20 14:47:36

> Thanks for the replies so far :-) They were very informative.

> About BackupPC itself, I'm still evaluating whether or not to actually
> use it, but I'm starting to decide against it. Here are my reasons:

Not that I'm trying to sway your opinion either way, but since the
majority of your analysis, though detailed, is steeped in ignorance,
you're projecting the impression of somebody who has trouble changing
paradigms.  Not that there's really anything wrong with that, not
everybody thinks flexibly, and it's not always useful to do so.  I
wouldn't use BackupPC to backup my Oracle data, since stepping outside
Oracle's backup and recovery paradigm is generally a bad idea.

On the other hand, I can't imagine inventing my own overly-complicated
system of backing up Outlook Express files unless I really had nothing
better to do and there was some kind of biblical passage commanding me not
to purchase a cheap 500G drive whenever necessary and stop being a pain in
the ass.

While I appreciate your brief-albeit-misplaced-and-weirdly-patronizing
lecture on the Unix philosophy, I'd recommend starting with your Backup
and Recovery goals and priorities.  I'd suggest that manual space
management and diddling around with low level tools probably shouldn't be
at the top of your list, since for many people in the US, it only takes a
few hours of their time to equate to a terabyte of their time.  Your
mileage may vary.

I'll briefly outline my own priorities for a Backup system:
1) It must be reliable
2) Files to recover must be less than 24 hours out of date
3) Recovery must be simple
4) It must take very little time and effort to maintain

#1 implies a great deal, including sanity checking, notifications,
awareness of free space, and so on.

Having done a few bare-metal restores and considerably more registry and
spot file recovery, I can say without question, and as a professional
programmer, that I really do not want to have to worry about writing and
maintaining that all by myself.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: dan <dan...@gm...> - 2009-08-23 03:25:11

Unfortunately, every backup option you have has some limitations or
imperfections.  Hardlinks have thier pros and cons.  Really, there are only
a few ways of doing incremental managed backups.  Hardlinks, Diff files, and
Diff file lists, SQL.  Hardlinks are nice because they are inexpensive.
Looking at the directory contents of a backup that is using hard links
requires no overhead because of the hardlinks.  Diff files and Diff file
lists(one being where a diff is taken of each individual file and only the
changes are stored and a diff file list being only storing those files that
have changed) requires an algoryth to recurse other directories that hold
the real data and overlay the backup on the previous one.

The only option that is more efficient than hardlinks would really be
storing files in SQL and also storing an MD5, then linking the rows in SQL.
Very similar to a hardlink but instead its just a row pointer.  This would
be many times faster than doing hardlinks in a filesystem because SQL
selects in a heirarchy based on significant data.  It would be like backuppc
only having one host with one backup on it when you are looking at the web
interface.  All the other hosts and backups etc are already excluded.

SQL file storage for backuppc has been discussed extensively on this list
and suffice it to say that opinions are very split and for good reason.
SQL(mysql specifcally but applies to all) is much much better at some tasks
than a traditional filesystem(searching for data!, orders of magnitude
faster) but a filesystem is also much much better at simply storing files.
Some hybrid could take into account the pros of each such as storing all of
the pointer data in mysql and storing the actual files as their MD5 names on
a filesystem.  simply md5 a file, push the md5 off to mysql with the host
and backup date, filename, and file path and write the file to the
filesystem.  Incremental backups would MD5 a file, search the database for
the MD5, if found then write a pointer to that entry and if not write a new
entry for the MD5 of the file, the hostname, file path and file name , and
the backup number(or date).  All the files would just be stored as their MD5
name.  Recovering the files would be less transparent but would only require
an SQL to pull the list of files based on hostname and backup number and
then pull those files, renamed, into a zip or tar file.

On Mon, Aug 17, 2009 at 5:52 AM, David <wiz...@gm...> wrote:

> Hi there.
>
> Firstly, this isn't a backuppc-specific question, but it is of
> relevance to backup-pc users (due to backuppc architecture), so there
> might be people here with insight on the subject (or maybe someone can
> point me to a more relevant project or mailing list).
>
> My problem is as follows... with backup systems based on complete
> hardlink-based snapshots, you often end up with a large number of
> hardlinks. eg, at least one per server file, per backup generation,
> per file.
>
> Now, this is fine most of the time... but there is a problem case that
> comes up because of this.
>
> If the servers you're backing up, themselves have a huge number of
> files (like, hundreds of thousands or millions even), that means that
> you end up making a huge number of hardlinks on your backup server,
> for each backup generation.
>
> Although inefficient in some ways (using up a large number of inode
> entries in the filesystem tables), this can work pretty nicely.
>
> Where the real problem comes in, is if admins want to use 'updatedb',
> or 'du' on the linux system. updatedb gets a *huge* database and uses
> up tonnes of cpu & ram  (so, I usually disable it). And 'du' can take
> days to run, and make multi-gb files.
>
> Here's a question for backuppc users (and people who use hardlink
> snapshot-based backups in general)... when your backup server, that
> has millions of hardlinks on it, is running low on space, how do you
> correct this?
>
> The most obvious thing is to find which host's backups are taking up
> the most space, and then remove some of the older generations.
>
> Normally the simplest method to do this, is to run a tool like 'du',
> and then perhaps view the output in xdiskusage. (One interesting thing
> about 'du', is that it's clever about hardlinks, so doesn't count the
> disk usage twice. I think it must keep a table in memory of visited
> inodes, which had a link count of 2 or greater).
>
> However, with a gazillion hardlinks, du takes forever to run, and has
> a massive output. In my case, about 3-4 days, and about 4-5 GB output
> file.
>
> My current setup is a basic hardlink snapshot-based backup scheme, but
> backuppc (due to it's pool structure, where hosts have generations of
> hardlink snapshot dirs) would have the same problems.
>
> How do people solve the above problem?
>
> (I also imagine that running "du" to check disk usage of backuppc data
> is also complicated by the backuppc pool, but at least you can exclude
> the pool from the "du" scan to get more usable results).
>
> My current fix is an ugly hack, where I go through my snapshot backup
> generations (from oldest to newest), and remove all redundant hard
> links (ie, they point to the same inodes as the same hardlink in the
> next-most-recent generation). Then that info goes into a compressed
> text file that could be restored from later. And after that, compare
> the next 2-most-recent generations and so on.
>
> But yeah, that's a very ugly hack... I want to do it better and not
> re-invent the wheel. I'm sure this kind of problem has been solved
> before.
>
> fwiw, I was using rdiff-backup before. It's very du-friendly, since
> only the differences between each backup generation is stored (rather
> than a large number of hardlinks). But I had to stop using it, because
> with servers with a huge number of files it uses up a huge amount of
> memory + cpu, and takes a really long time. And the mailing list
> wasn't very helpful with trying to fix this, so I had to change to
> something new so that I could keep running backups (with history).
> That's when I changed over to a hardlink snapshots approach, but that
> has other problems, detailed above. And my current hack (removing all
> redundant hardlinks and empty dir structures) is kind of similar to
> rdiff-backup, but coming from another direction.
>
> Thanks in advance for ideas and advice.
>
> David.
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus
> on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> BackupPC-users mailing list
> Bac...@li...
> List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki:    http://backuppc.wiki.sourceforge.net
> Project: http://backuppc.sourceforge.net/
>

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Michael S. <ms...@ch...> - 2009-08-23 05:50:08

> Unfortunately, every backup option you have has some limitations or
> imperfections.

Speaking of imperfections, your scheme doesn't take into account that
multiple files can (and will) have the same md5 hashes.

> The only option that is more efficient than hardlinks would really be
> storing files in SQL and also storing an MD5, then linking the rows in
> SQL.

One wonders what measure of efficiency you're using, but frankly, I fail
to believe that there are only two ways of storing matching files as few
times as possible. Hashed pointers to a pseudo-directory system comes to
mind immediately, a scheme that's in use in at least one system.

At any rate, what problem are we trying to address?

Re: [BackupPC-users] Problems with hardlink-based backups...

From: dan <dan...@gm...> - 2009-08-23 16:30:27

Speed.  Backuppc is constrained by I/O performance as a bottleneck on the
system is that the storage volume must be a single filesystem due to
hardlinks.  It has been measured a number of times on this mailing list that
I/O is the major bottleneck for backuppc.  Getting faster hardware certainly
helps but the reliance on a single filesystem for all data is a bottleneck
for performance as well as an irritation when upgrading storage as you
either need to add additional raid arrays (as expanding a raid is not
generally an option) or just use JBOD with LVM or something.   not-ideal.

My solution is to break the backup scheme into smaller chunks and have a
number of backuppc servers handling a set number of clients.  The issues
here are complexity as I need to admin a number of servers and loss of the
file de-duping.   In my organization like many others, each client will have
absolutely identical files.  4 backup machines means that a massive amount
of data is duplicated 4 times PLUS whatever redundancy is in the raid.

A hybrid platform can use the filesystems strengths and a databases
strengths and no have most of the weaknesses.


My example was a simplistic one.  Sure MD5 can have some collisions so
either MD5+SHA1 or just do SHA2.  You would need to store a few more peices
of data but I think it would be hard to argue that mysql is many orders of
magnitude faster at finding data than a filesystem just like it is hard to
argue that a filesystem is many times faster at simply storing files and
even faster at storing large files.

Other benefits of the hybrid system are that the files can be on a different
volumes than the database.  In fact, because you store the files location on
disk in the database, you could store files on many different disks, with to
issues with hardlinks.  Because of this, you could put two backuppc machines
together in a cluster and each instance of backuppc would look at the same
database (or replicated data on their own database) and be able to do online
replication of the filestore on other servers.  They could automatically
duplicate these files on their own local file store and because there are
not millions of hardlinks to worry about, rsync can actually be useful in
syncing up file stores to other backuppc machines.  sure you will still have
a lot of files but you will have a lot less files for rsync to track.  rsync
can handle a lot of files.  with backuppc rsync actually has to track every
instance of every file from each host and each backup number plus the pool.
without the hardlink pooling rsync would only have to see each file once.

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jim L. <tr...@ol...> - 2009-08-24 00:57:50

dan wrote:
> Speed.  Backuppc is constrained by I/O performance as a bottleneck on 
> the system is that the storage volume must be a single filesystem due to 
> hardlinks.  

Then use a better filesystem.  I run BackupPC on an opensolaris system 
that uses ZFS as the storage pool, and I/O is the *last* of my worries 
(since the box is an older machine with only a single processor, CPU 
usage is my main worry as File::RsyncP is not as efficient as binary rsync).

> that I/O is the major bottleneck for backuppc.  Getting faster hardware 
> certainly helps but the reliance on a single filesystem for all data is 
> a bottleneck for performance as well as an irritation when upgrading 
> storage as you either need to add additional raid arrays (as expanding a 
> raid is not generally an option) or just use JBOD with LVM or 
> something.   

Like I said, use a more appropriate filesystem.  Use ZFS, JFS, or XFS 
(or Reiser) but not ext2/3 as those as jokes when it comes to performance.

> My solution is to break the backup scheme into smaller chunks and have a 
> number of backuppc servers handling a set number of clients.  The issues 
> here are complexity as I need to admin a number of servers and loss of 
> the file de-duping.   In my organization like many others, each client 
> will have absolutely identical files.  4 backup machines means that a 
> massive amount of data is duplicated 4 times PLUS whatever redundancy is 
> in the raid.

Keep in mind that BackupPC has a limited scope -- small to medium-sized 
organizations.  If you have over 100 clients to back up, it is expected 
that you will run multiple BackupPC servers.  If you have more than 500+ 
clients to back up, it is expected that you will invest in a commercial 
solution designed for that kind of enterprise.

> Other benefits of the hybrid system are that the files can be on a 
> different volumes than the database.  In fact, because you store the 
> files location on disk in the database, you could store files on many 
> different disks, with to issues with hardlinks.

If this is your point, then it's somewhat valid in that you are arguing 
for a system where the storage is modular.  There's nothing wrong with 
that, but that's not the scope of BackupPC.  BackupPC's core strength, 
one that no other opensource backup solution has, is pooling of like 
data, and that is the reason I've implemented it.  If you want a system 
where the back-end storage is modular, choose Amanda or Bacula.
-- 
Jim Leonard (tr...@ol...)            http://www.oldskool.org/
Help our electronic games project:           http://www.mobygames.com/
Or check out some trippy MindCandy at     http://www.mindcandydvd.com/
A child borne of the home computer wars: http://trixter.wordpress.com/

Re: [BackupPC-users] Problems with hardlink-based backups...

From: Jim L. <tr...@ol...> - 2009-08-24 01:17:28

Jim Leonard wrote:
> (since the box is an older machine with only a single processor, CPU 
> usage is my main worry as File::RsyncP is not as efficient as binary rsync).

Actually, since backuppc_dump does a lot more than just emulate rsync, 
this was not a fair statement.

However, it is a fair statement to complain that backuppc_dump is not 
multi-threaded, which would really help on multi-CPU systems (the copy 
could be one thread, the comparison another, the compression a third, 
etc.).  Hopefully that's on the development roadmap?
-- 
Jim Leonard (tr...@ol...)            http://www.oldskool.org/
Help our electronic games project:           http://www.mobygames.com/
Or check out some trippy MindCandy at     http://www.mindcandydvd.com/
A child borne of the home computer wars: http://trixter.wordpress.com/

1 2 3 .. 6 > >> (Page 1 of 6)