From: David <wiz...@gm...> - 2009-08-17 11:53:08
|
Hi there. Firstly, this isn't a backuppc-specific question, but it is of relevance to backup-pc users (due to backuppc architecture), so there might be people here with insight on the subject (or maybe someone can point me to a more relevant project or mailing list). My problem is as follows... with backup systems based on complete hardlink-based snapshots, you often end up with a large number of hardlinks. eg, at least one per server file, per backup generation, per file. Now, this is fine most of the time... but there is a problem case that comes up because of this. If the servers you're backing up, themselves have a huge number of files (like, hundreds of thousands or millions even), that means that you end up making a huge number of hardlinks on your backup server, for each backup generation. Although inefficient in some ways (using up a large number of inode entries in the filesystem tables), this can work pretty nicely. Where the real problem comes in, is if admins want to use 'updatedb', or 'du' on the linux system. updatedb gets a *huge* database and uses up tonnes of cpu & ram (so, I usually disable it). And 'du' can take days to run, and make multi-gb files. Here's a question for backuppc users (and people who use hardlink snapshot-based backups in general)... when your backup server, that has millions of hardlinks on it, is running low on space, how do you correct this? The most obvious thing is to find which host's backups are taking up the most space, and then remove some of the older generations. Normally the simplest method to do this, is to run a tool like 'du', and then perhaps view the output in xdiskusage. (One interesting thing about 'du', is that it's clever about hardlinks, so doesn't count the disk usage twice. I think it must keep a table in memory of visited inodes, which had a link count of 2 or greater). However, with a gazillion hardlinks, du takes forever to run, and has a massive output. In my case, about 3-4 days, and about 4-5 GB output file. My current setup is a basic hardlink snapshot-based backup scheme, but backuppc (due to it's pool structure, where hosts have generations of hardlink snapshot dirs) would have the same problems. How do people solve the above problem? (I also imagine that running "du" to check disk usage of backuppc data is also complicated by the backuppc pool, but at least you can exclude the pool from the "du" scan to get more usable results). My current fix is an ugly hack, where I go through my snapshot backup generations (from oldest to newest), and remove all redundant hard links (ie, they point to the same inodes as the same hardlink in the next-most-recent generation). Then that info goes into a compressed text file that could be restored from later. And after that, compare the next 2-most-recent generations and so on. But yeah, that's a very ugly hack... I want to do it better and not re-invent the wheel. I'm sure this kind of problem has been solved before. fwiw, I was using rdiff-backup before. It's very du-friendly, since only the differences between each backup generation is stored (rather than a large number of hardlinks). But I had to stop using it, because with servers with a huge number of files it uses up a huge amount of memory + cpu, and takes a really long time. And the mailing list wasn't very helpful with trying to fix this, so I had to change to something new so that I could keep running backups (with history). That's when I changed over to a hardlink snapshots approach, but that has other problems, detailed above. And my current hack (removing all redundant hardlinks and empty dir structures) is kind of similar to rdiff-backup, but coming from another direction. Thanks in advance for ideas and advice. David. |
From: Les M. <les...@gm...> - 2009-08-17 13:06:09
|
David wrote: > > Where the real problem comes in, is if admins want to use 'updatedb', > or 'du' on the linux system. updatedb gets a *huge* database and uses > up tonnes of cpu & ram (so, I usually disable it). And 'du' can take > days to run, and make multi-gb files. You can exclude directories from the updatedb runs. Du doesn't make any files unless you redirect its output - and it can be constrained to the relevant top level directories with the -s option. > Here's a question for backuppc users (and people who use hardlink > snapshot-based backups in general)... when your backup server, that > has millions of hardlinks on it, is running low on space, how do you > correct this? Backuppc maintains its own status showing how much space the pool uses and how much is left on the filesystem. So you just look at that page often enough to not run out of space. > The most obvious thing is to find which host's backups are taking up > the most space, and then remove some of the older generations. > > Normally the simplest method to do this, is to run a tool like 'du', > and then perhaps view the output in xdiskusage. (One interesting thing > about 'du', is that it's clever about hardlinks, so doesn't count the > disk usage twice. I think it must keep a table in memory of visited > inodes, which had a link count of 2 or greater). > > However, with a gazillion hardlinks, du takes forever to run, and has > a massive output. In my case, about 3-4 days, and about 4-5 GB output > file. > > My current setup is a basic hardlink snapshot-based backup scheme, but > backuppc (due to it's pool structure, where hosts have generations of > hardlink snapshot dirs) would have the same problems. > > How do people solve the above problem? Backuppc won't start a backup run if the disk is more than 95% (configurable) full. > (I also imagine that running "du" to check disk usage of backuppc data > is also complicated by the backuppc pool, but at least you can exclude > the pool from the "du" scan to get more usable results). > > My current fix is an ugly hack, where I go through my snapshot backup > generations (from oldest to newest), and remove all redundant hard > links (ie, they point to the same inodes as the same hardlink in the > next-most-recent generation). Then that info goes into a compressed > text file that could be restored from later. And after that, compare > the next 2-most-recent generations and so on. > > But yeah, that's a very ugly hack... I want to do it better and not > re-invent the wheel. I'm sure this kind of problem has been solved > before. It is best done pro-actively, avoiding the problem instead of trying to fix it afterwards because with everything linked, it doesn't help to remove old generations of files that still exist. So generating the stats daily and observing them (both human and your program) before starting the next run is the way to go. > fwiw, I was using rdiff-backup before. It's very du-friendly, since > only the differences between each backup generation is stored (rather > than a large number of hardlinks). But I had to stop using it, because > with servers with a huge number of files it uses up a huge amount of > memory + cpu, and takes a really long time. And the mailing list > wasn't very helpful with trying to fix this, so I had to change to > something new so that I could keep running backups (with history). > That's when I changed over to a hardlink snapshots approach, but that > has other problems, detailed above. And my current hack (removing all > redundant hardlinks and empty dir structures) is kind of similar to > rdiff-backup, but coming from another direction. Also, you really want your backup archive on its own mounted filesystem so it doesn't compete with anything else for space and to give you the possibility of doing an image copy if you need a backup since other methods will be too slow to be practical. And 'df' will tell you what you need to know about a filesystem fairly quickly. -- Les Mikesell les...@gm... |
From: David <wiz...@gm...> - 2009-08-18 14:25:55
|
Thanks for the replies On Mon, Aug 17, 2009 at 3:05 PM, Les Mikesell<les...@gm...> wrote: > You can exclude directories from the updatedb runs Only works if the data you want to exclude (such as older snapshots) are kept in a relatively small number of directories, or you need to make a lot of exclude rules, like one for each backup. In my case, each backed up server/user PC/etc, is independant, and has it's own directory structure with snaphots, etc. And actually backuppc also has a problematic layout for locate rules: __TOPDIR__/pc/$host/nnn <- One of those directories for each backup version. So basically, if you have a large number of files on a server, it seems like you need to entirely exclude the server from updatedb, otherwise the snapshot directories are going to cause a huge updatedb database. Which kind of defeats the point of having updatedb running on the backup server. Which is why I've disabled it here :-(. > Du doesn't make any files unless you redirect its output Usually I make du files on servers, so I can copy the files back to my workstation, and use a graphical tool like xdiskusage to get a better idea of where space is used. >- and it can be constrained to the relevant top > level directories with the -s option. Yep, but it is still going to take days :-(. And then afterwards you often still need to run 'du' on those lower levels to see where the space is actually going. > Backuppc maintains its own status showing how much space the pool uses and how > much is left on the filesystem. So you just look at that page often enough to > not run out of space. Sounds like a 'df'- like display on the web page, but for the backuppc pool rather than a partition. Please correct me if I'm mistaken, but that doesn't really help people who want to find which files and dirs are taking up the most space, so they can address it (like, tweak the number of backed up generations, or exclude additional directories/file patterns, etc). Normally people use a tool like 'du' for that, but 'du' itself is next to unusable when you have a massive filesystem, which can easily be created by hardlink snapshot-based backup systems :-( > > Backuppc won't start a backup run if the disk is more than 95% (configurable) full. > Sounds useful, but it doesn't really address my problem of 'du' (and locatedb, and others) having major problems with this kind of backup layout. > > It is best done pro-actively, avoiding the problem instead of trying to fix it > afterwards because with everything linked, it doesn't help to remove old > generations of files that still exist. So generating the stats daily and > observing them (both human and your program) before starting the next run is the > way to go. > 1. Removing old generations does help. The idea is to remove old "churn" that took place in that version. In other words, files which no longer have any references after that generation is removed (because all previous generations referring to those files via hard links, are also gone by this point). 2. Proactive is good, but again, with a massive directory structure, it's hard to use tools like du to check which backups you need to finetune/prune/etc. > > Also, you really want your backup archive on its own mounted filesystem so it > doesn't compete with anything else for space and to give you the possibility of > doing an image copy if you need a backup since other methods will be too slow to > be practical. And 'df' will tell you what you need to know about a filesystem > fairly quickly. > Our backups are stored under a LVM which is used only for backups. But again, the problem is not disk usage causing issues for other processes. The problem is, once the allocated area is running out of space, how to check *where* that space is going to, so you can take informed action. 'df' is only going to tell you that you're low on space, not where the space is going. - David. |
From: Les M. <les...@gm...> - 2009-08-18 15:35:17
|
David wrote: > >> You can exclude directories from the updatedb runs > > Only works if the data you want to exclude (such as older snapshots) > are kept in a relatively small number of directories, or you need to > make a lot of exclude rules, like one for each backup. In my case, > each backed up server/user PC/etc, is independant, and has it's own > directory structure with snaphots, etc. > > And actually backuppc also has a problematic layout for locate rules: > > __TOPDIR__/pc/$host/nnn <- One of those directories for each backup version. > > So basically, if you have a large number of files on a server, it > seems like you need to entirely exclude the server from updatedb, > otherwise the snapshot directories are going to cause a huge updatedb > database. > > Which kind of defeats the point of having updatedb running on the > backup server. Which is why I've disabled it here :-(. Why not just exclude the _TOPDIR_ - or the mount point if this is on its own filesystem? >> Backuppc maintains its own status showing how much space the pool uses and how >> much is left on the filesystem. So you just look at that page often enough to >> not run out of space. > > Sounds like a 'df'- like display on the web page, but for the backuppc > pool rather than a partition. It keeps both a summary of pool usage (current and yesterday) and totals for each backup run of number of files broken down by new and existing files in the pool and the size before and after compression. A glance at the pool percent usage and daily change tells you where you stand. > Please correct me if I'm mistaken, but that doesn't really help people > who want to find which files and dirs are taking up the most space, so > they can address it (like, tweak the number of backed up generations, > or exclude additional directories/file patterns, etc). There's not a good way to figure out which files might be in all of your backups and thus not help space-wise when you remove any instance(s) of it. But the per-host, per-run stats where you can see the rate of new files being picked up and how much they compress is very helpful. > Normally people use a tool like 'du' for that, but 'du' itself is next > to unusable when you have a massive filesystem, which can easily be > created by hardlink snapshot-based backup systems :-( That's probably why backuppc does it internally - that and keeping track of compression stats and which files are new. >> It is best done pro-actively, avoiding the problem instead of trying to fix it >> afterwards because with everything linked, it doesn't help to remove old >> generations of files that still exist. So generating the stats daily and >> observing them (both human and your program) before starting the next run is the >> way to go. >> > > 1. Removing old generations does help. The idea is to remove old > "churn" that took place in that version. In other words, files which > no longer have any references after that generation is removed > (because all previous generations referring to those files via hard > links, are also gone by this point). Of course, but you do it by starting with a smaller number of runs than you expect to be able to hold. Then after you see that the space consumed is staying stable you can adjust the amount of history to keep. > 2. Proactive is good, but again, with a massive directory structure, > it's hard to use tools like du to check which backups you need to > finetune/prune/etc. This may well be a problem with whatever method you use. It is handled reasonable well in backuppc. >> Also, you really want your backup archive on its own mounted filesystem so it >> doesn't compete with anything else for space and to give you the possibility of >> doing an image copy if you need a backup since other methods will be too slow to >> be practical. And 'df' will tell you what you need to know about a filesystem >> fairly quickly. >> > > Our backups are stored under a LVM which is used only for backups. But > again, the problem is not disk usage causing issues for other > processes. The problem is, once the allocated area is running out of > space, how to check *where* that space is going to, so you can take > informed action. 'df' is only going to tell you that you're low on > space, not where the space is going. One other thing - backuppc only builds a complete tree of links for full backups which by default run once a week with incrementals done on the other days. Incremental runs build a tree of directories but only the new and changed files are populated, with a notation for deletions. The web browser and restore processes merge the backing full on the fly and the expire process knows not to remove fulls until the incrementals that depend on it have expired as well. That, and the file compression might take care of most of your problems. -- Les Mikesell les...@gm... |
From: Jon C. <can...@gm...> - 2009-08-18 15:49:47
|
On Tue, Aug 18, 2009 at 10:25 AM, David<wiz...@gm...> wrote: > Sounds useful, but it doesn't really address my problem of 'du' (and > locatedb, and others) having major problems with this kind of backup > layout. > A personal desire on your part to use a specific tool to get information that is presented in other ways hardly constitues a problem with BackupPC. The linking structure within BackupPC is the "magic" behind deduping files. That it creates a huge number of directory entries with a resulting smaller number of inode entries is the whole point. Use the status pages to determine where your space is going. It gives you information about the apparent size (full size if you weren't de-duping") and the unique size (that portion of each backup that was new. This information is a whole lot more useful that whatever your gonna get from DU. DU takes so long because its a dumb tool that does what its told and you are in effect telling it to iterate accross each server multiple times (1 per retained backup) for each server you backup. If you did this against the actual clients the time would be similiar to doing it against BackupPC's topdir. As a side note are you letting available space dictate you retention policy? It sounds like you don't want to fund the retention policiy you've specified otherwise you wouldn't be out of disk space. Buy more disk or reduce your retention numbers for backups. Look at the Host Summary page. Those servers with the largest "Full Size" or a disspoportionate number of retained fulls/incrementals are the hosts to focus pruning efforts on. Now select a candidate and drill into the details for that host. On the "Host ??? Backup Summary" page look at the "File Size/Count Reuse Summary" table. Look for backups with a large "New Files - Size/MB" value. These are the backups where your host gained weight. You can review the "XferLOG" to get a list of files in this backup (note the number before the filename is the file size). Now you can go to the filesystem and wholesale delete a backup or pick/choose through a backup for a particular file (user copies a DVD BLOB to their server). This wont immediately free the space (although someone posted a tool that will) as you will have to wait for the pool cleanup to run. If its a particular file, you may need to go through several backups to find and kill the file (again someone posted a tool to do this I believe). Voila', you've put your system on a diet, but beware, you do this once and management will expect you to keep solving their under resourced backup infrastructure by doing it again and again. Each time your forced to make decisions about is this file really junk or might a user crawl up my backside when they find it can't be restored. You've also violated the sanctity of your backups and this could cause problems if your ever forced to do some foresics on your system for a legal case. -- Jonathan Craig |
From: David <wiz...@gm...> - 2009-08-19 10:37:44
|
Thanks for the replies. Firstly, I think I should reiterate a few things I mentioned in the first post. I haven't actually used BackupPC yet, mainly read through it's docs, and trying to judge how well it and it's storage system would work in our environment. I'm mainly asking questions on this list first, to get an idea of how well it handles the kind of issues I've experienced so far (with things like hardlinks to huge filesystems), before I spend more time playing with BackupPC and looking into migrating our backups to it. And like I said before, this isn't a BackupPC-specific complaint, more a general problem with hardlink-based backup systems (as opposed to rdiffs, or various other schemes). So I'm checking how sysadmins typically handle these kinds of issues. Also, I'm not too experienced with backup "best practices", methodologies, etc. Still learning, and seeing what works best. And heh, our (relatively small) company didn't even have a real backup system before, and I'm still the only person here that seems to take them seriously >_>. Fortunately, the boss has started seeing the light (after a near disaster in the server room), and acquired some more hardware. But nobody besides me seems to have time to actually setup things and make sure they're running. And I'm not even one of the network admins/tech support, I'm actually a programmer and I was never actually asked to work on the backups ^^; The actual network admins/tech support don't really know much about backups D: (or have time to work with them). Anyway, hopefully the above will give you a better idea of my angle on this. I'm not trying to criticize BackupPC, but rather figure out what kind of backup scheme is going to work here (and be easy to admin/diagnose/hack/etc), whether it is BackupPC, or something else (that may or may not use hardlinks). On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<les...@gm...> wrote: > > Why not just exclude the _TOPDIR_ - or the mount point if this is on its > own filesystem? > Because most of the interesting files on the backup server (at least in my case), are the files being backed up. I'm a lot more interested in being able to quickly find those files, than random stuff under /etc, /usr, etc. > > There's not a good way to figure out which files might be in all of your > backups and thus not help space-wise when you remove any instance(s) of > it. But the per-host, per-run stats where you can see the rate of new > files being picked up and how much they compress is very helpful. > Thanks for this info. At least with per-host stats, it's easier to narrow down where to run du if I need to, instead of over the entire backup partition. A couple of random questions: 1. How well does BackupPC work when you manually make changes to the pool behind it's back? (like removing a host, or some of the host's history, via the command line). Can you make it "resync/repair" it's database? 2) Is there a recommended approache for "backing up" BackupPC databases? In case they go corrupt and so on. Or is a simple rsync safe? 3) Is it possible to use BackupPC's logic on the command-line, with a bunch of command-line arguments, without setting up config files? That would be awesome for scripting and so on, for people who want to use just parts of it's logic (like the pooled system for instance), rather than the entire backup system. I tend to prefer that kind of "unix tool" design. > > Of course, but you do it by starting with a smaller number of runs than > you expect to be able to hold. Then after you see that the space > consumed is staying stable you can adjust the amount of history to keep. > Ah right. I think this is a fundamental difference in approach. With the backup systems I've used before, space usage is going to keep growing forever, until you take steps to fix it. Either manually, or by some kind of scripting, and so far I haven't added scripting, so I rely on du to know where to manually recover space. Basically, I was using rdiff-backup for along time. That tool keeps all the history, until you run it with a command-line argument to prune the oldest revisions. And also, I don't see a great need to pro-actively recover space most of the time. The large majority of servers/users/etc have a relatively small amount of change. So it's kind of cool to be able to get *any* of the earlier daily snapshots, for the last few years. Although ironically, the servers with the largest amount of churn (and harddrive usage on backup server), are the ones you'd actually want to keep old versions for (like yearlies, monthlies, etc). But with rdiff-backup, that isn't really possible without some major repo surgery :-). You end up throwing away all the oldest versions when space runs low. Also, I'm influenced by revision control tools, like git/svn/etc. I don't like to throw away old versions, unless it's really necessary. And, if you have a lot of harddrive space on the backup server, then may as well actually make use of it, to store as many versions as possible. And then only remove oldest versions where needed. The above backup philosophy (based partly on rdiff-backup limitations) has served me well so far, but I guess I need to unlearn some of it, particularly if I want to use a hardlink-based backup system. > > One other thing - backuppc only builds a complete tree of links for full > backups which by default run once a week with incrementals done on the > other days. Incremental runs build a tree of directories but only the > new and changed files are populated, with a notation for deletions. The > web browser and restore processes merge the backing full on the fly and > the expire process knows not to remove fulls until the incrementals that > depend on it have expired as well. That, and the file compression might > take care of most of your problems. Ah, very interesting info, thanks. I read the info on incrementals in the docs, and mainly picked up that "rsync is a good thing" :-) AA couple of questions, pardon my noobiness: If rsync is used, then what is the difference between an incremental and a full backup? ie, do "full" backups copy all the data over (if using rsync), or just the changed files? And, what kind of disadvantage is there if you only do (rsync-based) incrementals and don't ever make full backups? On Tue, Aug 18, 2009 at 5:49 PM, Jon Craig<can...@gm...> wrote: > A personal desire on your part to use a specific tool to get > information that is presented in other ways hardly constitues a > problem with BackupPC. Again, I'm not criticizing BackupPC specifically. And indeed it seems that BackupPC has ways which can reduce the problem. Specifically incremental backups, as opposed to a large number (hundreds/thousands) of "full" snapshot directories, each containing a huge number of hardlinks (possibly millions), for several such servers. My angle is that Linux sysadmins have certain tools they like to use, and saying they can't use them effectively due to the backup architecture is kind of problematic. I guess though, that the philosophy behind rdiff-backup (keep every single version, until you want to start removing oldest) isn't really compatible with BackupPC, or other schemes that keep an actual filesystem entry for every version of every file, even when there are no changes in those files. Probably I need to think more about using a more traditional scheme (keep a fixed number of backups, X daily, Y weekly, Z monthly, etc), instead of "keep versions forever, until you need to start recovering harddrive space". > The linking structure within BackupPC is the > "magic" behind deduping files. That it creates a huge number of > directory entries with a resulting smaller number of inode entries is > the whole point. Yeah, I like that. But the problem I see is this: (From BackupPC docs) "Therefore, every file in the pool will have at least 2 hard links (one for the pool file and one for the backup file below __TOPDIR__/pc). Identical files from different backups or PCs will all be linked to the same file. When old backups are deleted, some files in the pool might only have one link. BackupPC_nightly checks the entire pool and removes all files that have only a single link, thereby recovering the storage for that file." Therefore, if you want to keep tonnes of history (like, every day for the past 3 years), for a server with lots of files, then it sounds like you need to actually have a huge number of filesystem entries. I think if I wanted to use BackupPC, and still be able to use du and friends effectively, I'd need to do some combination of: 1) Use incrementals for most of the backups, to limit the number of hardlinks created, as Les Mikesell described. 2) Stop trying to keep history for every single day for years (rather keep 1 for the last X days, last Y weeks, Z months, etc). This would also mean having to spend less time managing space. Although at the moment it only comes up every few weeks/months, and had been pretty fast with du & xdiskusage, at least until I switched over from rdiff-backup to a "make a hardlink snapshot every day" process :-(. > Use the status pages to determine where your space > is going. It gives you information about the apparent size (full size > if you weren't de-duping") and the unique size (that portion of each > backup that was new. This information is a whole lot more useful that > whatever your gonna get from DU. DU takes so long because its a dumb > tool that does what its told and you are in effect telling it to > iterate accross each server multiple times (1 per retained backup) for > each server you backup. If you did this against the actual clients > the time would be similiar to doing it against BackupPC's topdir. And furthermore, hardlink-based storage does cause ambiguous du output, even if the time it took to run wasn't an issue. Which is another thing about hardlink-based backups which annoys me (compared to when I was using rdiff-backup), and one of the reasons why I'm currently running my own very hackish "de-duping" script on our backup server. Nice that BackupPC maintains these stats separately. Although kind of annoying (imo), that you have to go through it's frontend to see this info, rather than being able to tell from standard linux commands (for scripting purposes and so on). And also it bothers me that those kind of stats can potentially go out of synch with the harddrive (maybe you delete part of the pool by mistake). Is there a way to make BackupPC "repair" it's database, by re-scanning it's pool? Or some kind of recommended procedure for fixing problems like this? > > As a side note are you letting available space dictate you retention > policy? It sounds like you don't want to fund the retention policiy > you've specified otherwise you wouldn't be out of disk space. Buy > more disk or reduce your retention numbers for backups. > More like, there wasn't a backup or retention policy to begin with D:. I hacked together some scripts that use rdiff-backup and other tools, and then added them to the backup server crontab. And since we have a fairly large backup server (compared to the servers being backed up), I let the older backups build up for a while to take advantage of the space, and then free a chunk of space manually when the scripts email me about space issues. But now I can't "free a chunk of space manually" that easily any more, since "du" doesn't work :-(. At least thanks to the discussions in this thread, I have a few more ideas for my own scripts, even if I don't use BackupPC in the end. > Look at the Host Summary page. Those servers with the largest "Full > Size" or a disspoportionate number of retained fulls/incrementals are > the hosts to focus pruning efforts on. Now select a candidate and Ah, thanks. This is very useful info. So you can find which files/transfers/etc caused a given host to use a huge amount of storage. > Voila', you've put your system on a diet, but beware, you do this once > and management will expect you to keep solving their under resourced > backup infrastructure by doing it again and again. Well, the good news is that nobody here seems to care about the backups much, until the moment they're needed. The fact we have them at all is kind of a bonus D:. At least I'm starting to get the boss (we're a pretty small company) on my side. Just that nobody besides myself has time to work on things like this. Anyway, thanks again for the replies. This thread has been educational so far :-) David. PS: Random question: Does backuppc have tools for making offsite, offline backups? Like copying a subset of the recent BackupPC backups over to a set of external drives (in encrypted format) and then taking the drives home or something like that. Or alternately, are there recommended tools for this? I made a script for this, but want to see how people here usually handle this. |
From: Adam G. <mai...@we...> - 2009-08-19 11:58:30
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 David wrote: > On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<les...@gm...> wrote: >> Why not just exclude the _TOPDIR_ - or the mount point if this is on its >> own filesystem? > Because most of the interesting files on the backup server (at least > in my case), are the files being backed up. I'm a lot more interested > in being able to quickly find those files, than random stuff under > /etc, /usr, etc. Yes, and this is something I'd like to have in backuppc (please find a file on any host, in any backup number, with the string abc in it's filename). This isn't possible without using the standard tools like find, and waiting for it to traverse all the directories and backups etc.. (well, you could use grep on the logfiles to find it, which would probably be faster)... >> There's not a good way to figure out which files might be in all of your >> backups and thus not help space-wise when you remove any instance(s) of >> it. But the per-host, per-run stats where you can see the rate of new >> files being picked up and how much they compress is very helpful. > Thanks for this info. At least with per-host stats, it's easier to > narrow down where to run du if I need to, instead of over the entire > backup partition. > > A couple of random questions: > > 1. How well does BackupPC work when you manually make changes to the > pool behind it's back? (like removing a host, or some of the host's > history, via the command line). Can you make it "resync/repair" it's > database? Removing hosts, or individual backups doesn't affect the pool, and in my experience, this works just fine. Although I would advise against doing it, simply because you never know exactly what might get stuffed up.... I've had a remote client rename about 10G of images, so I simply did a cp -al from the previous full backup into the current partial (aborted full) backup, and then continued the full backup. It then noticed all the old filenames were gone, found the new filenames were already downloaded (hardlinked really), and continued on nicely. I've also deleted individual files (vmware disk image files, dvd images, etc) and not had a problem. Of course, if you are going to do things like that, you should try and use the tools that have recently been written to help do this properly. > 2) Is there a recommended approache for "backing up" BackupPC databases? > In case they go corrupt and so on. Or is a simple rsync safe? Stop backuppc, umount the partition, and use dd to copy to another partition, or else use RAID1 with three members, stop backuppc, umount, remove a member, and you have your backup. Rsync *should* work fine for smaller pools/number of files, as long as you have lots of RAM on both ends.... Eventually, you will get a pool size (number of files) where it will stop working... > 3) Is it possible to use BackupPC's logic on the command-line, with a > bunch of command-line arguments, without setting up config files? No, not really. > That would be awesome for scripting and so on, for people who want to > use just parts of it's logic (like the pooled system for instance), > rather than the entire backup system. I tend to prefer that kind of > "unix tool" design. You really sound like a programmer <EG> (yes I have read the rest of your post already)... After configuring backuppc, there are some things you can do to basically cancel out all the automated features of backuppc and just use it's pieces manually. Though I think if you actually used backuppc normally first, you would be unlikely to want to do this. >> Of course, but you do it by starting with a smaller number of runs than >> you expect to be able to hold. Then after you see that the space >> consumed is staying stable you can adjust the amount of history to keep. > > Ah right. I think this is a fundamental difference in approach. With > the backup systems I've used before, space usage is going to keep > growing forever, until you take steps to fix it. Either manually, or > by some kind of scripting, and so far I haven't added scripting, so I > rely on du to know where to manually recover space. > > Basically, I was using rdiff-backup for along time. That tool keeps > all the history, until you run it with a command-line argument to > prune the oldest revisions. You specify in advance how many incremental and full backups you want, what period you want to keep them on, etc. Then backuppc *can* automatically prune the relevant backups to keep what you have asked for. One specific point is that you can keep your daily (incremental) backups for the past month, then every second one for two months, and all fulls (weekly) for the past 6 months, every 4th full for the past two years, etc... > And also, I don't see a great need to pro-actively recover space most > of the time. The large majority of servers/users/etc have a relatively > small amount of change. So it's kind of cool to be able to get *any* > of the earlier daily snapshots, for the last few years. I never recover space on any of my backuppc servers either, but sometimes I increase the number of backups I want to keep :) Yes, some things are cool, but they are rarely useful... However, I have one customer whose backuppc server keeps *every* backup it has ever completed, and that has been running for over 3 years now. > Although ironically, the servers with the largest amount of churn (and > harddrive usage on backup server), are the ones you'd actually want to > keep old versions for (like yearlies, monthlies, etc). But with > rdiff-backup, that isn't really possible without some major repo > surgery :-). You end up throwing away all the oldest versions when > space runs low. Which is the problem with those tools. Sometimes you want to keep the backup from 7 years ago, but you don't really need every daily backup for the past 7 years! This is where backuppc is quite helpful... > Also, I'm influenced by revision control tools, like git/svn/etc. I > don't like to throw away old versions, unless it's really necessary. When it is necessary, do you want to always throw away the oldest version though ? > And, if you have a lot of harddrive space on the backup server, then > may as well actually make use of it, to store as many versions as > possible. And then only remove oldest versions where needed. Again, you might not want to remove the oldest, you might want to remove some of the in between backups... > The above backup philosophy (based partly on rdiff-backup limitations) > has served me well so far, but I guess I need to unlearn some of it, > particularly if I want to use a hardlink-based backup system. Or just get more disk space... > If rsync is used, then what is the difference between an incremental > and a full backup? Basically, the full will read every file on the client and backuppc server, and compare checksums. The incremental will skip this full checksum comparison. > ie, do "full" backups copy all the data over (if using rsync), or > just the changed files? No, both full and incremental will only transfer the modified portions of the modified files (if using rsync). > And, what kind of disadvantage is there if you only do (rsync-based) > incrementals and don't ever make full backups? In the older versions (which my above client started with, and this is the config I started with), an incremental backup would compare the remote client with the last *full* backup, so over time, you needed to transfer more and more data over the network. In current versions, you can backup compared to the last incremental of a lower level (not sure how many levels you can get, but you can do [0,1,0,0,2,1,1,3,2,2,4,3,3,5,4,4,6] etc.. or whatever you like... not sure how many entries can be included there. After working out how this affected backuppc (along with the huge amount of extra work to "fill in" the backups in the web interface), I just configured full backups every 3 days. The only real difference between a full and incremental is the amount of IO load and CPU load on the client (and backuppc server), and hence the time it takes to complete a backup. You really should schedule a regular full backup anyway. Also, another reason for regular full backups is so you don't need to keep every full backup, you can drop every second (or every fourth etc) backup to recover space... > My angle is that Linux sysadmins have certain tools they like to use, > and saying they can't use them effectively due to the backup > architecture is kind of problematic. It isn't that they can't be used... they are just slow, and there are more efficient methods to obtain the same information. I could use find or grep or du on my massive maildir's, but they suck and there are other methods to get some of the answers I need, other times, I have to use du/find/etc... > Probably I need to think more about using a more traditional scheme > (keep a fixed number of backups, X daily, Y weekly, Z monthly, etc), > instead of "keep versions forever, until you need to start recovering > harddrive space". You can still keep versions forever, just set the keepcnt values to very high values... 15 years, or 50 years, etc... The difference is with backuppc you have more flexibility on *which* backups you remove to recover space... Consider the common case of a growing log file, you backup every day, and the file is rotated each month. So, you have 30 versions of the same file, yet you don't really need 29 of them since all the data is included in the last/30th one... etc.. lots of examples I'm sure you can think of :) > But the problem I see is this: > > (From BackupPC docs) > > "Therefore, every file in the pool will have at least 2 hard links > (one for the pool file and one for the backup file below > __TOPDIR__/pc). Identical files from different backups or PCs will all > be linked to the same file. When old backups are deleted, some files > in the pool might only have one link. BackupPC_nightly checks the > entire pool and removes all files that have only a single link, > thereby recovering the storage for that file." > > Therefore, if you want to keep tonnes of history (like, every day for > the past 3 years), for a server with lots of files, then it sounds > like you need to actually have a huge number of filesystem entries. Yes, but is that a problem? With 5 hosts being backed up, I have 401 full backups, and 3303 incremental backups, using 36TB of storage prior to pooling and compression. (ie, if we didn't have hardlinks or compression). We have approx 1.9M unique files in the pool using only 680GB of disk space. I'm not sure how to calculate the actual number of inodes used... (df -i doesn't seem to work as we are using reiserfs, I'm sure you would get major issues doing this on ext2/3 etc..) > I think if I wanted to use BackupPC, and still be able to use du and > friends effectively, I'd need to do some combination of: > > 1) Use incrementals for most of the backups, to limit the number of > hardlinks created, as Les Mikesell described. > > 2) Stop trying to keep history for every single day for years (rather > keep 1 for the last X days, last Y weeks, Z months, etc). or just be more patient with how long those tools take to run, and realise that they might stop working one day if your pool/etc gets too big... > This would also mean having to spend less time managing space. > Although at the moment it only comes up every few weeks/months, and > had been pretty fast with du & xdiskusage, at least until I switched > over from rdiff-backup to a "make a hardlink snapshot every day" > process :-(. or just get more disk space :) > And furthermore, hardlink-based storage does cause ambiguous du > output, even if the time it took to run wasn't an issue. Which is > another thing about hardlink-based backups which annoys me (compared > to when I was using rdiff-backup), and one of the reasons why I'm > currently running my own very hackish "de-duping" script on our backup > server. Or is it that you don't know the right tool for this job which annoys you (a little sarcasm :)... > Nice that BackupPC maintains these stats separately. Although kind of > annoying (imo), that you have to go through it's frontend to see this > info, rather than being able to tell from standard linux commands (for > scripting purposes and so on). As far as I know, the format of the files this information is stored in is well documented, and as such you could write scripts to your hearts content to read/parse this simple text files, and get any information you desire... > And also it bothers me that those kind of stats can potentially go out > of synch with the harddrive (maybe you delete part of the pool by > mistake). Ummm, don't make mistakes :) or if you do, fix the stats... > Is there a way to make BackupPC "repair" it's database, by re-scanning > it's pool? Or some kind of recommended procedure for fixing problems > like this? I am pretty sure there is no such tools... you either live with it until the relevant backups are purged, or you manually stuff around, potentially making the problem even worse (ie, messing it up in a way that you don't know you have messed it up, as opposed to knowing it is wrong). >> As a side note are you letting available space dictate you retention >> policy? It sounds like you don't want to fund the retention policiy >> you've specified otherwise you wouldn't be out of disk space. Buy >> more disk or reduce your retention numbers for backups. > And since we have a fairly large backup server (compared to the > servers being backed up), I let the older backups build up for a while > to take advantage of the space, and then free a chunk of space > manually when the scripts email me about space issues. > > But now I can't "free a chunk of space manually" that easily any more, > since "du" doesn't work :-(. rm -rf TopDir/pc/host/nnn where nnn is a random incr backup number or a full backup which no remaining incr relies on it seems to work pretty well. Though I'd advise adjusting the values in the config file and letting backuppc purge the backups itself. > Well, the good news is that nobody here seems to care about the > backups much, until the moment they're needed. The fact we have them > at all is kind of a bonus D:. At least I'm starting to get the boss > (we're a pretty small company) on my side. Just that nobody besides > myself has time to work on things like this. Once you lose all the data, everybody will have plenty of time :) You can't afford not to have good backups! (But hey, *we* all know that....) One other thing that should be considered, the point of using backuppc is that lots of other people use it, and have checked that there is no bugs etc in it. As such, we are somewhat certain that we will get back the correct data as long as we treat it correctly (don't fiddle with it's storage behind it's back)... Home grown scripts/programs can be hugely rewarding/etc, but you will never get the same reliability/certainty about the software. Of course, you also have to write all the improvements yourself, instead of just downloading the new version that someone else was nice enough to write for you :) > PS: Random question: Does backuppc have tools for making offsite, > offline backups? Like copying a subset of the recent BackupPC backups > over to a set of external drives (in encrypted format) and then taking > the drives home or something like that. Yes, you can archive backups... One of my customers plugs in a esata drive, crontab runs a script to mount the drive, create the tar files of the most recent backups onto a staging (internal raid array) area, delete the files from the external disk, and then copy the new tar files onto the esata, and finally delete the files from the staging area... Lots of checks/etc to make sure we are doing the correct things, and alerts (or OK's) are reported back to the monitoring system as needed. > Or alternately, are there recommended tools for this? I made a script > for this, but want to see how people here usually handle this. This is where custom scripts/plugins are best utilised. A single program can't determine the possible needs of every user.... :) I hope the above information is useful to you, please note it is just my wordy opinion, and probably hardly worth the electrons used to display it. Please recycle them thoughtfully... Regards, Adam -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkqL6NAACgkQGyoxogrTyiUAfwCfbrQU8HrY4NgcYzihRuv1kMLs HOsAnjFVA/ALzyrQtJZKwaLTnSREvDmu =ANHr -----END PGP SIGNATURE----- |
From: Jim W. <pr...@gm...> - 2009-08-19 12:18:49
|
On 8/19/09, David <wiz...@gm...> wrote: > 2) Is there a recommended approache for "backing up" BackupPC databases? > In case they go corrupt and so on. Or is a simple rsync safe? ... > PS: Random question: Does backuppc have tools for making offsite, > offline backups? Like copying a subset of the recent BackupPC backups > over to a set of external drives (in encrypted format) and then taking > the drives home or something like that. Hi again David - HashBackup could be used to backup your BackupPC server. Basically, you'd just hook an external USB drive to your server and take a backup. I've tested it with directories containing millions of files with 32000 hard links to each file on a Linux box with only 1GB of memory, so it scales very well and doesn't have rsync's memory problems. The backup is AES encrypted, and with a 3-4 line config file containing userids and passwords, you can send your encrypted backup offsite to Amazon S3, FTP servers, or remote ssh accounts. The beta site is: http://sites.google.com/site/hashbackup Jim |
From: Jim L. <tr...@ol...> - 2009-08-19 13:57:45
|
David wrote: > And like I said before, this isn't a BackupPC-specific complaint, more > a general problem with hardlink-based backup systems (as opposed to > rdiffs, or various other schemes). So I'm checking how sysadmins > typically handle these kinds of issues. Typically, we exclude such application-specific trees from updatedb and other tree-traversal processes. Another best practice is to put the pool on it's own filesystem/storage. > Anyway, hopefully the above will give you a better idea of my angle on > this. I'm not trying to criticize BackupPC, but rather figure out what > kind of backup scheme is going to work here (and be easy to > admin/diagnose/hack/etc), whether it is BackupPC, or something else > (that may or may not use hardlinks). I didn't think open-source backup systems that pool storage even existed before I found BackupPC, so if you're concerned about storage pooling, you only have to worry about BackupPC. Only commercial solutions do that (ie. Connected Dataprotector), and that is one of the reasons I installed BackupPC last week and am doing my best to commit to it, since it can result in a huge savings win if you're backing up more than 5 machines with identical OSes/configurations. -- Jim Leonard (tr...@ol...) http://www.oldskool.org/ Help our electronic games project: http://www.mobygames.com/ Or check out some trippy MindCandy at http://www.mindcandydvd.com/ A child borne of the home computer wars: http://trixter.wordpress.com/ |
From: Les M. <les...@gm...> - 2009-08-19 14:02:10
|
David wrote: > > I haven't actually used BackupPC yet, mainly read through it's docs, > and trying to judge how well it and it's storage system would work in > our environment. Why not set up a test machine? It is trivial to install, especially if you use the ubuntu package or the one from the epel repository on RHEL or Centos. > Also, I'm not too experienced with backup "best practices", > methodologies, etc. Still learning, and seeing what works best. Backuppc is very configurable (browse though the docs and note all the settings you can change) but the defaults are pretty good so you can get reasonable results without changing much. >> Why not just exclude the _TOPDIR_ - or the mount point if this is on its >> own filesystem? >> > > Because most of the interesting files on the backup server (at least > in my case), are the files being backed up. I'm a lot more interested > in being able to quickly find those files, than random stuff under > /etc, /usr, etc. Backuppc provides a web interface for easy browsing, so if you know where something was on the original target you can find it easily. It does mangle the filenames and compress the contents so it is harder - but not impossible to work directly with the filesystem. Where it is appropriate, you can assign 'owners' of the target hosts so they can control and access them directly so you don't have to be involved. > 1. How well does BackupPC work when you manually make changes to the > pool behind it's back? (like removing a host, or some of the host's > history, via the command line). Can you make it "resync/repair" it's > database? Forcing a 'full' run will fix about anything. There are some tricks to keep the stats right - and I think someone on the list has a script to do things cleanly. But, drastic measures like that are rarely necessary because you can control expiration on a per-host basis and normally it takes care of itself. > 2) Is there a recommended approache for "backing up" BackupPC databases? > > In case they go corrupt and so on. Or is a simple rsync safe? This is a big issue. Up to a certain size (depending mostly on the number of files and amount of RAM you have), rsync -H will work, but there are limits. Image copies of the partition will always work. Personally I like to keep the archive small enough to fit on a single disk (so 2 TB or less these days) and raid-mirror to a swappable drive. > 3) Is it possible to use BackupPC's logic on the command-line, with a > bunch of command-line arguments, without setting up config files? It does have command line tools. But they are less convenient than letting the system work as designed. > That would be awesome for scripting and so on, for people who want to > use just parts of it's logic (like the pooled system for instance), > rather than the entire backup system. I tend to prefer that kind of > "unix tool" design. It's all in perl. If you want to change something you might as well do it in the base script... > Ah right. I think this is a fundamental difference in approach. With > the backup systems I've used before, space usage is going to keep > growing forever, until you take steps to fix it. Either manually, or > by some kind of scripting, and so far I haven't added scripting, so I > rely on du to know where to manually recover space. Expiration is designed in and tunable - per host. > And, if you have a lot of harddrive space on the backup server, then > may as well actually make use of it, to store as many versions as > possible. And then only remove oldest versions where needed. > > The above backup philosophy (based partly on rdiff-backup limitations) > has served me well so far, but I guess I need to unlearn some of it, > particularly if I want to use a hardlink-based backup system. There is also an 'archive host' concept to generate a fairly standard tar archive out of the backup for one or more of your targets - or you can do it with the command line tool. For really long term storage that is a better approach since you can restore it without any special programs - but you lose the space-sharing storage. > AA couple of questions, pardon my noobiness: > > If rsync is used, then what is the difference between an incremental > and a full backup? > > ie, do "full" backups copy all the data over (if using rsync), or > just the changed files? Fulls add the --ignore-times option to the run and re-reads everything on the target for a block-checksum comparison in addition to rebuilding the backup tree completely. > And, what kind of disadvantage is there if you only do (rsync-based) > incrementals and don't ever make full backups? Unless you do incremental 'levels', each incremental is based on the previous full so you end up copying more and more each run. > My angle is that Linux sysadmins have certain tools they like to use, > and saying they can't use them effectively due to the backup > architecture is kind of problematic. You get over that quickly when you have a system that takes care of itself. > 2) Stop trying to keep history for every single day for years (rather > keep 1 for the last X days, last Y weeks, Z months, etc). You can do an 'exponential' series to keep some old copies but they get farther apart as they get older. but it is better to get the things that need to be kept forever into some sort of version control system so backing up the current version of its repository lets you reconstruct the past. Then let the rest expire. > And also it bothers me that those kind of stats can potentially go out > of synch with the harddrive (maybe you delete part of the pool by > mistake). > > Is there a way to make BackupPC "repair" it's database, by re-scanning > it's pool? Or some kind of recommended procedure for fixing problems > like this? I think this happens nightly. > PS: Random question: Does backuppc have tools for making offsite, > offline backups? Like copying a subset of the recent BackupPC backups > over to a set of external drives (in encrypted format) and then taking > the drives home or something like that. > > Or alternately, are there recommended tools for this? I made a script > for this, but want to see how people here usually handle this. Image copies always work, rsync sometimes works. Even better is to just run another independent instance remotely and let it take care of itself. -- Les Mikesell les...@gm... |
From: David <wiz...@gm...> - 2009-08-20 08:52:06
|
Thanks for the replies so far :-) They were very informative. About BackupPC itself, I'm still evaluating whether or not to actually use it, but I'm starting to decide against it. Here are my reasons: 1) We're not backing up a lot of machines with a huge amount of duplicate data between machines. Just about every server and user's data is different. The common stuff between servers (/usr/, etc), isn't that large (the vast majority of storage is for unique data). For user machines, we backup their user folders, not the entire C: drive. Pooling common files from different machines isn't a priority. 2) A big problem, is user dbx files, 2 GB etc .. don't want to store multiple copies of those. Actually a reverse diff-based approach works a lot better imo. My current backup system lets me define which kind of "history storage" system to use for backups. rdiff-backup (for most places, where it works), and hardlinks, for servers which cause problems with rdiff-backup (although that led to my current problems with du & locate, which I'm currently researching). I might add more "history storage" systems, if I find something more appropriate later (eg HashBackup or gibak), or write my own. I lose that kind of flexibility if I change to most "fully integrated, prepackaged" backup systems (like BackupPC and most others), as opposed to command-line tools which you can script and mix & match to get a backup system that works best for your setup. Which is also why I asked earlier about the ability to mix and match parts of BackupPC separately :-) Yeah, there are downsides to home-brewed stuff, and I prefer to use premade stuff most of the time. But when none of the existing stuff matches my needs, I won't hesitate throw together scripts that do it better (for my needs), by scripting command-line tools, or writing new tools and then calling them. That's the unixy way :-). You can read more here: http://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy 3) The incremental/full system of backuppc is bothersome. I don't want to copy over full servers later, after initial rsync (or if I do, relatively infrequently, like once a month). I actually do want most of the backups to be incremental (ie, how rsync does it in hardlink snapshot-like schemes). But, a lot can change during incrementals, and dealing with those multiple incremental levels seems kind of annoying, although I'm sure there are good reasons for them. 4) Possibly lots of redundant config in the text files This is very minor in the grand scheme of things, but it's a pet peeve of mine, with backup systems I've seen in general. Every single server or user backup etc has the complete backup details in the config files, even if they are 99% similar. This violates the "DRY" (Don't Repeat Yourself) programming principle (yeah, I treat backup configuration as a programming exercise :-) ). Like if you have 40 servers, then it looks like you need to define all the details for all servers, rather than just defining the parts that changed per-server. In my own config, adding a new server to the backup config is as simple as adding one line, like this to my server backups config file: bkp('192.168.0.2 complete backup') # Router This basically adds the complete specification for a backup, to a list of backups to be run (after all the backup configs are loaded into memory, and filtered according to user command-line arguments and so on to the main backup script, which itself is run daily from cron). Earlier in the config file for server backups, there is a Python class definition, where you define the details, kind of like a template. And those classes can also inherit from other Python classes, to customize a few details. Or take advantage of other Python programming constructs. Also, passwords are stored separately, in a secure text file, using a ~/.pgass-like format, that supports wildcards for individual fields (for those of you familiar with PostgreSQL). eg: rsync:192.168.0.2::root:rrbackups:gib5Gryn (gib5Gryn is a password I just generated with apg, for this example) Although, these types of config files are more oriented at people who prefer to edit text files directly (eg, programmers like myself :-) ), and understand how classes, inheritance, and other programming-related things work, rather than going through a web frontend. And adding a templating-type system can introduce more complexity by itself. Web frontends like BackupPC's are probably a lot more usable in general though, especially for non-programmers :-) David. |
From: Les M. <les...@gm...> - 2009-08-20 12:48:57
|
David wrote: > Thanks for the replies so far :-) They were very informative. > > About BackupPC itself, I'm still evaluating whether or not to actually > use it, but I'm starting to decide against it. Here are my reasons: > > 1) We're not backing up a lot of machines with a huge amount of > duplicate data between machines. Just about every server and user's > data is different. The common stuff between servers (/usr/, etc), > isn't that large (the vast majority of storage is for unique data). > For user machines, we backup their user folders, not the entire C: > drive. Pooling common files from different machines isn't a priority. > > 2) A big problem, is user dbx files, 2 GB etc .. don't want to store > multiple copies of those. Actually a reverse diff-based approach works > a lot better imo. > > My current backup system lets me define which kind of "history > storage" system to use for backups. rdiff-backup (for most places, > where it works), and hardlinks, for servers which cause problems with > rdiff-backup (although that led to my current problems with du & > locate, which I'm currently researching). I might add more "history > storage" systems, if I find something more appropriate later (eg > HashBackup or gibak), or write my own. I lose that kind of flexibility > if I change to most "fully integrated, prepackaged" backup systems > (like BackupPC and most others), as opposed to command-line tools > which you can script and mix & match to get a backup system that works > best for your setup. > > Which is also why I asked earlier about the ability to mix and match > parts of BackupPC separately :-) > > Yeah, there are downsides to home-brewed stuff, and I prefer to use > premade stuff most of the time. But when none of the existing stuff > matches my needs, I won't hesitate throw together scripts that do it > better (for my needs), by scripting command-line tools, or writing new > tools and then calling them. That's the unixy way :-). You can read > more here: > > http://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy > > 3) The incremental/full system of backuppc is bothersome. I don't want > to copy over full servers later, after initial rsync (or if I do, > relatively infrequently, like once a month). I actually do want most > of the backups to be incremental (ie, how rsync does it in hardlink > snapshot-like schemes). But, a lot can change during incrementals, and > dealing with those multiple incremental levels seems kind of annoying, > although I'm sure there are good reasons for them. > > 4) Possibly lots of redundant config in the text files > > This is very minor in the grand scheme of things, but it's a pet peeve > of mine, with backup systems I've seen in general. Every single server > or user backup etc has the complete backup details in the config > files, even if they are 99% similar. > > This violates the "DRY" (Don't Repeat Yourself) programming principle > (yeah, I treat backup configuration as a programming exercise :-) ). > > Like if you have 40 servers, then it looks like you need to define all > the details for all servers, rather than just defining the parts that > changed per-server. > > In my own config, adding a new server to the backup config is as > simple as adding one line, like this to my server backups config file: > > bkp('192.168.0.2 complete backup') # Router > > This basically adds the complete specification for a backup, to a list > of backups to be run (after all the backup configs are loaded into > memory, and filtered according to user command-line arguments and so > on to the main backup script, which itself is run daily from cron). > > Earlier in the config file for server backups, there is a Python class > definition, where you define the details, kind of like a template. And > those classes can also inherit from other Python classes, to customize > a few details. Or take advantage of other Python programming > constructs. > > Also, passwords are stored separately, in a secure text file, using a > ~/.pgass-like format, that supports wildcards for individual fields > (for those of you familiar with PostgreSQL). eg: > > rsync:192.168.0.2::root:rrbackups:gib5Gryn > > (gib5Gryn is a password I just generated with apg, for this example) > > Although, these types of config files are more oriented at people who > prefer to edit text files directly (eg, programmers like myself :-) ), > and understand how classes, inheritance, and other programming-related > things work, rather than going through a web frontend. And adding a > templating-type system can introduce more complexity by itself. > > Web frontends like BackupPC's are probably a lot more usable in > general though, especially for non-programmers :-) So far the only thing your evaluation is actually right about is that backuppc isn't great at handling large files that have small changes in each run - although if they are compressible it may still be a win compared to other approaches. It would save everyone a lot of time if you just tried it instead of guessing about the way it works and assuming it is wrong. (For example, the config files only have per-host differences and inherit all other values from the master, and when you add a new host in the web interface you can tell it to copy an existing host config so you don't even have to retype that part.) And everything is just perl snippets that you can hand-edit if you prefer. -- Les Mikesell les...@gm... |
From: Carl W. S. <ch...@re...> - 2009-08-20 13:17:22
|
On 08/20 10:51 , David wrote: > 3) The incremental/full system of backuppc is bothersome. It's not really 'full' backups in the traditional sense when you use rsync as the transport mechanism. the 'incremental levels' are a one-option configuration setting, and there's an example provided in the config file. > 4) Possibly lots of redundant config in the text files nope. you only set what you need changing in the per-host config file. BackupPC isn't appropriate for all backup situations; but it works pretty darned well. Try it. Read through config.pl and it will all make much more sense. -- Carl Soderstrom Systems Administrator Real-Time Enterprises www.real-time.com |
From: John R. <rou...@re...> - 2009-08-20 16:38:18
|
On Thu, Aug 20, 2009 at 08:17:10AM -0500, Carl Wilhelm Soderstrom wrote: > On 08/20 10:51 , David wrote: > > 4) Possibly lots of redundant config in the text files > > nope. you only set what you need changing in the per-host config file. Well not quite. It's getting better with $Conf{RsyncArgsExtra} for example. I don't have to copy the whole $Conf{RsyncArgs} stanza into my pc/hostname.pl file. If I want to append to or remove a particular entry from $Conf{RsyncShareName} in config.pl, I have to manually copy the current definition into pc/hostname.pl because I can't directly affect that definition. If I add a new entry to the config.pl copy, it doesn't propagate to the other hosts. Now that being said, I generate the config files as part of the CM system used to manage my systems. It would certainly be possible to eliminate the redundancy to a large extent by using filepp, make etc. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111 |
From: David <wiz...@gm...> - 2009-08-20 18:07:46
|
By the way, sorry for the tone of my previous mail. I realized afterwords that I came across as condescending. I think, I get a bit too obsessed with an idea or mindset sometimes. Thanks for bearing with my noob questions and attitude. And I do need to study BackupPC more before making ignorant assumptions :-( For my immediate problem, I'm probably going to switch back to rdiff-backup, but give it the --no-hard-links option, to reduce memory usage. I totally missed that before. And I seriously don't need to preserve /usr/bin/ etc hardlinks, or I can generate a list of files to hardlink together (after restoration) separately. Longer term, probably switching over to BackupPC is better. I'll probably start by migrating some of the backups in the near future. Still not too sure about the DBX mail files, have to consider that further. Maybe it's not as big an issue as I think it's going to be, especially if I'm not keeping every single daily version of DBX files for the past X years. David. |
From: Jim L. <tr...@ol...> - 2009-08-20 20:15:39
|
John Rouillard wrote: > Well not quite. It's getting better with $Conf{RsyncArgsExtra} for > example. I don't have to copy the whole $Conf{RsyncArgs} stanza into > my pc/hostname.pl file. I haven't ever touched/created a hostname.pl file -- I've done everything in the GUI with "override" checked for a particular host's configuration. It creates very neat and tidy hostname.pl files for me, with only the options I've overridden. -- Jim Leonard (tr...@ol...) http://www.oldskool.org/ Help our electronic games project: http://www.mobygames.com/ Or check out some trippy MindCandy at http://www.mindcandydvd.com/ A child borne of the home computer wars: http://trixter.wordpress.com/ |
From: John R. <rou...@re...> - 2009-08-20 21:55:46
|
On Thu, Aug 20, 2009 at 03:14:49PM -0500, Jim Leonard wrote: > John Rouillard wrote: > > Well not quite. It's getting better with $Conf{RsyncArgsExtra} for > > example. I don't have to copy the whole $Conf{RsyncArgs} stanza into > > my pc/hostname.pl file. > > I haven't ever touched/created a hostname.pl file -- I've done > everything in the GUI with "override" checked for a particular host's > configuration. It creates very neat and tidy hostname.pl files for me, That doesn't scale well when you are running a couple of hundred hosts across three different backup servers and you have standard backup recipies for particular services on those hosts. When you change the services so that new backups have to be added (i.e. change recipies), or you move services and the configs have to change it's a lot easier to have a single set of hostname.pl files to distribute to all the backup servers. Doing it this way means not going "oops you mean we didn't have an off site backup of that filesystem?" and makes auditing on a regular basis (say weekly) possible. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111 |
From: Les M. <les...@gm...> - 2009-08-20 22:33:50
|
John Rouillard wrote: > On Thu, Aug 20, 2009 at 03:14:49PM -0500, Jim Leonard wrote: >> John Rouillard wrote: >>> Well not quite. It's getting better with $Conf{RsyncArgsExtra} for >>> example. I don't have to copy the whole $Conf{RsyncArgs} stanza into >>> my pc/hostname.pl file. >> I haven't ever touched/created a hostname.pl file -- I've done >> everything in the GUI with "override" checked for a particular host's >> configuration. It creates very neat and tidy hostname.pl files for me, > > That doesn't scale well when you are running a couple of hundred hosts > across three different backup servers and you have standard backup > recipies for particular services on those hosts. Unless you get 'owners' to go with the hosts... > When you change the services so that new backups have to be added > (i.e. change recipies), or you move services and the configs have to > change it's a lot easier to have a single set of hostname.pl files to > distribute to all the backup servers. Doing it this way means not > going "oops you mean we didn't have an off site backup of that > filesystem?" and makes auditing on a regular basis (say weekly) > possible. I think it would be a little nicer if there were another layer of inheritance - like a group config file that could be evaluated between the master and per-host configs so you could control settings that are common for several machines in one place. But auditing should probably be done against the archive filesystem instead of the configs. -- Les Mikesell les...@gm... |
From: Jim L. <tr...@ol...> - 2009-08-21 00:49:55
|
John Rouillard wrote: > That doesn't scale well when you are running a couple of hundred hosts > across three different backup servers and you have standard backup > recipies for particular services on those hosts. Using the GUI doesn't scale well, no, but it is not a requirement to include a copy of the entire config.pl for every host. If you want to do so, nothing is stopping you, but it's not required, which was my point. -- Jim Leonard (tr...@ol...) http://www.oldskool.org/ Help our electronic games project: http://www.mobygames.com/ Or check out some trippy MindCandy at http://www.mindcandydvd.com/ A child borne of the home computer wars: http://trixter.wordpress.com/ |
From: Michael S. <ms...@ch...> - 2009-08-20 14:47:36
|
> Thanks for the replies so far :-) They were very informative. > About BackupPC itself, I'm still evaluating whether or not to actually > use it, but I'm starting to decide against it. Here are my reasons: Not that I'm trying to sway your opinion either way, but since the majority of your analysis, though detailed, is steeped in ignorance, you're projecting the impression of somebody who has trouble changing paradigms. Not that there's really anything wrong with that, not everybody thinks flexibly, and it's not always useful to do so. I wouldn't use BackupPC to backup my Oracle data, since stepping outside Oracle's backup and recovery paradigm is generally a bad idea. On the other hand, I can't imagine inventing my own overly-complicated system of backing up Outlook Express files unless I really had nothing better to do and there was some kind of biblical passage commanding me not to purchase a cheap 500G drive whenever necessary and stop being a pain in the ass. While I appreciate your brief-albeit-misplaced-and-weirdly-patronizing lecture on the Unix philosophy, I'd recommend starting with your Backup and Recovery goals and priorities. I'd suggest that manual space management and diddling around with low level tools probably shouldn't be at the top of your list, since for many people in the US, it only takes a few hours of their time to equate to a terabyte of their time. Your mileage may vary. I'll briefly outline my own priorities for a Backup system: 1) It must be reliable 2) Files to recover must be less than 24 hours out of date 3) Recovery must be simple 4) It must take very little time and effort to maintain #1 implies a great deal, including sanity checking, notifications, awareness of free space, and so on. Having done a few bare-metal restores and considerably more registry and spot file recovery, I can say without question, and as a professional programmer, that I really do not want to have to worry about writing and maintaining that all by myself. |
From: dan <dan...@gm...> - 2009-08-23 03:25:11
|
Unfortunately, every backup option you have has some limitations or imperfections. Hardlinks have thier pros and cons. Really, there are only a few ways of doing incremental managed backups. Hardlinks, Diff files, and Diff file lists, SQL. Hardlinks are nice because they are inexpensive. Looking at the directory contents of a backup that is using hard links requires no overhead because of the hardlinks. Diff files and Diff file lists(one being where a diff is taken of each individual file and only the changes are stored and a diff file list being only storing those files that have changed) requires an algoryth to recurse other directories that hold the real data and overlay the backup on the previous one. The only option that is more efficient than hardlinks would really be storing files in SQL and also storing an MD5, then linking the rows in SQL. Very similar to a hardlink but instead its just a row pointer. This would be many times faster than doing hardlinks in a filesystem because SQL selects in a heirarchy based on significant data. It would be like backuppc only having one host with one backup on it when you are looking at the web interface. All the other hosts and backups etc are already excluded. SQL file storage for backuppc has been discussed extensively on this list and suffice it to say that opinions are very split and for good reason. SQL(mysql specifcally but applies to all) is much much better at some tasks than a traditional filesystem(searching for data!, orders of magnitude faster) but a filesystem is also much much better at simply storing files. Some hybrid could take into account the pros of each such as storing all of the pointer data in mysql and storing the actual files as their MD5 names on a filesystem. simply md5 a file, push the md5 off to mysql with the host and backup date, filename, and file path and write the file to the filesystem. Incremental backups would MD5 a file, search the database for the MD5, if found then write a pointer to that entry and if not write a new entry for the MD5 of the file, the hostname, file path and file name , and the backup number(or date). All the files would just be stored as their MD5 name. Recovering the files would be less transparent but would only require an SQL to pull the list of files based on hostname and backup number and then pull those files, renamed, into a zip or tar file. On Mon, Aug 17, 2009 at 5:52 AM, David <wiz...@gm...> wrote: > Hi there. > > Firstly, this isn't a backuppc-specific question, but it is of > relevance to backup-pc users (due to backuppc architecture), so there > might be people here with insight on the subject (or maybe someone can > point me to a more relevant project or mailing list). > > My problem is as follows... with backup systems based on complete > hardlink-based snapshots, you often end up with a large number of > hardlinks. eg, at least one per server file, per backup generation, > per file. > > Now, this is fine most of the time... but there is a problem case that > comes up because of this. > > If the servers you're backing up, themselves have a huge number of > files (like, hundreds of thousands or millions even), that means that > you end up making a huge number of hardlinks on your backup server, > for each backup generation. > > Although inefficient in some ways (using up a large number of inode > entries in the filesystem tables), this can work pretty nicely. > > Where the real problem comes in, is if admins want to use 'updatedb', > or 'du' on the linux system. updatedb gets a *huge* database and uses > up tonnes of cpu & ram (so, I usually disable it). And 'du' can take > days to run, and make multi-gb files. > > Here's a question for backuppc users (and people who use hardlink > snapshot-based backups in general)... when your backup server, that > has millions of hardlinks on it, is running low on space, how do you > correct this? > > The most obvious thing is to find which host's backups are taking up > the most space, and then remove some of the older generations. > > Normally the simplest method to do this, is to run a tool like 'du', > and then perhaps view the output in xdiskusage. (One interesting thing > about 'du', is that it's clever about hardlinks, so doesn't count the > disk usage twice. I think it must keep a table in memory of visited > inodes, which had a link count of 2 or greater). > > However, with a gazillion hardlinks, du takes forever to run, and has > a massive output. In my case, about 3-4 days, and about 4-5 GB output > file. > > My current setup is a basic hardlink snapshot-based backup scheme, but > backuppc (due to it's pool structure, where hosts have generations of > hardlink snapshot dirs) would have the same problems. > > How do people solve the above problem? > > (I also imagine that running "du" to check disk usage of backuppc data > is also complicated by the backuppc pool, but at least you can exclude > the pool from the "du" scan to get more usable results). > > My current fix is an ugly hack, where I go through my snapshot backup > generations (from oldest to newest), and remove all redundant hard > links (ie, they point to the same inodes as the same hardlink in the > next-most-recent generation). Then that info goes into a compressed > text file that could be restored from later. And after that, compare > the next 2-most-recent generations and so on. > > But yeah, that's a very ugly hack... I want to do it better and not > re-invent the wheel. I'm sure this kind of problem has been solved > before. > > fwiw, I was using rdiff-backup before. It's very du-friendly, since > only the differences between each backup generation is stored (rather > than a large number of hardlinks). But I had to stop using it, because > with servers with a huge number of files it uses up a huge amount of > memory + cpu, and takes a really long time. And the mailing list > wasn't very helpful with trying to fix this, so I had to change to > something new so that I could keep running backups (with history). > That's when I changed over to a hardlink snapshots approach, but that > has other problems, detailed above. And my current hack (removing all > redundant hardlinks and empty dir structures) is kind of similar to > rdiff-backup, but coming from another direction. > > Thanks in advance for ideas and advice. > > David. > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus > on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > BackupPC-users mailing list > Bac...@li... > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki: http://backuppc.wiki.sourceforge.net > Project: http://backuppc.sourceforge.net/ > |
From: Michael S. <ms...@ch...> - 2009-08-23 05:50:08
|
> Unfortunately, every backup option you have has some limitations or > imperfections. Speaking of imperfections, your scheme doesn't take into account that multiple files can (and will) have the same md5 hashes. > The only option that is more efficient than hardlinks would really be > storing files in SQL and also storing an MD5, then linking the rows in > SQL. One wonders what measure of efficiency you're using, but frankly, I fail to believe that there are only two ways of storing matching files as few times as possible. Hashed pointers to a pseudo-directory system comes to mind immediately, a scheme that's in use in at least one system. At any rate, what problem are we trying to address? |
From: dan <dan...@gm...> - 2009-08-23 16:30:27
|
Speed. Backuppc is constrained by I/O performance as a bottleneck on the system is that the storage volume must be a single filesystem due to hardlinks. It has been measured a number of times on this mailing list that I/O is the major bottleneck for backuppc. Getting faster hardware certainly helps but the reliance on a single filesystem for all data is a bottleneck for performance as well as an irritation when upgrading storage as you either need to add additional raid arrays (as expanding a raid is not generally an option) or just use JBOD with LVM or something. not-ideal. My solution is to break the backup scheme into smaller chunks and have a number of backuppc servers handling a set number of clients. The issues here are complexity as I need to admin a number of servers and loss of the file de-duping. In my organization like many others, each client will have absolutely identical files. 4 backup machines means that a massive amount of data is duplicated 4 times PLUS whatever redundancy is in the raid. A hybrid platform can use the filesystems strengths and a databases strengths and no have most of the weaknesses. My example was a simplistic one. Sure MD5 can have some collisions so either MD5+SHA1 or just do SHA2. You would need to store a few more peices of data but I think it would be hard to argue that mysql is many orders of magnitude faster at finding data than a filesystem just like it is hard to argue that a filesystem is many times faster at simply storing files and even faster at storing large files. Other benefits of the hybrid system are that the files can be on a different volumes than the database. In fact, because you store the files location on disk in the database, you could store files on many different disks, with to issues with hardlinks. Because of this, you could put two backuppc machines together in a cluster and each instance of backuppc would look at the same database (or replicated data on their own database) and be able to do online replication of the filestore on other servers. They could automatically duplicate these files on their own local file store and because there are not millions of hardlinks to worry about, rsync can actually be useful in syncing up file stores to other backuppc machines. sure you will still have a lot of files but you will have a lot less files for rsync to track. rsync can handle a lot of files. with backuppc rsync actually has to track every instance of every file from each host and each backup number plus the pool. without the hardlink pooling rsync would only have to see each file once. |
From: Jim L. <tr...@ol...> - 2009-08-24 00:57:50
|
dan wrote: > Speed. Backuppc is constrained by I/O performance as a bottleneck on > the system is that the storage volume must be a single filesystem due to > hardlinks. Then use a better filesystem. I run BackupPC on an opensolaris system that uses ZFS as the storage pool, and I/O is the *last* of my worries (since the box is an older machine with only a single processor, CPU usage is my main worry as File::RsyncP is not as efficient as binary rsync). > that I/O is the major bottleneck for backuppc. Getting faster hardware > certainly helps but the reliance on a single filesystem for all data is > a bottleneck for performance as well as an irritation when upgrading > storage as you either need to add additional raid arrays (as expanding a > raid is not generally an option) or just use JBOD with LVM or > something. Like I said, use a more appropriate filesystem. Use ZFS, JFS, or XFS (or Reiser) but not ext2/3 as those as jokes when it comes to performance. > My solution is to break the backup scheme into smaller chunks and have a > number of backuppc servers handling a set number of clients. The issues > here are complexity as I need to admin a number of servers and loss of > the file de-duping. In my organization like many others, each client > will have absolutely identical files. 4 backup machines means that a > massive amount of data is duplicated 4 times PLUS whatever redundancy is > in the raid. Keep in mind that BackupPC has a limited scope -- small to medium-sized organizations. If you have over 100 clients to back up, it is expected that you will run multiple BackupPC servers. If you have more than 500+ clients to back up, it is expected that you will invest in a commercial solution designed for that kind of enterprise. > Other benefits of the hybrid system are that the files can be on a > different volumes than the database. In fact, because you store the > files location on disk in the database, you could store files on many > different disks, with to issues with hardlinks. If this is your point, then it's somewhat valid in that you are arguing for a system where the storage is modular. There's nothing wrong with that, but that's not the scope of BackupPC. BackupPC's core strength, one that no other opensource backup solution has, is pooling of like data, and that is the reason I've implemented it. If you want a system where the back-end storage is modular, choose Amanda or Bacula. -- Jim Leonard (tr...@ol...) http://www.oldskool.org/ Help our electronic games project: http://www.mobygames.com/ Or check out some trippy MindCandy at http://www.mindcandydvd.com/ A child borne of the home computer wars: http://trixter.wordpress.com/ |
From: Jim L. <tr...@ol...> - 2009-08-24 01:17:28
|
Jim Leonard wrote: > (since the box is an older machine with only a single processor, CPU > usage is my main worry as File::RsyncP is not as efficient as binary rsync). Actually, since backuppc_dump does a lot more than just emulate rsync, this was not a fair statement. However, it is a fair statement to complain that backuppc_dump is not multi-threaded, which would really help on multi-CPU systems (the copy could be one thread, the comparison another, the compression a third, etc.). Hopefully that's on the development roadmap? -- Jim Leonard (tr...@ol...) http://www.oldskool.org/ Help our electronic games project: http://www.mobygames.com/ Or check out some trippy MindCandy at http://www.mindcandydvd.com/ A child borne of the home computer wars: http://trixter.wordpress.com/ |