From: Jeffrey J. K. <bac...@ko...> - 2009-08-30 05:08:37
|
dan wrote at about 21:24:56 -0600 on Saturday, August 22, 2009: > Unfortunately, every backup option you have has some limitations or > imperfections. Hardlinks have thier pros and cons. Really, there are only > a few ways of doing incremental managed backups. Hardlinks, Diff files, and > Diff file lists, SQL. Hardlinks are nice because they are inexpensive. > Looking at the directory contents of a backup that is using hard links > requires no overhead because of the hardlinks. Diff files and Diff file > lists(one being where a diff is taken of each individual file and only the > changes are stored and a diff file list being only storing those files that > have changed) requires an algoryth to recurse other directories that hold > the real data and overlay the backup on the previous one. > > The only option that is more efficient than hardlinks would really be > storing files in SQL and also storing an MD5, then linking the rows in SQL. > Very similar to a hardlink but instead its just a row pointer. This would > be many times faster than doing hardlinks in a filesystem because SQL > selects in a heirarchy based on significant data. It would be like backuppc > only having one host with one backup on it when you are looking at the web > interface. All the other hosts and backups etc are already excluded. > > SQL file storage for backuppc has been discussed extensively on this list > and suffice it to say that opinions are very split and for good reason. > SQL(mysql specifcally but applies to all) is much much better at some tasks > than a traditional filesystem(searching for data!, orders of magnitude > faster) but a filesystem is also much much better at simply storing files. > Some hybrid could take into account the pros of each such as storing all of > the pointer data in mysql and storing the actual files as their MD5 names on > a filesystem. simply md5 a file, push the md5 off to mysql with the host > and backup date, filename, and file path and write the file to the > filesystem. Incremental backups would MD5 a file, search the database for > the MD5, if found then write a pointer to that entry and if not write a new > entry for the MD5 of the file, the hostname, file path and file name , and > the backup number(or date). All the files would just be stored as their MD5 > name. Recovering the files would be less transparent but would only require > an SQL to pull the list of files based on hostname and backup number and > then pull those files, renamed, into a zip or tar file. > That is exactly the hybrid that I have been advocating... But as you mentioned, some like it and some don't... > > > On Mon, Aug 17, 2009 at 5:52 AM, David <wiz...@gm...> wrote: > > > Hi there. > > > > Firstly, this isn't a backuppc-specific question, but it is of > > relevance to backup-pc users (due to backuppc architecture), so there > > might be people here with insight on the subject (or maybe someone can > > point me to a more relevant project or mailing list). > > > > My problem is as follows... with backup systems based on complete > > hardlink-based snapshots, you often end up with a large number of > > hardlinks. eg, at least one per server file, per backup generation, > > per file. > > > > Now, this is fine most of the time... but there is a problem case that > > comes up because of this. > > > > If the servers you're backing up, themselves have a huge number of > > files (like, hundreds of thousands or millions even), that means that > > you end up making a huge number of hardlinks on your backup server, > > for each backup generation. > > > > Although inefficient in some ways (using up a large number of inode > > entries in the filesystem tables), this can work pretty nicely. > > > > Where the real problem comes in, is if admins want to use 'updatedb', > > or 'du' on the linux system. updatedb gets a *huge* database and uses > > up tonnes of cpu & ram (so, I usually disable it). And 'du' can take > > days to run, and make multi-gb files. > > > > Here's a question for backuppc users (and people who use hardlink > > snapshot-based backups in general)... when your backup server, that > > has millions of hardlinks on it, is running low on space, how do you > > correct this? > > > > The most obvious thing is to find which host's backups are taking up > > the most space, and then remove some of the older generations. > > > > Normally the simplest method to do this, is to run a tool like 'du', > > and then perhaps view the output in xdiskusage. (One interesting thing > > about 'du', is that it's clever about hardlinks, so doesn't count the > > disk usage twice. I think it must keep a table in memory of visited > > inodes, which had a link count of 2 or greater). > > > > However, with a gazillion hardlinks, du takes forever to run, and has > > a massive output. In my case, about 3-4 days, and about 4-5 GB output > > file. > > > > My current setup is a basic hardlink snapshot-based backup scheme, but > > backuppc (due to it's pool structure, where hosts have generations of > > hardlink snapshot dirs) would have the same problems. > > > > How do people solve the above problem? > > > > (I also imagine that running "du" to check disk usage of backuppc data > > is also complicated by the backuppc pool, but at least you can exclude > > the pool from the "du" scan to get more usable results). > > > > My current fix is an ugly hack, where I go through my snapshot backup > > generations (from oldest to newest), and remove all redundant hard > > links (ie, they point to the same inodes as the same hardlink in the > > next-most-recent generation). Then that info goes into a compressed > > text file that could be restored from later. And after that, compare > > the next 2-most-recent generations and so on. > > > > But yeah, that's a very ugly hack... I want to do it better and not > > re-invent the wheel. I'm sure this kind of problem has been solved > > before. > > > > fwiw, I was using rdiff-backup before. It's very du-friendly, since > > only the differences between each backup generation is stored (rather > > than a large number of hardlinks). But I had to stop using it, because > > with servers with a huge number of files it uses up a huge amount of > > memory + cpu, and takes a really long time. And the mailing list > > wasn't very helpful with trying to fix this, so I had to change to > > something new so that I could keep running backups (with history). > > That's when I changed over to a hardlink snapshots approach, but that > > has other problems, detailed above. And my current hack (removing all > > redundant hardlinks and empty dir structures) is kind of similar to > > rdiff-backup, but coming from another direction. > > > > Thanks in advance for ideas and advice. > > > > David. > > > > > > ------------------------------------------------------------------------------ > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > > trial. Simplify your report design, integration and deployment - and focus > > on > > what you do best, core application coding. Discover what's new with > > Crystal Reports now. http://p.sf.net/sfu/bobj-july > > _______________________________________________ > > BackupPC-users mailing list > > Bac...@li... > > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > > Wiki: http://backuppc.wiki.sourceforge.net > > Project: http://backuppc.sourceforge.net/ > > > > ---------------------------------------------------------------------- > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > ---------------------------------------------------------------------- > _______________________________________________ > BackupPC-users mailing list > Bac...@li... > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki: http://backuppc.wiki.sourceforge.net > Project: http://backuppc.sourceforge.net/ |