Menu

Backup solution with par2

Hugo Rune
2004-02-03
2004-02-04
  • Hugo Rune

    Hugo Rune - 2004-02-03

    Hi

    I am thinking about ways to backup large amounts of data (several harddrives).

    With the classical backup solution you need roughly as much space for the backup as for the actual data, (minus compression). So to backup 4 120 GB harddisks regulary you would need 4 other 120 GB disks (or lots of CDs, or dvds, or a tape writer)
    I know some people like to burn everything on cd constantly, but I can'nt get used to that.

    Now with par files there seems to be an alternative: by repeatedly grabbing a few files from each disk and generating a par set for them on a separate disk, (and repeat this for all files) one would essentially need only one or two backup disks with par data to be safe against most failures (these disks would have to be separated from the actual PC by network or usb-drive)

    I know that this is essentially what a raid system already does. However I think a par-based solution would offer several advantages over a raid-based:
    (please correct me if I am wrong on any of these points)

    - With a raid system all the disks have to be in the same pc (same controller?). Definitely no network drives and no usb drives possible, wheter for the orginal data or the backup

    - raid is unflexible: you need exactly one disk, that is at least as large as the largest data disk. If it is larger, then that space is lost. if it is smaller, then no raid.
    No chance to use two disks for extra redundancy. no way to use only half a disk to achive less redundancy at cheaper cost.

    - raid is block-based, par is filebased. With raid you cannot simply restore a few files afer a crash or corruption. its always the whole disk and you need a new blank disk to restore to.

    - raid runs always in the background. if files get corrupted, then the raid system could automatically overwrite the backup with the corrupted data.
    a par backup tool would only run on request (daily/weekly)

    - more points on request :)

    So, what would be neccessary for a par backup system? Just a few scripts:
    - to index the data that should be backed up,
    - to grab a few files from each disk and run throught par2cmdline (a whole disk is too large to par, so it has to be done in small chunks),
    - to check which files have been changed/added/deleted and update the correspondending par files (incremental backups)
    - and, and, and...

    Has anybody thought of such a system already? has anybody written something similar?

    If not, I think I will try to write it (which means a first alpha will be ready by late 2006 :-)

    Would anybody beside me even be interested in such a tool?

    Would anybody want to code for such a tool?

    Or is my idea total crap and only a fool would think of it?

    Please tell me your oppinions!

     
    • Peter C

      Peter C - 2004-02-03

      I don't think trying to use PAR2 as a means of doing a backup, you should be using PAR2 as a means of protecting your back.

      i.e. You create PAR2 files from the data files you want to back up, and then you copy both the data files and the new PAR2 files to your backup media.

      If you store nothing but PAR2 files on your backup media, then you would need to use redundancy settings of over 100% (which would result in extremely long times to create the PAR2 files).

       
      • Hugo Rune

        Hugo Rune - 2004-02-03

        The point is this:
        if i have several harddrives, say 4 120 GB disks, then to backup them all with normal means I would need 420 GB.
        With par2 I would just need a little over 120 GB, maybe 160.
        It is unlikely that more than one disk at a time would fail. so with 160 gb of parity data I could still restore all the orginal data.
        Of course for really important stuff that is not a safe assumption. Irreplacable data should be burned to cd several times and stored far away from the PC.

        But usenet downloads for example are not exactly irreplacable.
        Still, losing them because of a harddrive failure can be very annoying, and harddrives do fail sooner or later.

        So, while I may not be willing to invest the money for four additional harddrives just to backup downloads, I would gladly buy one more to keep the others moderately save.

        Or do you think the time to create the par2 files would be  too long for this?
        I dont want to par the whole drives at once, just one or a few files from each drive together at a time.

         
        • Peter C

          Peter C - 2004-02-04

          OK, your use of the term backup is slightly missleading then (as that implies protection against loss of all files on all drives).

          If you want to ensure that the drive containing PAR2 files will have sufficient information to fully reconstruct the data from one lost drive, you will need to do the following:

          1) Ensure that each PAR2 set uses an equal amount of files from each drive.

          2) Ensure that each PAR2 set is for a set of files that are all similar in size.

          The first of these ensures that the total amount of PAR2 data will not be significantly greater than the size of the files on any of the drives.

          The second allows you to use larger blocks sizes without suffering a significant loss in efficiency.

          The real problem you will have is in managing the PAR2 files. For that you will need to some software than can track which files have been modified on each drive and decide how to allocate the files from the drives to PAR2 files.

           
          • Hugo Rune

            Hugo Rune - 2004-02-04

            Thanks for your ideas. As you said the main problm will be to assign files to par2 files and keep track of them. I already thought about this for some time, and  I came up with 1), but did not think of 2)  until now. This really complicates things even more.

            I have to make an algorithm that assigns groups of files to each other that have aproximately the same total size, same indiviual size, probably some more constraints. And all this in a way so that the addition of new files or deletion of old files on the disks does not require total re-encoding of all par-data because the assigned groups have changed. hmmm....

            Looks like some sort of packing problem, which would be np-hard, so only a heuristic approach would work. There are semi-good heuristic solutions to many packing problems, but even heuristics could maybe get trouble with the mass of files that are on modern drives.

            I could tar groups of files togethger, and then split the tarballs into equal-sized parts and  allways use par2 on one part from each drive. But that would kill most of the advantages of the file-based approach.

            The more I think about it the more problems turn up. Still, no reason to give up yet
            Maybe I still get a brilliant  idea for an algorithm that could solve all this. Or maybe someone else does :)

             
    • Thomas Harold

      Thomas Harold - 2004-02-04

      Interesting idea (basically, a delayed write RAID5), but probably not worth the effort.   In order to recover any one of the disks failing, you'll need at least that amount of PAR2 data to recover.  So even if you have a 75+75+120+80+150 collection of disks, you're going to have to create 150+ worth of PAR2 data to cover the situation where the 150 disk fails.

      The other problem, is that a PAR set requires that all files in the set get reset back to the original data.  So if you have a single-corrupted file, but some other files changed correctly, doing a PAR repair is going to fix files that shouldn't be fixed.

      Oh, and RAID really isn't meant as a backup system.  The job of RAID is to allow you to create really big disks out of smaller disks, and to give you protection against single disk failures.

      Frankly, if you want to store 1Gb of data securely, you need 2-3Gb of storage space (or more if you want generational backups).  There's really no way around that.

      The lowest cost is a single disk with periodic mirroring to a 2nd drive.  Advantages is that disk access / read / write will be quick and you can recover from accidental corruption/deletions (if detected prior to the next mirroring).  Downside is that data can be lost between scheduled mirrorings.  Ratio is 2:1.

      Next best is a 1Gb drive hooked to a 1.5Gb drive, where you're still doing scheduled mirroring, but deleted files get moved to a trash folder which allows you to recover up to 0.5Gb of older data (or you're doing a full/incremental backup approach).  Ratio is 2.5:1.

      RAID5 for 1Gb of storage can be done with say 6x200Mb drives, with an external 1.5Gb drive for mirroring.  Ratio is 2.7:1.  Adding a 200Mb hot-spare drive takes the ratio to 2.9:1

      For data that you really care about, you're talking RAID5, with hot-spare, along with multiple external storage drives; so 7x200Mb plus 3 1.5Gb drives for a ratio of 5.9:1. 

      Now -- what would interest me is software that backs up to removable media (tape / DVD) and uses both my GPG public-key to encrypt the backups as well as putting PAR2 data on the media so that I can recover from damaged media when doing a restore.  Basically, placing some GPG'd archive files in the root of the media, with a PAR2 set calculated to recovery/repair damaged .GPG files.  Right now, I have to do that process by hand (splitting the archives into 4Gb sets and then PAR'ing).

       
      • Hugo Rune

        Hugo Rune - 2004-02-04

        > In order to recover any one of the disks failing, you'll need at least that amount of PAR2 data to recover.
        > So even if you have a 75+75+120+80+150 collection of disks, you're going to have to create 150+ worth of PAR2 data to cover the situation where the 150 disk fails.

        In that case I just use the 150 disk for my par2 data, and add two more 80er disks for my real data :)
        and still, 150 additional storage is better than 75+75+120+80+150

        As I said, I would not use this for important stuff. To backup somthing reliably one needs the ratios you described, but most of my files are not worth that much. Given enough time, I could download/rip them all again. Still, I'd rather avoid that.

        So a way to decrease the chances of data loss (caused by a single-disk failure, accidental deletion, data corruption) with only a low ratio like 0.25:1, would be a real improvement for me, even if it would not mean total security.

        Of course you may be right about the effort - It does seem like a lot of work and I dont know yet if the result justifies the effort.

        Maybe I should get a dvd-burner and use your aproach. But I really hate cd-like storage, I just tend to loose the little things en masse :)

         

Log in to post a comment.