Menu

Repair damaged files *in place*

2006-11-05
2013-05-23
  • Nobody/Anonymous

    When repairing a bad file, it looks like the damaged file is renamed, and then a new file is created with the correct data.  Do any of the PAR2 clients try to repair damaged files directly, without creating (even temporarily) a duplicate file?

    If recovering a file basically implies that a second copy of the file needs to be made on the drive, this becomes problematic for very large files.  For example, imagine a 10GB file with 3MB blocks where 2 blocks are bad.  Why do we need to create a 2nd 10GB file on the drive to complete the repair?

    It seems like it should be possible to seek around the existing file and rewrite just the bad blocks directly... wouldn't this be significantly faster too?  I'm not an expert in the Reed-Solomon algorithm, so I don't know if there's something about how the algorithm works that would preclude this...

    Thanks,
    --Jeremy

     
    • S. Jamison Wilde

      One of the main issues with this, is that the data can be out of place, not just corrupt.  In particular many newsreaders do not add padding for missing parts when they join the pieces for each article.  This is why in some newsreaders you will see that the file is less than all of the others that aren't incomplete.  The problem is, that in order for the rest of the file to not become unusable after the point where there was the missing piece, the par2 algorithm will start going byte by byte looking for a new block until it finds one, and finished processing the file. 

      To do in-place processing in this situation would require copying out all of the data past the corruption anyway.  While on average this would be say half of the file size in needed extra, it adds a lot of complexity and actually could increase file io overall.  With a par2 repair you are guaranteed to read 3x or 4x the data size.  1x to scan it, 1x to read while repairing, and 1x to write the data.  In the most likely in-place scenario, you would need to read the data once, copy out parts after the corruption, then read the good part+copied out parts, etc etc...The copy out is extra file io.

      It may be technically possible to do all of that in place( Extend the file size to the expected length. Process backwards to find the last block with valid data.  Copy that data into the correct location further back in the file.
      ) , but it would require essentially reading the file backwards (not ideal in most hardware systems).  You would basically incur a seek operation every sector read, which would destroy drive performance, IMO.

       
      • Nobody/Anonymous

        Interresting technical challenge.

        I still believe the feature is worth looking into.  I'd like to see the day that transfering 10GB of data doesn't require splitting the source into 100 different files just so that it can be efficiently reconstructed at the other side.

        I had been wondering about what the average newsreader would do when asked to 'assemble incompletes'.  Thanks for raising this point.  Padding would seem like the right thing to do but that's not really our place to say.  We would have to deal with either case for the technology to be useful.

        The trade-off between disk IO and disk space is definately important.  Even if repair in place would add IOs, I think better to be able to repair that 10GB file than have to start clearing space on the drive to make room for the repair to even happen. 

        Back to the point you make about padding.  Padding or no padding, it seems all the same to me.  Why would you have to copy out all the data that was out of place, rather than just inserting the reconstructed data into the file?  The file system will allow you to insert data into a file without physically moving any data on the hard drive, so converting a non-padded file into a padded file is a very fast operation.  Even more, it's not like we would have to first insert null padding and then fill it in, we would only do the insert only once the real data was known.  While this is guaranteed to add at least one fragment to the resulting file, making a copy of a large file could cause significantly worse fragmentation if the drive is filling up.

        I just watched PAR2 run a repair on about 7GB of data I just downloaded that was split across 69 files.  17 of the files were damaged with a total of 22 damaged blocks between them.  All the damaged files were 99,999,999 bytes instead of the expected 100,000,000 bytes.  The first step of the repair created 17 new files, each 100,000,000 bytes long.  This step took no observable time, since no actual data was written, the files were simply allocated to be 100MB.  A simplistic implementation of in-place repair could just delete any corrupt blocks (if they existed) from the source files and insert the correct amount of bytes into the file with the same performance characteristic -- i.e no writes to the actual data, just modifying the allocation tables.  At that point recovery could continue in-place as a common algorithm without ever worrying about non-padded files.

        In the example above, I had 2,625 blocks of 2,688,000 bytes each.  The current algorithm totally rewrote the 17 bad files, at a cost of 100MB each.  If only the bad blocks were overwritten, we would only have had to write about 58MB to the drive instead of 1.7GB, and total utilized disk space would have remained constant, instead of a temporary spike of 1.7GB.

        As long as PAR2 'blocks' correspond with contiguous sections of data in a file, I think this should be achievable without adding significant complexity, and should in all cases decrease the number of IOs required for repair.

         
    • Nobody/Anonymous

      Maybe that would be technically possible, but then Quickpar requires knowledge of the filesystem built in. Because QP is not created specifically for any OS, it has to rely on standard APIs. Creating new files and copying data are basic operations handled by the OS. I can imagine that not every filesystem supports manipulating allocation tables that way to 'insert' bytes into a file.

      Also mind that splitting files is not only for recovery purposes. In the case of usenet posting it is also required for uploading with multiple threads and multiple files are useful as 'checkpoints'. I'd pull my hair out if a 10 GB single-file-upload went wrong somewhere along the line, and I'd have to start all over.

      To be honest, if you're working with single 10GB files, you really shouldn't need to free up disk space. Get a bigger hard drive.

       
    • Nobody/Anonymous

      Oh, one additional comment about decreasing disk IO for repair.

      I believe, after a little testing with a 700 MB data set, that the files which Quickpar has to re-generate really have *no* impact on repair time. You're waiting because of CPU time and reading valid data, not because of IO on these new files.

      I split up a single 700 MB iso-image in 50 MB RAR files (13 parts).
      Then I created a PAR2 recovery set that allows recovery of at least 2 parts (let's say 15%).

      First, I mangled a single RAR file so that one block (one byte) was missing. I just deleted a single byte. Repairing takes 27 seconds on my system.

      Now, after repairing, just delete an entire file. Repairing now takes 57 seconds.
      In case two files are missing, repair time rises to over 2 minutes of which 1m 48s is CPU time!

      But here's the clue: copying a single 50 MB file (on the same drive) takes less than 2 seconds!

      I think this shows several things:
      - In both cases, about 25 seconds is spent reading the other intact blocks. This is linear time, depending on the size of the recovery set. This hurts with a 10GB recovery set, but there's not much we can do.
      - In case one entire file is missing, about 30 seconds are spent on reconstructing the data from parity information.
      - Repair time grows very quickly if more data blocks are missing/corrupt. This is CPU-intensive.
      - In both cases only 2, maybe 3, seconds are spent on file IO to the re-created file(s).

      By the way, if disk space is a potential problem, then how do you extract the archives?

      I'll shut up now.

       
    • Nobody/Anonymous

      A little thought experiment:
      Say I've got a file taking up 1000 blocks on the filesystem.
      I then delete every _even_ byte from it.
      How many blocks will the file now take up on on the filesystem?
      For most (nearly all?) filesystems the answer will be (about) 500,
      in other words it would get copied/moved to a new set of filesystem blocks.
      Now take the "in place" idea: instead of occopying just 500 blocks, this "in place"
      edited file would take up 1000 blocks and the filesystem "somehow" has to list
      the blocks as being an on/off pattern of data and garbage bytes...
      Rather complicated and inefficient.   i.o.w. even if you'd design a specialized
      par2_repair_in_place filesystem you probably wouldn't win much...

      Other reasons _not_ to do "in place" repair: if your par2 client misbehaves because
      of bad RAM, bad par2 set etc. you'd loose your (possible correct) source files if
      they don't get backed up to .1 files.

       
    • Nobody/Anonymous

      It's good to see discussion on the topic!

      I saw a few points made:

        - OS support for random file access

           I think this is a great point -- I'm not sure if PAR2 would work on a tape drive type media today (my guess is not, due to the non-sequential read intensive nature), but definately an in-place repair would not work.  In any case, this feature would not exclude the current mode of operation, such that on an unnamed OS that could not support it, the feature could be disabled.

         - Splitting files prevents things from 'going wrong somewhere'

           I'm not sure why this would be true.  In the case of Usenet, downloading incompletes is of course a requirement, particularly with PAR2.  We can tell from the 'PAR1 was better' crowd that downloading incompletes is obviously harder than it sounds, but I'm not sure if this is the right place to bring it up.  One could argue that if files were NOT split, people would be forced to download incompletes and then finally we would be rid of all the 'PAR1 is better' folks.

         - You should always have 2x the disk space for the file you downloaded, a.k.a buy a bigger hard drive

           Not sure this is helpful input.  We're talking about improving the algorithm to consume less temporary disk space.  This is clearly a Good Thing, particularly if it is possible under the same or better performance characteristics.  Please also note the point about excess drive fragmentation caused by the temporary spike in disk utilization as well.

         - There no impact on disk IO Performance

           I think the experiment is flawed.  Deleting an entire file isn't really testing the performance metric we are interested in.  I think if you took a 1GB file, and split it into 10 100MB files, then created a 1000 block repair set for both the single and the 10 file set.  Then defragment your hard drive.  Then create exactly one byte error in the 1GB file, and one byte error in just one of the 100MB files, and measure the difference in repair time.  This would get us a lot closer to seeing the performance difference I believe.  I would expect the 1GB file would take longer to repair, specifically due to the extra time to recreate the file.

           In any case, the question is, if we can achieve EQUAL or better performance, the feature is worth adding for the benefit of not needing all the temporary space.  My opinion is that the feature would actually increase performance, but since I haven't rolled up my sleeves and implemented it and profiled it, I'll shut up on that.

          - You need 2x the space to extract the file anyway

           This may be true in some cases.  But in other cases, for example with HDTV content as .ts files, the time to RAR and UNRAR is prohibitive and the byte savings is too low due to the already highly compressed MPEG4 content to bother with a second round of compression.  I.e. not every file that is run against PAR2 is split and compressed, and we shouldn't assume that they are.

          - PAR2 could damage the source files

           I believe the default QP options do not save the backups of the files that are repaired, so this is as much as an issue today as it would be with this feature.  As always, this feature could be turned off anyway.  Personally I think this is a non-issue in both cases, since you could never recover a file with the wrong PAR set (if you were looking at the wrong file, every block would look 'wrong').

      The last point that was made about 500 vs. 1000 blocks I can't understand.  If you have a 1000 block file that you want to repair, the repaired file will take 1000 blocks.  I think it's possible you have misunderstood the proposed feature.  To follow the example you gave correctly:

         Original source file: 1000 blocks
         Delete every other byte: 500 blocks
        
         Current recovery method:
            1. Existing data: 500 blocks + recovery
            2. Allocate 1000 blocks on the drive
            3. Recover into new file using existing data and recovery
            Max Used Blocks during Recovery: 1500 + recovery

         Proposed recovery method:
            1. Existing data: 500 blocks + recovery
            2. Insert missing blocks in place (+500 blocks)
            3. Recover into old file using existing data and recovery
            Max Used Blocks during Recovery: 1000 + recovery

         You can see that the max number of blocks used in the current method will always be higher than the propopsed method.  Higher by exactly how much existing data is on the drive when recovery is started.  This is actually a high price to pay in order to recover a file.  Obviously not *too* high a price, since PAR2 is widely used, but all the same.

      In summary:

         - The existing method may be more compatible with some unnamed OS's.  Agreed that the existing method should never be phased out.

         - The proposed method could damage source files.  I believe the existing method using the default QP settings has exactly the same liklihood of damaging source files.  As above, the option to repair in a new file and keep backups of the broken files should never be phased out.

      Just so everyone understands my motivations, one of the reasons I brought this up is after reading this in the 'AboutPAR2' page:

      "PAR files can be generated from a single source file without the need to split it. On UseNet, this removes the need to use RAR or any other file splitter. Please note however, that due to the limitations of some newsreaders (which do not permit the download of incomplete files), it is advisable to use RAR to split very large files."

      We should be able to agree, that if PAR2 requires 1x the file size in disk space to perform the repair, the statement "this removes the need to use RAR or any other file splitter" is not 100% true.

      Please consider all the use cases of an in-place repair before assuming that no one could use it.  All the video-smiths out there wrangling with 10GB, even 100GB files on a daily basis, looking for a way to protect their investments could benefit greatly from this feature.  Open your minds to the coming of HDTV content, at about 18Mbps bit rate we're talking 8GB/hr of data, on drives that we want to store hundreds of hours of data on.  Shipping these files around, PAR2 is a great way to repair any damage, but not without this feature.

      Once you get used to the idea of recovering uncompressed, unsplit video files, the really progressive feature is having the ability to recover the file in real-time... while playing it.

       
    • Nobody/Anonymous

      > - Splitting files prevents things from 'going wrong somewhere'
      >
      >I'm not sure why this would be true. In the case of Usenet, downloading incompletes
      >is of course a requirement, particularly with PAR2. We can tell from the 'PAR1 was
      >better' crowd that downloading incompletes is obviously harder than it sounds, but
      >I'm not sure if this is the right place to bring it up. One could argue that if files
      >were NOT split, people would be forced to download incompletes and then finally we
      >would be rid of all the 'PAR1 is better' folks.

      OK, will you tell everyone to set aside several gigabytes of discspace for their
      newsreader/grabber program as temp space to decode and re-assemble those unsplit DVDs... ;)

      > - PAR2 could damage the source files
      >
      >I believe the default QP options do not save the backups of the files that are
      >repaired, so this is as much as an issue today as it would be with this feature. As
      >always, this feature could be turned off anyway. Personally I think this is a
      >non-issue in both cases, since you could never recover a file with the wrong PAR set
      >(if you were looking at the wrong file, every block would look 'wrong').

      - poster creates a bad RAR set and a creates a par2 set for it, notices the bad RAR
        set and re-RARs the file(s), but forgets to create a new par2 sets and posts
        the RAR and par2 sets.   The par2 set now repairs to the bad RAR set.   (I _know_
         this has happened)
      - the poster hit a known (unfixable) bug in the par2 algarithm and created a bad par2
        set (that IIRC will mess up any perfectly file source files).  Read the  
        forums/mailinglist archives/Quickpar forums to find multiple mentions of this bug.
      - poster has bad RAM or other hardware/software problems that cause a bad par2 set
        to be created that damages source files when used to repair.
        <sarcasm>Hey, if we'd want to optimize stuff a bit, we could also strip out
         Quickpar's check for bad memory and par2cmdline's "verify after repair"</sarcasm>
        Way too many corrupt files are already floating around the internet for me to just
        trust a download and repair is 100% OK; if I had the time, I'd compare any   
        video/audio file against a known good checksum from the creator of that file
        or run it through a very strickt decoder to check for damaged data...

      >The last point that was made about 500 vs. 1000 blocks I can't understand. If you
      >have a 1000 block file that you want to repair, the repaired file will take 1000
      >blocks. I think it's possible you have misunderstood the proposed feature.

      No, you seem to have missed the word "filesystem".
      Tell me one filesystem where a cluster/block _in the middle_ of a file is allowed
      to contain less data bytes than the filesystem's cluster/blocksize.

      Using the (theoretical) "edit in place" filesystem and my example of editing out
      every even byte of the source file:

      > Proposed recovery method:
      >1. Existing data: 500 blocks + recovery
      Nope, 1000 blocks/clusters + recovery plus at least 1 bit per byte overhead to
      indicate if it's a data byte or a garbage byte (for simplicity's sake I'll assume
      1 block/cluster can only hold data from _1_ file or we'd get even more overhead [1] )

      >2. Insert missing blocks in place (+500 blocks)
      +0 blocks on an "edit in place" filesystem

      >3. Recover into old file using existing data and recovery
      >Max Used Blocks during Recovery: 1000 + recovery

      Perhaps such a filesystem could have a tiny bit of use for par2, but for other
      purposes it would be a nightmare because of the increased disk and cpu overhead,
      since you're essentially switching over from a block/cluster based filesystem
      to managing it on the byte level.   (which is about as nuts as making a par2 set
      with a blocksize of 10 bytes...)

      [1]though some filesystems such as Reiser IIRC use the free space at the end of
      the last block/cluster of files to put very small files into, so that for ex. 10-byte
      file won't take up an extra disk block/cluster...

       
    • S. Jamison Wilde

      PAR2 is used extensively by the newsreader community obviously, and is probably the vast majority of its users.  I think this has been the driving force behind features and functionality.

      With that in mind, one reason that has been mentioned that files are split has nothing to do with you GETTING the file, but everything to do with them POSTING the files.  Many news servers cap a single connection at a certain speed, and by utilizing multiple connections to the server you get much better performance when posting.  Also, in general the files posted on news servers are not the original source of those files.  Many of those files were originally acquired through FTP and/or IRC (DCC).  In those cases there may very well be good reasons to split the files up...In many cases this is also due to historical reason, since PAR/PAR2 has only been around for a few years and split files were used before that to reduce the amount of bandwidth needed to 'fix' a bad file.

      I certainly agree that in-place repair would be useful on a 100GB HD file in the limited situation of a file that was complete in size, but had corrupted data internally.  In any scenario were the file is less than the original file size, I just can't see it being efficient given the added complexity.  As there isn't much active development going on with this, and given that the source tree won't compile on windows anymore given the posix threading that was added, I doubt a major change like this will be happening.

      It is doubtful that people working in high end video with uncompressed HD/SDI etc are going to have problems with temp disk space, so I think it is a pretty manufactured scenario.  At home I have 2 TB of disk storage and I am always running out too, falling behind on burning and cleaning up, etc.  I don't necessarily think that in-place repair is going to suddenly make that all go away when I can download 40GB a day. 

       

       

Log in to post a comment.