Menu

Unrecoverable files while recovering from a broken data disk

Help
2022-08-22
2022-08-23
  • Kevin Mychal M. Ong

    So this is the first time I lost a disk in a 5-disk array (w/ 1 parity drive). The recovery process is currently ongoing and I'm seeing multiple unrecoverable files but mostly recovered ones.

    Here's what I have in the fix.log so far:

    https://pastebin.com/VTiiS06K

    They're mostly media files which is not really that of a big deal but some are pdf files. I want to understand why not everything is recoverable because I know a drive failing in the future will be inevitable and I don't want to have to lose more files along the way of course.

     
    • David

      David - 2022-08-22

      Snapraid is great and my first contribution to Andrea was almost nine years ago, but SR isn't real-time.

      This is how I understand how SR works and this is very simplified, so if I'm wrong, someone let me know.

      Here's an array with 4 drives and a parity drive with 5 sectors each

      1 2 3 4 5
      x x x x y
      x x x x y
      x x x x y
      x x x x y
      x x x x y

      Let's say drive 4 dies.

      1 2 3 4 5
      x x x......y
      x x x......y
      x x x......y
      x x x......y
      x x x......y

      SR can rebuild drive 4 by figuring out how the parity was built against the other 3 drives

      But SR isn't real-time. Let's say you have the same array.

      1 2 3 4 5
      x x x x y
      x x x x y
      x x x x y
      x x x x y
      x x x x y

      But you delete a few files.

      1 2 3 4 5
      x x.... x y
      x x x x y
      x x x x y
      x x x x y
      ....x x x y

      Now you lose drive 4 before you sync.

      1 2 3 4 5
      x x..........y
      x x x.......y
      x x x.......y
      x x x.......y
      ...x x.......y

      You can see that you are missing 2 files that the single parity drive needs.. You won't be able to recover those corresponding missing files on drive 4.

      The good news, well semi-good news, is SR will tell you which files you deleted. If you're able to acquire the missing files then you can rebuild the lost files as long as you don't sync. You can continually rebuild files and it won't affect anything.

      It's good practice to have at least two parity drives. This example is one, but there is another. Rebuilding puts a lot of stress on the drives because they are running 100% for many hours and sometimes for over a day. If a drive is already close to failing, putting it under that amount of stress for that long can cause failure.

       

      Last edit: David 2022-08-22
      • Kevin Mychal M. Ong

        I kind of understand what you're saying but not 100%. In your example, for the files that got deleted from disks 1 and 3 and lost disk4 before a sync happens, are those files used by the parity drive in trying to recover the actual files in disk 4 even though they aren't technically the same file?

        When you say "continually rebuild the files" manually, what do you mean? As in I get a copy from backup and just copy that file over to the new drive?

        With two parity drives, is this still a problem?

         
        • David

          David - 2022-08-22

          Right. Recovery computes the remaining drives with the parity drive to see what value is missing. Single parity works like this.

          Same four drive array with a single parity drive. Again very simplified. Sometimes I don't explain things in the best way, so if any of this is confusing let me know and I'll try to explain better.

          1 2 3 4
          1 0 0 0
          1 1 1 1
          0 0 1 0
          0 0 1 1
          1 0 1 0

          Parity looks at the data and computes a value with XOR.

          1 2 3 4 P
          1 0 0 0 1
          1 1 1 1 0
          0 0 1 0 1
          0 0 1 1 0
          1 0 1 0 0

          So let's say drive four drops.

          1 2 3 4 P
          1 0 0....1 Parity is a 1 & other sectors compute to 1 so drive 4 must be a 0
          1 1 1....0 Parity is a 0 & other sectors compute to 1 so drive 4 must be a 1
          0 0 1....1 Parity is a 1 & other sectors compute to 1 so drive 4 must be a 0
          0 0 1....0 Parity is a 0 & other sectors compute to 1 so drive 4 must be a 1
          1 0 1....0 Parity is a 0 & other sectors compute to 0 so drive 4 must be a 0

          The entire array is used to rebuild lost files. Instead of thinking of a lost drive, think of it as a collection of files. If other files on the array are missing and they are used to compute parity, then you won't be able to rebuild those files.

          Here is another example. There are 4 drives but the first two are 8tb and the second two are 4tb and the parity drive is 8tb.

          1 2 3 4 P
          x x .......y Just the first two are used to create parity because the others as too small
          x x .......y Just the first two are used to create parity because the others as too small
          x x .......y Just the first two are used to create parity because the others as too small
          x x x x y All drives are used to create parity
          x x x x y All drives are used to create parity
          x x x x y All drives are used to create parity

          With this example, if you lost drive four you would need the first three drives and parity to rebuild it. But here is what is weird. Let's say drives one, three, and four were lost. You could still recover the top 50% of drive one because you still have the matching parts on drive one and the parity drive. Rather than think about whole drives, SR works on files and it's easier, in my opinion, to think about SR working on files.

          When you rebuild a drive or perhaps more precisely, recover the files, if you are missing the matching files on other drives you may not be able to recover the files. But SR will tell you which files are needed for the recovery.

          I think of SR working as lining drives vertically and the parity is created straight across. This isn't technically correct, but it's how I visualize it.

          Here is an array with 3 8TB drive, 2 6TB drives, a 4 TB drive and an 8TB parity drive. Each line is a TB.

          1 2 3 4 5 6 P
          x x x............y only the 8tb drives are used to create parity
          x x x............y only the 8tb drives are used to create parity
          x x x x x.....y the 8tb drives and the 6tb drives are used for parity
          x x x x x.....y the 8tb drives and the 6tb drives are used for parity
          x x x x x x..y all drives are used for parity
          x x x x x x y all drives are used for parity
          x x x x x x y all drives are used for parity
          x x x x x x y all drives are used for parity

          To get to a file level, let's say file1.txt on drive 1, file2.txt on drive 2, file3.txt and drive 3, and file4.txt on drive 4 are used to create the parity for those files. File2.txt is deleted and before the next sync, drive 4 drops. Recovery looks and says "Hmm, I only have 3 of the files and I need 4. But I know the file I need is file2.txt". SR will tell you that file2.txt is needed for recovery. If you have an exact copy of file2.txt, you can copy file2.txt to drive 2 and then run recovery again. You can do this as often as you like if you are able to fine copies of the missing files as long as you don't sync the new data.

          With two parity drives and the same example, you could recover file4.txt because you would have the other two files and the two parities. With RAID it's easy to think in terms of drives, but with SR I find it easier to think of collection of files rather than entire drives.

           
          • Kevin Mychal M. Ong

            Ok, I think I got it now. What you just explained is definitely the cause of my issue because there were a lot of SRT's that were replaced during the time the drive is in a failed state for 3 days. That sucks. Is there a way to tell the mergerFS/Snapraid combo to immediately (and automatically) stop any access to the array when a single drive fails?

            In all honesty then, is RAID better than SnapRAID? I had multiple instances of a single drive failure on my Synology NAS (13 drives on SHR-1) and was able to recover the drive 100% even if the array was used in degraded mode for a few weeks. I thought it was the same with SnapRAID which is why I didn't really worry about replacing the drive immediately. Had I replaced it immediately after the first CurrentPendingSector alert I got on 7/21, I wouldn't have this problem now.

             
            • David

              David - 2022-08-23

              No idea. I run manual syncs under windows. SR should refuse to sync if a drive is missing.

              Personally I like SR over hardware RAID. I've used dedicated hardware RAID where the drive's firmware had to match and DAS/NAS systems where drives could be mixed. I still have two drobos and I think they're great. While it's more convenient, I prefer SR. With SR there is no funky formatting and if the absolute worst happens, I have readable data on each of my drives so I can simply pop them out and have my data. I can replace the machine with any hardware running windows and still access the drives when the machine is dead. It's not as convenient, I can move the drives between machines, use one in a dock, replace one and then have a backup. Everyone's use case is different, but I really like SR.

               
              • Kevin Mychal M. Ong

                Yeah, it did refuse to sync when the drive totally failed. However, my other programs still had access to the other drives. So those programs (Bazarr specifically) updated some SRT files which put the state of my array into the issue you are describing (did I understand correctly?). This went on for around 3 to 4 days before I finally replaced the drive and tried recovering the files.

                Synology is software RAID. SHR-1 is simply RAID5 in the background but with minor improvements. What I don't understand is how software RAID can run in degraded mode (single disk failure) and still be able to recover 100% of the data upon drive replacement. Does the parity in RAID5 not work the same as SR?

                 
                • David

                  David - 2022-08-23

                  If you have a program that is updating or replacing files, that is effectively the same as a delete for a RAID. The dedicated RAID hardware and software constantly update the parity. That's how they are able to work with a degraded system while other programs update files.

                  I've used dedicated hardware RAIDs before and like I mentioned, I still use a couple of drobos. I'm perfectly happy with them. SR is generally designed for files that don't change after adding. I'd recommend adding another parity drive, but SR may not be the best solution for you. If you have a program that is constantly changing files in the background then a real-time RAID solution may be best for you.

                   
  • Kevin Mychal M. Ong

    FWIW, these are the order of events:

    • 7/21/2022 @ 4:53 PM - one drive (d3) started getting CurrentPendingSector SMART errors but I just let them be (I know, my bad)
    • 8/17/2022 @ 5:00 AM - last successful Snapraid runner run (touch, diff, sync, scrub, in that order)
    • 8/18 @ 3:35 AM - the same drive got a FailedHealthCheck error when running SMART
    • 8/18 @ 5:00 AM - 8/21 @ 5:00 AM - all daily Snapraid runner runs fail because it's having a hard time reading the failed drive already
    • 8/21 night time - I finally replaced the drive with a working one and modified /etc/fstab accordingly to reflect the replacement drive
    • 8/22 5:00 AM - since I forgot to turn off Snapraid runner at this point, it still tried doing a touch and a diff (which did a write to the existing content files) but did not push through with a sync and a scrub because it detected that all files from the old drive were missing (delete threshold of 250). Here's the last email notif I got today: https://gist.github.com/kevindd992002/36bdff4d60efe13d4295c3612e5c0a4d

    So between 8/17 and 8/21 night time, everything was accessing the mergerFS drive (that includes the bad drive). So there were addition and removal of files on those dates but no more successful Snapraid sync (which I think is a good one). Here's a copy of the snapraid.log file:

    https://www.dropbox.com/s/gqk8mcukmax9k08/snapraid.log?dl=0

    Here's the complete fix.log:

    https://1drv.ms/u/s!AhDXcRksNyfes3HwSuF25vUjEvEO?e=rA768q

    For some reason, I see unrecoverable events there on other drives (d2 and d5) even though the snapraid fix command was targeted only at d3. Is this normal?

     

    Last edit: Kevin Mychal M. Ong 2022-08-22
  • Kevin Mychal M. Ong

    Delete

     

    Last edit: Kevin Mychal M. Ong 2022-08-22
  • Kevin Mychal M. Ong

    @amadvance do you have any ideas here? The last August 17 run of Snapraid was a success so not sure what's happening here.

     

Log in to post a comment.