Menu

Found errors during Scrub - multiple fix needed, file corrupt

Help
therealjmc
2023-09-14
2023-10-10
  • therealjmc

    therealjmc - 2023-09-14

    Hi,

    I've been using snapraid for several years and on a new build I've encountered a strange behaviour:

    • Running scrub errors were detected in multiple Files I have CRC Checksums of
    • Running the fix Command errors were recovered, CRC for 2 of the files didn't match
    • Running the same fix command both files were fixed again, CRC for 1 of the 2 files didn't match
    • Running fix again won't start to fix again, nothing to do.
    • Deleted the file and ran fix missing - file was recovered, CRC does not match

    Since I had the files on hand on the other hardware I decided to just copy the file over again. 2 Weeks later roughly the same files were damaged during a scrub (I still have to figure out why, it's an SSD drivepool used for Image-Backups) and the same behaviour: 2 runs of fix needed, same 2 files were fixed in the 2nd run, same file did not CRC Match after the 2nd run.

    I don't get it? What am I doing wrong?

    Edit: After the first error and copying the file over I did a 100% scrub that went smoothly without any errors. So snapraid should have the right hashes for the files? I'm using 2 data and 2 parity and since it's image-backups it's rather static without any changes (diff shows 0 changes)

    Edit2:
    I've deleted the CRC-Error File and recovered it, it will recover with the wrong CRC?! Copying the correct file and syncing does not change that behaviour

     

    Last edit: therealjmc 2023-09-14
  • rubylaser

    rubylaser - 2023-09-19

    Since this is a new build, have you run a memtest on the RAM to confirm that it's working properly? Faulty memory can cause all sorts of errors like these.

     
  • therealjmc

    therealjmc - 2023-10-08

    Yes, I've run multiple checks. Since it are always the same file I suspect it's one of the SSDs. However what is giving me a hard time is that snapraid can't fix the errors even in a build with 2 data and 2 parity disks. It's making no sense.

    Happened today again, the same files, running CRC currently and giving fix another try. But my bet is I'll have to copy the files over again.

     
  • UhClem

    UhClem - 2023-10-09

    You should include the exact error messages that SnapRAID is reporting (use copy/paste).

    And, when you write:

    Yes, I've run multiple checks

    Exactly what tests, and for how long?

     
    • therealjmc

      therealjmc - 2023-10-09

      The Problem is - snapraid doesn't report any error. However checking CRC with the restored files against a previos (known good) CRC checksum shows it's bad. Re-creating the CRC from my backup and verifying - the restored files are bad. Copying them over from Backup and veryfing the CRC -> everything OK. So I'm pretty sure the files ARE damaged even after a restore.

      785210 errors
      785210 recovered errors
      0 unrecoverable errors
      Everything OK

      I've run memtest86+ for about 2 days like always on a new server.

      Since it are always the same files I tend to think it's the SSD. Copying over the whole file whil write the whole file new instead of just fixing individual blocks.

       
  • UhClem

    UhClem - 2023-10-10

    Are you saying that, from the beginning of this mess, there has never been SnapRAID error messages of the form "Data error ..." or "Parity error ..." ?

    Sometimes, memory errors are elusive and only occur in combination with heavy CPU use (such as SR hash and/or parity calcs); these can often be uncovered using Prime95, either Torture test or Blend test.

     
    • therealjmc

      therealjmc - 2023-10-10

      There are error messages found during scrub after several weeks. Running 100% scrub now won't show any error, the error starts to appear after 4-6 weeks and are always the same few files. Every single Time. Only a few files, while other files in the directory are CRC okay. So it seems like a Disk issue to me. Prime95 torture was done to verify cooling before getting the build live.

      For example
      error:5653915:d2:Images/DiskBackup.tib: Data error at position 42426, diff bits 63/128
      msg:error: Data error in file 'Y:/PoolPart.c98762be-3a07-4fca-ad12-220a39671c8f/Images/DiskBackup.tib' at position '42426', diff bits 63/128
      and another 10mb of it.

       

Log in to post a comment.