SnapRAID / Discussion / Help: Found errors during Scrub

therealjmc - 2023-09-14

Hi,

I've been using snapraid for several years and on a new build I've encountered a strange behaviour:

Running scrub errors were detected in multiple Files I have CRC Checksums of

Running the fix Command errors were recovered, CRC for 2 of the files didn't match

Running the same fix command both files were fixed again, CRC for 1 of the 2 files didn't match

Running fix again won't start to fix again, nothing to do.

Deleted the file and ran fix missing - file was recovered, CRC does not match

Since I had the files on hand on the other hardware I decided to just copy the file over again. 2 Weeks later roughly the same files were damaged during a scrub (I still have to figure out why, it's an SSD drivepool used for Image-Backups) and the same behaviour: 2 runs of fix needed, same 2 files were fixed in the 2nd run, same file did not CRC Match after the 2nd run.

I don't get it? What am I doing wrong?

Edit: After the first error and copying the file over I did a 100% scrub that went smoothly without any errors. So snapraid should have the right hashes for the files? I'm using 2 data and 2 parity and since it's image-backups it's rather static without any changes (diff shows 0 changes)

Edit2:
I've deleted the CRC-Error File and recovered it, it will recover with the wrong CRC?! Copying the correct file and syncing does not change that behaviour

Last edit: therealjmc 2023-09-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rubylaser - 2023-09-19

Since this is a new build, have you run a memtest on the RAM to confirm that it's working properly? Faulty memory can cause all sorts of errors like these.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

therealjmc - 2023-10-08

Yes, I've run multiple checks. Since it are always the same file I suspect it's one of the SSDs. However what is giving me a hard time is that snapraid can't fix the errors even in a build with 2 data and 2 parity disks. It's making no sense.

Happened today again, the same files, running CRC currently and giving fix another try. But my bet is I'll have to copy the files over again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UhClem - 2023-10-09

You should include the exact error messages that SnapRAID is reporting (use copy/paste).

And, when you write:

Yes, I've run multiple checks

Exactly what tests, and for how long?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- therealjmc - 2023-10-09
  
  The Problem is - snapraid doesn't report any error. However checking CRC with the restored files against a previos (known good) CRC checksum shows it's bad. Re-creating the CRC from my backup and verifying - the restored files are bad. Copying them over from Backup and veryfing the CRC -> everything OK. So I'm pretty sure the files ARE damaged even after a restore.
  
  785210 errors
  785210 recovered errors
  0 unrecoverable errors
  Everything OK
  
  I've run memtest86+ for about 2 days like always on a new server.
  
  Since it are always the same files I tend to think it's the SSD. Copying over the whole file whil write the whole file new instead of just fixing individual blocks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UhClem - 2023-10-10

Are you saying that, from the beginning of this mess, there has never been SnapRAID error messages of the form "Data error ..." or "Parity error ..." ?

Sometimes, memory errors are elusive and only occur in combination with heavy CPU use (such as SR hash and/or parity calcs); these can often be uncovered using Prime95, either Torture test or Blend test.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- therealjmc - 2023-10-10
  
  There are error messages found during scrub after several weeks. Running 100% scrub now won't show any error, the error starts to appear after 4-6 weeks and are always the same few files. Every single Time. Only a few files, while other files in the directory are CRC okay. So it seems like a Disk issue to me. Prime95 torture was done to verify cooling before getting the build live.
  
  For example
  error:5653915:d2:Images/DiskBackup.tib: Data error at position 42426, diff bits 63/128
  msg:error: Data error in file 'Y:/PoolPart.c98762be-3a07-4fca-ad12-220a39671c8f/Images/DiskBackup.tib' at position '42426', diff bits 63/128
  and another 10mb of it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Found errors during Scrub - multiple fix needed, file corrupt

A backup program for disk arrays

Forums

Help

Found errors during Scrub - multiple fix needed, file corrupt

Found errors during Scrub - multiple fix needed, file corrupt

A backup program for disk arrays

Forums

Help

Found errors during Scrub - multiple fix needed, file corrupt document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Found errors during Scrub - multiple fix needed, file corrupt