I've been using snapraid for several years and on a new build I've encountered a strange behaviour:
Running scrub errors were detected in multiple Files I have CRC Checksums of
Running the fix Command errors were recovered, CRC for 2 of the files didn't match
Running the same fix command both files were fixed again, CRC for 1 of the 2 files didn't match
Running fix again won't start to fix again, nothing to do.
Deleted the file and ran fix missing - file was recovered, CRC does not match
Since I had the files on hand on the other hardware I decided to just copy the file over again. 2 Weeks later roughly the same files were damaged during a scrub (I still have to figure out why, it's an SSD drivepool used for Image-Backups) and the same behaviour: 2 runs of fix needed, same 2 files were fixed in the 2nd run, same file did not CRC Match after the 2nd run.
I don't get it? What am I doing wrong?
Edit: After the first error and copying the file over I did a 100% scrub that went smoothly without any errors. So snapraid should have the right hashes for the files? I'm using 2 data and 2 parity and since it's image-backups it's rather static without any changes (diff shows 0 changes)
Edit2:
I've deleted the CRC-Error File and recovered it, it will recover with the wrong CRC?! Copying the correct file and syncing does not change that behaviour
Last edit: therealjmc 2023-09-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since this is a new build, have you run a memtest on the RAM to confirm that it's working properly? Faulty memory can cause all sorts of errors like these.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I've run multiple checks. Since it are always the same file I suspect it's one of the SSDs. However what is giving me a hard time is that snapraid can't fix the errors even in a build with 2 data and 2 parity disks. It's making no sense.
Happened today again, the same files, running CRC currently and giving fix another try. But my bet is I'll have to copy the files over again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The Problem is - snapraid doesn't report any error. However checking CRC with the restored files against a previos (known good) CRC checksum shows it's bad. Re-creating the CRC from my backup and verifying - the restored files are bad. Copying them over from Backup and veryfing the CRC -> everything OK. So I'm pretty sure the files ARE damaged even after a restore.
785210 errors
785210 recovered errors
0 unrecoverable errors
Everything OK
I've run memtest86+ for about 2 days like always on a new server.
Since it are always the same files I tend to think it's the SSD. Copying over the whole file whil write the whole file new instead of just fixing individual blocks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Are you saying that, from the beginning of this mess, there has never been SnapRAID error messages of the form "Data error ..." or "Parity error ..." ?
Sometimes, memory errors are elusive and only occur in combination with heavy CPU use (such as SR hash and/or parity calcs); these can often be uncovered using Prime95, either Torture test or Blend test.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are error messages found during scrub after several weeks. Running 100% scrub now won't show any error, the error starts to appear after 4-6 weeks and are always the same few files. Every single Time. Only a few files, while other files in the directory are CRC okay. So it seems like a Disk issue to me. Prime95 torture was done to verify cooling before getting the build live.
For example
error:5653915:d2:Images/DiskBackup.tib: Data error at position 42426, diff bits 63/128
msg:error: Data error in file 'Y:/PoolPart.c98762be-3a07-4fca-ad12-220a39671c8f/Images/DiskBackup.tib' at position '42426', diff bits 63/128
and another 10mb of it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've been using snapraid for several years and on a new build I've encountered a strange behaviour:
Since I had the files on hand on the other hardware I decided to just copy the file over again. 2 Weeks later roughly the same files were damaged during a scrub (I still have to figure out why, it's an SSD drivepool used for Image-Backups) and the same behaviour: 2 runs of fix needed, same 2 files were fixed in the 2nd run, same file did not CRC Match after the 2nd run.
I don't get it? What am I doing wrong?
Edit: After the first error and copying the file over I did a 100% scrub that went smoothly without any errors. So snapraid should have the right hashes for the files? I'm using 2 data and 2 parity and since it's image-backups it's rather static without any changes (diff shows 0 changes)
Edit2:
I've deleted the CRC-Error File and recovered it, it will recover with the wrong CRC?! Copying the correct file and syncing does not change that behaviour
Last edit: therealjmc 2023-09-14
Since this is a new build, have you run a memtest on the RAM to confirm that it's working properly? Faulty memory can cause all sorts of errors like these.
Yes, I've run multiple checks. Since it are always the same file I suspect it's one of the SSDs. However what is giving me a hard time is that snapraid can't fix the errors even in a build with 2 data and 2 parity disks. It's making no sense.
Happened today again, the same files, running CRC currently and giving fix another try. But my bet is I'll have to copy the files over again.
You should include the exact error messages that SnapRAID is reporting (use copy/paste).
And, when you write:
Exactly what tests, and for how long?
The Problem is - snapraid doesn't report any error. However checking CRC with the restored files against a previos (known good) CRC checksum shows it's bad. Re-creating the CRC from my backup and verifying - the restored files are bad. Copying them over from Backup and veryfing the CRC -> everything OK. So I'm pretty sure the files ARE damaged even after a restore.
785210 errors
785210 recovered errors
0 unrecoverable errors
Everything OK
I've run memtest86+ for about 2 days like always on a new server.
Since it are always the same files I tend to think it's the SSD. Copying over the whole file whil write the whole file new instead of just fixing individual blocks.
Are you saying that, from the beginning of this mess, there has never been SnapRAID error messages of the form "Data error ..." or "Parity error ..." ?
Sometimes, memory errors are elusive and only occur in combination with heavy CPU use (such as SR hash and/or parity calcs); these can often be uncovered using Prime95, either Torture test or Blend test.
There are error messages found during scrub after several weeks. Running 100% scrub now won't show any error, the error starts to appear after 4-6 weeks and are always the same few files. Every single Time. Only a few files, while other files in the directory are CRC okay. So it seems like a Disk issue to me. Prime95 torture was done to verify cooling before getting the build live.
For example
error:5653915:d2:Images/DiskBackup.tib: Data error at position 42426, diff bits 63/128
msg:error: Data error in file 'Y:/PoolPart.c98762be-3a07-4fca-ad12-220a39671c8f/Images/DiskBackup.tib' at position '42426', diff bits 63/128
and another 10mb of it.