'snapraid sync' - 'Data error in file', memtest86+ and prime95 tests clean, SMART stats good. I recently upgraded the 2 parity drives to 8TB ESATA drives. I copied the existing parity files and updated the snapraid config, which seemed to work fine. But then a scrub showed 'Data error in parity' errors in the parity files at a bunch of different block locations. I had not paid much attention before and am not sure if I ever had any errors before the move of the parity files to the new drives. I decided to delete all the content and parity files and run 'snapraid sync -h' to rebuild everything from scratch. I received errors for around 30 files of 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!' during the initial sync, which took almost 2 days on about 12TB of data (9 mixed data drives in mergerfs pool). Most errors were for large 30GB+ video files that would not have been touched / altered during the sync process. Then ran 'snapraid sync' after the initial 'snapraid sync -h' and again received 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!' for all the same files as before, plus now Data error in file ' for some additional files that didn't have errors on the first run. All the files in question are not duplicates, they do not exist anywhere else on the snapraid volume. SMART stats look good for all drives. I have 4x 2 port PCIE ASM1062 sata controller cards in the system, but the errors are for files mixed across different drives connected to both the onboard and add on controller cards. I have just run an overnight 17hr memtest86+ of 8 full passes with no errors. I've just run prime95 small fft's test for over 4 hrs, from a bootable Hirens USB stick, and CPU core temps stay below 57c. I'm now running a prime95 blend test. I don't expect to see any issues. I've been running openmediavault on this system for years with no stability issues whatsoever. The system will run for months until I have a reason to reboot it manually. I've never had an issue accessing any of my data. Not sure where to go from here. As I mentioned, I did not pay close enough attention to snapraid in the past, until now that I'm moving more critical data to it. In the past, Snapraid would run sync and scheduled scrub jobs that were set in the OMV gui. I don't recall seeing errors whenever running a manual sync from the gui in the past, but I can't be 100% sure. I would occasionally look at the snapraid logs through the gui. I have also recovered a full drive and individual files with snapriad in the past without issue. I've just recently been educating myself more on how snapraid works and how to use it effectively. I'm now running the snapraid commands from the shell, which is why I am aware of the issue, but I'm not sure where to go from here, as I can not get a clean sync, and I can not find a hardware issue.
samples snapraid sync -h
Data change at file '/srv/dev-disk-by-label-2TBblack01/Storage/Backup/8700k_full_b1_s1_v2.tib' at position '286007'
WARNING! Unexpected data modification of a file without parity!
snapraid sync
Data change at file '/srv/dev-disk-by-label-2TBblack01/Storage/Backup/8700k_full_b1_s1_v2.tib' at position '286007'
WARNING! Unexpected data modification of a file without parity!
Try removing the file from the array and rerun the 'sync' command!
Data error in file '/srv/dev-disk-by-label-1TBblack02/Storage/path obscured.mp4' at position '2734', diff bits 65/128
System specs:
Openmediavault 5.3.10-1
Snapaid plugin 3.7.7 (snapraid V11.3)
Motherboard: Gigabyte GA-EP45-UD3P
CPU: Core2quad q9550
Memory: OCZ2N800SR4GK 4x2GB sticks
PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.
Last edit: Nathan 2020-04-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ran extended SMART tests on all drives last night, all passed.
now getting many 'Data error in file' when I run a sync, I got over 147 in about a minute so I just cancelled the sync. No data has changed, not sure what is going on. I have not found a single hardware issue. See below output.
snapraid diff
Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
Mismatching CRC in '/srv/dev-disk-by-label-1TBblack01/snapraid.content'
This content file is damaged! Use an alternate copy.
I renamed the bad content file mv snapraid.content damaged-snapraid.content
snapraid diff now ran successfully
snapraid sync
Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
WARNING! Content file '/srv/dev-disk-by-label-1TBblack01/snapraid.content' not found, trying with another copy...
Loading state from /srv/dev-disk-by-label-2TBblack01/snapraid.content...
Mismatching CRC in '/srv/dev-disk-by-label-2TBblack01/snapraid.content'
This content file is damaged! Use an alternate copy.
renamed bad content file again mv snapraid.content damaged-snapraid.content
Resolved, I think, issue is running snapraid sync -h for the first run causes tons of errors which remain permanent in the content / parity files. Running snapraid sync runs clean. More details here - Ok, so I renamed my content files and parity files so I could setup a test snapraid configuration that wouldn't take so long to sync. I setup 2 data disks and 1 parity disk, all containing the content file since you need a minimum of 3 content files. I only put 1 video file on each data disk. Ran snapraid sync -h same result but worse, tons more errors, over 900 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!'. Then ran snapraid sync tons of Data error in file again. So I wiped those content and parity files again, this time ran snapraid sync for the first run instead of with -h it ran clean! Then snapraid diff showed all files equal as expected. snapraid sync again resulted in nothing to do and was instantaneous as expected. I've been pulling my hair out for about a week over this and the whole issue was because I was running snapraid sync -h! I thought that was the best way to do it by hashing everything twice? Why does it throw so many errors? If whoever is reading this didn't read through all my previous post, I ran a full 8 passes with memtest86+, over 8 hours total of prime95 small ffts and blend tests, extended smart test on all drives and found no issues with anything, plus the system has been rock solid stable for years. I'll know if it is fully resolved once I get a successful sync on all my real data.
Last edit: Nathan 2020-04-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Issue not resolved, the only reason snapraid sync works is I'm assuming because there is no verification of the hashes when writing parity. As soon as I run a snapraid scrub -p new tons of errors show up. I believe this an issue with snapraid, not the hardware, as all my stress tests have shown there is nothing wrong, look at my previous posts for details. If anything, I'm more convinced of the stability of the system at this point. As I stated, the data in question can be copied and accessed without any issue. Can someone shed some light on this issue?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looks very much like a workload related hardware issue.
When you do an inital sync without -h you see no problem until you scrub.
The reason for this is that snapraid has never seen the these files before and therefore have no hashes to compare them to.
But when you later scrub snapraid discovers that some files are different compared to the hashes from the initial sync.
When you do sync -h all data is read twice and snapraid finds that some files are different when read the second time compared to the first time.
How likely does it seem that you have encountered multiple bugs related to snapraids core functionality, not found by anyone else, but still easily reproduced?
I can pretty much guarantee that if you do this:
Delete the content and parity files
Run snapraid sync -h until you encounter the first error.
Delete the content and parity files
Run snapraid sync -h until you encounter the first error.
Then you will find that the errors in 2 and 4 are different and that you have proven that it is random errors which cannot be explained by bugs in snapraid.
My main suspect would be the SATA controllers.
It could also be a motherboard problem, over-heating problem or unstable power supply related to the combined workload.
Since you have not had this problem in the past, a good starting point would be to think back what hardware changes / additions you made before the problems started.
It would also be nice if you tried to use the enter-key once in a while. It would make reading your posts much easier.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the reply. I've since setup a test machine of newer hardware and brought over 2 of the sata controllers from the problem machine. snapraid sync -h worked perfect on the test machine, no errors when comparing hashes.
On Problem Machine
installed from scratch OMV 5, with snapraid plugin on the problem machine.
Moved same drives used in snapraid test machine over to problem machine.
Deleted content and parity files, ran snapraid sync -h returned errors for almost every block of both video files in the array.
Ran a second run, deleted content and parity files before second run.
2245 errors first run, 1750 errors second run.
The errors don't align exactly between runs, but I just find it odd if there are this many errors that I would have other symtoms outside of snapraid or that something would come up in the stress tests with memtest86+ and prime95.
I've attached output of both snapraid sync -h runs.
Did some more testing. On Problem Machine:
Reverted back to what I ran for years as a test.
Fresh install of OpenMediaVault 3.0.94, updated to 3.0.99 in update manager
Installed snapraid plugin 3.7.3 (which is snapraid v11.1)
snapraid sync -h runs with no errors snapraid scrub -p new runs with no errors, all hashes verified successfully.
Fresh install of OpenMediaVault 4.1.35-1, with snapraid plugin v3.7.7 (snapraid 11.3) snapraid sync -h tons of errors again.
The content and parity files were deleted between tests, data remained the same.
Is it likely the latest version of snapraid 11.3 is not working with this hardware?
It seems I can not run the lastest version on this hardware, but this type of application is what the old hardware is good for...
Let me know what I can do to provide any debugging info to help get to the bottom of this.
Other notes:
Plugin 3.7.3 (snapraid 11.1) was the latest snapraid plugin available for OMV v3. I tried to manually upgrade snapraid to 11.3 while still on OMV 3, but I would get the following.
root@OMVtest3:/home/snapraid-11.3# ./configurecheckingforaBSD-compatibleinstall.../usr/bin/install-ccheckingwhetherbuildenvironmentissane...yescheckingforathread-safemkdir-p.../bin/mkdir-pcheckingforgawk...nocheckingformawk...mawkcheckingwhethermakesets$(MAKE)...nocheckingwhethermakesupportsnestedvariables...nocheckingbuildsystemtype...x86_64-unknown-linux-gnucheckinghostsystemtype...x86_64-unknown-linux-gnucheckingforgcc...nocheckingforcc...nocheckingforcl.exe...noconfigure:error:in`/home/snapraid-11.3':configure:error:noacceptableCcompilerfoundin$PATHSee`config.log' for more details
Note - Running OMV 3.0.99 with snapraid plugin 3.7.3 (snapraid v11.1) results in no errors.
OMV 4.1.35-1, snapraid plugin 3.7.7 (v11.3) = thousands of errors
My latest test:
OMV 5.4.2-1, no OMV snapraid plugin, manual install of snapraid 11.1 - results in a few errors
I was hoping 11.1 would work on OMV 5 for now.
Upgrading is what broke it. See previous post for test output. OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.
test was snapraid sync -h followed by snapraid scrub -p new
2 video files (1 on each data drive)
content and pary files deleted between test runs, data untouched between tests
Anyone know what might be going on?
System specs:
Openmediavault 5.3.10-1
Snapaid plugin 3.7.7 (snapraid V11.3)
Motherboard: Gigabyte GA-EP45-UD3P
CPU: Core2quad q9550
Memory: OCZ2N800SR4GK 4x2GB sticks
PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
Power Supply: OCZ750FTY - Fatal1ty Gaming Series 750 Watt 80+ Bronze
Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I've proved this is a software / hardware compatibility issue and not a problem with the actual hardware. See previous post with test details. OMV 3 (based on Debian Jessie) with snapraid 11.1 works with NO ERRORS, Upgrading causes thousands of errors
I just synced my entire array of 12TB of data again, then scrubbed 5% of the array with no errors.
If I was on a newer version of OMV with snapraid 11.3, I would have seen tens of thousands of errors scrubbing 5%.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.
On the surface this is what it looks like:
Problem with OMV 4/5 being observable when using snapraid 11.1 and much more frequent when using snapraid 11.3.
Logically it could of course be the other way around but it seems more intuitive that snapraid would be the trigger.
I guess Andrea would be the best person to figure out if it is possible to rule out snapraid as root cause or not.
He usually keeps an eye on the forum so perhaps he can comment. But personally I would try to get some feedback on the issue in the OMV forum.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
did you do some memory tests?
SnapRAID is quite stressy to memory and disks (DMA). And newer versions of the Linux kernel may improve performance (disk/memory), which could trigger the problem.
Last edit: Walter Tuppa 2020-04-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I know there is a lot to read through in previous posts, that was the first thing I did. 17hr 8 full pass test with memtest86+ with - no errors
prime95 small fft's 4hrs, prim95 blend test 4hrs - no errors
OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.
Issue is specific to my hardware layout, tested on newer hardware and there is no issue, see system specs below. Also remember, I moved 2 of the sata controllers from production to the newer hardware testbed and used the same drives throughout all tests on both machines. I would guess the issue could probably be replicated with same CPU / chipset as my problem machine.
I've proven the issue can be predictably replicated just by upgrading the software from OMV 3 w/snapraid 11.1 to OMV 4 or 5 w/snapraid 11.3 OMV 3 w/snapraid 11.1 works with no errors, I've put my production data back on that for now.
System specs:
Openmediavault 5.3.10-1
Snapaid plugin 3.7.7 (snapraid V11.3)
Motherboard: Gigabyte GA-EP45-UD3P
CPU: Core2quad q9550
Memory: OCZ2N800SR4GK 4x2GB sticks
PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
Power Supply: OCZ750FTY - Fatal1ty Gaming Series 750 Watt 80+ Bronze
Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, now I'm stumped because I thought I would reproduce the same errors on another, almost identical board and cpu I dug up, but I do not. This would make me jump to a possible hardware issue that was not revealed in any stress tests, but it still doesn't explain why snapraid runs perfectly clean in OMV3 w/snapraid 11.1, and has for years until I upgraded.
Again this is what I'm seeing on the problem machine. OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
Fresh install of OMV 4 or 5 with snapraid plugin (v11.3) = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1(manual snapraid install, no plugin) = few errors, much better than v11.3 but still broken. snapraid 11.1 only works if paired with OMV3.
I mentioned I dug up a nearly identical Gigabyte P45 chipset board and core2quad cpu. The only real difference between the motherboards is
the problem machine has 2x PCIE 2.0 slots and 2x 1GB realtek r8169 NIC's.
The test board only has 1 PCIE 2.0 slot and 1 onboard NIC.
The CPU's are identical.
Both boards are running the latest BIOS. The BIOS's are basically identical and configured the same.
Same 3 drives used for all testing.
Also tested RAM from problem machine in testbed without snapraid errors.
You would think at this point this sounds like bad hardware on the problem machine, narrowed down to CPU or board, (removed all external SATA controllers for latest tests), but remember, OMV3 / snapraid 11.1 scrubs entire 12TB of data without errors on the problem machine. Almost every block is an error when running OMV4 or 5 w/snapraid 11.3 using just 2 small test files.
Let's just say I'm done with the hardware. I've since moved my data over to a dell R320 (ECC RAM) with an LSI SAS9207-8e HBA adapter, HP MSA60 storage shelf. All works fine. Will be testing 24 bay netapp storage shelf with 6Gbps dell controller once I get the parts. This is much better than my previous setup.
Problem machine specs:
Motherboard - GA-EP45-UD3P rev 1.6 (P45 express chipset, ICH10R)
CPU - Core2quad q9550
2nd test machine specs:
Motherboard - GA-EP45-UD3R rev 1.1 (P45 express chipest, ICH10R)
CPU - Core2quad q9550
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
with OMV4/5 you change much more than only SnapRAID, e.g. Linux kernel, all programs, environment, ...
maybe one of these changes is the problem (most likely the kernel).
have you testet OV3 with SnapRAID 11.3?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
'snapraid sync' - 'Data error in file', memtest86+ and prime95 tests clean, SMART stats good. I recently upgraded the 2 parity drives to 8TB ESATA drives. I copied the existing parity files and updated the snapraid config, which seemed to work fine. But then a scrub showed 'Data error in parity' errors in the parity files at a bunch of different block locations. I had not paid much attention before and am not sure if I ever had any errors before the move of the parity files to the new drives. I decided to delete all the content and parity files and run 'snapraid sync -h' to rebuild everything from scratch. I received errors for around 30 files of 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!' during the initial sync, which took almost 2 days on about 12TB of data (9 mixed data drives in mergerfs pool). Most errors were for large 30GB+ video files that would not have been touched / altered during the sync process. Then ran 'snapraid sync' after the initial 'snapraid sync -h' and again received 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!' for all the same files as before, plus now Data error in file ' for some additional files that didn't have errors on the first run. All the files in question are not duplicates, they do not exist anywhere else on the snapraid volume. SMART stats look good for all drives. I have 4x 2 port PCIE ASM1062 sata controller cards in the system, but the errors are for files mixed across different drives connected to both the onboard and add on controller cards. I have just run an overnight 17hr memtest86+ of 8 full passes with no errors. I've just run prime95 small fft's test for over 4 hrs, from a bootable Hirens USB stick, and CPU core temps stay below 57c. I'm now running a prime95 blend test. I don't expect to see any issues. I've been running openmediavault on this system for years with no stability issues whatsoever. The system will run for months until I have a reason to reboot it manually. I've never had an issue accessing any of my data. Not sure where to go from here. As I mentioned, I did not pay close enough attention to snapraid in the past, until now that I'm moving more critical data to it. In the past, Snapraid would run sync and scheduled scrub jobs that were set in the OMV gui. I don't recall seeing errors whenever running a manual sync from the gui in the past, but I can't be 100% sure. I would occasionally look at the snapraid logs through the gui. I have also recovered a full drive and individual files with snapriad in the past without issue. I've just recently been educating myself more on how snapraid works and how to use it effectively. I'm now running the snapraid commands from the shell, which is why I am aware of the issue, but I'm not sure where to go from here, as I can not get a clean sync, and I can not find a hardware issue.
samples
snapraid sync -h
snapraid sync
snapraid.conf
System specs:
Openmediavault 5.3.10-1
Snapaid plugin 3.7.7 (snapraid V11.3)
Motherboard: Gigabyte GA-EP45-UD3P
CPU: Core2quad q9550
Memory: OCZ2N800SR4GK 4x2GB sticks
PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.
Last edit: Nathan 2020-04-26
Ran extended SMART tests on all drives last night, all passed.
now getting many 'Data error in file' when I run a sync, I got over 147 in about a minute so I just cancelled the sync. No data has changed, not sure what is going on. I have not found a single hardware issue. See below output.
snapraid diff
Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
Mismatching CRC in '/srv/dev-disk-by-label-1TBblack01/snapraid.content'
This content file is damaged! Use an alternate copy.
I renamed the bad content file
mv snapraid.content damaged-snapraid.content
snapraid diff now ran successfully
snapraid sync
Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
WARNING! Content file '/srv/dev-disk-by-label-1TBblack01/snapraid.content' not found, trying with another copy...
Loading state from /srv/dev-disk-by-label-2TBblack01/snapraid.content...
Mismatching CRC in '/srv/dev-disk-by-label-2TBblack01/snapraid.content'
This content file is damaged! Use an alternate copy.
renamed bad content file again
mv snapraid.content damaged-snapraid.content
Resolved, I think, issue is running snapraid sync -h for the first run causes tons of errors which remain permanent in the content / parity files. Running snapraid sync runs clean. More details here - Ok, so I renamed my content files and parity files so I could setup a test snapraid configuration that wouldn't take so long to sync. I setup 2 data disks and 1 parity disk, all containing the content file since you need a minimum of 3 content files. I only put 1 video file on each data disk. Ran snapraid sync -h same result but worse, tons more errors, over 900 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!'. Then ran snapraid sync tons of Data error in file again. So I wiped those content and parity files again, this time ran snapraid sync for the first run instead of with -h it ran clean! Then snapraid diff showed all files equal as expected. snapraid sync again resulted in nothing to do and was instantaneous as expected. I've been pulling my hair out for about a week over this and the whole issue was because I was running snapraid sync -h! I thought that was the best way to do it by hashing everything twice? Why does it throw so many errors? If whoever is reading this didn't read through all my previous post, I ran a full 8 passes with memtest86+, over 8 hours total of prime95 small ffts and blend tests, extended smart test on all drives and found no issues with anything, plus the system has been rock solid stable for years. I'll know if it is fully resolved once I get a successful sync on all my real data.
Last edit: Nathan 2020-04-24
Issue not resolved, the only reason snapraid sync works is I'm assuming because there is no verification of the hashes when writing parity. As soon as I run a snapraid scrub -p new tons of errors show up. I believe this an issue with snapraid, not the hardware, as all my stress tests have shown there is nothing wrong, look at my previous posts for details. If anything, I'm more convinced of the stability of the system at this point. As I stated, the data in question can be copied and accessed without any issue. Can someone shed some light on this issue?
Looks very much like a workload related hardware issue.
When you do an inital sync without -h you see no problem until you scrub.
The reason for this is that snapraid has never seen the these files before and therefore have no hashes to compare them to.
But when you later scrub snapraid discovers that some files are different compared to the hashes from the initial sync.
When you do sync -h all data is read twice and snapraid finds that some files are different when read the second time compared to the first time.
How likely does it seem that you have encountered multiple bugs related to snapraids core functionality, not found by anyone else, but still easily reproduced?
I can pretty much guarantee that if you do this:
Then you will find that the errors in 2 and 4 are different and that you have proven that it is random errors which cannot be explained by bugs in snapraid.
My main suspect would be the SATA controllers.
It could also be a motherboard problem, over-heating problem or unstable power supply related to the combined workload.
Since you have not had this problem in the past, a good starting point would be to think back what hardware changes / additions you made before the problems started.
It would also be nice if you tried to use the enter-key once in a while. It would make reading your posts much easier.
Thanks for the reply. I've since setup a test machine of newer hardware and brought over 2 of the sata controllers from the problem machine. snapraid sync -h worked perfect on the test machine, no errors when comparing hashes.
On Problem Machine
2245 errors first run, 1750 errors second run.
The errors don't align exactly between runs, but I just find it odd if there are this many errors that I would have other symtoms outside of snapraid or that something would come up in the stress tests with memtest86+ and prime95.
I've attached output of both snapraid sync -h runs.
I think this part of the FAQ is close enough even if the symptom is different:
http://www.snapraid.it/faq#panic
Did some more testing.
On Problem Machine:
Reverted back to what I ran for years as a test.
snapraid sync -h runs with no errors
snapraid scrub -p new runs with no errors, all hashes verified successfully.
snapraid sync -h tons of errors again.
The content and parity files were deleted between tests, data remained the same.
Is it likely the latest version of snapraid 11.3 is not working with this hardware?
It seems I can not run the lastest version on this hardware, but this type of application is what the old hardware is good for...
Let me know what I can do to provide any debugging info to help get to the bottom of this.
Other notes:
Plugin 3.7.3 (snapraid 11.1) was the latest snapraid plugin available for OMV v3. I tried to manually upgrade snapraid to 11.3 while still on OMV 3, but I would get the following.
Then tried
I'm no linux expert, wasn't sure what else to do, so I just went to OMV v4 to continue the test with snapraid 11.3.
Last edit: Nathan 2020-04-27
Note - Running OMV 3.0.99 with snapraid plugin 3.7.3 (snapraid v11.1) results in no errors.
OMV 4.1.35-1, snapraid plugin 3.7.7 (v11.3) = thousands of errors
My latest test:
OMV 5.4.2-1, no OMV snapraid plugin, manual install of snapraid 11.1 - results in a few errors
I was hoping 11.1 would work on OMV 5 for now.
see output:
snapraid sync -h
snapraid scrub -p new
Upgrading is what broke it. See previous post for test output.
OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.
test was snapraid sync -h followed by snapraid scrub -p new
2 video files (1 on each data drive)
content and pary files deleted between test runs, data untouched between tests
Anyone know what might be going on?
System specs:
Openmediavault 5.3.10-1
Snapaid plugin 3.7.7 (snapraid V11.3)
Motherboard: Gigabyte GA-EP45-UD3P
CPU: Core2quad q9550
Memory: OCZ2N800SR4GK 4x2GB sticks
PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
Power Supply: OCZ750FTY - Fatal1ty Gaming Series 750 Watt 80+ Bronze
Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.
I think I've proved this is a software / hardware compatibility issue and not a problem with the actual hardware. See previous post with test details.
OMV 3 (based on Debian Jessie) with snapraid 11.1 works with NO ERRORS, Upgrading causes thousands of errors
I just synced my entire array of 12TB of data again, then scrubbed 5% of the array with no errors.
If I was on a newer version of OMV with snapraid 11.3, I would have seen tens of thousands of errors scrubbing 5%.
Yes, I agree.
On the surface this is what it looks like:
Problem with OMV 4/5 being observable when using snapraid 11.1 and much more frequent when using snapraid 11.3.
Logically it could of course be the other way around but it seems more intuitive that snapraid would be the trigger.
I guess Andrea would be the best person to figure out if it is possible to rule out snapraid as root cause or not.
He usually keeps an eye on the forum so perhaps he can comment. But personally I would try to get some feedback on the issue in the OMV forum.
did you do some memory tests?
SnapRAID is quite stressy to memory and disks (DMA). And newer versions of the Linux kernel may improve performance (disk/memory), which could trigger the problem.
Last edit: Walter Tuppa 2020-04-28
I know there is a lot to read through in previous posts, that was the first thing I did.
17hr 8 full pass test with memtest86+ with - no errors
prime95 small fft's 4hrs, prim95 blend test 4hrs - no errors
OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.
Issue is specific to my hardware layout, tested on newer hardware and there is no issue, see system specs below. Also remember, I moved 2 of the sata controllers from production to the newer hardware testbed and used the same drives throughout all tests on both machines. I would guess the issue could probably be replicated with same CPU / chipset as my problem machine.
I've proven the issue can be predictably replicated just by upgrading the software from OMV 3 w/snapraid 11.1 to OMV 4 or 5 w/snapraid 11.3
OMV 3 w/snapraid 11.1 works with no errors, I've put my production data back on that for now.
System specs:
Openmediavault 5.3.10-1
Snapaid plugin 3.7.7 (snapraid V11.3)
Motherboard: Gigabyte GA-EP45-UD3P
CPU: Core2quad q9550
Memory: OCZ2N800SR4GK 4x2GB sticks
PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
Power Supply: OCZ750FTY - Fatal1ty Gaming Series 750 Watt 80+ Bronze
Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.
Ok, now I'm stumped because I thought I would reproduce the same errors on another, almost identical board and cpu I dug up, but I do not. This would make me jump to a possible hardware issue that was not revealed in any stress tests, but it still doesn't explain why snapraid runs perfectly clean in OMV3 w/snapraid 11.1, and has for years until I upgraded.
Again this is what I'm seeing on the problem machine.
OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
Fresh install of OMV 4 or 5 with snapraid plugin (v11.3) = thousands of errors with just 2 test files.
OMV 4 or 5 with snapraid 11.1(manual snapraid install, no plugin) = few errors, much better than v11.3 but still broken.
snapraid 11.1 only works if paired with OMV3.
I mentioned I dug up a nearly identical Gigabyte P45 chipset board and core2quad cpu. The only real difference between the motherboards is
You would think at this point this sounds like bad hardware on the problem machine, narrowed down to CPU or board, (removed all external SATA controllers for latest tests), but remember, OMV3 / snapraid 11.1 scrubs entire 12TB of data without errors on the problem machine. Almost every block is an error when running OMV4 or 5 w/snapraid 11.3 using just 2 small test files.
Let's just say I'm done with the hardware. I've since moved my data over to a dell R320 (ECC RAM) with an LSI SAS9207-8e HBA adapter, HP MSA60 storage shelf. All works fine. Will be testing 24 bay netapp storage shelf with 6Gbps dell controller once I get the parts. This is much better than my previous setup.
Problem machine specs:
Motherboard - GA-EP45-UD3P rev 1.6 (P45 express chipset, ICH10R)
CPU - Core2quad q9550
2nd test machine specs:
Motherboard - GA-EP45-UD3R rev 1.1 (P45 express chipest, ICH10R)
CPU - Core2quad q9550
with OMV4/5 you change much more than only SnapRAID, e.g. Linux kernel, all programs, environment, ...
maybe one of these changes is the problem (most likely the kernel).
have you testet OV3 with SnapRAID 11.3?