Menu

snapraid sync results in multiple 'Data error in file', SMART stats good, memtest86+ and prime95 tests clean

Help
Nathan
2020-04-22
2020-05-03
  • Nathan

    Nathan - 2020-04-22

    'snapraid sync' - 'Data error in file', memtest86+ and prime95 tests clean, SMART stats good. I recently upgraded the 2 parity drives to 8TB ESATA drives. I copied the existing parity files and updated the snapraid config, which seemed to work fine. But then a scrub showed 'Data error in parity' errors in the parity files at a bunch of different block locations. I had not paid much attention before and am not sure if I ever had any errors before the move of the parity files to the new drives. I decided to delete all the content and parity files and run 'snapraid sync -h' to rebuild everything from scratch. I received errors for around 30 files of 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!' during the initial sync, which took almost 2 days on about 12TB of data (9 mixed data drives in mergerfs pool). Most errors were for large 30GB+ video files that would not have been touched / altered during the sync process. Then ran 'snapraid sync' after the initial 'snapraid sync -h' and again received 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!' for all the same files as before, plus now Data error in file ' for some additional files that didn't have errors on the first run. All the files in question are not duplicates, they do not exist anywhere else on the snapraid volume. SMART stats look good for all drives. I have 4x 2 port PCIE ASM1062 sata controller cards in the system, but the errors are for files mixed across different drives connected to both the onboard and add on controller cards. I have just run an overnight 17hr memtest86+ of 8 full passes with no errors. I've just run prime95 small fft's test for over 4 hrs, from a bootable Hirens USB stick, and CPU core temps stay below 57c. I'm now running a prime95 blend test. I don't expect to see any issues. I've been running openmediavault on this system for years with no stability issues whatsoever. The system will run for months until I have a reason to reboot it manually. I've never had an issue accessing any of my data. Not sure where to go from here. As I mentioned, I did not pay close enough attention to snapraid in the past, until now that I'm moving more critical data to it. In the past, Snapraid would run sync and scheduled scrub jobs that were set in the OMV gui. I don't recall seeing errors whenever running a manual sync from the gui in the past, but I can't be 100% sure. I would occasionally look at the snapraid logs through the gui. I have also recovered a full drive and individual files with snapriad in the past without issue. I've just recently been educating myself more on how snapraid works and how to use it effectively. I'm now running the snapraid commands from the shell, which is why I am aware of the issue, but I'm not sure where to go from here, as I can not get a clean sync, and I can not find a hardware issue.

    samples
    snapraid sync -h

    Data change at file '/srv/dev-disk-by-label-2TBblack01/Storage/Backup/8700k_full_b1_s1_v2.tib' at position '286007'
    WARNING! Unexpected data modification of a file without parity!
    

    snapraid sync

    Data change at file '/srv/dev-disk-by-label-2TBblack01/Storage/Backup/8700k_full_b1_s1_v2.tib' at position '286007'
    WARNING! Unexpected data modification of a file without parity!
    Try removing the file from the array and rerun the 'sync' command!
    
    Data error in file '/srv/dev-disk-by-label-1TBblack02/Storage/path obscured.mp4' at position '2734', diff bits 65/128
    

    snapraid.conf

    # this file was automatically generated from
    # openmediavault Arrakis 4.1.35-1
    # and 'openmediavault-snapraid' 3.7.7
    
    block_size 256
    autosave 0
    #####################################################################
    # OMV-Name: 1TBblack02  Drive Label: 1TBblack02
    content /srv/dev-disk-by-label-1TBblack02/snapraid.content
    disk 1TBblack02 /srv/dev-disk-by-label-1TBblack02
    
    #####################################################################
    # OMV-Name: WD2TBblackFAEX01  Drive Label: WD2TBblackFAEX01
    content /srv/dev-disk-by-label-WD2TBblackFAEX01/snapraid.content
    disk WD2TBblackFAEX01 /srv/dev-disk-by-label-WD2TBblackFAEX01
    
    #####################################################################
    # OMV-Name: 2TBblack01  Drive Label: 2TBblack01
    content /srv/dev-disk-by-label-2TBblack01/snapraid.content
    disk 2TBblack01 /srv/dev-disk-by-label-2TBblack01
    
    #####################################################################
    # OMV-Name: 1TBblack01  Drive Label: 1TBblack01
    content /srv/dev-disk-by-label-1TBblack01/snapraid.content
    disk 1TBblack01 /srv/dev-disk-by-label-1TBblack01
    
    #####################################################################
    # OMV-Name: 2TBgreen01  Drive Label: 2TBgreen01
    disk 2TBgreen01 /srv/dev-disk-by-label-2TBgreen01
    
    #####################################################################
    # OMV-Name: 3TBred01  Drive Label: 3TBred01
    disk 3TBred01 /srv/dev-disk-by-label-3TBred01
    
    #####################################################################
    # OMV-Name: 4TBPurple  Drive Label: 4TBPurple
    content /srv/dev-disk-by-label-4TBPurple/snapraid.content
    disk 4TBPurple /srv/dev-disk-by-label-4TBPurple
    
    #####################################################################
    # OMV-Name: 6TBHitachi  Drive Label: 6TBHitachi
    disk 6TBHitachi /srv/dev-disk-by-label-6TBHitachi
    
    #####################################################################
    # OMV-Name: 2TBblack03  Drive Label: 2TBblack03
    disk 2TBblack03 /srv/dev-disk-by-label-2TBblack03
    
    #####################################################################
    # OMV-Name: 8TBParity01  Drive Label: 8TBParity01
    parity /srv/dev-disk-by-label-8TBParity01/snapraid.parity
    
    #####################################################################
    # OMV-Name: 8TBParity02  Drive Label: 8TBParity02
    2-parity /srv/dev-disk-by-label-8TBParity02/snapraid.2-parity
    
    exclude /snapraid.conf*
    exclude *.unrecoverable
    exclude lost+found/
    exclude aquota.user
    exclude aquota.group
    exclude /tmp/
    exclude .content
    exclude *.bak
    

    System specs:
    Openmediavault 5.3.10-1
    Snapaid plugin 3.7.7 (snapraid V11.3)
    Motherboard: Gigabyte GA-EP45-UD3P
    CPU: Core2quad q9550
    Memory: OCZ2N800SR4GK 4x2GB sticks
    PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
    Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
    Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.

     

    Last edit: Nathan 2020-04-26
  • Nathan

    Nathan - 2020-04-24

    Ran extended SMART tests on all drives last night, all passed.
    now getting many 'Data error in file' when I run a sync, I got over 147 in about a minute so I just cancelled the sync. No data has changed, not sure what is going on. I have not found a single hardware issue. See below output.

    snapraid diff
    Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
    Mismatching CRC in '/srv/dev-disk-by-label-1TBblack01/snapraid.content'
    This content file is damaged! Use an alternate copy.

    I renamed the bad content file
    mv snapraid.content damaged-snapraid.content
    snapraid diff now ran successfully

    snapraid sync
    Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
    WARNING! Content file '/srv/dev-disk-by-label-1TBblack01/snapraid.content' not found, trying with another copy...
    Loading state from /srv/dev-disk-by-label-2TBblack01/snapraid.content...
    Mismatching CRC in '/srv/dev-disk-by-label-2TBblack01/snapraid.content'
    This content file is damaged! Use an alternate copy.

    renamed bad content file again
    mv snapraid.content damaged-snapraid.content

    snapraid sync
    Self test...
    Loading state from /srv/dev-disk-by-label-1TBblack01/snapraid.content...
    WARNING! Content file '/srv/dev-disk-by-label-1TBblack01/snapraid.content' not found, trying with another copy...
    Loading state from /srv/dev-disk-by-label-2TBblack01/snapraid.content...
    WARNING! Content file '/srv/dev-disk-by-label-2TBblack01/snapraid.content' not found, trying with another copy...
    Loading state from /srv/dev-disk-by-label-4TBPurple/snapraid.content...
    Scanning disk 1TBblack01...
    Scanning disk 1TBblack02...
    Scanning disk 2TBblack01...
    Scanning disk 2TBblack03...
    Scanning disk 2TBgreen01...
    Scanning disk 3TBred01...
    Scanning disk 4TBPurple...
    Scanning disk 6TBHitachi...
    Scanning disk WD2TBblackFAEX01...
    Using 945 MiB of memory for the file-system.
    Initializing...
    Resizing...
    Saving state to /srv/dev-disk-by-label-1TBblack01/snapraid.content...
    Saving state to /srv/dev-disk-by-label-2TBblack01/snapraid.content...
    Saving state to /srv/dev-disk-by-label-4TBPurple/snapraid.content...
    Saving state to /srv/dev-disk-by-label-6TBHitachi/snapraid.content...
    Saving state to /srv/dev-disk-by-label-WD2TBblackFAEX01/snapraid.content...
    Verifying /srv/dev-disk-by-label-1TBblack01/snapraid.content...
    Verifying /srv/dev-disk-by-label-2TBblack01/snapraid.content...
    Verifying /srv/dev-disk-by-label-4TBPurple/snapraid.content...
    Verifying /srv/dev-disk-by-label-6TBHitachi/snapraid.content...
    Verifying /srv/dev-disk-by-label-WD2TBblackFAEX01/snapraid.content...
    Syncing...
    Using 88 MiB of memory for 32 cached blocks.
    Data error in file '/srv/dev-disk-by-label-4TBPurple/Storage/Pictures/Save/13967975.JPEG' at position '0', diff bits 57/128
    Data error in file '/srv/dev-disk-by-label-2TBblack01/VirtualMachines/Debian x64/Debian 7 64-bit.vmdk' at position '548', diff bits 71/128
    Data error in file '/srv/dev-disk-by-label-2TBblack01/VirtualMachines/Debian x64/Debian 7 64-bit.vmdk' at position '570', diff bits 63/128
    
     
  • Nathan

    Nathan - 2020-04-24

    Resolved, I think, issue is running snapraid sync -h for the first run causes tons of errors which remain permanent in the content / parity files. Running snapraid sync runs clean. More details here - Ok, so I renamed my content files and parity files so I could setup a test snapraid configuration that wouldn't take so long to sync. I setup 2 data disks and 1 parity disk, all containing the content file since you need a minimum of 3 content files. I only put 1 video file on each data disk. Ran snapraid sync -h same result but worse, tons more errors, over 900 'Data change at file', 'WARNING! Unexpected data modification of a file without parity!'. Then ran snapraid sync tons of Data error in file again. So I wiped those content and parity files again, this time ran snapraid sync for the first run instead of with -h it ran clean! Then snapraid diff showed all files equal as expected. snapraid sync again resulted in nothing to do and was instantaneous as expected. I've been pulling my hair out for about a week over this and the whole issue was because I was running snapraid sync -h! I thought that was the best way to do it by hashing everything twice? Why does it throw so many errors? If whoever is reading this didn't read through all my previous post, I ran a full 8 passes with memtest86+, over 8 hours total of prime95 small ffts and blend tests, extended smart test on all drives and found no issues with anything, plus the system has been rock solid stable for years. I'll know if it is fully resolved once I get a successful sync on all my real data.

     

    Last edit: Nathan 2020-04-24
  • Nathan

    Nathan - 2020-04-26

    Issue not resolved, the only reason snapraid sync works is I'm assuming because there is no verification of the hashes when writing parity. As soon as I run a snapraid scrub -p new tons of errors show up. I believe this an issue with snapraid, not the hardware, as all my stress tests have shown there is nothing wrong, look at my previous posts for details. If anything, I'm more convinced of the stability of the system at this point. As I stated, the data in question can be copied and accessed without any issue. Can someone shed some light on this issue?

     
  • Leifi Plomeros

    Leifi Plomeros - 2020-04-26

    Looks very much like a workload related hardware issue.

    When you do an inital sync without -h you see no problem until you scrub.

    The reason for this is that snapraid has never seen the these files before and therefore have no hashes to compare them to.

    But when you later scrub snapraid discovers that some files are different compared to the hashes from the initial sync.

    When you do sync -h all data is read twice and snapraid finds that some files are different when read the second time compared to the first time.

    How likely does it seem that you have encountered multiple bugs related to snapraids core functionality, not found by anyone else, but still easily reproduced?

    I can pretty much guarantee that if you do this:

    1. Delete the content and parity files
    2. Run snapraid sync -h until you encounter the first error.
    3. Delete the content and parity files
    4. Run snapraid sync -h until you encounter the first error.

    Then you will find that the errors in 2 and 4 are different and that you have proven that it is random errors which cannot be explained by bugs in snapraid.

    My main suspect would be the SATA controllers.

    It could also be a motherboard problem, over-heating problem or unstable power supply related to the combined workload.

    Since you have not had this problem in the past, a good starting point would be to think back what hardware changes / additions you made before the problems started.

    It would also be nice if you tried to use the enter-key once in a while. It would make reading your posts much easier.

     
  • Nathan

    Nathan - 2020-04-26

    Thanks for the reply. I've since setup a test machine of newer hardware and brought over 2 of the sata controllers from the problem machine. snapraid sync -h worked perfect on the test machine, no errors when comparing hashes.

    On Problem Machine

    1. installed from scratch OMV 5, with snapraid plugin on the problem machine.
    2. Moved same drives used in snapraid test machine over to problem machine.
    3. Deleted content and parity files, ran snapraid sync -h returned errors for almost every block of both video files in the array.
    4. Ran a second run, deleted content and parity files before second run.

    2245 errors first run, 1750 errors second run.
    The errors don't align exactly between runs, but I just find it odd if there are this many errors that I would have other symtoms outside of snapraid or that something would come up in the stress tests with memtest86+ and prime95.
    I've attached output of both snapraid sync -h runs.

     
  • Leifi Plomeros

    Leifi Plomeros - 2020-04-26

    I think this part of the FAQ is close enough even if the symptom is different:
    http://www.snapraid.it/faq#panic

     
  • Nathan

    Nathan - 2020-04-27

    Did some more testing.
    On Problem Machine:
    Reverted back to what I ran for years as a test.

    1. Fresh install of OpenMediaVault 3.0.94, updated to 3.0.99 in update manager
    2. Installed snapraid plugin 3.7.3 (which is snapraid v11.1)

    snapraid sync -h runs with no errors
    snapraid scrub -p new runs with no errors, all hashes verified successfully.


    1. Fresh install of OpenMediaVault 4.1.35-1, with snapraid plugin v3.7.7 (snapraid 11.3)
      snapraid sync -h tons of errors again.

    The content and parity files were deleted between tests, data remained the same.

    Is it likely the latest version of snapraid 11.3 is not working with this hardware?
    It seems I can not run the lastest version on this hardware, but this type of application is what the old hardware is good for...
    Let me know what I can do to provide any debugging info to help get to the bottom of this.

    Other notes:
    Plugin 3.7.3 (snapraid 11.1) was the latest snapraid plugin available for OMV v3. I tried to manually upgrade snapraid to 11.3 while still on OMV 3, but I would get the following.

    root@OMVtest3:/home/snapraid-11.3# ./configure
    checking for a BSD-compatible install... /usr/bin/install -c
    checking whether build environment is sane... yes
    checking for a thread-safe mkdir -p... /bin/mkdir -p
    checking for gawk... no
    checking for mawk... mawk
    checking whether make sets $(MAKE)... no
    checking whether make supports nested variables... no
    checking build system type... x86_64-unknown-linux-gnu
    checking host system type... x86_64-unknown-linux-gnu
    checking for gcc... no
    checking for cc... no
    checking for cl.exe... no
    configure: error: in `/home/snapraid-11.3':
    configure: error: no acceptable C compiler found in $PATH
    See `config.log' for more details
    

    Then tried

    apt-get install build-essential
    Package build-essential is not available, but is referred to by another package.
    This may mean that the package is missing, has been obsoleted, or
    is only available from another source
    
    E: Package 'build-essential' has no installation candidate
    

    I'm no linux expert, wasn't sure what else to do, so I just went to OMV v4 to continue the test with snapraid 11.3.

     

    Last edit: Nathan 2020-04-27
  • Nathan

    Nathan - 2020-04-27

    Note - Running OMV 3.0.99 with snapraid plugin 3.7.3 (snapraid v11.1) results in no errors.
    OMV 4.1.35-1, snapraid plugin 3.7.7 (v11.3) = thousands of errors

    My latest test:
    OMV 5.4.2-1, no OMV snapraid plugin, manual install of snapraid 11.1 - results in a few errors
    I was hoping 11.1 would work on OMV 5 for now.

    see output:
    snapraid sync -h

    root@openmediavault:/srv/dev-disk-by-label-500GBParity# snapraid sync -h
    Self test...
    Loading state from /srv/dev-disk-by-label-250GB/snapraid.content...
    WARNING! Content file '/srv/dev-disk-by-label-250GB/snapraid.content' not found, trying with another copy...
    Loading state from /srv/dev-disk-by-label-500GB/snapraid.content...
    WARNING! Content file '/srv/dev-disk-by-label-500GB/snapraid.content' not found, trying with another copy...
    Loading state from /srv/dev-disk-by-label-500GBParity/snapraid.content...
    No content file found. Assuming empty.
    Scanning disk 250GB...
    Scanning disk 500GB...
    Using 0 MiB of memory for the FileSystem.
    Initializing...
    Hashing...
    100% completed, 5449 MB accessed in 0:00
    Everything OK
    Resizing...
    Saving state to /srv/dev-disk-by-label-250GB/snapraid.content...
    Saving state to /srv/dev-disk-by-label-500GB/snapraid.content...
    Saving state to /srv/dev-disk-by-label-500GBParity/snapraid.content...
    Verifying /srv/dev-disk-by-label-250GB/snapraid.content...
    Verifying /srv/dev-disk-by-label-500GB/snapraid.content...
    Verifying /srv/dev-disk-by-label-500GBParity/snapraid.content...
    Syncing...
    Using 24 MiB of memory for 32 blocks of IO cache.
    Data change at file '/srv/dev-disk-by-label-250GB/test/Video-01.mp4' at position '6129'
    WARNING! Unexpected data modification of a file without parity!
    Try removing the file from the array and rerun the 'sync' command!
    Data change at file '/srv/dev-disk-by-label-250GB/test/Video-01.mp4' at position '10177'
    WARNING! Unexpected data modification of a file without parity!
    Try removing the file from the array and rerun the 'sync' command!
    Data change at file '/srv/dev-disk-by-label-250GB/test/Video-01.mp4' at position '13657'
    WARNING! Unexpected data modification of a file without parity!
    Try removing the file from the array and rerun the 'sync' command!
    100% completed, 5449 MB accessed in 0:00
    
      250GB 31% | ******************
      500GB 13% | ********
     parity 48% | *****************************
       raid  2% | *
       hash  4% | **
      sched  0% |
       misc  0% |
                |______________________________________________________________
                               wait time (total, less is better)
    
    
           3 file errors
           0 io errors
           0 data errors
    WARNING! Unexpected file errors!
    Saving state to /srv/dev-disk-by-label-250GB/snapraid.content...
    Saving state to /srv/dev-disk-by-label-500GB/snapraid.content...
    Saving state to /srv/dev-disk-by-label-500GBParity/snapraid.content...
    Verifying /srv/dev-disk-by-label-250GB/snapraid.content...
    Verifying /srv/dev-disk-by-label-500GB/snapraid.content...
    Verifying /srv/dev-disk-by-label-500GBParity/snapraid.content...
    root@openmediavault:/srv/dev-disk-by-label-500GBParity#
    

    snapraid scrub -p new

    root@openmediavault:/srv/dev-disk-by-label-500GBParity#
    root@openmediavault:/srv/dev-disk-by-label-500GBParity# snapraid scrub -p new
    Self test...
    Loading state from /srv/dev-disk-by-label-250GB/snapraid.content...
    Using 0 MiB of memory for the FileSystem.
    Initializing...
    Scrubbing...
    Using 32 MiB of memory for 32 blocks of IO cache.
    Data error in parity 'parity' at position '1707', diff bits 1/2097152
    100% completed, 5448 MB accessed in 0:00
    
      250GB  0% |
      500GB  0% |
     parity 86% | ****************************************************
       raid  6% | ****
       hash  5% | ***
      sched  0% |
       misc  0% |
                |______________________________________________________________
                               wait time (total, less is better)
    
    
           0 file errors
           0 io errors
           1 data errors
    DANGER! Unexpected data errors! The failing blocks are now marked as bad!
    Use 'snapraid status' to list the bad blocks.
    Use 'snapraid -e fix' to recover.
    Saving state to /srv/dev-disk-by-label-250GB/snapraid.content...
    Saving state to /srv/dev-disk-by-label-500GB/snapraid.content...
    Saving state to /srv/dev-disk-by-label-500GBParity/snapraid.content...
    Verifying /srv/dev-disk-by-label-250GB/snapraid.content...
    Verifying /srv/dev-disk-by-label-500GB/snapraid.content...
    Verifying /srv/dev-disk-by-label-500GBParity/snapraid.content...
    root@openmediavault:/srv/dev-disk-by-label-500GBParity#
    
     
  • Nathan

    Nathan - 2020-04-27

    Upgrading is what broke it. See previous post for test output.
    OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
    A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
    OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.

    test was snapraid sync -h followed by snapraid scrub -p new
    2 video files (1 on each data drive)
    content and pary files deleted between test runs, data untouched between tests
    Anyone know what might be going on?

    System specs:
    Openmediavault 5.3.10-1
    Snapaid plugin 3.7.7 (snapraid V11.3)
    Motherboard: Gigabyte GA-EP45-UD3P
    CPU: Core2quad q9550
    Memory: OCZ2N800SR4GK 4x2GB sticks
    PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
    Power Supply: OCZ750FTY - Fatal1ty Gaming Series 750 Watt 80+ Bronze
    Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
    Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.

     
  • Nathan

    Nathan - 2020-04-27

    I think I've proved this is a software / hardware compatibility issue and not a problem with the actual hardware. See previous post with test details.
    OMV 3 (based on Debian Jessie) with snapraid 11.1 works with NO ERRORS, Upgrading causes thousands of errors

    I just synced my entire array of 12TB of data again, then scrubbed 5% of the array with no errors.
    If I was on a newer version of OMV with snapraid 11.3, I would have seen tens of thousands of errors scrubbing 5%.

     
  • Leifi Plomeros

    Leifi Plomeros - 2020-04-28

    Yes, I agree.

    OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
    A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
    OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.

    On the surface this is what it looks like:
    Problem with OMV 4/5 being observable when using snapraid 11.1 and much more frequent when using snapraid 11.3.

    Logically it could of course be the other way around but it seems more intuitive that snapraid would be the trigger.

    I guess Andrea would be the best person to figure out if it is possible to rule out snapraid as root cause or not.

    He usually keeps an eye on the forum so perhaps he can comment. But personally I would try to get some feedback on the issue in the OMV forum.

     
  • Walter Tuppa

    Walter Tuppa - 2020-04-28

    did you do some memory tests?
    SnapRAID is quite stressy to memory and disks (DMA). And newer versions of the Linux kernel may improve performance (disk/memory), which could trigger the problem.

     

    Last edit: Walter Tuppa 2020-04-28
  • Nathan

    Nathan - 2020-04-28

    I know there is a lot to read through in previous posts, that was the first thing I did.
    17hr 8 full pass test with memtest86+ with - no errors
    prime95 small fft's 4hrs, prim95 blend test 4hrs - no errors

    OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
    A combination of OMV 4 or 5 with snapraid 11.3 = thousands of errors with just 2 test files.
    OMV 4 or 5 with snapraid 11.1 = few errors, much better than v11.3 but still broken.

    Issue is specific to my hardware layout, tested on newer hardware and there is no issue, see system specs below. Also remember, I moved 2 of the sata controllers from production to the newer hardware testbed and used the same drives throughout all tests on both machines. I would guess the issue could probably be replicated with same CPU / chipset as my problem machine.

    I've proven the issue can be predictably replicated just by upgrading the software from OMV 3 w/snapraid 11.1 to OMV 4 or 5 w/snapraid 11.3
    OMV 3 w/snapraid 11.1 works with no errors, I've put my production data back on that for now.

    System specs:
    Openmediavault 5.3.10-1
    Snapaid plugin 3.7.7 (snapraid V11.3)
    Motherboard: Gigabyte GA-EP45-UD3P
    CPU: Core2quad q9550
    Memory: OCZ2N800SR4GK 4x2GB sticks
    PCIE Sata Cards: 4x startech PEXESAT3221 2 port sata cards with ASM1062 controller
    Power Supply: OCZ750FTY - Fatal1ty Gaming Series 750 Watt 80+ Bronze
    Parity Drives: 2x Fantom GF3B8000EU (8TB drives connected via ESATA)
    Data Drives: 9 drives mixed 1TB, 2TB, 4TB, 6TB... mostly all WD, one Hitachi, can provide full details if needed.

     
  • Nathan

    Nathan - 2020-05-03

    Ok, now I'm stumped because I thought I would reproduce the same errors on another, almost identical board and cpu I dug up, but I do not. This would make me jump to a possible hardware issue that was not revealed in any stress tests, but it still doesn't explain why snapraid runs perfectly clean in OMV3 w/snapraid 11.1, and has for years until I upgraded.

    Again this is what I'm seeing on the problem machine.
    OMV 3 with snapraid 11.1 = NO ERRORS (tested multiple times to make sure)
    Fresh install of OMV 4 or 5 with snapraid plugin (v11.3) = thousands of errors with just 2 test files.
    OMV 4 or 5 with snapraid 11.1(manual snapraid install, no plugin) = few errors, much better than v11.3 but still broken.
    snapraid 11.1 only works if paired with OMV3.

    I mentioned I dug up a nearly identical Gigabyte P45 chipset board and core2quad cpu. The only real difference between the motherboards is

    1. the problem machine has 2x PCIE 2.0 slots and 2x 1GB realtek r8169 NIC's.
    2. The test board only has 1 PCIE 2.0 slot and 1 onboard NIC.
    3. The CPU's are identical.
    4. Both boards are running the latest BIOS. The BIOS's are basically identical and configured the same.
    5. Same 3 drives used for all testing.
    6. Also tested RAM from problem machine in testbed without snapraid errors.

    You would think at this point this sounds like bad hardware on the problem machine, narrowed down to CPU or board, (removed all external SATA controllers for latest tests), but remember, OMV3 / snapraid 11.1 scrubs entire 12TB of data without errors on the problem machine. Almost every block is an error when running OMV4 or 5 w/snapraid 11.3 using just 2 small test files.

    Let's just say I'm done with the hardware. I've since moved my data over to a dell R320 (ECC RAM) with an LSI SAS9207-8e HBA adapter, HP MSA60 storage shelf. All works fine. Will be testing 24 bay netapp storage shelf with 6Gbps dell controller once I get the parts. This is much better than my previous setup.

    Problem machine specs:
    Motherboard - GA-EP45-UD3P rev 1.6 (P45 express chipset, ICH10R)
    CPU - Core2quad q9550

    2nd test machine specs:
    Motherboard - GA-EP45-UD3R rev 1.1 (P45 express chipest, ICH10R)
    CPU - Core2quad q9550

     
  • Walter Tuppa

    Walter Tuppa - 2020-05-03

    with OMV4/5 you change much more than only SnapRAID, e.g. Linux kernel, all programs, environment, ...
    maybe one of these changes is the problem (most likely the kernel).
    have you testet OV3 with SnapRAID 11.3?

     

Log in to post a comment.

MongoDB Logo MongoDB