Menu

Unexpected file errors during scrub, but no blocks marked bad

Help
2026-02-21
2026-02-23
  • Mitchell Deoudes

    Apologies if this is a stupid question, but I've been using snapraid forever with almost no errors. Primarily, I've seen the excel-touching-file-contents-without-updating-the-timestamp problem, and periodically the zero-file-size problem for things like mailboxes that get emptied. So, practically no genuine file errors.

    I have an overnight script that runs a sync, and then a scrub of 8% of the array. Last night, it gave the following "WARNING! Unexpected file errors!" message, but snapraid didn't report which specific files or blocks.

    I know that sometimes snapraid can error if the server is under load when it's scrubbing. In this case, I'm 99% sure that no files in the array were being altered during the scrub, and the load was probably no more than usual at 7am.

    "status" gives "no error detected", and "diff" gives "no differences". I'm hesitant to run "fix" without knowing what files snapraid thinks it's going to fix. Though I suspect that "fix" will also report that there are no errors.

    Is there any way of getting additional insight into the unexpected file error warning? Or should I just assume it was a transient read glitch during scrub, run "fix", and then forget about it if there's nothing to be fixed?

    2026-02-20 07:41:46,946 [RESULT] 
    2026-02-20 07:41:46,947 [RESULT]        d2  9% | *****
    2026-02-20 07:41:46,952 [OUTERR] WARNING! Unexpected file errors!
    2026-02-20 07:41:46,952 [RESULT]        d3  5% | ***
    2026-02-20 07:41:46,954 [RESULT]        d4  9% | *****
    2026-02-20 07:41:46,954 [RESULT]    parity  0% |
    2026-02-20 07:41:46,954 [RESULT]  2-parity  0% |
    2026-02-20 07:41:46,956 [RESULT]      raid 38% | **********************
    2026-02-20 07:41:46,956 [RESULT]      hash 23% | *************
    2026-02-20 07:41:46,956 [RESULT]     sched 12% | *******
    2026-02-20 07:41:46,956 [RESULT]      misc  0% |
    2026-02-20 07:41:46,956 [RESULT]               |____________________________________________________________
    2026-02-20 07:41:46,956 [RESULT]                             wait time (total, less is better)
    2026-02-20 07:41:46,956 [RESULT] 
    2026-02-20 07:41:46,957 [RESULT] 
    2026-02-20 07:41:46,957 [RESULT]        2 file errors
    2026-02-20 07:41:46,957 [RESULT]        0 io errors
    2026-02-20 07:41:46,957 [RESULT]        0 data errors
    2026-02-20 07:41:51,925 [RESULT] Saving state to /var/snapraid.content...
    2026-02-20 07:41:51,925 [RESULT] Saving state to /mnt/W/snapraid.content...
    2026-02-20 07:41:51,926 [RESULT] Saving state to /mnt/X/snapraid.content...
    2026-02-20 07:41:51,926 [RESULT] Saving state to /mnt/Y/snapraid.content...
    2026-02-20 07:42:33,460 [RESULT] Verifying...
    2026-02-20 07:42:42,642 [RESULT] Verified /var/snapraid.content in 9 seconds
    2026-02-20 07:42:42,792 [RESULT] Verified /mnt/W/snapraid.content in 9 seconds
    2026-02-20 07:42:46,880 [RESULT] Verified /mnt/X/snapraid.content in 13 seconds
    2026-02-20 07:42:50,165 [RESULT] Verified /mnt/Y/snapraid.content in 16 seconds
    2026-02-20 07:42:51,291 [ERROR ] Command 'snapraid /usr/bin/snapraid' returned non-zero exit status 1.
    

    Additional data point: "smart" reports one error in each data disk. But that doesn't seem like a lot, given that these disks have been running for years. In fact, it seems like the kind of thing I'd expect if there was a power glitch during scrub, and there were no actual disk data errors.

    SnapRAID SMART report:
    
       Temp  Power   Error   FP Size
          C OnDays   Count        TB  Serial              Device    Disk
     -----------------------------------------------------------------------
         36   1167       1   4% 14.0  9LK953SG            /dev/sdb  d2
         37   1882       1   4% 16.0  2CGLP20P            /dev/sdc  d3
         35   1163       1   4% 14.0  9LJ52LKG            /dev/sdd  d4
         38   1891       0   4% 16.0  2CGLKGSP            /dev/sde  parity
         37   1887       0   4% 16.0  2CGM7WKP            /dev/sda  2-parity
         33   2219       -  SSD  0.1  CVDA441500EF1207GN  /dev/sdf  -
    
     

    Last edit: Mitchell Deoudes 2026-02-21
  • Mitchell Deoudes

    Yep - as expected, running snapraid fix -e resulted in nothing to do:

    Self test...
    Loading state from /var/snapraid.content...
    Searching disk d2...
    Searching disk d3...
    Searching disk d4...
    Selecting...
    Using 3212 MiB of memory for the file-system.
    Initializing...
    Selecting...
    Fixing...
    Nothing to do
    Everything OK
    

    So while I'm still not happy about seeing "WARNING! Unexpected file errors!", I'm not sure there's any more I can do to figure out what happened.

    I suppose I could run a full "snapraid check". But that'll likely take a couple of days of running flat-out, on my ancient file server. And my current overnight script will scrub the entire array in about two weeks on its own. So it's easier to just let it do its thing.

     
  • Mitchell Deoudes

    Aha - so I found the original source of the problem: there was an excel file that produced a "Unexpected data modification of a file without parity!" warning message in a sync run from several days ago that I had missed. I'm assuming that the file contents were changed without changing the mtime, when it was opened for reading by excel, making it no longer match to another copy of the file elsewhere in the array. The sync run says as much:

    This file was detected as a copy of another file with the same name, size,
    and timestamp, but the file data isn't matching the assumed copy.
    If this is a false positive, and the files are expected to be different,
    you can 'sync' anyway using 'snapraid --force-nocopy sync'
    

    But this kind of error (a file error, vs. a block error?) is apparently not recorded anywhere, or at least it's not reported by "diff" or "status". So my overnight script was running "diff", seeing no differences, and then skipping "sync", and going straight to "scrub".

    Because "sync" is the only command that produces any real info about this type of error, and I'd missed the first time it was reported, the overnight script had been scrubbing in this state for several days. I guess it was only when the scrub reached the file in question that it was able to detect that something was wrong. (It reported "2 file errors" - which I assume means the two mis-matched copies of one xls file.)

    I ran a sync --force-nocopy to fix the issue. Weirdly, it didn't report any added/changed files. I would have thought if it was essentially being told to treat two matched files as being separate entities, it would consider one of them an addition?

    Selecting...
    Syncing...
    50%, 0 MB          
    
           d2 46% | ***************************
           d3 14% | ********
           d4 26% | ***************
       parity  0% | 
     2-parity  0% | 
         raid  3% | **
         hash  0% | 
        sched 50% | *****************************
         misc  0% | 
                  |____________________________________________________________
                                wait time (total, less is better)
    
    Everything OK
    

    Anyway, at this point, I'm assuming tonight's overnight scrub will complete without errors.

    And the moral of the story is to always read your logs carefully. And that for certain types of errors, like this one, don't expect diff/status/scrub to be able to tell you anything useful. (Well, "status" did report that the array was not fully synced - which could have been a hint.) sync is the only one that really knows what's going on.

     

Log in to post a comment.

MongoDB Logo MongoDB