Menu

fix/check with --filter

Help
Quaraxkad
2021-05-15
2021-05-29
  • Quaraxkad

    Quaraxkad - 2021-05-15

    I occasionally need to restore accidentally deleted files, and since they were deleted from a network share they are not in the Recycle Bin, so I'll run:
    snapraid fix -m --filter "missing filename.ext"
    With my large 47-disk array (incl. 4 parity), this takes a very long time to restore even a 1KB file. The fix itself occurs very quickly, it's the time before it begins the restoration that's the issue. It runs through "Searching disk n", "Filtering", "Scanning disk n", etc, and then after the "Fixing"/"Checking" message, it appears to stop. There's no disk activity; high memory usage (as in capacity) but doesn't seem to be much memoryactivity (as in reads/writes); and very little CPU usage, with snapraid.exe averaging 12.46%. This lasts for about 2 hours, and I frequently enter the filter pattern incorrectly so it just says "Nothing to do" after all that wait! What exactly is SnapRAID doing during this period?

    Just some brainstorming: Perhaps not coincidentally, I'm using a 4-core 8-thread CPU, which would mean a single thread (assuming both threads of a core are in use at the time) would max out at 12.5% total CPU usage (yes, I'm overgeneralizing here). The rest of the system is idle, there's no single maxed-usage core or thread during this time. However it could be one single-threaded task or set of tasks that's rapidly switching between cores. In that scenario I could see how no single core would ever appear to be at max usage on the graph, because it's only pegged for a fraction of a second before activity moves to a different core.

    In Process Explorer, I can view the activity of the single thread that snapraid is running in. I'm currently doing a check and it's been about an hour, it's at 10 trillion cycles and 150k context switches (which I think is basically when a software thread switches between hardware cores/threads?).

    So... what's SnapRAID doing that takes so long during filtered checks and syncs? Is it something that could be multi-threaded for better performance?

     

    Last edit: Quaraxkad 2021-05-15
  • Leifi Plomeros

    Leifi Plomeros - 2021-05-16

    I think that if you look closer this is basically what happens:
    Block 0 to 10,000,000 = Nothing to do; Long freeze while evaluating.
    Block 10,000,001 = Do check or fix for the tiny file matching filter
    Block 10,000,002 to 20,000,000 Nothing to do; Long freeze while evaluating.
    Done, Everything OK.

    When restoring an entire disk this is insignificant since it accounts for less than a percent of the total restore time.

    But when you want to restore something specific then it becomes very noticable due to the relatively small time spent on actually restoring the file compared to the time when nothing happens.

    In my personal array the total combined delay time after fix in progress until complete is 8 minutes and not really a big deal considering how rarely I want to restore something specific.

    For a typical Snapraid user this delay is probably much shorter, since it is directly related to the size of the array (or more specifically number of blocks in the array).

    But in your case where it takes hours it might be worth mentioning it to @amadvance

     

    Last edit: Leifi Plomeros 2021-05-16
  • UhClem

    UhClem - 2021-05-20

    I've made two simple improvements/speedups to check/fix and filtering; they especially help with the case described here.

    The first reduces by about 40% the "frozen" time described above. Almost all of that "frozen" time is spent in calls to the routine file_post() [in check.c] . (It is called from state_check_process() for every data block.) That saving benefits every check/fix regardless of filtering.

    The second change only affects check/fix when filtering (of any flavor) is invoked. But the effect can be dramatic, since it imposes an implicit -S and -B to reduce the range of blocks processed to the minimal range encompassing the filter-matching blocks. Hence, for the case described (of a single file), there will be NO frozen time (typically SnapRAID assigns a file to a contiguous range of blocks; there can be fragmentation but it is almost always small). Note that if you do -f File1 -f File2, and File1 is at the beginning (blk 1) of the array, and File2 is at the end of the array, you will incur the full frozen time (well, 60% of it [see previous paragraph].

    I've sent the details to @amadvance

     
    👍
    1
  • Andrea Mazzoleni

    Thanks @uhclem, I'm going to look at it.

     
  • Andrea Mazzoleni

    Hi,

    I've uploaded a beta version including the optimization suggested by @UhCleam , and some more.

    They are here: http://beta.snapraid.it/

    Ciao,
    Andrea

     
  • Leifi Plomeros

    Leifi Plomeros - 2021-05-24

    I just tested the check function on a specific file and the difference was like night and day.
    ~30 sec to read content file / prepare, ~45 sec to check the file and poff done :-)

    Thank you both @amadvance and @uhclem

     
  • Leifi Plomeros

    Leifi Plomeros - 2021-05-24

    But... uhm... The sync function acted really weird...

    5 added files in total ~10 GB was early estimated for 1h40m and reading from data disks in sequence instead of parallel according to disk activity in task manager (full speed reading from the disks at 100-150 MB/s but only one disk at each time).

    And then suddenly after a few minutes everything became normal with all disks reading and writing in parallel and completed in a few minutes.

     
  • UhClem

    UhClem - 2021-05-24

    I just tested the check function on a specific file and the difference was like night and day.

    And, it's even better! Andrea's variation on my suggestion (#2) is more general and effectively eliminates the "frozen" time completely, regardless the amount of filtering. (I.e., my second example of -f File1 -f File2 (at begin/end) will also be fast.)

    As for your "weirdness" report, that why it's called a Beta :) My guess is that, by also applying the generalized implmentation to sync (& scrub), which, unlike check & fix, utilize multi-threaded I/O, some complication arose ... I expect we'll see a follow-up Beta ...

     
    👍
    1
  • Andrea Mazzoleni

    Hi Leifi,

    It's unexpected the thing you report. On my tests everything looks fine.

    Could you please make another try ? The sync and scrub behavior is not expected to change, beside a minimal speed increase.

    Thanks,
    Andrea

     
    • Leifi Plomeros

      Leifi Plomeros - 2021-05-25

      I've tried a few more times without being able to repeat it and I'm starting to think that it may have been a really odd hardware problem in the file server.

      The system disk (SSD) has been reporting lots and lots of bad sectors recently together with some general freezeing issues.

      Even though difficult to imagine exactly how, I guess the most likely explanation is nonetheless a hardware problem.

      Please ignore it for now.

       
  • UhClem

    UhClem - 2021-05-26

    Very interesting. and coincidental. glitch, Leifi. The symptoms you observed/reported really fooled me!

    My apologies to Andrea--for "blaming" the new improvement.

     
    • Leifi Plomeros

      Leifi Plomeros - 2021-05-29

      Yes, I don't think I will ever see somthing as weird as that again.

      It was a very small addition of 5 new files and everything appeared normal, including scanning, saving, verifying, until the point where sync actually began and I was literally looking at it when it happened, since I was curious to see the changes to the progress indicator line.

      At first it was only a few characters long. MB processed (and maybe also speed) and it remained like that for maybe 15-45 seconds but then changed to the full line with all details and ETA 1h40m showed up which made me suspect that there was a problem with one of the data disks or parity disks.

      So I went into the performance tab in task manager and saw the bisarre behaviour of only a single data disk being read, but at good speed of 100~150 MB/s, only to be replaced by read activity on another single disk (also at good speed), and then repeated on a few other disks and zero activity on the parity disks until suddenly everything became normal with high activity on more or less all disks.

      There was no indications of general slowness or freezing of any sort. The progress line in snapraid was updated normally every second and all GUI parts of windows was perfectly fluent and responsive.

      The wait time chart presented by snapraid after completed sync indicated about ~17% each on 5 disks (same number as the added files) and the rest on raid, hash, sched (usually almost all wait time occurs in raid, hash and sched).

      In hindsight the only explanation that I can think of is that at some very low level the parallel file reading is initiated sequentially and that some windows or device driver process that was trying to comply was having serious performance problems due to the failing SSD or some other unknown OS or hardware error.

      I have since done about 10 syncs and there is absolutely no trace of any similar issue occurring again.

      So, I am perfectly convinced that there was no issue at all with the new beta version of snapraid and I only wish that I had bought a lottery ticket instead of trying beta software at the time of the event :)

      And yes, I'm also sorry for the confusing caused by this.

       

Log in to post a comment.