Menu

Sanity check 6d+3p array - how many p disks can fail

Help
kyle--
2023-03-11
2023-12-01
  • kyle--

    kyle-- - 2023-03-11

    So regarding the parity maths to recreate data blocks...

    With a 6d+3p array - this cfg can sustain up to 3d disk failure without data loss. ✅
    A 3d disk failure would be rebuilt by recalculating the blocks of the missing disks, by using remaining 3d disks and 3p disks for the calc.

    Does the same logic apply to p disks? How many p & d disks can fail concurrently before there isn't enough p disks to restore d disks?

    If I understand the parity maths correctly for 6d+3p the number would be: 2p and 1d could fail without impacting the ability to recover d disks?

    Fact check:

    Concurrent 1p disk failures, and 0d disk failure - data recovery possible - failure of 1 more p disks is tolerated.
    Concurrent 2p disk failures, and 0d disk failure - data recovery possible - failure of 0 more p disks is tolerated.
    Concurrent 3p disk failures, and 0p disk failure - data recovery impossible but no data loss.
    Concurrent 2p disk failure, and 1d disk failures - data recovery possible - failure of 0 more disks is tolerated?
    Concurrent 1p disk failures, and 1d disk failure - data recovery possible - failure of 1 more disks is tolerated?
    Concurrent 1p disk failures, and 2d disk failure - data recovery possible - failure of 0 more disks is tolerated?
    Concurrent 1p disk failures, and 3d disk failure - data recovery impossible.
    

    Is that correct?

     
  • Bill McClain

    Bill McClain - 2023-03-11

    That is how I understand it.

    If 3p fail you can recreate them from your data disks, but are unprotected against data disk failure until that completes.

     
  • kyle--

    kyle-- - 2023-03-12

    Thanks for your answer Bill!

    Bill wrote:

    If 3p fail you can recreate them from your data disks

    .

    Yeah - this is essentially the starting point prior to using/implementing snapraid 👍

    I'd love to hear from others including @amadvance, @uhclem or @mrleifi.

    It would be awesome to see a check ✅ or ❌ next to the fact check list to see how well I understand things.

    e.g.

    🔲 Concurrent 1p disk failures, and 0d disk failure - data recovery ⁉ possible - failure of 1 more p disks is tolerated.
    🔲 Concurrent 2p disk failures, and 0d disk failure - data recovery possible - failure of 0 more p disks is tolerated.
    ✅ Concurrent 3p disk failures, and 0p disk failure - data recovery impossible but no data loss.
    🔲 Concurrent 2p disk failure, and 1d disk failures - data recovery possible - failure of 0 more disks is tolerated?
    🔲 Concurrent 1p disk failures, and 1d disk failure - data recovery possible - failure of 1 more disks is tolerated?
    🔲 Concurrent 1p disk failures, and 2d disk failure - data recovery possible - failure of 0 more disks is tolerated?
    🔲 Concurrent 1p disk failures, and 3d disk failure - data recovery impossible.
    
     
  • kyle--

    kyle-- - 2023-03-18

    bump

     
  • kyle--

    kyle-- - 2023-11-26

    2023-Q4 bump - still interested to hear from the author / main contributors / main users of the project on this one.

     
  • UhClem

    UhClem - 2023-11-27

    Hi Kyle,

    [ This is NOT an "I told you so ..." but ... ] :(:(
    From Link

    1. Give serious consideration to creating a small SR "Lab array" for testing, experimentation, performance tuning, etc. E.g., each of my /dev/sdX1 (for the full # [D+P] of drives) is 8GiB. You will never miss that amount of space from your Production Array; and you will be thankful to have a "safe place to play".

    But, regarding your question here, my understanding (albeit, superficial; I'm not a Mu-Alpha-THeta guy) of RAID (in general, not just SR) parity levels is that the # of parities determines the max # of drive (any combo P/D) failures you can sustain/recover_from. Note that, during recovery, all (valid/working) disks (not just P's) contribute value to the recovery calculations. Also, the more disks you are recovering, the more complex the calculations, and the longer the recovery time--and, hence, the more stress on the working drives, and the increased likelihood of that (Oh shit!) N+1th failure. (Not to be fatalistic, just statistical reality.)

     
    • kyle--

      kyle-- - 2023-11-29

      Greetings on this timeline @uhclem ! It made me smile that you referenced my 2017 snapraid "origins" thread #1 on the forums here. Your insights and validation were very useful back then to setting up my data storage. I'm sure others have read that thread and been helped in some way. Thank you again for that!

      It's fascinating how time goes by and some things change and some things don't ⌚✈

      Having re-read my SR "origins" thread, I might now take your suggestion on faster storage for the primary content file - do you do this today? if so, how? (since you mention it can be outside the array - symlink perhaps?) (I would guess the primary file is the first one in the cfg?)

      You're right about the "lab array", it's a good idea that I haven't had the need to invest in yet. I could create a dedicated kvm for this and actually prove each scenario as ✅ or ❌. I'll add it to my backlog of projects. In related news today I read the snapraid code changes between 11.6 ... 12.0 and I plan to read more to further improve my understanding on the SR internals. In further related news I also re-read this thread recently which was a good memory jog on internals.

      the # of parities determines the max # of drive (any combo P/D) failures you can sustain/recover_from.

      Yes, that follows my understanding and the example scenarios I was trying to articulate. Thanks for sharing your view/understanding. wink to the Mu-Alpha reference 😉

      the more disks you are recovering, the more complex the calculations, and the longer the recovery time--and, hence, the more stress on the working drives, and the increased likelihood of that (Oh shit!) N+1th failure. (Not to be fatalistic, just statistical reality.)

      Yes, that level of detail isn't really in the man/faq. Most users probably don't care too much about that level of detail... but some users do! I certainly like understanding the internal of things 😊

      I'd still appreciate it if @amadvance and @mrleifi could validate the example scenarios table in post 3, and your "# parities determines the max..." statement as and when they have time.

      Wishing you many more years of happy data storing!

       
  • Leifi Plomeros

    Leifi Plomeros - 2023-11-29

    A fully synced array, with 3 parity disks, can always be fully restored from any combination of 3 or less missing / broken disks.

    If more than 3 disks are lost then no data can be restored.

    Examples:
    DDDDDDPPP = 6 data disks, parity and nothing to do
    DXDXDXPPP = OK, restore 3 data disks using 3 parity.
    DDXDDXPXP = OK, restore 2 data disks using 2 parity + rebuild 1 parity.
    DDDDDDXXX = OK, rebuild 3 parity.
    DDDDDXXXX = NOK, 1 data disk permanently lost.
    DXXXXDPPP = NOK, 4 data disks permanently lost.

    X = missing disk
    D = data disk
    P = Parity

     
    • kyle--

      kyle-- - 2023-11-30

      Thank you @mrleifi, very helpful and illustrative examples. It really helps clear up my understanding.

      That last example scenario would suck!

       
    • kyle--

      kyle-- - 2023-11-30

      Please can you do a example of how many P disks can become X where no data is lost but recovery becomes impossible i.e. for a 6d+3p array how many missing P disks can be sustained before fix of D disks is not possible, assuming all D disks are OK?

       
      • Leifi Plomeros

        Leifi Plomeros - 2023-12-01

        There is no such limit.

        Think of it like this: Any parity disk can replace any data disk.

        Regardless if lost disks are parity or data disks, you can make full recovery, as long as the total number of lost disks is not greater than the total number of parity disks.

        You can have up to 6 parity disks and up to ~250 data disks.

        In any situation where you have lost a total number of disks larger than the number of parity disks, all surving parity is useless

        It can't be used to restore data and it can't be used to rebuild any parity.

        You effectively need to build a new array from the surviving data (but you can still use the original content file to verify that the surviving data is not corrupted).

         

Log in to post a comment.