SnapRAID / Discussion / Help: Error rate? Scanned the forum, but I must not be using the corect keywords...

Reciprocate - 2014-09-18

Just what is the effective theoretical error rate of Snapraid? How far does it take us beyond the typical 1x10E-14 figure one finds on disk drives these days?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jwill42 - 2014-09-18

SnapRAID maintains a 128-bit hash for every block (default 256KiB) of data, so the error rate is infinitesimal. For any practical purpose, after SnapRAID has computed the hashes for data, you can assume the error rate is zero for that data as long as the hash checks out.

Of course, this does not apply when the hash is not checked, or for any errors that crept in before SnapRAID computed the hash.

Last edit: jwill42 2014-09-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Reciprocate - 2014-09-19
  
  Thanks!
  
  But I was realy hoping to get a quantification of this error rate. Surely there must be a way to calculate - even if it would take too long to test :)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - jwill42 - 2014-09-19
    
    If you want a number, why don't you compute it?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mitchell Deoudes - 2014-09-19
      
      I think he was asking people to speculate on an accurate way to estimate
      it. ("Surely there must be a way to calculate - even if it would take
      too long to test"). Presumably, if he knew exactly how the math worked,
      he wouldn't have posted the original question.
      
      On 9/18/2014 10:15 PM, jwill42 wrote:
      
      If you want a number, why don't you compute it?
      
      Error rate? Scanned the forum, but I must not be using the corect keywords... (sourceforge.net)
      
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/snapraid/discussion/1677233/
      
      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - jwill42 - 2014-09-19
        
        I understand that. The thing is, I just handed him the information he needed to compute it. If he cannot even figure out how to compute it when he is spoon fed the information, then what the heck is he going to do with the number once he has it? Answer: nothing useful.
        
        Which is why I already told him that for practical purposes, he can consider the error rate to be zero.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Reciprocate - 2014-09-19
        
        Sorry. This is not my field. Came here seeking expertise I didn't have, perhaps in an attempt to link it to things I do know.
        
        Don't know brain surgery either. Should I wiki "self help" there too?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        jwill42 - 2014-09-19
        
        Sure, I suggest search terms "self lobotomy"
        :-)
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Reciprocate - 2014-09-19
        
        Damn! More power tools to buy.....
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Reciprocate - 2014-09-19
        
        Seriously. I would like to have a "X error in X * 10E XX" figure specifically for snapraid. From what I've read - this question is not as easy as it looks.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mitchell Deoudes - 2014-09-19
        
        I think you're looking for "Arguments". This department is clearly "Abuse".
        
        But seriously... Judging by casual reading of this forum over the past
        year, my guess is that too many of the potential sources of error are
        unquantifiable. Human error being the largest, natch'. There have also
        been > 0 cases of "something mysterious" that was either never solved,
        or written off to hardware errors. And there's always the lurking
        possibility of a bug in the code, which is also unquantifiable - but
        probably larger than the hash collision numbers jwill42 gave, and
        smaller than "unlikely".
        
        You might be able to come up with a list of possible failure modes,
        though - even if you weren't able to put numbers to them. (For
        instance, the aforementioned hardware errors. Which, come to think of
        it, you could put numbers to - if you can get read-error rates for HDDs,
        multi-bit error rates for ECC memory, etc.)
        
        Far more practical would be a list of Things to Make Sure You Don't Do.
        
        On 9/18/2014 10:42 PM, Reciprocate wrote:
        
        Sorry. This is not my field. Came here seeking expertise I didn't have, perhaps in an attempt to link it to things I do know.
        
        Don't know brain surgery either. Should I wiki "self help" there too?
        
        Error rate? Scanned the forum, but I must not be using the corect keywords... (sourceforge.net)
        
        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/snapraid/discussion/1677233/
        
        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Reciprocate - 2014-09-19
        
        It's true we're well into "getting more than I bargained for"....That said, how many good things begin with "don't you dare ask that question?"
        
        I admire the work this product represents, and the generosity of it's owner.
        
        Still want to know that error figure.....
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Leifi Plomeros - 2014-09-19
        
        "Libraries using 128-bit checksums should expect 1 collision once they hit 16 quintillion documents."
        http://burtleburtle.net/bob/hash/spooky.html
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mitchell Deoudes - 2014-09-19
        
        Clearly, hash collisions are just not going to be the major driver of
        errors.
        
        On 9/19/2014 5:44 PM, Leifi Plomeros wrote:
        
        "Libraries using 128-bit checksums should expect 1 collision once they hit 16 quintillion documents."
        http://burtleburtle.net/bob/hash/spooky.html
        
        Error rate? Scanned the forum, but I must not be using the corect keywords... (sourceforge.net)
        
        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/snapraid/discussion/1677233/
        
        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Leifi Plomeros - 2014-09-19
        
        No, but it should give a very good indication regarding how infinitesimal the chance is that randomly incorrect data would slip by undetected.
        
        How many failed hash checks would you expect to surround a randomly false positive hash check? I'm pretty sure the answer is more than enough to realize something is severly broken rather than to trust that single false positive.
        
        If discovered problems are not addressed by the operator, then the chance of losing all data is 100% given enough time.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Reciprocate - 2014-09-19
        
        You nailed it. Sad thing is I remember finding this page (Doh!)
        
        He goes further to claim that 2 E72 is as far as he can test.
        
        http://www.easycalculation.com/exponential-power.php
        
        Since primitive people (like myself) can only do powers of 10, that works out to 2.117582368135751e-22 - and it's likely much, much better.
        
        Thanks eagle eye! This was a big help.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jwill42 - 2014-09-19

In my opinion, the biggest concern with data integrity for SnapRAID is the fact that all of the checksums are read into memory from the content file when SnapRAID starts a sync, then saved from memory back to content files when SnapRAID has completed a sync.

The content files have a CRC32C checksum to verify their integrity, so if a content file is loaded and it fails the CRC check then SnapRAID knows not to trust it. In that case, presumably another content file could be loaded until one is found that passes the CRC check.

But my concern is with the time the checksums are in RAM. If you don't have ECC RAM, then cosmic rays or whatever that cause a random bit flip in RAM could mess up one or more of your checksums. Then the incorrect checksum would be saved to all the content files, complete with CRC, and SnapRAID would think everything is all right. Until it went to verify the block(s) of data with the corrupted checksum, and would give a false positive.

It should be possible to modify SnapRAID to keep checksums on each individual hash in RAM (sort of a poor man's ECC RAM), at the cost of increasing memory usage and the size of the content file (and slight cost in additional CPU usage).

Last edit: jwill42 2014-09-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Leifi Plomeros - 2014-09-20
  
  Isn't there a bigger risk of the data block to contain a bit flip before the checksum is calculated, resulting in an incorrect checksum?
  
  I realize that the time spent in ram is significantly shorter for the data block, but still, we are talking about 128 bits vs 2.097.152 bits per block?
  
  On the other hand you may have very large content files and perform many syncs with only tiny amounts of data to sync, offsetting above argument... so maybe you are right about the risk estimation after all :)
  
  Edit: Wouldn't a simpler approach be to have an option to compare sync-unrelated checksums of the new content file with the old content file before deleting the old content file?
  
  Last edit: Leifi Plomeros 2014-09-20
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - jwill42 - 2014-09-20
    
    Isn't there a bigger risk of the data block to contain a bit flip before the checksum is calculated, resulting in an incorrect checksum?
    
    I do not think that is quantifiable, and anyway, it is not treatable.
    
    Wouldn't a simpler approach be to have an option to compare sync-unrelated checksums of the new content file with the old content file before deleting the old content file?
    
    Not as effective.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reciprocate - 2014-09-20

I would like any feature which freed one from ECC memory requirements, even though I read through the many ECC recommendations on this forum and acted accordingly...thinking "if it needs it- it needs it".

But the selection of gear which does not and never will support ECC memory is far more comprehensive...especially when you are counting watts or building to cost. So the ECC requirement just limits the potential snapraid user base.

Understand dispensing with ECC just may not be possible. Still worth investigating.....

There really aren't that many backup/raid systems which "jus work". So it would be a shame if this clean solution was denied to those lacking commensurate PC resources. Can tell you this....an old thinkpad (plenty of them out there) chewing on a huge drive is pretty laughable - until the damn thing works. I doubt there are many software packages out there capable of of actually doing "something" on such modest platforms.

Unfortunately, we must be concerned with this:

"But my concern is with the time the checksums are in RAM. If you don't have ECC RAM, then cosmic rays or whatever that cause a random bit flip in RAM could mess up one or more of your checksums. Then the incorrect checksum would be saved to all the content files, complete with CRC, and SnapRAID would think everything is all right."

If you walk into any university's dorms today, you'll find lots of kids trying to do big things with no money. The schools and parents do their best to help - but it's never enough.Sure there's an awful lot of distracted energy going into video games and the like....but that's not all of it. And those young ones often have no fear even in places where they should.

You'll never find backups or raid in college dorms...even when the machine is tied up for two weeks working through a term project. And you can bet none of that crap they own runs ECC memory. It's always whatever they can wheedle or trade for. Dodgy stuff....fun to watch the antics until exercise ends in tears.

So Snapraid gets disqualified out of the gate. Might as well be "an available CPU on the moon". And these are probably a group which would most benefit from this Snapraid "cheat".

Recognize this was never Snapraid's target market. Doesn't mean it wouldn't be very useful to the "wild men". But for the lack of ECC memory....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mickey Batman - 2014-09-20

I can't see why it's so important to have ECC RAM. The content file is just metadata. If it's corrupted, just delete it and re-run snapraid sync. For important data, you can create a 2nd set of checksums using 3rd party tools, to ensure no "bit rot" occurred during sync.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Reciprocate - 2014-09-30
  
  Your are correct of course, and some of the tools out there attempt to make this easy: http://corz.org/windows/software/checksum/
  
  But I'll never get other family members to do something like this. Run a batch file - maybe. Hell, it's hard enough in these parts to get them to copy things off a laptop :)
  
  Most of the folks on this forum seem to own new or exotic hardware - and that's great. But I still hope there will always be a version of snapraid which runs on junk .... Removing that recommended ECC memory would really help.
  
  Think how many parents are running around with irreplaceable photos on a laptop. Mind boggling. No overs for them if their luck runs out....and most of those photos never get printed.
  
  Give these people a choice between a "new and improved" camera vs some unsexy backup hardware - which is gonna win?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Error rate? Scanned the forum, but I must not be using the corect keywords...

A backup program for disk arrays

Forums

Help

Error rate? Scanned the forum, but I must not be using the corect keywords...

Error rate? Scanned the forum, but I must not be using the corect keywords...

A backup program for disk arrays

Forums

Help

Error rate? Scanned the forum, but I must not be using the corect keywords... document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Error rate? Scanned the forum, but I must not be using the corect keywords...