Just what is the effective theoretical error rate of Snapraid? How far does it take us beyond the typical 1x10E-14 figure one finds on disk drives these days?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SnapRAID maintains a 128-bit hash for every block (default 256KiB) of data, so the error rate is infinitesimal. For any practical purpose, after SnapRAID has computed the hashes for data, you can assume the error rate is zero for that data as long as the hash checks out.
Of course, this does not apply when the hash is not checked, or for any errors that crept in before SnapRAID computed the hash.
Last edit: jwill42 2014-09-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think he was asking people to speculate on an accurate way to estimate
it. ("Surely there must be a way to calculate - even if it would take
too long to test"). Presumably, if he knew exactly how the math worked,
he wouldn't have posted the original question.
I understand that. The thing is, I just handed him the information he needed to compute it. If he cannot even figure out how to compute it when he is spoon fed the information, then what the heck is he going to do with the number once he has it? Answer: nothing useful.
Which is why I already told him that for practical purposes, he can consider the error rate to be zero.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Seriously. I would like to have a "X error in X * 10E XX" figure specifically for snapraid. From what I've read - this question is not as easy as it looks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think you're looking for "Arguments". This department is clearly "Abuse".
But seriously... Judging by casual reading of this forum over the past
year, my guess is that too many of the potential sources of error are
unquantifiable. Human error being the largest, natch'. There have also
been > 0 cases of "something mysterious" that was either never solved,
or written off to hardware errors. And there's always the lurking
possibility of a bug in the code, which is also unquantifiable - but
probably larger than the hash collision numbers jwill42 gave, and
smaller than "unlikely".
You might be able to come up with a list of possible failure modes,
though - even if you weren't able to put numbers to them. (For
instance, the aforementioned hardware errors. Which, come to think of
it, you could put numbers to - if you can get read-error rates for HDDs,
multi-bit error rates for ECC memory, etc.)
Far more practical would be a list of Things to Make Sure You Don't Do.
On 9/18/2014 10:42 PM, Reciprocate wrote:
Sorry. This is not my field. Came here seeking expertise I didn't have, perhaps in an attempt to link it to things I do know.
Don't know brain surgery either. Should I wiki "self help" there too?
No, but it should give a very good indication regarding how infinitesimal the chance is that randomly incorrect data would slip by undetected.
How many failed hash checks would you expect to surround a randomly false positive hash check? I'm pretty sure the answer is more than enough to realize something is severly broken rather than to trust that single false positive.
If discovered problems are not addressed by the operator, then the chance of losing all data is 100% given enough time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In my opinion, the biggest concern with data integrity for SnapRAID is the fact that all of the checksums are read into memory from the content file when SnapRAID starts a sync, then saved from memory back to content files when SnapRAID has completed a sync.
The content files have a CRC32C checksum to verify their integrity, so if a content file is loaded and it fails the CRC check then SnapRAID knows not to trust it. In that case, presumably another content file could be loaded until one is found that passes the CRC check.
But my concern is with the time the checksums are in RAM. If you don't have ECC RAM, then cosmic rays or whatever that cause a random bit flip in RAM could mess up one or more of your checksums. Then the incorrect checksum would be saved to all the content files, complete with CRC, and SnapRAID would think everything is all right. Until it went to verify the block(s) of data with the corrupted checksum, and would give a false positive.
It should be possible to modify SnapRAID to keep checksums on each individual hash in RAM (sort of a poor man's ECC RAM), at the cost of increasing memory usage and the size of the content file (and slight cost in additional CPU usage).
Last edit: jwill42 2014-09-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Isn't there a bigger risk of the data block to contain a bit flip before the checksum is calculated, resulting in an incorrect checksum?
I realize that the time spent in ram is significantly shorter for the data block, but still, we are talking about 128 bits vs 2.097.152 bits per block?
On the other hand you may have very large content files and perform many syncs with only tiny amounts of data to sync, offsetting above argument... so maybe you are right about the risk estimation after all :)
Edit: Wouldn't a simpler approach be to have an option to compare sync-unrelated checksums of the new content file with the old content file before deleting the old content file?
Last edit: Leifi Plomeros 2014-09-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Isn't there a bigger risk of the data block to contain a bit flip before the checksum is calculated, resulting in an incorrect checksum?
I do not think that is quantifiable, and anyway, it is not treatable.
Wouldn't a simpler approach be to have an option to compare sync-unrelated checksums of the new content file with the old content file before deleting the old content file?
Not as effective.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like any feature which freed one from ECC memory requirements, even though I read through the many ECC recommendations on this forum and acted accordingly...thinking "if it needs it- it needs it".
But the selection of gear which does not and never will support ECC memory is far more comprehensive...especially when you are counting watts or building to cost. So the ECC requirement just limits the potential snapraid user base.
Understand dispensing with ECC just may not be possible. Still worth investigating.....
There really aren't that many backup/raid systems which "jus work". So it would be a shame if this clean solution was denied to those lacking commensurate PC resources. Can tell you this....an old thinkpad (plenty of them out there) chewing on a huge drive is pretty laughable - until the damn thing works. I doubt there are many software packages out there capable of of actually doing "something" on such modest platforms.
Unfortunately, we must be concerned with this:
"But my concern is with the time the checksums are in RAM. If you don't have ECC RAM, then cosmic rays or whatever that cause a random bit flip in RAM could mess up one or more of your checksums. Then the incorrect checksum would be saved to all the content files, complete with CRC, and SnapRAID would think everything is all right."
If you walk into any university's dorms today, you'll find lots of kids trying to do big things with no money. The schools and parents do their best to help - but it's never enough.Sure there's an awful lot of distracted energy going into video games and the like....but that's not all of it. And those young ones often have no fear even in places where they should.
You'll never find backups or raid in college dorms...even when the machine is tied up for two weeks working through a term project. And you can bet none of that crap they own runs ECC memory. It's always whatever they can wheedle or trade for. Dodgy stuff....fun to watch the antics until exercise ends in tears.
So Snapraid gets disqualified out of the gate. Might as well be "an available CPU on the moon". And these are probably a group which would most benefit from this Snapraid "cheat".
Recognize this was never Snapraid's target market. Doesn't mean it wouldn't be very useful to the "wild men". But for the lack of ECC memory....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can't see why it's so important to have ECC RAM. The content file is just metadata. If it's corrupted, just delete it and re-run snapraid sync. For important data, you can create a 2nd set of checksums using 3rd party tools, to ensure no "bit rot" occurred during sync.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But I'll never get other family members to do something like this. Run a batch file - maybe. Hell, it's hard enough in these parts to get them to copy things off a laptop :)
Most of the folks on this forum seem to own new or exotic hardware - and that's great. But I still hope there will always be a version of snapraid which runs on junk .... Removing that recommended ECC memory would really help.
Think how many parents are running around with irreplaceable photos on a laptop. Mind boggling. No overs for them if their luck runs out....and most of those photos never get printed.
Give these people a choice between a "new and improved" camera vs some unsexy backup hardware - which is gonna win?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just what is the effective theoretical error rate of Snapraid? How far does it take us beyond the typical 1x10E-14 figure one finds on disk drives these days?
SnapRAID maintains a 128-bit hash for every block (default 256KiB) of data, so the error rate is infinitesimal. For any practical purpose, after SnapRAID has computed the hashes for data, you can assume the error rate is zero for that data as long as the hash checks out.
Of course, this does not apply when the hash is not checked, or for any errors that crept in before SnapRAID computed the hash.
Last edit: jwill42 2014-09-18
Thanks!
But I was realy hoping to get a quantification of this error rate. Surely there must be a way to calculate - even if it would take too long to test :)
If you want a number, why don't you compute it?
I think he was asking people to speculate on an accurate way to estimate
it. ("Surely there must be a way to calculate - even if it would take
too long to test"). Presumably, if he knew exactly how the math worked,
he wouldn't have posted the original question.
On 9/18/2014 10:15 PM, jwill42 wrote:
I understand that. The thing is, I just handed him the information he needed to compute it. If he cannot even figure out how to compute it when he is spoon fed the information, then what the heck is he going to do with the number once he has it? Answer: nothing useful.
Which is why I already told him that for practical purposes, he can consider the error rate to be zero.
Sorry. This is not my field. Came here seeking expertise I didn't have, perhaps in an attempt to link it to things I do know.
Don't know brain surgery either. Should I wiki "self help" there too?
Sure, I suggest search terms "self lobotomy"
:-)
Damn! More power tools to buy.....
Seriously. I would like to have a "X error in X * 10E XX" figure specifically for snapraid. From what I've read - this question is not as easy as it looks.
I think you're looking for "Arguments". This department is clearly "Abuse".
But seriously... Judging by casual reading of this forum over the past
year, my guess is that too many of the potential sources of error are
unquantifiable. Human error being the largest, natch'. There have also
been > 0 cases of "something mysterious" that was either never solved,
or written off to hardware errors. And there's always the lurking
possibility of a bug in the code, which is also unquantifiable - but
probably larger than the hash collision numbers jwill42 gave, and
smaller than "unlikely".
You might be able to come up with a list of possible failure modes,
though - even if you weren't able to put numbers to them. (For
instance, the aforementioned hardware errors. Which, come to think of
it, you could put numbers to - if you can get read-error rates for HDDs,
multi-bit error rates for ECC memory, etc.)
Far more practical would be a list of Things to Make Sure You Don't Do.
On 9/18/2014 10:42 PM, Reciprocate wrote:
It's true we're well into "getting more than I bargained for"....That said, how many good things begin with "don't you dare ask that question?"
I admire the work this product represents, and the generosity of it's owner.
Still want to know that error figure.....
"Libraries using 128-bit checksums should expect 1 collision once they hit 16 quintillion documents."
http://burtleburtle.net/bob/hash/spooky.html
Clearly, hash collisions are just not going to be the major driver of
errors.
On 9/19/2014 5:44 PM, Leifi Plomeros wrote:
No, but it should give a very good indication regarding how infinitesimal the chance is that randomly incorrect data would slip by undetected.
How many failed hash checks would you expect to surround a randomly false positive hash check? I'm pretty sure the answer is more than enough to realize something is severly broken rather than to trust that single false positive.
If discovered problems are not addressed by the operator, then the chance of losing all data is 100% given enough time.
You nailed it. Sad thing is I remember finding this page (Doh!)
He goes further to claim that 2 E72 is as far as he can test.
http://www.easycalculation.com/exponential-power.php
Since primitive people (like myself) can only do powers of 10, that works out to 2.117582368135751e-22 - and it's likely much, much better.
Thanks eagle eye! This was a big help.
In my opinion, the biggest concern with data integrity for SnapRAID is the fact that all of the checksums are read into memory from the content file when SnapRAID starts a sync, then saved from memory back to content files when SnapRAID has completed a sync.
The content files have a CRC32C checksum to verify their integrity, so if a content file is loaded and it fails the CRC check then SnapRAID knows not to trust it. In that case, presumably another content file could be loaded until one is found that passes the CRC check.
But my concern is with the time the checksums are in RAM. If you don't have ECC RAM, then cosmic rays or whatever that cause a random bit flip in RAM could mess up one or more of your checksums. Then the incorrect checksum would be saved to all the content files, complete with CRC, and SnapRAID would think everything is all right. Until it went to verify the block(s) of data with the corrupted checksum, and would give a false positive.
It should be possible to modify SnapRAID to keep checksums on each individual hash in RAM (sort of a poor man's ECC RAM), at the cost of increasing memory usage and the size of the content file (and slight cost in additional CPU usage).
Last edit: jwill42 2014-09-19
Isn't there a bigger risk of the data block to contain a bit flip before the checksum is calculated, resulting in an incorrect checksum?
I realize that the time spent in ram is significantly shorter for the data block, but still, we are talking about 128 bits vs 2.097.152 bits per block?
On the other hand you may have very large content files and perform many syncs with only tiny amounts of data to sync, offsetting above argument... so maybe you are right about the risk estimation after all :)
Edit: Wouldn't a simpler approach be to have an option to compare sync-unrelated checksums of the new content file with the old content file before deleting the old content file?
Last edit: Leifi Plomeros 2014-09-20
I do not think that is quantifiable, and anyway, it is not treatable.
Not as effective.
I would like any feature which freed one from ECC memory requirements, even though I read through the many ECC recommendations on this forum and acted accordingly...thinking "if it needs it- it needs it".
But the selection of gear which does not and never will support ECC memory is far more comprehensive...especially when you are counting watts or building to cost. So the ECC requirement just limits the potential snapraid user base.
Understand dispensing with ECC just may not be possible. Still worth investigating.....
There really aren't that many backup/raid systems which "jus work". So it would be a shame if this clean solution was denied to those lacking commensurate PC resources. Can tell you this....an old thinkpad (plenty of them out there) chewing on a huge drive is pretty laughable - until the damn thing works. I doubt there are many software packages out there capable of of actually doing "something" on such modest platforms.
Unfortunately, we must be concerned with this:
"But my concern is with the time the checksums are in RAM. If you don't have ECC RAM, then cosmic rays or whatever that cause a random bit flip in RAM could mess up one or more of your checksums. Then the incorrect checksum would be saved to all the content files, complete with CRC, and SnapRAID would think everything is all right."
If you walk into any university's dorms today, you'll find lots of kids trying to do big things with no money. The schools and parents do their best to help - but it's never enough.Sure there's an awful lot of distracted energy going into video games and the like....but that's not all of it. And those young ones often have no fear even in places where they should.
You'll never find backups or raid in college dorms...even when the machine is tied up for two weeks working through a term project. And you can bet none of that crap they own runs ECC memory. It's always whatever they can wheedle or trade for. Dodgy stuff....fun to watch the antics until exercise ends in tears.
So Snapraid gets disqualified out of the gate. Might as well be "an available CPU on the moon". And these are probably a group which would most benefit from this Snapraid "cheat".
Recognize this was never Snapraid's target market. Doesn't mean it wouldn't be very useful to the "wild men". But for the lack of ECC memory....
I can't see why it's so important to have ECC RAM. The content file is just metadata. If it's corrupted, just delete it and re-run
snapraid sync. For important data, you can create a 2nd set of checksums using 3rd party tools, to ensure no "bit rot" occurred during sync.Your are correct of course, and some of the tools out there attempt to make this easy: http://corz.org/windows/software/checksum/
But I'll never get other family members to do something like this. Run a batch file - maybe. Hell, it's hard enough in these parts to get them to copy things off a laptop :)
Most of the folks on this forum seem to own new or exotic hardware - and that's great. But I still hope there will always be a version of snapraid which runs on junk .... Removing that recommended ECC memory would really help.
Think how many parents are running around with irreplaceable photos on a laptop. Mind boggling. No overs for them if their luck runs out....and most of those photos never get printed.
Give these people a choice between a "new and improved" camera vs some unsexy backup hardware - which is gonna win?