Last nights scheduled Scrub reported a single file with a bad block. When I saw it this morning, I manually ran scrub -p 0 to make sure it wasn't just a one-time drive reading error. It still showed up as a data error on the file.
This file happened to be a seeding torrent, so I ran a re-check in the torrent client and the file was good (verified as complete by its own hash). I made a backup copy of the file off the server. Then ran fix -e. SnapRAID reported the file as "recovered". I checked it again in the torrent client and there was now a bad block in the recovered file. SnapRAID "fixed" it into a corrupt state.
It's not a big deal in this instance because I had made a backup of the known-good original, so I didn't lose anything. But I'm not sure what actually happened here. How can I really be sure if a file is corrupt or not when it's reported as having a data error? Was the actual error in all 4 parity drives or in every content file? How should I proceed if/when this happens again?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sounds quite "critical" if it's not related to external factors.
Snapraid version?
Was the torrent file completely downloaded at the time of last sync?
"4 parity drives", how many data drives do you have?
(Content-files can't fix errors only report.)
/X
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The file was complete on the last sync. In fact, the client on the server doesn't do downloads at all. All torrents on the server are permanent seed-only.
I have 22 data drives (expandable to 44), plus 4 parity, one content file on the system drive and one on every data drive.
Using the latest SnapRAID 8.0. I should have tested with the previous release also but didn't think of it at the time.
EDIT: I re-ran the fix with logging enabled. I don't know if this is of any use...
error:3136380:b6:Seeds/File.mkv: Data error at position 2461
entry:0:block:known:bad:b6:Seeds/File.mkv:2461:
fixed:3136380:b6:Seeds/File.mkv: Fixed data error at position 2461
status:recovered:b6:Seeds/File.mkv
Last edit: Quaraxkad 2015-04-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Perhaps a one time reading error occured when snapraid synced the file?
Perhaps a bit flip happened in RAM before writing hash to content file?
If it happens again on a different disk you should probably suspect bad RAM.
If it happens again on the same disk you should probably suspect bad cable or bad disk.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If the RAM hash table had a bit-flip, then when SnapRAID tried to restore from parity, the restored block(s) would not match the hash. But apparently the restored blocks did match the hash (or SR would have printed an error).
So a transient silent read error seems like the most likely explanation. I would guess it happened somewhere between the HDD and RAM. As you say, it could be a noise spike on a cable. Or it could be a glitch in the HBA.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you know if "last nights scheduled scrub" was the first time that this particular file had been scrubbed since it was first synced?
The reason I ask is because the only consistent explanation I can think of is that there was a silent read error when SnapRAID first synced the file and computed the parity. But that error was transient, and now the file is back to reading "normally". Unfortunately, when SnapRAID tries to verify the checksum, all SR knows is that the file is now different than when it was read during the sync. So it uses parity (which was computed with the same read error as the checksum) to restore the file to the state is was when SR last read it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That file is almost month old and my nightly scrub is -p 3 so it takes about a month for a full scrub... So I'm guessing that last night could have been the first time that file was scrubbed.
That explanation definitely makes sense... I wonder if there's something Andrea could add to SnapRAID to detect that kind of thing?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, by nature it cannot be reliably detected by SnapRAID, since the best SnapRAID can do is verify that the data is the same (or different) from the first time that SnapRAID read the data.
I suppose there could be an option during sync to read new data multiple times, and if it changes, skip or retry. But that would considerably slow down a sync operation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just remembered that SnapRAID does have a --pre-hash option that causes SnapRAID sync to read the data twice, computing just the hash on the first read. This might have detected the problem you had.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"...if it has a latent hardware problem, it's possible to encounter silent errors what cannot be detected because the data is not yet hashed."
Maybee there should be a --post-hash option to validate the parity and content after it's written to ensure the quality of what is written to the disk. This should have caught the infamous Samsung HD firmware bug when data was silently not written to disk and any other memory-to-disk bugs, which the --pre-hash option will not catch (AFAIK). I'm not sure though if the flush-cache also always clear the read cache logic and forces disk read.
/X
PS! How is the parity file integrity protected?
Last edit: xad 2015-05-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The parity is not hashed, so it cannot be directly verified. Of course, it is indirectly verified when parity is used to reconstruct data and then the hash is checked on the reconstructed data.
The content file has a checksum that is verified when it is read into memory. If it fails, SR can always try another content file. Of course, this does not protect against bit-flips in the hash table while it is in RAM. ECC RAM can help with that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"Of course, it is indirectly verified when parity is used to reconstruct data and then the hash is checked on the reconstructed data."
But is the block hashes validated after restore? And if the parity is bad then you have corrupted the original file and in best case the hash can inform you but not correct, or what am I missing?!
/X
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you gave a specific example of what you are concerned about and EXACTLY where the hypothetical error is introduced, then we might have a productive discussion. But if it is not the kind of error that Quaraxkad encountered, you should probably start another thread rather than hijacking this one.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Quaraxkad, with that big setup I assume you run with ECC memory, is that correct? (This to reduce the possible causes of your incident, based on probability).
EDIT: Looking at the code a bit. Could you run a byte-compare between the two files to let us know how much data is different (one bit, one byte, block of data...)
/X
Last edit: xad 2015-05-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've something similar, I've a false positive (not sure if hardware or non-ecc). But I know it's false positive since I've a separate hash and PAR file which concurs the file has no errors. I also used "-e fix" and confirmed the recovered file is corrupted while the original file is fine.
How do I get SyncRaid to rehash & recreate the parity for those files with false positive errors, and hence remove the scrub errors?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"Touch" the file to update its dates. The next sync should see the dates have changed, assume the file was changed intentionally, and update the parity information. I like FileDateChanger, just drag+drop the file onto it and click the Change button.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Last nights scheduled Scrub reported a single file with a bad block. When I saw it this morning, I manually ran scrub -p 0 to make sure it wasn't just a one-time drive reading error. It still showed up as a data error on the file.
This file happened to be a seeding torrent, so I ran a re-check in the torrent client and the file was good (verified as complete by its own hash). I made a backup copy of the file off the server. Then ran fix -e. SnapRAID reported the file as "recovered". I checked it again in the torrent client and there was now a bad block in the recovered file. SnapRAID "fixed" it into a corrupt state.
It's not a big deal in this instance because I had made a backup of the known-good original, so I didn't lose anything. But I'm not sure what actually happened here. How can I really be sure if a file is corrupt or not when it's reported as having a data error? Was the actual error in all 4 parity drives or in every content file? How should I proceed if/when this happens again?
Sounds quite "critical" if it's not related to external factors.
Snapraid version?
Was the torrent file completely downloaded at the time of last sync?
"4 parity drives", how many data drives do you have?
(Content-files can't fix errors only report.)
/X
The file was complete on the last sync. In fact, the client on the server doesn't do downloads at all. All torrents on the server are permanent seed-only.
I have 22 data drives (expandable to 44), plus 4 parity, one content file on the system drive and one on every data drive.
Using the latest SnapRAID 8.0. I should have tested with the previous release also but didn't think of it at the time.
EDIT: I re-ran the fix with logging enabled. I don't know if this is of any use...
error:3136380:b6:Seeds/File.mkv: Data error at position 2461
entry:0:block:known:bad:b6:Seeds/File.mkv:2461:
fixed:3136380:b6:Seeds/File.mkv: Fixed data error at position 2461
status:recovered:b6:Seeds/File.mkv
Last edit: Quaraxkad 2015-04-30
Perhaps a one time reading error occured when snapraid synced the file?
Perhaps a bit flip happened in RAM before writing hash to content file?
If it happens again on a different disk you should probably suspect bad RAM.
If it happens again on the same disk you should probably suspect bad cable or bad disk.
If the RAM hash table had a bit-flip, then when SnapRAID tried to restore from parity, the restored block(s) would not match the hash. But apparently the restored blocks did match the hash (or SR would have printed an error).
So a transient silent read error seems like the most likely explanation. I would guess it happened somewhere between the HDD and RAM. As you say, it could be a noise spike on a cable. Or it could be a glitch in the HBA.
Do you know if "last nights scheduled scrub" was the first time that this particular file had been scrubbed since it was first synced?
The reason I ask is because the only consistent explanation I can think of is that there was a silent read error when SnapRAID first synced the file and computed the parity. But that error was transient, and now the file is back to reading "normally". Unfortunately, when SnapRAID tries to verify the checksum, all SR knows is that the file is now different than when it was read during the sync. So it uses parity (which was computed with the same read error as the checksum) to restore the file to the state is was when SR last read it.
That file is almost month old and my nightly scrub is -p 3 so it takes about a month for a full scrub... So I'm guessing that last night could have been the first time that file was scrubbed.
That explanation definitely makes sense... I wonder if there's something Andrea could add to SnapRAID to detect that kind of thing?
Well, by nature it cannot be reliably detected by SnapRAID, since the best SnapRAID can do is verify that the data is the same (or different) from the first time that SnapRAID read the data.
I suppose there could be an option during sync to read new data multiple times, and if it changes, skip or retry. But that would considerably slow down a sync operation.
I just remembered that SnapRAID does have a --pre-hash option that causes SnapRAID sync to read the data twice, computing just the hash on the first read. This might have detected the problem you had.
"...if it has a latent hardware problem, it's possible to encounter silent errors what cannot be detected because the data is not yet hashed."
Maybee there should be a --post-hash option to validate the parity and content after it's written to ensure the quality of what is written to the disk. This should have caught the infamous Samsung HD firmware bug when data was silently not written to disk and any other memory-to-disk bugs, which the --pre-hash option will not catch (AFAIK). I'm not sure though if the flush-cache also always clear the read cache logic and forces disk read.
/X
PS! How is the parity file integrity protected?
Last edit: xad 2015-05-01
The parity is not hashed, so it cannot be directly verified. Of course, it is indirectly verified when parity is used to reconstruct data and then the hash is checked on the reconstructed data.
The content file has a checksum that is verified when it is read into memory. If it fails, SR can always try another content file. Of course, this does not protect against bit-flips in the hash table while it is in RAM. ECC RAM can help with that.
"Of course, it is indirectly verified when parity is used to reconstruct data and then the hash is checked on the reconstructed data."
But is the block hashes validated after restore? And if the parity is bad then you have corrupted the original file and in best case the hash can inform you but not correct, or what am I missing?!
/X
If you gave a specific example of what you are concerned about and EXACTLY where the hypothetical error is introduced, then we might have a productive discussion. But if it is not the kind of error that Quaraxkad encountered, you should probably start another thread rather than hijacking this one.
Quaraxkad, with that big setup I assume you run with ECC memory, is that correct? (This to reduce the possible causes of your incident, based on probability).
EDIT: Looking at the code a bit. Could you run a byte-compare between the two files to let us know how much data is different (one bit, one byte, block of data...)
/X
Last edit: xad 2015-05-01
I did compare the files and the difference was minor, but I didn't check to see exactly how many bytes were effected.
EDIT: It's one byte difference.
Last edit: Quaraxkad 2015-05-01
Quaraxkad: Do you still have the exact error messages from the scrub and previous commands?
I do. What messages did you want to see? I posted the relevant lines from the verbose fix log above.
The scrub that found the error:
Data error in file 'c:/!mountpoints/b6/Seeds/File.mkv' at position '2461'
The status check after scrub:
They are from block 3136380 to 3136380, specifically at blocks: 3136380
Sorry, but a day after Jessie's initial swift conclusion I agree (but at least now I know why, and not only in theory).
/X
I've something similar, I've a false positive (not sure if hardware or non-ecc). But I know it's false positive since I've a separate hash and PAR file which concurs the file has no errors. I also used "-e fix" and confirmed the recovered file is corrupted while the original file is fine.
How do I get SyncRaid to rehash & recreate the parity for those files with false positive errors, and hence remove the scrub errors?
"Touch" the file to update its dates. The next sync should see the dates have changed, assume the file was changed intentionally, and update the parity information. I like FileDateChanger, just drag+drop the file onto it and click the Change button.
Got it, makes sense. Thanks!