I'm currently running SnapRAID v6.1 on Linux-i686 (CentOS 5.10).
I've been running SnapRAID for years without any problems. Nothing new has changed with my hardware setup. No drives have been added or removed recently. The server doesn't seem to be having any hardware issues. All drives have somewhat recently (~2 months) gone though a full FSCK check without any problems. All drives also run though a weekly cycles of S.M.A.R.T. long and short scans without any errors. None of the drives in the system are full and I did do a 100% scrub of the SnapRAID system about a month ago without problems.
--
The Problem:
This morning the script that runs a nightly sync returned an error when running a diff to see if a sync was needed. This is the output of the diff command:
$ snapraid diff
Self test...
Loading state from /FlexRaid/DRU1/sr-content...
Decoding error in '/FlexRaid/DRU1/sr-content' at offset 1143771534
The file CRC is correct!
Internal inconsistency for used hole!
I do have a total of 3 content files on 3 different disks in my setup but all 3 of them report the exact same error. This is what the file looks like on the disk:
-rw------- 1 root root 1386301917 Jun 12 10:12 sr-content
It seems that running any SnapRAID command (sync, diff, etc...) reports the same issue. I did find other people reporting a similar issue and were told to run snapraid with "test-rewrite" but even that has the same results as the diff command above.
I figure worse case I can just wipe all the parity data and re-sync the whole system but that takes forever (48+ hours the last time I had to do it I think) so I'd like to avoid that if possible.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your content files all have (the same) corrupted data element, likely due to a single memory glitch (during the previous content-modifying snapraid run).
Are you using ECC memory? (I bet not)
You might also want to do a thorough system memory checkout.
Gary, look on the bright side -- if you had lost a data drive yesterday, it would not have been recoverable.
This emphasizes just how critical the content file is to array protection. Having N copies does no good if they can all be similarly corrupt.
Andrea: suggestion--rename the previous (about to be replaced [near end of state_write()]) "content" file to "content.bak". With an option to disable the enhancement (and its protection) for users tight on disk space.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
An hardware glitch seems the most likely cause of this. If you can upload your content file somewhere, I can have a look at it and confirm or not what the problem could be.
But at now the best option for you seems to create again the parity.
What Clem Luser is telling is a good suggestion. It would be better if SnapRAID automatically keeps backup copies of the content file.
Something to add in the next version.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, I ran 5 iterations of memtester on my server and all 5 passed without errors. I know that doesn't mean they can't still happen but it doesn't look my memory is just popping up errors all the time.
I've upgrade to SnapRAID v6.2, formatted my parity drives (running 2 of them), deleted the 3 content files and kicked off a fresh sync. ETA to completion as of this writing is ~30 hours.
On the subject of having SnapRAID maintaining a copy of the previous version of the Content files in the future. If I did have a manual copy of a last known good copy of the content file, would I have been able to just copy it in place and run a sync or what not and everything continue working again as usual? If the parity data and the content files are not linked or matched thus allowing the use of a backup content file, could I have just deleted the bad content files and had SnapRAID create new ones while continuing to use the existing parity data thus avoiding a 30+ hour initial sync to generate the parity data again? The assumption here is that after running the sync command, any difference between the parity data and the restored (or even newly created) content files will get corrected and the system will be back to a protected state?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No. An old copy of the content file would anyway require a full scrub to detect the unsynced parity.
But it would have the advantage to keep all the previous computed hashes.
To be able to use an old content file, SnapRAID should keep track of when each parity block was written, and recompute everything that is recent than the content file used.
At now I can only make it to recognize that the content file is older than the parity file and force a full sync if this happen.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i have the same issue..content file broken.and i have it on 4 hdd.i deleted them ony by one..what happened?now i have to wait 14h for a full sync :(
i had messges like this
Loading state from F:/array/content.lst...
Decoding error in 'F:/array/content.lst' at offset 1168222618
Mismatching CRC in 'F:/array/content.lst'
This content file is damaged! Use an alternate copy.
Last edit: mradu 2015-10-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sounds that this is a pretty bad issue. Does the backup of a old content file help at all? I have mine all backed out every day? Just would be nice to know if it were to happen.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes i have a sync everyday..but i don't check always the logs too see them.i think is a 2-3 max days problem. anyhow..no hdd was broken...but if i had one down and with this seems no recovery possible :(. For me is SW issue not a HW issue.
i have 4 content files ( on 4 hdd)
2 parity drives
so it's a little odd or all 4 hdd to have an issue and broke every content file or the 2 parity drives to have wrong data in them at the same time compared with content list.
What am I missing here? :(
LE: just check 1st issue on 24 Oct, today is 26. Somehow the date i started the backblaze cloud backup..
The content.lst file are only keeping info about files that are not ignored and processed by snapraid, no?
Last edit: mradu 2015-10-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In theory, shutdown, or even a kernel crash, should not cause this kind of problem. SnapRAID always writes new full copies of the content files, and only when finished, it renames them over the old ones. So, you should get either the new or old one.
The most likely cause is a RAM memory error. If the data written to SnapRAID is bad, this could remain undetected until you try to use that content file.
For the upcoming 9.0 version, I've added a new CRC check of the new content files, just before having them renamed over the old ones. This should prevent this kind of issues.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes. If you don't have any other content file, a resync is the only option.
In such case, also an older content file could be useful. Even if it's not anymore in sync with the parity, it's surely better than nothing.
Anyway, with the upcoming SnapRAID 9.0 I really hope to have found a definitive solution for that case.
The content files are now written independently from different threads, and with a post-write CRC verification. So, the condition of getting all the content files with a bad CRC should really not happen anymore.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm currently running SnapRAID v6.1 on Linux-i686 (CentOS 5.10).
I've been running SnapRAID for years without any problems. Nothing new has changed with my hardware setup. No drives have been added or removed recently. The server doesn't seem to be having any hardware issues. All drives have somewhat recently (~2 months) gone though a full FSCK check without any problems. All drives also run though a weekly cycles of S.M.A.R.T. long and short scans without any errors. None of the drives in the system are full and I did do a 100% scrub of the SnapRAID system about a month ago without problems.
--
The Problem:
This morning the script that runs a nightly sync returned an error when running a diff to see if a sync was needed. This is the output of the diff command:
I do have a total of 3 content files on 3 different disks in my setup but all 3 of them report the exact same error. This is what the file looks like on the disk:
It seems that running any SnapRAID command (sync, diff, etc...) reports the same issue. I did find other people reporting a similar issue and were told to run snapraid with "test-rewrite" but even that has the same results as the diff command above.
I figure worse case I can just wipe all the parity data and re-sync the whole system but that takes forever (48+ hours the last time I had to do it I think) so I'd like to avoid that if possible.
You do need to rebuild.
Your content files all have (the same) corrupted data element, likely due to a single memory glitch (during the previous content-modifying snapraid run).
Are you using ECC memory? (I bet not)
You might also want to do a thorough system memory checkout.
Gary, look on the bright side -- if you had lost a data drive yesterday, it would not have been recoverable.
This emphasizes just how critical the content file is to array protection. Having N copies does no good if they can all be similarly corrupt.
Andrea: suggestion--rename the previous (about to be replaced [near end of state_write()]) "content" file to "content.bak". With an option to disable the enhancement (and its protection) for users tight on disk space.
Hi Gary,
An hardware glitch seems the most likely cause of this. If you can upload your content file somewhere, I can have a look at it and confirm or not what the problem could be.
But at now the best option for you seems to create again the parity.
What Clem Luser is telling is a good suggestion. It would be better if SnapRAID automatically keeps backup copies of the content file.
Something to add in the next version.
Ciao,
Andrea
Well, I ran 5 iterations of memtester on my server and all 5 passed without errors. I know that doesn't mean they can't still happen but it doesn't look my memory is just popping up errors all the time.
I've upgrade to SnapRAID v6.2, formatted my parity drives (running 2 of them), deleted the 3 content files and kicked off a fresh sync. ETA to completion as of this writing is ~30 hours.
On the subject of having SnapRAID maintaining a copy of the previous version of the Content files in the future. If I did have a manual copy of a last known good copy of the content file, would I have been able to just copy it in place and run a sync or what not and everything continue working again as usual? If the parity data and the content files are not linked or matched thus allowing the use of a backup content file, could I have just deleted the bad content files and had SnapRAID create new ones while continuing to use the existing parity data thus avoiding a 30+ hour initial sync to generate the parity data again? The assumption here is that after running the sync command, any difference between the parity data and the restored (or even newly created) content files will get corrected and the system will be back to a protected state?
Hi Gary,
No. An old copy of the content file would anyway require a full scrub to detect the unsynced parity.
But it would have the advantage to keep all the previous computed hashes.
To be able to use an old content file, SnapRAID should keep track of when each parity block was written, and recompute everything that is recent than the content file used.
At now I can only make it to recognize that the content file is older than the parity file and force a full sync if this happen.
Ciao,
Andrea
i have the same issue..content file broken.and i have it on 4 hdd.i deleted them ony by one..what happened?now i have to wait 14h for a full sync :(
i had messges like this
Last edit: mradu 2015-10-26
Sounds that this is a pretty bad issue. Does the backup of a old content file help at all? I have mine all backed out every day? Just would be nice to know if it were to happen.
yes i have a sync everyday..but i don't check always the logs too see them.i think is a 2-3 max days problem. anyhow..no hdd was broken...but if i had one down and with this seems no recovery possible :(. For me is SW issue not a HW issue.
so it's a little odd or all 4 hdd to have an issue and broke every content file or the 2 parity drives to have wrong data in them at the same time compared with content list.
What am I missing here? :(
LE: just check 1st issue on 24 Oct, today is 26. Somehow the date i started the backblaze cloud backup..
The content.lst file are only keeping info about files that are not ignored and processed by snapraid, no?
Last edit: mradu 2015-10-26
my assumption is a schedule shutdown during closing of content.lst files..
Hi mradu,
In theory, shutdown, or even a kernel crash, should not cause this kind of problem. SnapRAID always writes new full copies of the content files, and only when finished, it renames them over the old ones. So, you should get either the new or old one.
The most likely cause is a RAM memory error. If the data written to SnapRAID is bad, this could remain undetected until you try to use that content file.
For the upcoming 9.0 version, I've added a new CRC check of the new content files, just before having them renamed over the old ones. This should prevent this kind of issues.
Ciao,
Andrea
and when this error happens, what must be done? (re) sync?
Yes. If you don't have any other content file, a resync is the only option.
In such case, also an older content file could be useful. Even if it's not anymore in sync with the parity, it's surely better than nothing.
Anyway, with the upcoming SnapRAID 9.0 I really hope to have found a definitive solution for that case.
The content files are now written independently from different threads, and with a post-write CRC verification. So, the condition of getting all the content files with a bad CRC should really not happen anymore.
Ciao,
Andrea