OK, so Murphy's Law kicked in on me and I now have 2 drives throwing errors in my array of 4. I realise that with only 1 parity disk my options of a full recovery are probably slim or non-existent, but wanted to throw out some options to see if I can minimise my losses.
Obviously the first step is to copy the 2 dying drives to new ones to stop any further loss and to have only fully functional disks for the (attempted) recovery. For this I was going to use ddrescue as both the old and new drives have exactly the same number of logical sectors. The new drives are the 4k physical sector ones, so by using ddrescue I know the partition alignment will be off, but I can correct that later, once I have recovered what I can.
Next my plan was to run "check -a -d disk" against each disk in turn to see which files it thinks are in need of recovery. But I have a question here. I did a dry-run of this on one of my good disks, disk2, and at the completion, it threw out 15 errors, "Error stating dir", all of which referenced directories on one of the dying disks, disk1. Why would this be, as I thought that "check -a -d disk" would just verify the contents of that disk only against the content file. Could that be some artifact of having the 4 mounted drives, re-mounted under a single directory using mergerfs. I won't have the pool mounted while I an doing the recovery.
Next, I was going save off a copy of one of the content files, which I'm hoping I can use later to see how much data really is lost.
Now it's time to run the "fix" across the array. If I'm understanding correctly how the process works, I should only really lose data if there was corruption on both the failed disks that are used in the same recovery/parity calculation. Where there is only loss on one of the disks, it should recover correctly.
Here's the point at which I was planning on using the "saved" content file to recheck each disk to verify what files had been recovered correctly and which are really lost.
Obviously at the end of the process, I need to "sync" to make sure I'm covered going forward.
Before any of the errors started, the array was fully synched and since then nothing has been added or deleted to what is protected. There are other directories on the disks that are excluded that will have been modified, but I don't really care about those. Hence them being excluded.
And if it makes any difference, all this is in CentOS 6.8.
Anything else I should try/consider with this.
Cheers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think you are overcomplicating it.
1. Recover the data from the dying disks to new disks as good as possible.
2. Update config to point to the new disks
3. Run Snapraid diff to get an estimation of how much needs to be recovered
4. Run Snapraid fix -U -l logfile.txt to make snapraid attempt to fix everything it can
5. Run Snapraid diff again to see how much is still missing after the the fix.
The fix option -U is needed to make snapraid accept that two disks have changed UUID.
Make sure that the tool you use to recover files from the dying disks preserve timestamps with sub second precision, to avoid problems recognizing the recovered files.
You can run fix and diff multiple times if you run in to trouble. The content file will not change until you run snapraid sync. But it is of course still a good idea to make a backup of the content file before you begin.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Are the drives just showing errors that are not counting or are they showing pre-fail smart counts?
For smart that's 5,187,188,197,198 (Based on released data from BackBlaze)
I ask this because in my Father in Laws array we have a disk that showed 4 errors at 3 days old (Not seen at the time was in an external caddy). This drive been running now over 1,300 days with no issues.
I had a disk that started counting Smart 187 and it died within days.
Move your data onto new disk if you have them. Stress test the old ones to see if they do fail.
Last edit: Steve Miles 2016-08-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I guess Murhey's boots were a few sizes bigger than I expected. After powering down the server to add one of the replacement drives, now it doesn't power up. Fans spin for a few seconds and then it shuts down. A second later it repeats the process, and again, and again, and etc. etc. After a few basic checks on the PSU, clearing CMOS, etc., it looks like my MB or CPU cr@pped out. Time to build a new server.
@Leifi
I guess I was trying to see at each stage exactly what was still correct and what needed to be fixed. But again, a couple of questions:
From my original post, do you have any ideas why a "check -a -d disk2" would throw errors the "Error stating dir" errors which all referenced disk1.
You say to run a "diff" to see what needs to be recovered. I thought that only compared file name, size, date/time, and inode (anything else ??) to see differences. I didn't think it hashed the file. So how would it know something was corrupt, if all the other information was the same.
@Steve
Up until a few days ago, I would not have thought they were really dying either. Here's a brief history:
Originally all the disks were attached to an IBM M105 (??) flashed to look like an LSI 9211 running in IT mode. This worked well for quite a while. Then I started to see random errors on 2 of the disks, which forced me to do a umount/mount to clear them, as they were formatted as XFS. None of these errors increased any of the SMART counts, or led to any corruption of data.
This went on for a while and I came to the conclusion that it must be the controller slowly dying. So I replaced it with a genuine LSI 9211. Unfortunately that didn't seem to make any difference. Still many "soft" errors that caused Linux to drop/reassign the LUN, but nothing in SMART or data issues.
After realising that it was only 2 drives that ever had these errors, I thought (well, wildly guessed) that somehow they might be slightly out of spec, and the LSI controller was super-sensative to this. So for the 2 "problem childs" I moved them from the LSI to the onboard SATA ports.
Well, this appeared to work. No more errors for over 6 months. But over the last couple of weeks, these errors have started to creep back, but only on one of the disks. Initially, it was the same symptoms, just a "soft" error that wasn't reported to SMART. But then, that disk threw a bunch of "Current pending sector" and "Offline uncorretable" errors. This was followed very quickly, by another disk, which wasn't one of my previous "problem" ones also throwing the same SMART errors.
Hence my double disk failure I'm trying to fix. Once I get my server back up and running. :-)
Cheers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm afraid I can only guess, same as you, that the error stating dir messages are related to mergerfs.
Regarding diff vs check.
Yes you are correct that diff will only look at missing files and meta data of existing files and that it will not alert you of any file corruption.
But why do you want to be informed about that before you run fix?
Fix will verify the correctness of all data and attempt to fix any corrupted, modified or missing files it encounters.
Corrupt files that can't be fixed will be renamed to filename.broken or something similar.
So after fix is complete snapraid diff will give you a detailed list of which files are bad or missing and still needs recovery.
At this point you can manually attempt to recover some of the broken or missing files and if successful you can attempt to fix again with those files added to the array.
Snapraid fix will never update the file hashes in the content file with bad information from broken file, which is the reason why you can run fix again and again until everything is ok.
Edit: Also it would be easier to understand the situation better if you clarified how many smart errors there are. Are we talking about a few thousands then there is a good chance that you will recover 99.9% already with the help of ddrescue and then 100% with snapraid.
If you have several millions of bad sectors there is a high risk that ddrescue will be very unsuccessful, and if that is the case for both disks then it is pretty much game over.
In either case the -n option in ddrescue that makes it skip bad areas and attempt to save everything that is still OK seems like a very good option to begin with.
Last edit: Leifi Plomeros 2016-08-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm only talking about a handfull of Uncorrectable/Pending sectors. Unless it's changed, it was 16 on one drive, 8 on the other. But once these start, there's usually the possiblity of a huge cascade, hence saving off everything quickly and then push the drive through all the stress testing I can. After all, if they don't fail any more, I have 2 more disks I can use. :-)
What concerns me more are the level of "soft" errors I get, where Linux reports an error and then usually drops and re-assigns the LUN, which then floods my logs with errors until I can un-mount and re-mount the drive and also drops that drive from my pool.
Cheers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't think you need to very worried about data loss in that case.
Get the server up and running again.
Manually copy all files from the disks with bad sectors or use ddrescue (just make sure that whatever tool you use handles timestamps with sub second precission).
Edit config so that it points to the new disks.
Run diff followed by fix or check, whichever seems most appropriate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, so Murphy's Law kicked in on me and I now have 2 drives throwing errors in my array of 4. I realise that with only 1 parity disk my options of a full recovery are probably slim or non-existent, but wanted to throw out some options to see if I can minimise my losses.
Obviously the first step is to copy the 2 dying drives to new ones to stop any further loss and to have only fully functional disks for the (attempted) recovery. For this I was going to use ddrescue as both the old and new drives have exactly the same number of logical sectors. The new drives are the 4k physical sector ones, so by using ddrescue I know the partition alignment will be off, but I can correct that later, once I have recovered what I can.
Next my plan was to run "check -a -d disk" against each disk in turn to see which files it thinks are in need of recovery. But I have a question here. I did a dry-run of this on one of my good disks, disk2, and at the completion, it threw out 15 errors, "Error stating dir", all of which referenced directories on one of the dying disks, disk1. Why would this be, as I thought that "check -a -d disk" would just verify the contents of that disk only against the content file. Could that be some artifact of having the 4 mounted drives, re-mounted under a single directory using mergerfs. I won't have the pool mounted while I an doing the recovery.
Next, I was going save off a copy of one of the content files, which I'm hoping I can use later to see how much data really is lost.
Now it's time to run the "fix" across the array. If I'm understanding correctly how the process works, I should only really lose data if there was corruption on both the failed disks that are used in the same recovery/parity calculation. Where there is only loss on one of the disks, it should recover correctly.
Here's the point at which I was planning on using the "saved" content file to recheck each disk to verify what files had been recovered correctly and which are really lost.
Obviously at the end of the process, I need to "sync" to make sure I'm covered going forward.
Before any of the errors started, the array was fully synched and since then nothing has been added or deleted to what is protected. There are other directories on the disks that are excluded that will have been modified, but I don't really care about those. Hence them being excluded.
And if it makes any difference, all this is in CentOS 6.8.
Anything else I should try/consider with this.
Cheers.
I think you are overcomplicating it.
1. Recover the data from the dying disks to new disks as good as possible.
2. Update config to point to the new disks
3. Run Snapraid diff to get an estimation of how much needs to be recovered
4. Run Snapraid fix -U -l logfile.txt to make snapraid attempt to fix everything it can
5. Run Snapraid diff again to see how much is still missing after the the fix.
The fix option -U is needed to make snapraid accept that two disks have changed UUID.
Make sure that the tool you use to recover files from the dying disks preserve timestamps with sub second precision, to avoid problems recognizing the recovered files.
You can run fix and diff multiple times if you run in to trouble. The content file will not change until you run snapraid sync. But it is of course still a good idea to make a backup of the content file before you begin.
Are the drives just showing errors that are not counting or are they showing pre-fail smart counts?
For smart that's 5,187,188,197,198 (Based on released data from BackBlaze)
I ask this because in my Father in Laws array we have a disk that showed 4 errors at 3 days old (Not seen at the time was in an external caddy). This drive been running now over 1,300 days with no issues.
I had a disk that started counting Smart 187 and it died within days.
Move your data onto new disk if you have them. Stress test the old ones to see if they do fail.
Last edit: Steve Miles 2016-08-07
I guess Murhey's boots were a few sizes bigger than I expected. After powering down the server to add one of the replacement drives, now it doesn't power up. Fans spin for a few seconds and then it shuts down. A second later it repeats the process, and again, and again, and etc. etc. After a few basic checks on the PSU, clearing CMOS, etc., it looks like my MB or CPU cr@pped out. Time to build a new server.
@Leifi
I guess I was trying to see at each stage exactly what was still correct and what needed to be fixed. But again, a couple of questions:
From my original post, do you have any ideas why a "check -a -d disk2" would throw errors the "Error stating dir" errors which all referenced disk1.
You say to run a "diff" to see what needs to be recovered. I thought that only compared file name, size, date/time, and inode (anything else ??) to see differences. I didn't think it hashed the file. So how would it know something was corrupt, if all the other information was the same.
@Steve
Up until a few days ago, I would not have thought they were really dying either. Here's a brief history:
Originally all the disks were attached to an IBM M105 (??) flashed to look like an LSI 9211 running in IT mode. This worked well for quite a while. Then I started to see random errors on 2 of the disks, which forced me to do a umount/mount to clear them, as they were formatted as XFS. None of these errors increased any of the SMART counts, or led to any corruption of data.
This went on for a while and I came to the conclusion that it must be the controller slowly dying. So I replaced it with a genuine LSI 9211. Unfortunately that didn't seem to make any difference. Still many "soft" errors that caused Linux to drop/reassign the LUN, but nothing in SMART or data issues.
After realising that it was only 2 drives that ever had these errors, I thought (well, wildly guessed) that somehow they might be slightly out of spec, and the LSI controller was super-sensative to this. So for the 2 "problem childs" I moved them from the LSI to the onboard SATA ports.
Well, this appeared to work. No more errors for over 6 months. But over the last couple of weeks, these errors have started to creep back, but only on one of the disks. Initially, it was the same symptoms, just a "soft" error that wasn't reported to SMART. But then, that disk threw a bunch of "Current pending sector" and "Offline uncorretable" errors. This was followed very quickly, by another disk, which wasn't one of my previous "problem" ones also throwing the same SMART errors.
Hence my double disk failure I'm trying to fix. Once I get my server back up and running. :-)
Cheers.
I'm afraid I can only guess, same as you, that the error stating dir messages are related to mergerfs.
Regarding diff vs check.
Yes you are correct that diff will only look at missing files and meta data of existing files and that it will not alert you of any file corruption.
But why do you want to be informed about that before you run fix?
Fix will verify the correctness of all data and attempt to fix any corrupted, modified or missing files it encounters.
Corrupt files that can't be fixed will be renamed to filename.broken or something similar.
So after fix is complete snapraid diff will give you a detailed list of which files are bad or missing and still needs recovery.
At this point you can manually attempt to recover some of the broken or missing files and if successful you can attempt to fix again with those files added to the array.
Snapraid fix will never update the file hashes in the content file with bad information from broken file, which is the reason why you can run fix again and again until everything is ok.
Edit: Also it would be easier to understand the situation better if you clarified how many smart errors there are. Are we talking about a few thousands then there is a good chance that you will recover 99.9% already with the help of ddrescue and then 100% with snapraid.
If you have several millions of bad sectors there is a high risk that ddrescue will be very unsuccessful, and if that is the case for both disks then it is pretty much game over.
In either case the -n option in ddrescue that makes it skip bad areas and attempt to save everything that is still OK seems like a very good option to begin with.
Last edit: Leifi Plomeros 2016-08-08
I'm only talking about a handfull of Uncorrectable/Pending sectors. Unless it's changed, it was 16 on one drive, 8 on the other. But once these start, there's usually the possiblity of a huge cascade, hence saving off everything quickly and then push the drive through all the stress testing I can. After all, if they don't fail any more, I have 2 more disks I can use. :-)
What concerns me more are the level of "soft" errors I get, where Linux reports an error and then usually drops and re-assigns the LUN, which then floods my logs with errors until I can un-mount and re-mount the drive and also drops that drive from my pool.
Cheers.
I don't think you need to very worried about data loss in that case.
Get the server up and running again.
Manually copy all files from the disks with bad sectors or use ddrescue (just make sure that whatever tool you use handles timestamps with sub second precission).
Edit config so that it points to the new disks.
Run diff followed by fix or check, whichever seems most appropriate.