I'm running windows 32 bit, with 7 1TB drives, 5 data and 2 parity. Everything seems ok in normal use. I can add files and run sync successfully. But when I scrub, I get errors on all drives, and it tells me to run -e fix. When I do that, I usually get all recovered files, or all but one. The one marked unrecovetable is video and seems to play ok. When I run -e fix, it says to scrub to remove the errors, and when I do that, I get unexpected errors on all data and parity drives, and it tells me to run -e fix ... rinse repeat forever.
What am I seeing? Bad drives? Bad memory? Bad controller card or cables? The actual data seems to be fine. The drives are pretty old, and one has a single 197 SMART error, but it has had that error for 15 months.
Any suggestions for a methodical approach to figuring out what is happening?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are dozens of them, and I don't write them down, so I don't know. Is there any way to save the list after it's run. I know I could pipe to a file, but it takes hours to run a scrub.
I ran a -e fix several hours ago. It claimed to recover all but one file. I'm now running a 100% scrub, and it's finding lots of errors. Does that make sense?
Last edit: stelser 2015-06-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
stelser, as you are in "debuging" mode and have scrubbed all, have you tested to run "check" (shouldn't be different but as it is different code you never know).
/X
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I ran a 100% scrub of everything older than 1 day. It said 0 file errors, 0 io errors and 38050 data errors.
Then I started check. Within minutes it had found 3 recoverable files. I reran check twice more and it found the same 3 files within moments. Is there any point in running a complete check?
I'd just finished a complete -e fix before starting the 100% scrub.
My problem is I don't understand what a data error is, as compared to a file or io error. If a file needs to be recovered, why don't I see any file errors?
Next I ran a sync. It ran normally. It reported no changed files and nothing to do.
So what does this all mean?
Last edit: stelser 2015-06-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In v8.0 for "scrub" (if I have understood the code correct):
"File errors" are all errors not "io errors" or "data errors".
"Data errors" are silent errors, ie hash or parity errors except "it's a silent error only if we are dealing with synced blocks".
Silent errors in file data is reported as "error".
Silent errors in parity data is reported as "fatal".
/X
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Could I be seeing a failing parity drive? AFAICT, all my actual data is fine. What kind of test can I run to detect the cause of this problem?
I have two difficulties tracking this down. First is that most of my drives are on a raid controller and smartmon can't see through that controller to the drives to do SMART testing.
The second is that I'm disabled and can't stand up or bend over to make physical changes to the computers (swap drives and cables, etc.). I have to hire help or get a nontechnical friend to do it, so I want the most info possible with the least effort for any physical debugging needed.
I can remotely control the computer from my hospital bed for virtual testing (running programs, etc.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Status test reports thousands of errors.
Smartmon can't see parity1 drive.
Smartmon says parity2 drive has 1 error type 197 and 1 error type 198, no errors type 5. It has reported that same thing for more than a year.
What would you test next?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm, I hope someone else might chip in with a simple answer to you (Andrea, Jessie?)
Otherwise without the more detailed log using the "-l" option (or at least the beginning of it) it's hard for me to be of much help. With the log I could at least try to trace the logic in the code. It would though expose filenames and paths which could take some effort to anonymize.
Btw, have you had this issue from the beginning or did it start after some time of running correctly or after a specific snapraid version upgrade (with improved detection of problems in hardware)?
/X
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It started after a year or so of use. Can you at least tell me how it's supposed to work? From the messages it gives me, I assumed one should run -e fix, then run scrub to make the errors disappear. But it isn't clear why simply running the fix isn't sufficient. IOW, why is the scrub needed after fixing?
Can you confirm that running -e fix, followed by a 100% scrub should have fixed it?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a feeling that something in your configuration (hardware/software/settings/something somewhere) changed (or failed?) just before these errors started to occur. Either with your without your knowledge or intent. It doesn't seem like this is something that could ever happen on a long-running healthy array.
Last edit: Quaraxkad 2015-06-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
stelser, as far as I know "-e" filter is to only fix blocks found in error, ie. fix with -e after 100% scrub should fix all errors. BUT due to all errors you have I would not fix anything before knowing better what is in error first (but thats me).
/X
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'd appreciate a suggestion on where to go from here. I can test each drive with smartmon or chkdsk if that makes sense, but I was hoping I could narrow things down a bit from the messages snapraid gives. Isn't there some error check I can do that will tell me what snapraid thinks the problem is? I don't mind doing tests that take a long time if that's the best approach. None of the data is criticsl, but I'd like to keep what I can, and so far no data seems to be lost, yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you are receiving a SMART error 197 which has a norm-ed value less than 100 then you almost certainly have a bad drive which needs immediate replacement. I would recommend you copy all data off that drive immediately and replace it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Smartmon gives me an error count of 1 on 197 on that drive, which is my second parity drive. It's been giving me that same count for 2 years, unchanged. For the first year, it was used as a data drive, not in snapraid, and for the last year it has been used as the parity drive. For at least 10-11 months, there have been no errors reported from snapraid. I know that a 197 error is not good, but are my symptoms here consistent with a failing parity drive? If so, is there any way to confirm that's the problem other than pulling it and rebuilding with a single parity drive?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've done hours of memory testing, dskchk and smartmon testing. No errors show up. When I add files to the array and sync, I see no errors. SR recognizes the changes correctly and seems to be happy after the sync. But a status lists thousands of errors by number. Is there any way to use one of the error numbers from status to determine if the corresponding file really has an error? Would you run another -e fix? How can I tell whether there are really errors in the data files? Any suggestions at all? Should I move the array to another computer? Swap the HD controller card? Replace drives?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I ran chkdsk on all drives. No errors on data disks or parity disks.
I'm now running -e fix. It varies between saying it needs 10 hours and 27 hours to finish. It is successfully recovering files. I try running the recovered files and they seem fine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I ran chkdsk on all drives. No errors on data disks or parity disks.
I'm now running -e fix. It varies between saying it needs 10 hours and 27 hours to finish. It is successfully recovering files. I try running the recovered files and they seem fine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The -e fix is 21% complete, with another 10.5 hours to go. It has recovered 17 video files and found one to be unrecoverable. The unrecoverable seems to be perfectly ok if renamed and played. The recovered files seem to be fine, also.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm running windows 32 bit, with 7 1TB drives, 5 data and 2 parity. Everything seems ok in normal use. I can add files and run sync successfully. But when I scrub, I get errors on all drives, and it tells me to run -e fix. When I do that, I usually get all recovered files, or all but one. The one marked unrecovetable is video and seems to play ok. When I run -e fix, it says to scrub to remove the errors, and when I do that, I get unexpected errors on all data and parity drives, and it tells me to run -e fix ... rinse repeat forever.
What am I seeing? Bad drives? Bad memory? Bad controller card or cables? The actual data seems to be fine. The drives are pretty old, and one has a single 197 SMART error, but it has had that error for 15 months.
Any suggestions for a methodical approach to figuring out what is happening?
Is it the same files that report errors every time you run scrub?
There are dozens of them, and I don't write them down, so I don't know. Is there any way to save the list after it's run. I know I could pipe to a file, but it takes hours to run a scrub.
I ran a -e fix several hours ago. It claimed to recover all but one file. I'm now running a 100% scrub, and it's finding lots of errors. Does that make sense?
Last edit: stelser 2015-06-05
stelser, as you are in "debuging" mode and have scrubbed all, have you tested to run "check" (shouldn't be different but as it is different code you never know).
/X
I ran a 100% scrub of everything older than 1 day. It said 0 file errors, 0 io errors and 38050 data errors.
Then I started check. Within minutes it had found 3 recoverable files. I reran check twice more and it found the same 3 files within moments. Is there any point in running a complete check?
I'd just finished a complete -e fix before starting the 100% scrub.
My problem is I don't understand what a data error is, as compared to a file or io error. If a file needs to be recovered, why don't I see any file errors?
Next I ran a sync. It ran normally. It reported no changed files and nothing to do.
So what does this all mean?
Last edit: stelser 2015-06-05
In v8.0 for "scrub" (if I have understood the code correct):
"File errors" are all errors not "io errors" or "data errors".
"Data errors" are silent errors, ie hash or parity errors except "it's a silent error only if we are dealing with synced blocks".
Silent errors in file data is reported as "error".
Silent errors in parity data is reported as "fatal".
/X
Could I be seeing a failing parity drive? AFAICT, all my actual data is fine. What kind of test can I run to detect the cause of this problem?
I have two difficulties tracking this down. First is that most of my drives are on a raid controller and smartmon can't see through that controller to the drives to do SMART testing.
The second is that I'm disabled and can't stand up or bend over to make physical changes to the computers (swap drives and cables, etc.). I have to hire help or get a nontechnical friend to do it, so I want the most info possible with the least effort for any physical debugging needed.
I can remotely control the computer from my hospital bed for virtual testing (running programs, etc.)
Status test reports thousands of errors.
Smartmon can't see parity1 drive.
Smartmon says parity2 drive has 1 error type 197 and 1 error type 198, no errors type 5. It has reported that same thing for more than a year.
What would you test next?
Hmm, I hope someone else might chip in with a simple answer to you (Andrea, Jessie?)
Otherwise without the more detailed log using the "-l" option (or at least the beginning of it) it's hard for me to be of much help. With the log I could at least try to trace the logic in the code. It would though expose filenames and paths which could take some effort to anonymize.
Btw, have you had this issue from the beginning or did it start after some time of running correctly or after a specific snapraid version upgrade (with improved detection of problems in hardware)?
/X
It started after a year or so of use. Can you at least tell me how it's supposed to work? From the messages it gives me, I assumed one should run -e fix, then run scrub to make the errors disappear. But it isn't clear why simply running the fix isn't sufficient. IOW, why is the scrub needed after fixing?
Can you confirm that running -e fix, followed by a 100% scrub should have fixed it?
I have a feeling that something in your configuration (hardware/software/settings/something somewhere) changed (or failed?) just before these errors started to occur. Either with your without your knowledge or intent. It doesn't seem like this is something that could ever happen on a long-running healthy array.
Last edit: Quaraxkad 2015-06-06
stelser, as far as I know "-e" filter is to only fix blocks found in error, ie. fix with -e after 100% scrub should fix all errors. BUT due to all errors you have I would not fix anything before knowing better what is in error first (but thats me).
/X
I'd appreciate a suggestion on where to go from here. I can test each drive with smartmon or chkdsk if that makes sense, but I was hoping I could narrow things down a bit from the messages snapraid gives. Isn't there some error check I can do that will tell me what snapraid thinks the problem is? I don't mind doing tests that take a long time if that's the best approach. None of the data is criticsl, but I'd like to keep what I can, and so far no data seems to be lost, yet.
If you are receiving a SMART error 197 which has a norm-ed value less than 100 then you almost certainly have a bad drive which needs immediate replacement. I would recommend you copy all data off that drive immediately and replace it.
Smartmon gives me an error count of 1 on 197 on that drive, which is my second parity drive. It's been giving me that same count for 2 years, unchanged. For the first year, it was used as a data drive, not in snapraid, and for the last year it has been used as the parity drive. For at least 10-11 months, there have been no errors reported from snapraid. I know that a 197 error is not good, but are my symptoms here consistent with a failing parity drive? If so, is there any way to confirm that's the problem other than pulling it and rebuilding with a single parity drive?
The computer is rebooting itself under heavy load, like full scrub, so I suspect I have a hardware issue of some sort.
I've done hours of memory testing, dskchk and smartmon testing. No errors show up. When I add files to the array and sync, I see no errors. SR recognizes the changes correctly and seems to be happy after the sync. But a status lists thousands of errors by number. Is there any way to use one of the error numbers from status to determine if the corresponding file really has an error? Would you run another -e fix? How can I tell whether there are really errors in the data files? Any suggestions at all? Should I move the array to another computer? Swap the HD controller card? Replace drives?
Have you run chkdsk?
Yes, I ran chkdsk on all drives. No errors on data disks or parity disks.
I'm now running -e fix. It varies between saying it needs 10 hours and 27 hours to finish. It is successfully recovering files. I try running the recovered files and they seem fine.
Yes, I ran chkdsk on all drives. No errors on data disks or parity disks.
I'm now running -e fix. It varies between saying it needs 10 hours and 27 hours to finish. It is successfully recovering files. I try running the recovered files and they seem fine.
I didn't hit post twice?
I didn't hit post twice?
Oops, I did that time. My system is slow.
What should I expect to see after this -e fix if I run status? Should it tell me I need to run a scrub, or should it be clean?
The -e fix is 21% complete, with another 10.5 hours to go. It has recovered 17 video files and found one to be unrecoverable. The unrecoverable seems to be perfectly ok if renamed and played. The recovered files seem to be fine, also.