From: Tim G. <tj...@so...> - 2012-08-31 20:51:29
Attachments:
smartctl.txt
|
Hi, I got an e-mail from smartd telling me that one of my drive has some issues. I logged into the server and ran: smartctl -a /dev/ada30 and got a ton of output, including 5 "error" sections. But, I can't tell how severe these errors are. I'm including the output below. Can someone please help me interpret this output? -- Tim Gustafson tj...@so... 831-459-5354 Baskin Engineering, Room 313A |
From: Gabriele P. <gp...@di...> - 2012-08-31 22:46:37
|
Tim, On 08/31/2012 10:43 PM, Tim Gustafson wrote: > > smartctl -a /dev/ada30 > > and got a ton of output, including 5 "error" sections. But, I can't > tell how severe these errors are. I'm including the output below. > Can someone please help me interpret this output? health check result is "passed" and short test completed without error, so don't worry ~ Some more hints.. http://sourceforge.net/apps/trac/smartmontools/wiki/Howto_ReadSmartctlReports_ATA_new HTH Gabriele |
From: Tim G. <tj...@so...> - 2012-08-31 22:52:42
|
> health check result is "passed" and > short test completed without error, > so don't worry ~ > > Some more hints.. > > http://sourceforge.net/apps/trac/smartmontools/wiki/Howto_ReadSmartctlReports_ATA_new Thanks! I appreciate your taking the time, and I'll read up on that document. -- Tim Gustafson tj...@so... 831-459-5354 Baskin Engineering, Room 313A |
From: Dan L. <da...@ob...> - 2012-09-01 07:20:20
|
On 09/01/12 00:52, Tim Gustafson: >> health check result is "passed" and >> short test completed without error, >> so don't worry ~ Well, I will offer somewhat interpretation. Attribute [5] show that there has been three unreadable sectors in the past (already solved by relocation). And they are not "manufacturing errors" - latest UNC sector has been found at lifetime 5353h I can compare it with my own FreeBSD with Hitachi (althougth different model), which has no relocated sector nor recorded errors at lifetime 13672 Tim's disc encountered several problems in the past, as recorded in error log ant attributes. I'm not speaking about "time to panic" in ant way, but unless they has been caused by known event (fall, brownout or so) he should take a dim view of it's disc. Of course, interpretation of output depends not only on values shown, but also on overall life experience and "paranoia level" of administrator. In short - facts are same, but I'm not as credulous as Gabrielle during it's interpretation. You need to make your own decision, Tim ... Dan |
From: Gabriele P. <gp...@di...> - 2012-09-01 12:06:34
|
On 09/01/2012 08:43 AM, Dan Lukes wrote: > On 09/01/12 00:52, Tim Gustafson: >>> health check result is "passed" and >>> short test completed without error, >>> so don't worry ~ > Well, I will offer somewhat interpretation. thanks! > Attribute [5] show that there has been three unreadable sectors in the > past (already solved by relocation). And they are not "manufacturing > errors" - latest UNC sector has been found at lifetime 5353h > > unless they has been caused by known event (fall, brownout or > so) he should take a dim view of it's disc. About getting a deeper view of the disks condition: As you have only run a short test I recommend to start a long test to check the whole disk. "Auto Offline Data Collection is Enabled" but "Offline data collection activity was suspended by an interrupting command from host" It announces a very long "Total time to complete Offline data collection" of 37566 seconds ~ 10.5 hours, which can also be last longer if the disk is in heavy use. Is it possible to umount the disk to check in captive mode? > In short - facts are same, but I'm not as credulous as Gabrielle during > it's interpretation. It was intended as first entry of a discussion. Thanks for picking it up! :-) > You need to make your own decision, Tim ... I would like to see some show cases about exploring the disks condition in the wiki. That would be for sure helpful for many other smartmontools users. Tell here about your next steps and results if you like https://sourceforge.net/apps/trac/smartmontools/wiki/Howto_ReadSmartctlReports_ATA_542.1 You need to be logged in at sourceforge to edit the page. All the best and cheers! Gabriele |
From: Dan L. <da...@ob...> - 2012-09-01 15:40:05
|
On 09/01/12 13:54, Gabriele Pohl: > to start a long test to check the whole disk. > It announces a very long "Total time to complete > Offline data collection" of 37566 seconds ~ 10.5 hours, > which can also be last longer if the disk is in heavy use. > > Is it possible to umount the disk to check in captive mode? But not on FreeBSD. It have no concept of "unlimited time to wait for result". A device driver require the device will respond to command in time. The timeout is not configurable from application level. Test in captive mode results in timeout and detach of device from system. Attempt to reattach may trigger reset of device, so test will be interrupted. As there is no big difference in duration of test in captive mode and test in standard mode (on idle disk) the "timeout problem" is not big issue - just don't use captive mode on FreeBSD. > I would like to see some show cases about exploring > the disks condition in the wiki. That would be for sure > helpful for many other smartmontools users. I'm in doubt somewhat. Analysis of data from one disk model has very limited applicability for other disk model. Even worse when from different vendors. In advance, one-shot data have very limited usability at all. It's progress during longer time is substantial for > Tell here about your next steps and results if you like > https://sourceforge.net/apps/trac/smartmontools/wiki/Howto_ReadSmartctlReports_ATA_542.1 > > You need to be logged in at sourceforge to edit the page. Unfortunately, my English is not good for public page. > SMART overall-health state > ..missing an explanation.. It is disk's own interpretation of it's health state. The most common algorithm seen is - if no cooked value of "pre-failure warning" type attribute is bellow threshold, then overall state is "PASS". But true algorithm is vendor specific. It's just vendor's opinion about overall health state of disk. Dan |
From: Tim G. <tj...@so...> - 2012-09-04 16:06:55
|
> As there is no big difference in duration of test in captive mode and > test in standard mode (on idle disk) the "timeout problem" is not big > issue - just don't use captive mode on FreeBSD. The disk is not idle, and bringing it out of service is a bit problematic because it's a member drive of a 45-disc Zpool, so I'd have to bring the whole Zpool off-line to test it in any sort of offline or idle state. It sounds like you're saying that I can run the test while the drive is "hot", but the test might take longer. I'm OK with that. But, I'm not clear on what test you want me to run. Would it be possible for you to send me the command line you'd like to see the output of? For what its worth, this Zpool ran a "zpool scrub" this past weekend without any errors, so I'm fairly confident that this may have been a one-time thing. I don't know when the failure started being reported because I didn't configure the box originally, and it wasn't until the other day that I reconfigured it to e-mail me notifications. Before that, it was only reporting bits to /var/log/message and I only have a week or so worth of those lying around at any given time. The box has been on-line for about a year, but we're just getting around to storing data on it now, so it sat mostly idle for 9 months or so. -- Tim Gustafson tj...@so... 831-459-5354 Baskin Engineering, Room 313A |
From: Dan L. <da...@ob...> - 2012-09-05 15:51:52
|
Tim Gustafson wrote: > It sounds like you're saying that I can run > the test while the drive is "hot", but the test might take longer. Exactly. Also, running test can affect running system speed slightly. > But, I'm not clear on what test you want me to run. -t long Dan |