Thread: [Smartmontools-support]Ignore "Overall Health Self-Assessment", other questions

Disk Inspection and Monitoring

Brought to you by: ballen4705, chrfranke, dipohl

smartmontools-support

[Smartmontools-support]Ignore "Overall Health Self-Assessment", other questions

From: John V. <jvi...@di...> - 2002-12-20 11:17:09

Attachments: YK0YKF13008_smartctl-a.txt VNP210B2G7VVEB_smartctl-a_1.txt VNP210B2G7VVEB_smartctl-a_2.txt

Hi.

I'm kindof getting the impression that the drive's own logging
of attribute values isn't necessarily entirely reliable if
one wants to predict failures, and so attribute values should
be logged separately.

A couple of cases:


I've had a Deskstar 75GXP which gave a failing "Technical Result Code",
(700010DB)from IBM's DFT tools, but which still gives:

SMART overall-health self-assessment test result: PASSED

Which seems to render the "overall-health self-assessment"
a bit pointless, from an /end-user's/ point-of-view.
(smartctl -a output attached in YK0YKF13008_smartctl-a.txt)

Am I missing something ?


Also, I have a 60GXP drive where the Raw_Read_Error_Rate
value wanders up and down between 94 and 100.  It has given
a "worst" value for this pparameter of 95.  However,
the current "cooked" value is currently 100 again,
and the "worst" value has gone back up to 100. (!?!)

I'd have somewhat expected the "worst" value to stay at 94.

Here's a log fragment (omitting temp changes):
Dec 19 04:16:19 [...] 1 Raw_Read_Error_Rate changed from 100 to 97
Dec 19 14:02:28 [...] 1 Raw_Read_Error_Rate changed from 97 to 99
Dec 19 14:42:53 [...] 2 Throughput_Performance changed from 100 to 145
Dec 19 14:42:53 [...] 8 Seek_Time_Performance changed from 100 to 138
Dec 19 16:03:44 [...] 1 Raw_Read_Error_Rate changed from 99 to 100
Dec 19 21:37:14 [...] 1 Raw_Read_Error_Rate changed from 100 to 99
Dec 19 21:47:21 [...] 1 Raw_Read_Error_Rate changed from 99 to 95
Dec 19 22:07:33 [...] 1 Raw_Read_Error_Rate changed from 95 to 94
Dec 19 23:08:11 [...] 2 Throughput_Performance changed from 145 to 100
Dec 19 23:58:43 [...] 1 Raw_Read_Error_Rate changed from 94 to 98
Dec 20 00:08:50 [...] 2 Throughput_Performance changed from 100 to 146
Dec 20 00:29:02 [...] 1 Raw_Read_Error_Rate changed from 98 to 99
Dec 20 00:39:09 [...] 1 Raw_Read_Error_Rate changed from 99 to 100

A couple of logs for the drive attached:
Dec 18 10:46 VNP210B2G7VVEB_smartctl-a_1.txt
Dec 20 10:47 VNP210B2G7VVEB_smartctl-a_2.txt

Between these two logs, the "worst" Raw_Read_Error_Rate has gone up.

Incidentally, maybe there's a case for printing the date in smartctl
reports.


Regards,

John.

Re: [Smartmontools-support]Ignore "Overall Health Self-Assessment", other questions

From: Bruce A. <ba...@gr...> - 2002-12-21 20:20:08

Hi John,

> I'm kindof getting the impression that the drive's own logging of
> attribute values isn't necessarily entirely reliable if one wants to
> predict failures, and so attribute values should be logged separately.

Do I understand correctly?  You are referring to the fact that on some
drives, the "WORST" value sometimes increases, whereas if it *really*
represented the worst value, then it would only get smaller, never larger?

> I've had a Deskstar 75GXP which gave a failing "Technical Result Code",
> (700010DB)from IBM's DFT tools, but which still gives:
> 
> SMART overall-health self-assessment test result: PASSED

Interesting.  Do you have any idea what the code: 700010DB means?

Unfortunately, the SMART status is the only diagnostic specified by the
ATA-5 standard that the disk gives us back.  The IBM utility clearly knows
more about the disk, since it's using features of the disk that are not
part of the ATA-5 spec.

> Which seems to render the "overall-health self-assessment" a bit
> pointless, from an /end-user's/ point-of-view. (smartctl -a output
> attached in YK0YKF13008_smartctl-a.txt)

I agree that on your disks, something is wrong if the disk knows that it
is failing (from the IBM DFT tools) but the firmware still reports that
the SMART status is OK.  Unfortunately in this case, what's wrong is with
the firmware of the disk reporting good smart status, not with smartctl.

I have been using smartctl quite extensively on some new IBM GXP180 disks
(we have just gotten 300 of these) and the SMART status does sometimes
report that the disk is failing (and indeed, it is!)

> Am I missing something ?

Unfortunatly I don't think so.  Just keep in mind that the IBM utility is
clearly using data from the disk which is not part of the ATA-5 standard
(and which IBM has not documented for the rest of us).  So it's not
surprising that it can do a better job of predicting disk failures.

It is surprising that the disk does not return failing smart status for
this case.

> Also, I have a 60GXP drive where the Raw_Read_Error_Rate value wanders
> up and down between 94 and 100.  It has given a "worst" value for this
> pparameter of 95.  However, the current "cooked" value is currently
> 100 again, and the "worst" value has gone back up to 100. (!?!)
> 
> I'd have somewhat expected the "worst" value to stay at 94.

Me too.

Unfortunately, once again, the disk is ATA-5.  Starting with ATA-4 the
SMART spec removed all interpretation from the SMART data.  Prior to that,
the current, worst, and threshold values were all part of the spec. But
for ATA-5 they are all "vendor specific".

So although smartctl reports the values of these fields, in fact the
meaning that they have is purely historical. Strictly speaking, the ATA-5
spec does not require that they have any standard meaning.

I have a constructive suggestion.  Get a copy of the IBM OEM manual for
your disk drive (~200 pages) which you can find on the web.  They document
quite clearly what the different attributes etc mean.  

> Between these two logs, the "worst" Raw_Read_Error_Rate has gone up.

Clear enough.

> Incidentally, maybe there's a case for printing the date in smartctl
> reports.

I agree.  This is easy to do.  Look for it in 5.1.x

I also think it might be sensible for smartd to print when the "worst"
value or the "threshold" value changes.  I'll try and add this in to 5.1.x
sometime in early January.  Or perhaps someone else will volunteer to do
this.

I'll add both of these to the "to-do" list.

Cheers,
	Bruce

Re: [Smartmontools-support]Ignore "Overall Health Self-Assessment", other questions

From: John V. <jvi...@di...> - 2002-12-24 09:28:44

Hi Bruce,

Bruce Allen wrote:
  > John Vickers wrote:
  >>I'm kindof getting the impression that the drive's own logging of
  >>attribute values isn't necessarily entirely reliable if one wants to
  >>predict failures, and so attribute values should be logged separately.
  >
  > Do I understand correctly?  You are referring to the fact that on some
  > drives, the "WORST" value sometimes increases, whereas if it *really*
  > represented the worst value, then it would only get smaller, never larger?

Yup.

  >>I've had a Deskstar 75GXP which gave a failing "Technical Result Code",
  >>(700010DB)from IBM's DFT tools, but which still gives:
   >
  >>SMART overall-health self-assessment test result: PASSED
  >
  > Interesting.  Do you have any idea what the code: 700010DB means?

No.  I suspect that if I did I wouldn't be allowed to tell anyone.

The drive was making Drr...Dr...Drrr..Drr.. ?recalibrate? noises,
and repeatedly giving hard read errors.

I've sent it off for a warranty replacement now.  After DFT
told me to send it back, I noticed that the overall health status was
still reported good, so I took it out of its windows box, plugged it into
a linux box, and ran "smartctl -a " on it, to see if that shed any light
on the problem.

  > Unfortunately, the SMART status is the only diagnostic specified by the
  > ATA-5 standard that the disk gives us back.  The IBM utility clearly knows
  > more about the disk, since it's using features of the disk that are not
  > part of the ATA-5 spec.

Yeah.  I gather that even if you're buying quite large volumes of drives
direct from the manufacturer they're unwilling to say much about the
instrumentation & logging in the drives.

If I'd had a logic analyser handy I'd've investigated IBM 's DFT
further.

  >>Which seems to render the "overall-health self-assessment" a bit
  >>pointless, from an /end-user's/ point-of-view. (smartctl -a output
  >>attached in YK0YKF13008_smartctl-a.txt)
  >
  > I agree that on your disks, something is wrong if the disk knows that it
  > is failing (from the IBM DFT tools) but the firmware still reports that
  > the SMART status is OK.

  > Unfortunately in this case, what's wrong is with
  > the firmware of the disk reporting good smart status, not with smartctl.

Sure.  Absolutely.  I'm not criticising smartmontools for this behaviour.

  > I have been using smartctl quite extensively on some new IBM GXP180 disks
  > (we have just gotten 300 of these) and the SMART status does sometimes
  > report that the disk is failing (and indeed, it is!)


  >>Am I missing something ?
  >
  > Unfortunatly I don't think so.  Just keep in mind that the IBM utility is
  > clearly using data from the disk which is not part of the ATA-5 standard
  > (and which IBM has not documented for the rest of us).  So it's not
  > surprising that it can do a better job of predicting disk failures.

  > It is surprising that the disk does not return failing smart status for
  > this case.

Indeedy.

  >>Also, I have a 60GXP
That should be "120GXP"
  >>drive where the Raw_Read_Error_Rate value wanders
  >>up and down between 94 and 100.  It has given a "worst" value for this
  >>pparameter of 95.  However, the current "cooked" value is currently
  >>100 again, and the "worst" value has gone back up to 100. (!?!)
  >>
  >>I'd have somewhat expected the "worst" value to stay at 94.
  >
  > Me too.
  >
  > Unfortunately, once again, the disk is ATA-5.  Starting with ATA-4 the
  > SMART spec removed all interpretation from the SMART data.  Prior to that,
  > the current, worst, and threshold values were all part of the spec. But
  > for ATA-5 they are all "vendor specific".
  >
  > So although smartctl reports the values of these fields, in fact the
  > meaning that they have is purely historical. Strictly speaking, the ATA-5
  > spec does not require that they have any standard meaning.

Heh.  I think consumer organisations should get more involved in electronic
standards.  There's quite a few which impact end-users interests, but most are
designed with end-user interests as colateral damage.  Whatever.

  > I have a constructive suggestion.  Get a copy of the IBM OEM manual for
  > your disk drive (~200 pages) which you can find on the web.  They document
  > quite clearly what the different attributes etc mean.

Hmm.  Neither the manuals nor the ATA4/5 documents I have (T13 1153D r18 & 1321D r3)
seemed to have any mention of the "worst" values.  I'll have another look.


  > Between these two logs, the "worst" Raw_Read_Error_Rate has gone up.
  >
  > Clear enough.

OTOH, IBM could be using some partial ordering relation other than the
usual arithmetic total ordering we would naturally expect...


  >>Incidentally, maybe there's a case for printing the date in smartctl
  >>reports.
  >
  > I agree.  This is easy to do.  Look for it in 5.1.x
  >
  > I also think it might be sensible for smartd to print when the "worst"
  > value or the "threshold" value changes.  I'll try and add this in to 5.1.x
  > sometime in early January.  Or perhaps someone else will volunteer to do
  > this.

I might even hack it myself.

  > I'll add both of these to the "to-do" list.

Regards,

John.

Re: [Smartmontools-support]Ignore "Overall Health Self-Assessment", other questions

From: Bruce A. <ba...@gr...> - 2002-12-25 05:38:57

Hi John,

>   > I have a constructive suggestion.  Get a copy of the IBM OEM manual for
>   > your disk drive (~200 pages) which you can find on the web.  They document
>   > quite clearly what the different attributes etc mean.
> 
> Hmm.  Neither the manuals nor the ATA4/5 documents I have (T13 1153D r18 & 1321D r3)
> seemed to have any mention of the "worst" values.  I'll have another look.

IBM does something very nice -- they document all the SMART behavior in
the publically-available OEM product manuals.  It's around a 200 page
manual. Let me know if you can't find it.  A pointer to a similar manual
is given in the smartctl man page.

(Based on the existence of these nice manuals, I actually consider IBM to
be "good guys" when it comes to sharing information.)

Cheers,
	Bruce