Hi Bruce,
Bruce Allen wrote:
> John Vickers wrote:
>>I'm kindof getting the impression that the drive's own logging of
>>attribute values isn't necessarily entirely reliable if one wants to
>>predict failures, and so attribute values should be logged separately.
>
> Do I understand correctly? You are referring to the fact that on some
> drives, the "WORST" value sometimes increases, whereas if it *really*
> represented the worst value, then it would only get smaller, never larger?
Yup.
>>I've had a Deskstar 75GXP which gave a failing "Technical Result Code",
>>(700010DB)from IBM's DFT tools, but which still gives:
>
>>SMART overall-health self-assessment test result: PASSED
>
> Interesting. Do you have any idea what the code: 700010DB means?
No. I suspect that if I did I wouldn't be allowed to tell anyone.
The drive was making Drr...Dr...Drrr..Drr.. ?recalibrate? noises,
and repeatedly giving hard read errors.
I've sent it off for a warranty replacement now. After DFT
told me to send it back, I noticed that the overall health status was
still reported good, so I took it out of its windows box, plugged it into
a linux box, and ran "smartctl -a " on it, to see if that shed any light
on the problem.
> Unfortunately, the SMART status is the only diagnostic specified by the
> ATA-5 standard that the disk gives us back. The IBM utility clearly knows
> more about the disk, since it's using features of the disk that are not
> part of the ATA-5 spec.
Yeah. I gather that even if you're buying quite large volumes of drives
direct from the manufacturer they're unwilling to say much about the
instrumentation & logging in the drives.
If I'd had a logic analyser handy I'd've investigated IBM 's DFT
further.
>>Which seems to render the "overall-health self-assessment" a bit
>>pointless, from an /end-user's/ point-of-view. (smartctl -a output
>>attached in YK0YKF13008_smartctl-a.txt)
>
> I agree that on your disks, something is wrong if the disk knows that it
> is failing (from the IBM DFT tools) but the firmware still reports that
> the SMART status is OK.
> Unfortunately in this case, what's wrong is with
> the firmware of the disk reporting good smart status, not with smartctl.
Sure. Absolutely. I'm not criticising smartmontools for this behaviour.
> I have been using smartctl quite extensively on some new IBM GXP180 disks
> (we have just gotten 300 of these) and the SMART status does sometimes
> report that the disk is failing (and indeed, it is!)
>>Am I missing something ?
>
> Unfortunatly I don't think so. Just keep in mind that the IBM utility is
> clearly using data from the disk which is not part of the ATA-5 standard
> (and which IBM has not documented for the rest of us). So it's not
> surprising that it can do a better job of predicting disk failures.
> It is surprising that the disk does not return failing smart status for
> this case.
Indeedy.
>>Also, I have a 60GXP
That should be "120GXP"
>>drive where the Raw_Read_Error_Rate value wanders
>>up and down between 94 and 100. It has given a "worst" value for this
>>pparameter of 95. However, the current "cooked" value is currently
>>100 again, and the "worst" value has gone back up to 100. (!?!)
>>
>>I'd have somewhat expected the "worst" value to stay at 94.
>
> Me too.
>
> Unfortunately, once again, the disk is ATA-5. Starting with ATA-4 the
> SMART spec removed all interpretation from the SMART data. Prior to that,
> the current, worst, and threshold values were all part of the spec. But
> for ATA-5 they are all "vendor specific".
>
> So although smartctl reports the values of these fields, in fact the
> meaning that they have is purely historical. Strictly speaking, the ATA-5
> spec does not require that they have any standard meaning.
Heh. I think consumer organisations should get more involved in electronic
standards. There's quite a few which impact end-users interests, but most are
designed with end-user interests as colateral damage. Whatever.
> I have a constructive suggestion. Get a copy of the IBM OEM manual for
> your disk drive (~200 pages) which you can find on the web. They document
> quite clearly what the different attributes etc mean.
Hmm. Neither the manuals nor the ATA4/5 documents I have (T13 1153D r18 & 1321D r3)
seemed to have any mention of the "worst" values. I'll have another look.
> Between these two logs, the "worst" Raw_Read_Error_Rate has gone up.
>
> Clear enough.
OTOH, IBM could be using some partial ordering relation other than the
usual arithmetic total ordering we would naturally expect...
>>Incidentally, maybe there's a case for printing the date in smartctl
>>reports.
>
> I agree. This is easy to do. Look for it in 5.1.x
>
> I also think it might be sensible for smartd to print when the "worst"
> value or the "threshold" value changes. I'll try and add this in to 5.1.x
> sometime in early January. Or perhaps someone else will volunteer to do
> this.
I might even hack it myself.
> I'll add both of these to the "to-do" list.
Regards,
John.
|