From: John V. <jvi...@di...> - 2002-12-20 11:17:09
|
Hi. I'm kindof getting the impression that the drive's own logging of attribute values isn't necessarily entirely reliable if one wants to predict failures, and so attribute values should be logged separately. A couple of cases: I've had a Deskstar 75GXP which gave a failing "Technical Result Code", (700010DB)from IBM's DFT tools, but which still gives: SMART overall-health self-assessment test result: PASSED Which seems to render the "overall-health self-assessment" a bit pointless, from an /end-user's/ point-of-view. (smartctl -a output attached in YK0YKF13008_smartctl-a.txt) Am I missing something ? Also, I have a 60GXP drive where the Raw_Read_Error_Rate value wanders up and down between 94 and 100. It has given a "worst" value for this pparameter of 95. However, the current "cooked" value is currently 100 again, and the "worst" value has gone back up to 100. (!?!) I'd have somewhat expected the "worst" value to stay at 94. Here's a log fragment (omitting temp changes): Dec 19 04:16:19 [...] 1 Raw_Read_Error_Rate changed from 100 to 97 Dec 19 14:02:28 [...] 1 Raw_Read_Error_Rate changed from 97 to 99 Dec 19 14:42:53 [...] 2 Throughput_Performance changed from 100 to 145 Dec 19 14:42:53 [...] 8 Seek_Time_Performance changed from 100 to 138 Dec 19 16:03:44 [...] 1 Raw_Read_Error_Rate changed from 99 to 100 Dec 19 21:37:14 [...] 1 Raw_Read_Error_Rate changed from 100 to 99 Dec 19 21:47:21 [...] 1 Raw_Read_Error_Rate changed from 99 to 95 Dec 19 22:07:33 [...] 1 Raw_Read_Error_Rate changed from 95 to 94 Dec 19 23:08:11 [...] 2 Throughput_Performance changed from 145 to 100 Dec 19 23:58:43 [...] 1 Raw_Read_Error_Rate changed from 94 to 98 Dec 20 00:08:50 [...] 2 Throughput_Performance changed from 100 to 146 Dec 20 00:29:02 [...] 1 Raw_Read_Error_Rate changed from 98 to 99 Dec 20 00:39:09 [...] 1 Raw_Read_Error_Rate changed from 99 to 100 A couple of logs for the drive attached: Dec 18 10:46 VNP210B2G7VVEB_smartctl-a_1.txt Dec 20 10:47 VNP210B2G7VVEB_smartctl-a_2.txt Between these two logs, the "worst" Raw_Read_Error_Rate has gone up. Incidentally, maybe there's a case for printing the date in smartctl reports. Regards, John. |
From: Bruce A. <ba...@gr...> - 2002-12-21 20:20:08
|
Hi John, > I'm kindof getting the impression that the drive's own logging of > attribute values isn't necessarily entirely reliable if one wants to > predict failures, and so attribute values should be logged separately. Do I understand correctly? You are referring to the fact that on some drives, the "WORST" value sometimes increases, whereas if it *really* represented the worst value, then it would only get smaller, never larger? > I've had a Deskstar 75GXP which gave a failing "Technical Result Code", > (700010DB)from IBM's DFT tools, but which still gives: > > SMART overall-health self-assessment test result: PASSED Interesting. Do you have any idea what the code: 700010DB means? Unfortunately, the SMART status is the only diagnostic specified by the ATA-5 standard that the disk gives us back. The IBM utility clearly knows more about the disk, since it's using features of the disk that are not part of the ATA-5 spec. > Which seems to render the "overall-health self-assessment" a bit > pointless, from an /end-user's/ point-of-view. (smartctl -a output > attached in YK0YKF13008_smartctl-a.txt) I agree that on your disks, something is wrong if the disk knows that it is failing (from the IBM DFT tools) but the firmware still reports that the SMART status is OK. Unfortunately in this case, what's wrong is with the firmware of the disk reporting good smart status, not with smartctl. I have been using smartctl quite extensively on some new IBM GXP180 disks (we have just gotten 300 of these) and the SMART status does sometimes report that the disk is failing (and indeed, it is!) > Am I missing something ? Unfortunatly I don't think so. Just keep in mind that the IBM utility is clearly using data from the disk which is not part of the ATA-5 standard (and which IBM has not documented for the rest of us). So it's not surprising that it can do a better job of predicting disk failures. It is surprising that the disk does not return failing smart status for this case. > Also, I have a 60GXP drive where the Raw_Read_Error_Rate value wanders > up and down between 94 and 100. It has given a "worst" value for this > pparameter of 95. However, the current "cooked" value is currently > 100 again, and the "worst" value has gone back up to 100. (!?!) > > I'd have somewhat expected the "worst" value to stay at 94. Me too. Unfortunately, once again, the disk is ATA-5. Starting with ATA-4 the SMART spec removed all interpretation from the SMART data. Prior to that, the current, worst, and threshold values were all part of the spec. But for ATA-5 they are all "vendor specific". So although smartctl reports the values of these fields, in fact the meaning that they have is purely historical. Strictly speaking, the ATA-5 spec does not require that they have any standard meaning. I have a constructive suggestion. Get a copy of the IBM OEM manual for your disk drive (~200 pages) which you can find on the web. They document quite clearly what the different attributes etc mean. > Between these two logs, the "worst" Raw_Read_Error_Rate has gone up. Clear enough. > Incidentally, maybe there's a case for printing the date in smartctl > reports. I agree. This is easy to do. Look for it in 5.1.x I also think it might be sensible for smartd to print when the "worst" value or the "threshold" value changes. I'll try and add this in to 5.1.x sometime in early January. Or perhaps someone else will volunteer to do this. I'll add both of these to the "to-do" list. Cheers, Bruce |
From: John V. <jvi...@di...> - 2002-12-24 09:28:44
|
Hi Bruce, Bruce Allen wrote: > John Vickers wrote: >>I'm kindof getting the impression that the drive's own logging of >>attribute values isn't necessarily entirely reliable if one wants to >>predict failures, and so attribute values should be logged separately. > > Do I understand correctly? You are referring to the fact that on some > drives, the "WORST" value sometimes increases, whereas if it *really* > represented the worst value, then it would only get smaller, never larger? Yup. >>I've had a Deskstar 75GXP which gave a failing "Technical Result Code", >>(700010DB)from IBM's DFT tools, but which still gives: > >>SMART overall-health self-assessment test result: PASSED > > Interesting. Do you have any idea what the code: 700010DB means? No. I suspect that if I did I wouldn't be allowed to tell anyone. The drive was making Drr...Dr...Drrr..Drr.. ?recalibrate? noises, and repeatedly giving hard read errors. I've sent it off for a warranty replacement now. After DFT told me to send it back, I noticed that the overall health status was still reported good, so I took it out of its windows box, plugged it into a linux box, and ran "smartctl -a " on it, to see if that shed any light on the problem. > Unfortunately, the SMART status is the only diagnostic specified by the > ATA-5 standard that the disk gives us back. The IBM utility clearly knows > more about the disk, since it's using features of the disk that are not > part of the ATA-5 spec. Yeah. I gather that even if you're buying quite large volumes of drives direct from the manufacturer they're unwilling to say much about the instrumentation & logging in the drives. If I'd had a logic analyser handy I'd've investigated IBM 's DFT further. >>Which seems to render the "overall-health self-assessment" a bit >>pointless, from an /end-user's/ point-of-view. (smartctl -a output >>attached in YK0YKF13008_smartctl-a.txt) > > I agree that on your disks, something is wrong if the disk knows that it > is failing (from the IBM DFT tools) but the firmware still reports that > the SMART status is OK. > Unfortunately in this case, what's wrong is with > the firmware of the disk reporting good smart status, not with smartctl. Sure. Absolutely. I'm not criticising smartmontools for this behaviour. > I have been using smartctl quite extensively on some new IBM GXP180 disks > (we have just gotten 300 of these) and the SMART status does sometimes > report that the disk is failing (and indeed, it is!) >>Am I missing something ? > > Unfortunatly I don't think so. Just keep in mind that the IBM utility is > clearly using data from the disk which is not part of the ATA-5 standard > (and which IBM has not documented for the rest of us). So it's not > surprising that it can do a better job of predicting disk failures. > It is surprising that the disk does not return failing smart status for > this case. Indeedy. >>Also, I have a 60GXP That should be "120GXP" >>drive where the Raw_Read_Error_Rate value wanders >>up and down between 94 and 100. It has given a "worst" value for this >>pparameter of 95. However, the current "cooked" value is currently >>100 again, and the "worst" value has gone back up to 100. (!?!) >> >>I'd have somewhat expected the "worst" value to stay at 94. > > Me too. > > Unfortunately, once again, the disk is ATA-5. Starting with ATA-4 the > SMART spec removed all interpretation from the SMART data. Prior to that, > the current, worst, and threshold values were all part of the spec. But > for ATA-5 they are all "vendor specific". > > So although smartctl reports the values of these fields, in fact the > meaning that they have is purely historical. Strictly speaking, the ATA-5 > spec does not require that they have any standard meaning. Heh. I think consumer organisations should get more involved in electronic standards. There's quite a few which impact end-users interests, but most are designed with end-user interests as colateral damage. Whatever. > I have a constructive suggestion. Get a copy of the IBM OEM manual for > your disk drive (~200 pages) which you can find on the web. They document > quite clearly what the different attributes etc mean. Hmm. Neither the manuals nor the ATA4/5 documents I have (T13 1153D r18 & 1321D r3) seemed to have any mention of the "worst" values. I'll have another look. > Between these two logs, the "worst" Raw_Read_Error_Rate has gone up. > > Clear enough. OTOH, IBM could be using some partial ordering relation other than the usual arithmetic total ordering we would naturally expect... >>Incidentally, maybe there's a case for printing the date in smartctl >>reports. > > I agree. This is easy to do. Look for it in 5.1.x > > I also think it might be sensible for smartd to print when the "worst" > value or the "threshold" value changes. I'll try and add this in to 5.1.x > sometime in early January. Or perhaps someone else will volunteer to do > this. I might even hack it myself. > I'll add both of these to the "to-do" list. Regards, John. |
From: Bruce A. <ba...@gr...> - 2002-12-25 05:38:57
|
Hi John, > > I have a constructive suggestion. Get a copy of the IBM OEM manual for > > your disk drive (~200 pages) which you can find on the web. They document > > quite clearly what the different attributes etc mean. > > Hmm. Neither the manuals nor the ATA4/5 documents I have (T13 1153D r18 & 1321D r3) > seemed to have any mention of the "worst" values. I'll have another look. IBM does something very nice -- they document all the SMART behavior in the publically-available OEM product manuals. It's around a 200 page manual. Let me know if you can't find it. A pointer to a similar manual is given in the smartctl man page. (Based on the existence of these nice manuals, I actually consider IBM to be "good guys" when it comes to sharing information.) Cheers, Bruce |