From: Erwan V. <er...@ma...> - 2003-10-30 11:27:21
|
When I'm reading my SMART informations, my disk is making a lot of ECC corrections ~100/sec. Should I consider this as a broken disk ? My system is running fine and my kernel never said that there is some troubles while reading/writing on my disk.=20 I know that ECC is done for correcting on the fly datas but I'm worried about the rate... 100/sec ! What do you thinks about It ? Disk is a Maxtor 6Y120L0 and was said to not be in the smartd database. 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age =20 Always - 10621 196 Reallocated_Event_Count 0x0008 252 252 000 Old_age =20 Offline - 1 197 Current_Pending_Sector 0x0008 253 253 000 Old_age =20 Offline - 4 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age =20 Offline - 1 --=20 Erwan Velu Linux Cluster Distribution Project Manager MandrakeSoft 43 rue d'aboukir 75002 Paris Phone Number : +33 (0) 1 40 41 17 94 Fax Number : +33 (0) 1 40 41 92 00 Web site : http://www.mandrakesoft.com OpenPGP key : http://www.mandrakesecure.net/cks/=20 |
From: Bruce A. <ba...@gr...> - 2003-10-30 12:32:41
|
Hi Erwan, Nice to hear from you -- you are one of our biggest packagers/distributors! On Thu, 30 Oct 2003, Erwan Velu wrote: > When I'm reading my SMART informations, my disk is making a lot of ECC > corrections ~100/sec. Should I consider this as a broken disk > Disk is a Maxtor 6Y120L0 and was said to not be in the smartd database. > > 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age > Always - 10621 The flag here 0x0a = 1010 binary. The first "1" is the Error-rate flag, saying that this is an error rate, rather than a count. I'm not inclined to worry about it, since the normalized Attribute value (253) is very large, and the worst-ever value is 252. I just looked at a similar (trouble-free) Maxtor disk, and it has: 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 45382 so yours looks comparable. On the other hand: > 196 Reallocated_Event_Count 0x0008 252 252 000 Old_age > Offline - 1 > 197 Current_Pending_Sector 0x0008 253 253 000 Old_age > Offline - 4 These indicate that one sector has been reallocated (a bad sector) and 4 are currently unreadable. > 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age > Offline - 1 And this indicates that in an offline test, at least one sector was (at some time) unreadable. At least with the Maxtor disks that I am experienced with, this CAN be a sign of trouble. Have you done a long self test "selftest -t long" recently? I'd suggest this, and then when it's done, check the self-test log "smartctl -l selftest". Cheers, Bruce |
From: Erwan V. <er...@ma...> - 2003-10-30 13:00:14
|
> Nice to hear from you -- you are one of our biggest > packagers/distributors! Thx :) > > 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age =20 > > Always - 10621 > The flag here 0x0a =3D 1010 binary. The first "1" is the Error-rate flag= , > saying that this is an error rate, rather than a count. Oh ok... The first reading wasn't really obvious :) (0x000a) Is there any way to transform this value in a human readable value ? The information you gave me is more understandable than 0x00a :) > I'm not inclined to worry about it, since the normalized Attribute value > (253) is very large, and the worst-ever value is 252. ok.. So should I understand my drive had reach 252 but the maximum is 253 = ? But what's that value ? What does it refers to ? > so yours looks comparable. ouf ! > And this indicates that in an offline test, at least one sector was (at > some time) unreadable. At least with the Maxtor disks that I am > experienced with, this CAN be a sign of trouble. outch ! > Have you done a long self test "selftest -t long" recently? I'd suggest > this, and then when it's done, check the self-test log "smartctl -l > selftest". This this are running... I'll keep you in touch. --=20 Erwan Velu Linux Cluster Distribution Project Manager MandrakeSoft 43 rue d'aboukir 75002 Paris Phone Number : +33 (0) 1 40 41 17 94 Fax Number : +33 (0) 1 40 41 92 00 Web site : http://www.mandrakesoft.com OpenPGP key : http://www.mandrakesecure.net/cks/=20 |
From: Bruce A. <ba...@gr...> - 2003-10-30 13:09:24
|
> > > 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age > > > Always - 10621 > > The flag here 0x0a = 1010 binary. The first "1" is the Error-rate flag, > > saying that this is an error rate, rather than a count. > Oh ok... The first reading wasn't really obvious :) (0x000a) Is there > any way to transform this value in a human readable value ? The > information you gave me is more understandable than 0x00a :) It's in atacmds.h (see comments). I still don't know how many of these bit flags are used correctly by the vendors. Only bit 0 is documented (in SFF-8035i and ATA-4). IBM has documented bits 1 and 2 for *some* of their disks. The other bits I found by reading souce code. In a future release I may add/document these flags to the Attribute table. > > > I'm not inclined to worry about it, since the normalized Attribute value > > (253) is very large, and the worst-ever value is 252. > ok.. So should I understand my drive had reach 252 but the maximum is 253 ? > But what's that value ? What does it refers to ? man smartctl Start reading at: -A, --attributes Prints only the vendor specific SMART Attributes. The Attributes are numbered from 1 to 253 and have specific names and ID numbers. For example Attribute 12 is "power cycle count": how many times has the disk been powered up. Each Attribute has a "Raw" value, printed under the heading "RAW_VALUE", and a "Nor- malized" value printed under the heading "VALUE". [Note: smartctl prints these val- .... > > Have you done a long self test "selftest -t long" recently? I'd suggest > > this, and then when it's done, check the self-test log "smartctl -l > > selftest". > This this are running... I'll keep you in touch. OK. If a self-test fails with a read error, there's a problem. Bruce |
From: Erwan V. <er...@ma...> - 2003-10-30 13:57:09
|
> Have you done a long self test "selftest -t long" recently? I'd suggest > this, and then when it's done, check the self-test log "smartctl -l > selftest". I received the following mail :( "The following warning/error was logged by the smartd daemon: Device: /dev/hdb, Self-Test Log error count increased from 3 to 4" "$ smartctl /dev/hdb -H -l selftest" gave me=20 Num Test_Description Status Remaining=20 LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 40% =20 131 0x003ff4a7 # 2 Short offline Completed: read failure 60% =20 128 0x00490037 # 3 Short offline Completed: read failure 60% =20 62 0x003e82b4 # 4 Short captive Completed: read failure 60% =20 12 0x003e82b4 Can I consider my disk has dead or is this a minimal error ? --=20 Erwan Velu Linux Cluster Distribution Project Manager MandrakeSoft 43 rue d'aboukir 75002 Paris Phone Number : +33 (0) 1 40 41 17 94 Fax Number : +33 (0) 1 40 41 92 00 Web site : http://www.mandrakesoft.com OpenPGP key : http://www.mandrakesecure.net/cks/=20 |
From: Bruce A. <ba...@gr...> - 2003-10-30 15:20:38
|
Erwan, > > Have you done a long self test "selftest -t long" recently? I'd suggest > > this, and then when it's done, check the self-test log "smartctl -l > > selftest". > > I received the following mail :( > "The following warning/error was logged by the smartd daemon: > Device: /dev/hdb, Self-Test Log error count increased from 3 to 4" > > "$ smartctl /dev/hdb -H -l selftest" gave me > > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed: read failure 40% > 131 0x003ff4a7 > # 2 Short offline Completed: read failure 60% > 128 0x00490037 > # 3 Short offline Completed: read failure 60% > 62 0x003e82b4 > # 4 Short captive Completed: read failure 60% > 12 0x003e82b4 > > Can I consider my disk has dead or is this a minimal error ? This is NOT a minimal error: something is wrong with your disk. It looks like a brand new disk as well, so this is not a good sign. (1) Save your data (2) You may be able to fix it using Maxtor's "PowerMax" utility. This can force reallocation of unreadable sectors, but may corrupt/destroy the file system. (3) PowerMax may tell you to send the disk to Maxtor for replacement. If not, when it is done, run another long self-test. If it fails, tell Maxtor you want a new disk anyway. Good luck! Cheers, Bruce |