From: Christopher W. <sma...@th...> - 2003-06-17 04:54:59
|
Can anyone explain these results? It shows no errors in the "Error Log", but a number of "failures" in the tests. The drive reports OK, so what's causing these problems at the same point each time, and why isn't the drive fixing them automatically? smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen === START OF INFORMATION SECTION === Device Model: MAXTOR 6L040J2 Serial Number: 662216026392 Firmware Version: A93.0500 ATA Version is: 5 ATA Standard is: ATA/ATAPI-5 T13 1321D revision 1 Local Time is: Mon Jun 16 23:48:56 2003 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Off-line data collection status: (0x02) Offline data collection activity completed without error. Self-test execution status: ( 112) The previous self-test completed having the read element of the test failed. Total time to complete off-line data collection: ( 35) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Automatic timer ON/OFF support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 20) minutes. SMART Attributes Data Structure revision number: 11 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0029 100 100 020 Pre-fail - 0 3 Spin_Up_Time 0x0027 080 080 020 Pre-fail - 2587 4 Start_Stop_Count 0x0032 100 100 008 Old_age - 68 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail - 2 7 Seek_Error_Rate 0x000b 100 093 023 Pre-fail - 0 9 Power_On_Hours 0x0012 096 096 001 Old_age - 3090 10 Spin_Retry_Count 0x0026 100 100 000 Old_age - 0 11 Calibration_Retry_Count 0x0013 100 100 020 Pre-fail - 0 12 Power_Cycle_Count 0x0032 100 100 008 Old_age - 28 13 Read_Soft_Error_Rate 0x000b 100 093 023 Pre-fail - 0 194 Temperature_Celsius 0x0022 082 079 042 Old_age - 47 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age - 12255 196 Reallocated_Event_Count 0x0010 100 099 020 Old_age - 0 197 Current_Pending_Sector 0x0032 100 100 020 Old_age - 1 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age - 0 199 UDMA_CRC_Error_Count 0x001a 200 200 000 Old_age - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended off-line Completed: read failure 90% 3068 0x0007d6e6 # 2 Short off-line Completed: read failure 50% 2901 0x0007d6e6 # 3 Short off-line Completed: read failure 50% 2735 0x0007d6e6 # 4 Extended off-line Completed: read failure 90% 2568 0x0007d6e6 # 5 Short off-line Completed: read failure 50% 2402 0x0007d6e6 # 6 Extended off-line Completed: read failure 90% 2235 0x0007d6e6 # 7 Short off-line Completed: read failure 50% 2068 0x0007d6e6 # 8 Extended off-line Completed: read failure 90% 1902 0x0007d6e6 # 9 Short off-line Completed: read failure 50% 1735 0x0007d6e6 #10 Short off-line Completed: read failure 50% 1569 0x0007d6e6 #11 Extended off-line Completed: read failure 90% 1402 0x0007d6e6 #12 Short off-line Completed: read failure 50% 1305 0x0007d6e6 #13 Extended off-line Completed 00% 358 - #14 Short off-line Completed 00% 357 - #15 Short off-line Completed 00% 357 - #16 Short off-line Completed 00% 0 - |
From: Bruce A. <ba...@gr...> - 2003-06-17 10:19:10
|
Hi Chris, > Can anyone explain these results? Short version: your drive is going bad. Back up your data and get the drive replaced. > It shows no errors in the "Error Log", but a number of "failures" in > the tests. This is because there are no errors with the "ATA spec" interactions between the motherboard's IO controller and the drive. In other words, you don't have a bad IDE cable, bad motherboard controller chipset, etc. > The drive reports OK, so what's causing these problems at the same > point each time, and why isn't the drive fixing them automatically? The drive is doing some fixing. Comments below: > Off-line data collection status: (0x02) Offline data collection activity > completed without error. use -o on to enable automatic offline data collection (be sure to use a recent version, there was a bug in the code for older versions). > Self-test execution status: ( 112) The previous self-test completed having > the read element of the test failed. self-explanatory > 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail - 2 Note the two reallocated sectors. Typically these maxtor drives can reallocate up to a few hundred sectors. But at least two sectors on the disk are bad and were eliminated from the "usable sectors" list. > 194 Temperature_Celsius 0x0022 082 079 042 Old_age - 47 Wow -- this is VERY high. I suggest you try and get some fan to blow air on this disk. It's awfully hot... > SMART Self-test log, version number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Extended off-line Completed: read > failure 90% 3068 0x0007d6e6 > # 2 Short off-line Completed: read > failure 50% 2901 0x0007d6e6 > # 3 Short off-line Completed: read > failure 50% 2735 0x0007d6e6 > # 4 Extended off-line Completed: read > failure 90% 2568 0x0007d6e6 > # 5 Short off-line Completed: read > failure 50% 2402 0x0007d6e6 > # 6 Extended off-line Completed: read > failure 90% 2235 0x0007d6e6 > # 7 Short off-line Completed: read > failure 50% 2068 0x0007d6e6 > # 8 Extended off-line Completed: read > failure 90% 1902 0x0007d6e6 > # 9 Short off-line Completed: read > failure 50% 1735 0x0007d6e6 > #10 Short off-line Completed: read > failure 50% 1569 0x0007d6e6 > #11 Extended off-line Completed: read > failure 90% 1402 0x0007d6e6 > #12 Short off-line Completed: read > failure 50% 1305 0x0007d6e6 > #13 Extended off-line Completed 00% 358 - > #14 Short off-line Completed 00% 357 - > #15 Short off-line Completed 00% 357 - > #16 Short off-line Completed 00% 0 - The disk is having real problems. If it were mine, I would replace it without much delay. The fact that it is failing the self-tests in the same place, and has been failing them for a couple of months, should not give you a false sense of reassurance. Bottom line: you need a new disk. It looks like it's less than a year old (much less than 8000 hours usage) so Maxtor's warranty should cover it. Cheers, Bruce |
From: Christopher W. <sma...@th...> - 2003-06-17 17:30:54
|
Thank for responding. I still don't understand the overall "picture". I've embedded some questions/comments below. At 05:18 AM 6/17/2003 -0500, Bruce Allen wrote: >Hi Chris, > > > Can anyone explain these results? > >Short version: your drive is going bad. Back up your data and get the >drive replaced. I'm confused as to why don't the internal diagnostics don't think so. > > Off-line data collection status: (0x02) Offline data collection activity > > completed without error. > >use -o on to enable automatic offline data collection (be sure to use a >recent version, there was a bug in the code for older versions). is the 5.1-4 version I'm using OK? > > 5 > Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail - 2 > >Note the two reallocated sectors. Typically these maxtor drives can >reallocate up to a few hundred sectors. But at least two sectors on the >disk are bad and were eliminated from the "usable sectors" list. Why does it keep giving a read error on sector 0x0007d6e6? Why doesn't it relocate it if it has free sectors to do so? > > 194 > Temperature_Celsius 0x0022 082 079 042 Old_age - 47 > >Wow -- this is VERY high. I suggest you try and get some fan to blow air >on this disk. It's awfully hot... It's "running idle" temp is around 42 degress C, so this is a minor increase while it's doing something. Even with the case off and a fan blowing directly on it, 42 is about as low as I can go. I have a physical temp monitor that verifies this measurement as real. But the drive can see this from it's sensor and I'd assume if it was a problem it'd be reflected in the Temp value/worse/threshold series, no? > # 1 Extended off-line Completed: read failure 90% 3068 0x0007d6e6 > # 2 Short off-line Completed: read failure 50% 2901 0x0007d6e6 > # 3 Short off-line Completed: read failure 50% 2735 0x0007d6e6 > # 4 Extended off-line Completed: read failure 90% 2568 0x0007d6e6 >The disk is having real problems. If it were mine, I would replace it >without much delay. The fact that it is failing the self-tests in the >same place, and has been failing them for a couple of months, should not >give you a false sense of reassurance. So why isn't the drive reallocating these bad sectors? Why is it always at the same address? I assume it's not reallocating them because there are 13 failures but the reallocation attribute only shows 2 reallocations. Why do all the read attributes show raw values of zero even though the tests keep failing on read errors? >Bottom line: you need a new disk. It looks like it's less than a year old >(much less than 8000 hours usage) so Maxtor's warranty should cover it. I'll probably need to provide them some sort of proof, no? Guess I'll need to find and try their diagnostics and see what those say. -W |
From: Bruce A. <ba...@gr...> - 2003-06-17 17:44:59
|
Hi Chris, On Tue, 17 Jun 2003, Christopher Wolf wrote: > Thank for responding. I still don't understand the overall > "picture". I've embedded some questions/comments below. > > > At 05:18 AM 6/17/2003 -0500, Bruce Allen wrote: > >Hi Chris, > > > > > Can anyone explain these results? > > > >Short version: your drive is going bad. Back up your data and get the > >drive replaced. > > > I'm confused as to why don't the internal diagnostics don't think so. The internal diagnostics DO think so. However the drive does not yet have failing SMART status, which indicates a predicted lifetime < 24 hours. > > > > > Off-line data collection status: (0x02) Offline data collection activity > > > completed without error. The key word here is 'offline'. The drive's offline tests are still running without error. But the selftests are failing. Please read smartctl man page for an explanation of the difference. > >use -o on to enable automatic offline data collection (be sure to use a > >recent version, there was a bug in the code for older versions). > > > is the 5.1-4 version I'm using OK? Note -- use 5.1-11 or better yet 5.1-14. > > > 5 > > Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail - 2 > > > >Note the two reallocated sectors. Typically these maxtor drives can > >reallocate up to a few hundred sectors. But at least two sectors on the > >disk are bad and were eliminated from the "usable sectors" list. > > > Why does it keep giving a read error on sector 0x0007d6e6? Why doesn't it > relocate it if it has free sectors to do so? I don't know -- I can only conjecture that the read error may not be due to a bad sector. > > > 194 > > Temperature_Celsius 0x0022 082 079 042 Old_age - 47 > > > >Wow -- this is VERY high. I suggest you try and get some fan to blow air > >on this disk. It's awfully hot... > > > It's "running idle" temp is around 42 degress C, so this is a minor > increase while it's doing something. Even with the case off and a fan > blowing directly on it, 42 is about as low as I can go. I have a physical > temp monitor that verifies this measurement as real. Hmm, what's the ambient room temperature? I have a bunch (well, hundreds) of maxtor drives in a 21 C room and they all run at 23-27 C. > But the drive can see this from it's sensor and I'd assume if it was a > problem it'd be reflected in the Temp value/worse/threshold series, no? Sure -- the drive is not failing. Just remember that if it does get down to threshold value or below, that means predicted failure in < 24 hours. Speaking from extensive experience and industry studies, each 5 C temp increase doubles the failure rate. > > # 1 Extended off-line Completed: read > failure 90% 3068 0x0007d6e6 > > # 2 Short off-line Completed: read > failure 50% 2901 0x0007d6e6 > > # 3 Short off-line Completed: read > failure 50% 2735 0x0007d6e6 > > # 4 Extended off-line Completed: read > failure 90% 2568 0x0007d6e6 > > >The disk is having real problems. If it were mine, I would replace it > >without much delay. The fact that it is failing the self-tests in the > >same place, and has been failing them for a couple of months, should not > >give you a false sense of reassurance. > > So why isn't the drive reallocating these bad sectors? Why is it always at > the same address? I assume it's not reallocating them because there are 13 > failures but the reallocation attribute only shows 2 reallocations. I don't know. Apparently the read problem can not be solved by reallocating sectors. > Why do all the read attributes show raw values of zero even though the > tests keep failing on read errors? I don't know. It may be that the read Attributes show error RATES rather than total error numbers. So unless you are trying to read data from the bad LBA, the error rate remains zero. > >Bottom line: you need a new disk. It looks like it's less than a year old > >(much less than 8000 hours usage) so Maxtor's warranty should cover it. > > I'll probably need to provide them some sort of proof, no? Guess I'll need > to find and try their diagnostics and see what those say. If you just tell them that the drive is failing it's SMART short & extended self-test at the same address each time, that should be enough for them to replace it. That, after all, is what the self-tests are for. Cheers, Bruce |
From: Christopher W. <sma...@th...> - 2003-06-17 19:00:43
|
At 12:44 PM 6/17/2003 -0500, Bruce Allen wrote: >The internal diagnostics DO think so. However the drive does not yet have >failing SMART status, which indicates a predicted lifetime < 24 hours. Ahh. Got it. > > >Wow -- this is VERY high. I suggest you try and get some fan to blow air > > >on this disk. It's awfully hot... > > > > > > It's "running idle" temp is around 42 degress C, so this is a minor > > increase while it's doing something. Even with the case off and a fan > > blowing directly on it, 42 is about as low as I can go. I have a physical > > temp monitor that verifies this measurement as real. > >Hmm, what's the ambient room temperature? I have a bunch (well, >hundreds) of maxtor drives in a 21 C room and they all run at 23-27 C. About 79 degrees F. (26 degress C?) I was playing around with it and I found if I point a box fan at it from about 2 ft. away, I can get it down to 38 C, but this does not seem a smart condition to continue in.... > > >Bottom line: you need a new disk. It looks like it's less than a year old > > >(much less than 8000 hours usage) so Maxtor's warranty should cover it. > > > > I'll probably need to provide them some sort of proof, no? Guess I'll > need > > to find and try their diagnostics and see what those say. > >If you just tell them that the drive is failing it's SMART short & >extended self-test at the same address each time, that should be enough >for them to replace it. That, after all, is what the self-tests are for. They had me run their PowerMAX tests, which reported an error code and reported it "might" be able to fix it and call back only if it cannot. But there also "might" be data loss, so I guess I need to back the whole doggone thing up and try that. Gonna blow a few hours today.... Running the Maxtor program did add an error entry in the "Error Log", though (reproduced here just FYI): SMART Error Log Version: 1 ATA Error Count: 1 DCR = Device Control Register FR = Features Register SC = Sector Count Register SN = Sector Number Register CL = Cylinder Low Register CH = Cylinder High Register D/H = Device/Head Register CR = Content written to Command Register ER = Error register STA = Status register Timestamp is seconds since the previous disk power-on. Note: timestamp "wraps" after 2^32 msec = 49.710 days. Error 1 occurred at disk power-on lifetime: 3104 hours When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER:40 SC:1a SN:e6 CL:d6 CH:07 D/H:e0 ST:d1 Sequence of commands leading to the command that caused the error were: DCR FR SC SN CL CH D/H CR Timestamp 07 00 00 00 d6 07 e0 41 144.450 07 00 00 00 d5 07 e0 41 144.447 07 00 00 00 d4 07 e0 41 144.444 07 00 00 00 d3 07 e0 41 144.440 07 00 00 00 d2 07 e0 41 144.437 And I just noticed an old Quantum Fireball in another system acting the exact same way, so I guess I'll need to do that one too....both root drives. Yuck. Thanks for the info! -W |
From: Bruce A. <ba...@gr...> - 2003-06-17 20:39:34
|
Hi Chris, > > > It's "running idle" temp is around 42 degress C, so this is a minor > > > increase while it's doing something. Even with the case off and a fan > > > blowing directly on it, 42 is about as low as I can go. I have a physical > > > temp monitor that verifies this measurement as real. > > > >Hmm, what's the ambient room temperature? I have a bunch (well, > >hundreds) of maxtor drives in a 21 C room and they all run at 23-27 C. > > > About 79 degrees F. (26 degress C?) > > I was playing around with it and I found if I point a box fan at it from > about 2 ft. away, I can get it down to 38 C, but this does not seem a smart > condition to continue in.... You need to get an 80mm case fan (one of the kind that plugs into a spare power connector in your system) and find a way to set it up to blow air directly by the disk. Another solution: for $50 you can buy a system case that has fans which blow air directly on/by the disks. It'll save the next one from being cooked to death. > > > >Bottom line: you need a new disk. It looks like it's less than a year old > > > >(much less than 8000 hours usage) so Maxtor's warranty should cover it. > > > > > > I'll probably need to provide them some sort of proof, no? Guess I'll > > need > > > to find and try their diagnostics and see what those say. > > > >If you just tell them that the drive is failing it's SMART short & > >extended self-test at the same address each time, that should be enough > >for them to replace it. That, after all, is what the self-tests are for. > > They had me run their PowerMAX tests, which reported an error code and > reported it "might" be able to fix it and call back only if it cannot. But > there also "might" be data loss, so I guess I need to back the whole > doggone thing up and try that. Gonna blow a few hours today.... Backing up is a very good idea. As is cooling the poor disk. > Running the Maxtor program did add an error entry in the "Error Log", > though (reproduced here just FYI): See comments below. > > SMART Error Log Version: 1 > ATA Error Count: 1 > DCR = Device Control Register > FR = Features Register > SC = Sector Count Register > SN = Sector Number Register > CL = Cylinder Low Register > CH = Cylinder High Register > D/H = Device/Head Register > CR = Content written to Command Register > ER = Error register > STA = Status register > Timestamp is seconds since the previous disk power-on. > Note: timestamp "wraps" after 2^32 msec = 49.710 days. > > Error 1 occurred at disk power-on lifetime: 3104 hours > When the command that caused the error occurred, the device was in an > unknown state. > After command completion occurred, registers were: > ER:40 SC:1a SN:e6 CL:d6 CH:07 D/H:e0 ST:d1 > Sequence of commands leading to the command that caused the error were: > DCR FR SC SN CL CH D/H CR Timestamp > 07 00 00 00 d6 07 e0 41 144.450 > 07 00 00 00 d5 07 e0 41 144.447 > 07 00 00 00 d4 07 e0 41 144.444 > 07 00 00 00 d3 07 e0 41 144.440 > 07 00 00 00 d2 07 e0 41 144.437 FYI, the command that failed was Command Register (CR) 0x41. This is an ATA-4 command READ VERIFY SECTOR(S). It's reporting a failure at LBA address [look at table above]: (bottom 4 bits of DH)(CH)(CL)(SN)= (0)(07)(d6)(e6)= 0x0007d6e6 This is the same LBA where the device self-test is failing. So I stand my ground -- tell Maxtor to replace the disk. > And I just noticed an old Quantum Fireball in another system acting > the exact same way, so I guess I'll need to do that one too....both > root drives. Yuck. When you say "the same way" I hope you don't mean exactly the same way, just something going wrong... Cheers, Bruce |
From: Christopher W. <sma...@th...> - 2003-06-17 21:31:06
|
At 03:31 PM 6/17/2003 -0500, Bruce Allen wrote: >You need to get an 80mm case fan (one of the kind that plugs into a spare >power connector in your system) and find a way to set it up to blow air >directly by the disk. I added some (more, bay) fans and opened a grill directly behind the disk in my tower case so air is drawn directly across the drive and out. >This is the same LBA where the device self-test is failing. So I stand my >ground -- tell Maxtor to replace the disk. After backing up, I ran the PowerMAX utility they told me to, and when it gave me the option to "repair", I activated it. The disk seems to be happy now, the self tests now pass without problems. At this point, I put it back together and will wait to see what happens. (The other, old Fireball drive, could not be repaired and so will have to be replaced (no warranty).) -W |
From: Bruce A. <ba...@gr...> - 2003-06-17 21:34:20
|
Hi Christopher, > At 03:31 PM 6/17/2003 -0500, Bruce Allen wrote: > >You need to get an 80mm case fan (one of the kind that plugs into a spare > >power connector in your system) and find a way to set it up to blow air > >directly by the disk. > > I added some (more, bay) fans and opened a grill directly behind the disk > in my tower case so air is drawn directly across the drive and out. That sounds like a very good move. Is the disk temperature now lower? > >This is the same LBA where the device self-test is failing. So I stand my > >ground -- tell Maxtor to replace the disk. > > After backing up, I ran the PowerMAX utility they told me to, and when it > gave me the option to "repair", I activated it. > > The disk seems to be happy now, the self tests now pass without problems. Hmmm, interesting. I'd appreciate it if you could again post the output of smartctl -a, so I could see what the utility did to the disk. > At this point, I put it back together and will wait to see what happens. > > (The other, old Fireball drive, could not be repaired and so will have to > be replaced (no warranty).) C'est la vie. Cheers, Bruce |