From: Andreas L. <ale...@cs...> - 2004-07-20 08:44:12
|
Hi, Recently, I've bought a hitachi travelstar. In general, it works fine. However, if I start my computer after I had turned it of for a longer period, it takes several minutes for the drive to work properly. I beliefe this period gets longer and longer ... I've used the vendor tool to check the disk. It reports no error. However, smartmontools reports some UNC (0x40) errors that I cannot really interpret: Do you have any suggestions: Device Model: IC25N060ATMR04-0 Serial Number: MRG377K3HYP0AH Firmware Version: MO3OAD4A Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a Local Time is: Tue Jul 20 09:18:04 2004 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x05) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 645) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 53) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 092 092 062 Pre-fail Always - 2883584 2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0 3 Spin_Up_Time 0x0007 136 136 033 Pre-fail Always - 1 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 329 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 192 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 72 191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1 193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 13201 194 Temperature_Celsius 0x0002 137 137 000 Old_age Always - 40 (Lifetime Min/Max 19/57) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 27 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 4 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 6 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Timestamp = decimal seconds since the previous disk power-on. Note: timestamp "wraps" after 2^32 msec = 49.710 days. Error 6 occurred at disk power-on lifetime: 192 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 3f 4d 98 18 e2 Error: UNC Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 ff 3f 4d 98 18 e2 00 205.800 READ DMA c8 ff 3f 00 00 00 e0 00 205.800 READ DMA 10 ff 3f 01 fe 3f af 00 205.800 RECALIBRATE [OBS-4] 91 ff 3f 01 fe 3f af 00 205.800 INITIALIZE DEVICE PARAMETERS [OBS-6] c8 ff 00 16 00 00 e0 04 205.800 READ DMA Error 5 occurred at disk power-on lifetime: 192 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 3f 4d 98 18 e2 Error: UNC Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 ff 3f 4d 98 18 e2 00 36.400 READ DMA c8 ff 3f 00 00 00 e0 00 36.400 READ DMA 10 ff 3f 01 fe 3f af 00 36.400 RECALIBRATE [OBS-4] 91 ff 3f 01 fe 3f af 00 36.400 INITIALIZE DEVICE PARAMETERS [OBS-6] c8 ff 00 16 00 00 e0 04 36.300 READ DMA Error 4 occurred at disk power-on lifetime: 192 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 21 6b 98 18 e2 Error: UNC Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 ff 3f 4d 98 18 e2 00 7.800 READ DMA c8 ff 3f 00 00 00 e0 00 7.800 READ DMA 10 ff 3f 01 fe 3f af 00 7.800 RECALIBRATE [OBS-4] 91 ff 3f 01 fe 3f af 00 7.800 INITIALIZE DEVICE PARAMETERS [OBS-6] c8 ff 00 16 00 00 e0 04 7.700 READ DMA Error 3 occurred at disk power-on lifetime: 186 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 01 1e 12 e6 ef Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- fe a2 01 1e 12 06 e0 02 311.200 [VENDOR SPECIFIC] fe a2 01 1d 12 06 e0 02 311.200 [VENDOR SPECIFIC] fe a2 01 1c 12 06 e0 02 311.200 [VENDOR SPECIFIC] fe a2 01 1b 12 06 e0 02 311.200 [VENDOR SPECIFIC] fe a2 01 1a 12 06 e0 02 311.200 [VENDOR SPECIFIC] Error 2 occurred at disk power-on lifetime: 186 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 01 1e 12 e6 ef Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- fe a2 01 1e 12 06 e0 02 301.300 [VENDOR SPECIFIC] fe a2 01 1d 12 06 e0 02 301.300 [VENDOR SPECIFIC] fe a2 01 1c 12 06 e0 02 301.300 [VENDOR SPECIFIC] fe a2 01 1b 12 06 e0 02 301.200 [VENDOR SPECIFIC] fe a2 01 1a 12 06 e0 02 301.200 [VENDOR SPECIFIC] SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 188 - # 2 Short offline Completed without error 00% 183 - |
From: Andreas L. <ale...@cs...> - 2004-07-21 08:02:16
Attachments:
smart3.txt
|
Hi, i've run smartctl -t long twice. The first run reports a bad block. However, the second run completed without any error. After a warm up period the drive works (after several restarts). I think I will exchange the drive. Is it possible to precisely describe the error from the register values? The second register (ST) always has value 51 which I think means "drive not ready" and occurs on the first tries to start the system? Andreas > Please run an extended self-test "-t long" to test for errors. After it > has completed, please send the complete output of "smartctl -a" as a .txt > email attachment. Please copy all email to the mailing list. > > Bruce > > On Tue, 20 Jul 2004, Andreas Leicher wrote: > > > Hi, > > > > Recently, I"ve bought a hitachi travelstar. In general, it works fine. > > However, if I start my computer after I had turned it of for a longer > > period, it takes several minutes for the drive to work properly. I > > beliefe this period gets longer and longer ... > > > > I"ve used the vendor tool to check the disk. It reports no error. > > However, smartmontools reports some UNC (0x40) errors that I cannot > > really interpret: > > > > Do you have any suggestions: > > > > Device Model: IC25N060ATMR04-0 > > Serial Number: MRG377K3HYP0AH > > Firmware Version: MO3OAD4A > > Device is: Not in smartctl database [for details use: -P showall] > > ATA Version is: 6 > > ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a > > Local Time is: Tue Jul 20 09:18:04 2004 UTC > > SMART support is: Available - device has SMART capability. > > SMART support is: Enabled > > > > === START OF READ SMART DATA SECTION === > > SMART overall-health self-assessment test result: PASSED > > > > General SMART Values: > > Offline data collection status: (0x05) Offline data collection activity was > > aborted by an interrupting command from host. > > Auto Offline Data Collection: Disabled. > > Self-test execution status: ( 0) The previous self-test routine > > completed > > without error or no self-test has ever > > been run. > > Total time to complete Offline > > data collection: ( 645) seconds. > > Offline data collection > > capabilities: (0x5b) SMART execute Offline immediate. > > Auto Offline data collection on/off support. > > Suspend Offline collection upon new > > command. > > Offline surface scan supported. > > Self-test supported. > > No Conveyance Self-test supported. > > Selective Self-test supported. > > SMART capabilities: (0x0003) Saves SMART data before entering > > power-saving mode. > > Supports SMART auto save timer. > > Error logging capability: (0x01) Error logging supported. > > General Purpose Logging supported. > > Short self-test routine > > recommended polling time: ( 2) minutes. > > Extended self-test routine > > recommended polling time: ( 53) minutes. > > > > SMART Attributes Data Structure revision number: 16 > > Vendor Specific SMART Attributes with Thresholds: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > UPDATED WHEN_FAILED RAW_VALUE > > 1 Raw_Read_Error_Rate 0x000b 092 092 062 Pre-fail > > Always - 2883584 > > 2 Throughput_Performance 0x0005 100 100 040 Pre-fail > > Offline - 0 > > 3 Spin_Up_Time 0x0007 136 136 033 Pre-fail > > Always - 1 > > 4 Start_Stop_Count 0x0012 100 100 000 Old_age > > Always - 329 > > 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail > > Always - 0 > > 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail > > Always - 0 > > 8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail > > Offline - 0 > > 9 Power_On_Hours 0x0012 100 100 000 Old_age > > Always - 192 > > 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail > > Always - 0 > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > Always - 72 > > 191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always > > - 0 > > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always > > - 1 > > 193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always > > - 13201 > > 194 Temperature_Celsius 0x0002 137 137 000 Old_age Always > > - 40 (Lifetime Min/Max 19/57) > > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always > > - 27 > > 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always > > - 4 > > 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age > > Offline - 0 > > 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always > > - 0 > > > > SMART Error Log Version: 1 > > ATA Error Count: 6 (device log contains only the most recent five errors) > > CR = Command Register [HEX] > > FR = Features Register [HEX] > > SC = Sector Count Register [HEX] > > SN = Sector Number Register [HEX] > > CL = Cylinder Low Register [HEX] > > CH = Cylinder High Register [HEX] > > DH = Device/Head Register [HEX] > > DC = Device Command Register [HEX] > > ER = Error register [HEX] > > ST = Status register [HEX] > > Timestamp = decimal seconds since the previous disk power-on. > > Note: timestamp "wraps" after 2^32 msec = 49.710 days. > > > > Error 6 occurred at disk power-on lifetime: 192 hours > > When the command that caused the error occurred, the device was > > active or idle. > > > > After command completion occurred, registers were: > > ER ST SC SN CL CH DH > > -- -- -- -- -- -- -- > > 40 51 3f 4d 98 18 e2 Error: UNC > > > > Commands leading to the command that caused the error were: > > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > > -- -- -- -- -- -- -- -- --------- -------------------- > > c8 ff 3f 4d 98 18 e2 00 205.800 READ DMA > > c8 ff 3f 00 00 00 e0 00 205.800 READ DMA > > 10 ff 3f 01 fe 3f af 00 205.800 RECALIBRATE [OBS-4] > > 91 ff 3f 01 fe 3f af 00 205.800 INITIALIZE DEVICE PARAMETERS [OBS-6] > > c8 ff 00 16 00 00 e0 04 205.800 READ DMA > > > > Error 5 occurred at disk power-on lifetime: 192 hours > > When the command that caused the error occurred, the device was > > active or idle. > > > > After command completion occurred, registers were: > > ER ST SC SN CL CH DH > > -- -- -- -- -- -- -- > > 40 51 3f 4d 98 18 e2 Error: UNC > > > > Commands leading to the command that caused the error were: > > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > > -- -- -- -- -- -- -- -- --------- -------------------- > > c8 ff 3f 4d 98 18 e2 00 36.400 READ DMA > > c8 ff 3f 00 00 00 e0 00 36.400 READ DMA > > 10 ff 3f 01 fe 3f af 00 36.400 RECALIBRATE [OBS-4] > > 91 ff 3f 01 fe 3f af 00 36.400 INITIALIZE DEVICE PARAMETERS [OBS-6] > > c8 ff 00 16 00 00 e0 04 36.300 READ DMA > > > > Error 4 occurred at disk power-on lifetime: 192 hours > > When the command that caused the error occurred, the device was > > active or idle. > > > > After command completion occurred, registers were: > > ER ST SC SN CL CH DH > > -- -- -- -- -- -- -- > > 40 51 21 6b 98 18 e2 Error: UNC > > > > Commands leading to the command that caused the error were: > > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > > -- -- -- -- -- -- -- -- --------- -------------------- > > c8 ff 3f 4d 98 18 e2 00 7.800 READ DMA > > c8 ff 3f 00 00 00 e0 00 7.800 READ DMA > > 10 ff 3f 01 fe 3f af 00 7.800 RECALIBRATE [OBS-4] > > 91 ff 3f 01 fe 3f af 00 7.800 INITIALIZE DEVICE PARAMETERS [OBS-6] > > c8 ff 00 16 00 00 e0 04 7.700 READ DMA > > > > Error 3 occurred at disk power-on lifetime: 186 hours > > When the command that caused the error occurred, the device was > > active or idle. > > > > After command completion occurred, registers were: > > ER ST SC SN CL CH DH > > -- -- -- -- -- -- -- > > 40 51 01 1e 12 e6 ef > > > > Commands leading to the command that caused the error were: > > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > > -- -- -- -- -- -- -- -- --------- -------------------- > > fe a2 01 1e 12 06 e0 02 311.200 [VENDOR SPECIFIC] > > fe a2 01 1d 12 06 e0 02 311.200 [VENDOR SPECIFIC] > > fe a2 01 1c 12 06 e0 02 311.200 [VENDOR SPECIFIC] > > fe a2 01 1b 12 06 e0 02 311.200 [VENDOR SPECIFIC] > > fe a2 01 1a 12 06 e0 02 311.200 [VENDOR SPECIFIC] > > > > Error 2 occurred at disk power-on lifetime: 186 hours > > When the command that caused the error occurred, the device was > > active or idle. > > > > After command completion occurred, registers were: > > ER ST SC SN CL CH DH > > -- -- -- -- -- -- -- > > 40 51 01 1e 12 e6 ef > > > > Commands leading to the command that caused the error were: > > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > > -- -- -- -- -- -- -- -- --------- -------------------- > > fe a2 01 1e 12 06 e0 02 301.300 [VENDOR SPECIFIC] > > fe a2 01 1d 12 06 e0 02 301.300 [VENDOR SPECIFIC] > > fe a2 01 1c 12 06 e0 02 301.300 [VENDOR SPECIFIC] > > fe a2 01 1b 12 06 e0 02 301.200 [VENDOR SPECIFIC] > > fe a2 01 1a 12 06 e0 02 301.200 [VENDOR SPECIFIC] > > > > SMART Self-test log structure revision number 1 > > Num Test_Description Status Remaining > > LifeTime(hours) LBA_of_first_error > > # 1 Short offline Completed without error 00% 188 > > - > > # 2 Short offline Completed without error 00% 183 > > - > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > Smartmontools-support mailing list > > Smartmontools-support@li... > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > |
From: Bruce A. <ba...@gr...> - 2004-07-21 16:02:02
|
Hi Andreas, > i've run smartctl -t long twice. The first run reports a bad block. > However, the second run completed without any error. I think that the drive has reallocated this bad sector. From looking at the current pending sector count though, it looks like there are a lot of other sectors that need reallocation. I don't understand why the reallocated sector count is still zero. Can IBM's Drive Fitness Test 'repair' the disk (make the current pending sector count zero)? > After a warm up period the drive works (after several restarts). I > think I will exchange the drive. That's probably the right thing to do. > Is it possible to precisely describe the error from the register values? > The second register (ST) always has value 51 which I think means "drive > not ready" and occurs on the first tries to start the system? This sounds right but I am not sure. Look in the ATA specs and in the IBM OEM documentation to see the exact definition of this register. Also, smartmontools 5.32 might do a slightly more verbose job of printing error log. Cheers, Bruce |
From: Geoffrey K. <ge...@ge...> - 2004-07-22 18:25:45
|
Bruce Allen <ba...@gr...> writes: > Hi Andreas, > > > i've run smartctl -t long twice. The first run reports a bad block. > > However, the second run completed without any error. > > I think that the drive has reallocated this bad sector. From looking at > the current pending sector count though, it looks like there are a lot of > other sectors that need reallocation. I don't understand why the > reallocated sector count is still zero. This can happen if the sectors are corrupt at the raw level (the ECC data doesn't match) but are fixed by re-writing the track, so no reallocation is necessary; or a transient read problem. |
From: Bruce A. <ba...@gr...> - 2004-07-22 20:30:26
|
Hi Geoff, > > > i've run smartctl -t long twice. The first run reports a bad block. > > > However, the second run completed without any error. > > > > I think that the drive has reallocated this bad sector. From looking at > > the current pending sector count though, it looks like there are a lot of > > other sectors that need reallocation. I don't understand why the > > reallocated sector count is still zero. > > This can happen if the sectors are corrupt at the raw level (the ECC > data doesn't match) but are fixed by re-writing the track, so no > reallocation is necessary; or a transient read problem. If the problems has been fixed by re-writing the track, won't the pending sector count return to zero? In other words, won't the sectors then be removed from the pending list? In the drive in question, the pending sector count was large, the reallocated sector count was zero, and the drive HAD shown one read scan error, but this error had disappeared in the most recent read scan, which showed zero read errors. Since the most recent read scan showed no errors, this would imply to me that any bad sectors had either been reallocated or fixed by rewriting the track. So in this case the pending sector count ought to be zero. What's wrong with this logic? Cheers, Bruce |
From: Geoff K. <ge...@ge...> - 2004-07-23 00:53:44
Attachments:
smime.p7s
|
On 22/07/2004, at 1:30 PM, Bruce Allen wrote: > Hi Geoff, > >>>> i've run smartctl -t long twice. The first run reports a bad block. >>>> However, the second run completed without any error. >>> >>> I think that the drive has reallocated this bad sector. From looking >>> at >>> the current pending sector count though, it looks like there are a >>> lot of >>> other sectors that need reallocation. I don't understand why the >>> reallocated sector count is still zero. >> >> This can happen if the sectors are corrupt at the raw level (the ECC >> data doesn't match) but are fixed by re-writing the track, so no >> reallocation is necessary; or a transient read problem. > > If the problems has been fixed by re-writing the track, won't the > pending > sector count return to zero? In other words, won't the sectors then be > removed from the pending list? Yes, that's how I've seen it work. Take a drive with one bad sector, overwrite that sector, and the pending count drops to zero. > In the drive in question, the pending sector count was large, the > reallocated sector count was zero, and the drive HAD shown one read > scan > error, but this error had disappeared in the most recent read scan, > which > showed zero read errors. > > Since the most recent read scan showed no errors, this would imply to > me > that any bad sectors had either been reallocated or fixed by rewriting > the > track. So in this case the pending sector count ought to be zero. > > What's wrong with this logic? My best guess: The sector is read as invalid, and so it gets put in the pending list, but the read scan later reads the same sector and sees it as OK (but, I assume, doesn't remove it from the pending list). When the sector is written, the drive successfully rewrites the sector, so it gets removed from the pending list and not reallocated. That matches the original description, which is that the trouble happens at startup but not later. |
From: Bruce A. <ba...@gr...> - 2004-07-23 14:38:36
|
> Yes, that's how I've seen it work. Take a drive with one bad sector, > overwrite that sector, and the pending count drops to zero. Agreed: this is what I have also seen. > > In the drive in question, the pending sector count was large, the > > reallocated sector count was zero, and the drive HAD shown one read > > scan > > error, but this error had disappeared in the most recent read scan, > > which > > showed zero read errors. > > > > Since the most recent read scan showed no errors, this would imply to > > me > > that any bad sectors had either been reallocated or fixed by rewriting > > the > > track. So in this case the pending sector count ought to be zero. > > > > What's wrong with this logic? > > My best guess: The sector is read as invalid, and so it gets put in the > pending list, but the read scan later reads the same sector and sees it > as OK (but, I assume, doesn't remove it from the pending list). When > the sector is written, the drive successfully rewrites the sector, so > it gets removed from the pending list and not reallocated. > > That matches the original description, which is that the trouble > happens at startup but not later. Your explanation is consistent with what's observed. I had assumed that the sector would be removed from the pending sector list as soon as the drive can read that sector (ie, as part of the read scan). Your explanation matches the observations better. To restate it, a sector will ONLY be removed from the pending sector list on a WRITE, never on a READ. The algorithm for removing it on a WRITE is that first, you try to write to the (supposedly) bad sector, and then immediately do a verifying READ. If that suceeds, you remove the sector from the pending sector list without reallocating it. If the verifying READ fails, then you reallocate the sector to a spare sector and remove it from the pending sector list. Geoff, do you know if all vendors do this? I think I've seen this behavior consistently on IBM and IBM/Hitachi drives. I think that Maxtor's might remove the sector from the pending list as soon as it can be read (as part of an offline scan) but I'm not sure. Geoff, if you don't know I'll ask a contact at Maxtor about this. Cheers, Bruce |
From: Geoff K. <ge...@ge...> - 2004-07-23 17:11:31
Attachments:
smime.p7s
|
On 23/07/2004, at 7:38 AM, Bruce Allen wrote: > I had assumed that the sector would be removed from the pending sector > list as soon as the drive can read that sector (ie, as part of the read > scan). Your explanation matches the observations better. To restate > it, > a sector will ONLY be removed from the pending sector list on a WRITE, > never on a READ. The algorithm for removing it on a WRITE is that > first, > you try to write to the (supposedly) bad sector, and then immediately > do a > verifying READ. If that suceeds, you remove the sector from the > pending > sector list without reallocating it. If the verifying READ fails, then > you reallocate the sector to a spare sector and remove it from the > pending > sector list. > > Geoff, do you know if all vendors do this? I think I've seen this > behavior > consistently on IBM and IBM/Hitachi drives. I think that Maxtor's > might > remove the sector from the pending list as soon as it can be read (as > part > of an offline scan) but I'm not sure. Geoff, if you don't know I'll > ask a > contact at Maxtor about this. No, I don't know. I wouldn't be surprised if it varied by vendor, it seems that both ways would be reasonable implementations. |
From: Hugo C. <hc...@ma...> - 2004-07-23 16:54:35
|
Dear All, I have a disk which is permanently failing an offline (see output bellow), Nevertheless all the SMART attributes are OK and therefore the self-assessment for the disk status is PASS. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 40 2e 02 34 e0 Error: UNC 64 sectors at LBA = 0x0034022e = 3408430 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 00 40 00 02 34 e0 00 376.950 READ DMA c8 00 40 c0 01 34 e0 00 376.950 READ DMA c8 00 40 80 01 34 e0 00 376.950 READ DMA c8 00 40 40 01 34 e0 00 376.950 READ DMA c8 00 40 00 01 34 e0 00 376.950 READ DMA Error 12 occurred at disk power-on lifetime: 968 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 40 70 ff 33 e0 Error: UNC 64 sectors at LBA = 0x0033ff70 = 3407728 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 00 40 40 ff 33 e0 00 372.950 READ DMA c8 00 40 40 ff 33 e0 00 369.400 READ DMA c8 00 40 00 ff 33 e0 00 369.400 READ DMA c8 00 40 c0 fe 33 e0 00 369.400 READ DMA c8 00 40 80 fe 33 e0 00 369.400 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 700 0x00340280 # 2 Short offline Completed: read failure 90% 677 0x00340280 # 3 Short offline Completed: read failure 90% 653 0x00340280 # 4 Short offline Completed: read failure 90% 630 0x00340280 # 5 Short offline Completed: read failure 90% 606 0x00340280 # 6 Extended offline Completed: read failure 90% 584 0x00340280 # 7 Short offline Completed: read failure 90% 583 0x00340280 # 8 Short offline Completed: read failure 90% 559 0x00340280 # 9 Short offline Completed: read failure 90% 536 0x00340280 #10 Short offline Completed: read failure 90% 512 0x00340280 #11 Short offline Completed: read failure 90% 489 0x00340280 #12 Short offline Completed: read failure 90% 465 0x00340280 #13 Short offline Completed: read failure 90% 442 0x00340280 #14 Extended offline Completed: read failure 90% 419 0x00340280 #15 Short offline Completed: read failure 90% 418 0x00340280 #16 Short offline Completed: read failure 90% 395 0x00340280 #17 Short offline Completed: read failure 90% 371 0x00340280 #18 Short offline Completed: read failure 90% 348 0x00340280 #19 Short offline Completed: read failure 90% 324 0x00340280 #20 Short offline Completed: read failure 90% 301 0x00340280 #21 Short offline Completed: read failure 90% 277 0x00340280 === START OF INFORMATION SECTION === Device Model: WDC WD200BB-00CLB0 Serial Number: WD-WMAAR1878298 Firmware Version: 05.04E05 Device is: In smartctl database [for details use: -P show] ATA Version is: 5 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Jul 23 18:44:17 2004 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED What should be the procedure for this case .... replace the disk?? How can I get more information of what is causing the disk to fail it's offline test?? Thank you for the Help -- Hugo Caçote @ CERN IT Office phone: +41227674341 Mobile phone: +41797649072 |
From: Bruce A. <ba...@gr...> - 2004-07-24 13:38:23
|
Hi Hugo, > I have a disk which is permanently failing an offline (see output bellow)= ,=20 >=20 > Nevertheless all the SMART attributes are OK and therefore the=20 > self-assessment for the disk status is PASS. This is a common situation: please see the smartmontools FAQ and the page http://smartmontools.sourceforge.net/BadBlockHowTo.txt for additional details. The disk has an unreadable (uncorrectable) sector at the indicated Logical Block Address. The disk status is PASS because nothing is 'wrong' with the disk. But it can't read this 512 bytes of data because the ECC code is inconsistent. You can 'repair' the sector (force the disk to allocate a good spare sector in its place) by writing to the bad sector -- but you will lose the 512 bytes that was stored there (well, it's already lost). Is this disk behind a 3ware controller? That controller should force the sector reallocation all by itself. By the way Hugo, have you gotten any 3ware 9000 series controllers and tried out the new smartmontools support for those? Cheers, =09Bruce > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 40 2e 02 34 e0 Error: UNC 64 sectors at LBA =3D 0x0034022e =3D= =20 > 3408430 >=20 > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > c8 00 40 00 02 34 e0 00 376.950 READ DMA > c8 00 40 c0 01 34 e0 00 376.950 READ DMA > c8 00 40 80 01 34 e0 00 376.950 READ DMA > c8 00 40 40 01 34 e0 00 376.950 READ DMA > c8 00 40 00 01 34 e0 00 376.950 READ DMA >=20 > Error 12 occurred at disk power-on lifetime: 968 hours > When the command that caused the error occurred, the device was active= =20 > or idle. >=20 > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 40 70 ff 33 e0 Error: UNC 64 sectors at LBA =3D 0x0033ff70 =3D= =20 > 3407728 >=20 > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > c8 00 40 40 ff 33 e0 00 372.950 READ DMA > c8 00 40 40 ff 33 e0 00 369.400 READ DMA > c8 00 40 00 ff 33 e0 00 369.400 READ DMA > c8 00 40 c0 fe 33 e0 00 369.400 READ DMA > c8 00 40 80 fe 33 e0 00 369.400 READ DMA >=20 > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining =20 > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed: read failure 90% 700 = =20 > 0x00340280 > # 2 Short offline Completed: read failure 90% 677 = =20 > 0x00340280 > # 3 Short offline Completed: read failure 90% 653 = =20 > 0x00340280 > # 4 Short offline Completed: read failure 90% 630 = =20 > 0x00340280 > # 5 Short offline Completed: read failure 90% 606 = =20 > 0x00340280 > # 6 Extended offline Completed: read failure 90% 584 = =20 > 0x00340280 > # 7 Short offline Completed: read failure 90% 583 = =20 > 0x00340280 > # 8 Short offline Completed: read failure 90% 559 = =20 > 0x00340280 > # 9 Short offline Completed: read failure 90% 536 = =20 > 0x00340280 > #10 Short offline Completed: read failure 90% 512 = =20 > 0x00340280 > #11 Short offline Completed: read failure 90% 489 = =20 > 0x00340280 > #12 Short offline Completed: read failure 90% 465 = =20 > 0x00340280 > #13 Short offline Completed: read failure 90% 442 = =20 > 0x00340280 > #14 Extended offline Completed: read failure 90% 419 = =20 > 0x00340280 > #15 Short offline Completed: read failure 90% 418 = =20 > 0x00340280 > #16 Short offline Completed: read failure 90% 395 = =20 > 0x00340280 > #17 Short offline Completed: read failure 90% 371 = =20 > 0x00340280 > #18 Short offline Completed: read failure 90% 348 = =20 > 0x00340280 > #19 Short offline Completed: read failure 90% 324 = =20 > 0x00340280 > #20 Short offline Completed: read failure 90% 301 = =20 > 0x00340280 > #21 Short offline Completed: read failure 90% 277 = =20 > 0x00340280 >=20 >=20 > =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D > Device Model: WDC WD200BB-00CLB0 > Serial Number: WD-WMAAR1878298 > Firmware Version: 05.04E05 > Device is: In smartctl database [for details use: -P show] > ATA Version is: 5 > ATA Standard is: Exact ATA specification draft version not indicated > Local Time is: Fri Jul 23 18:44:17 2004 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled >=20 > =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D > SMART overall-health self-assessment test result: PASSED >=20 >=20 >=20 >=20 > What should be the procedure for this case .... replace the disk?? How ca= n=20 > I get more information of what is causing the disk to fail it's offline= =20 > test?? >=20 >=20 > Thank you for the Help >=20 >=20 > --=20 > Hugo Ca=E7ote @ CERN IT=20 > Office phone: +41227674341=20 > Mobile phone: +41797649072 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support >=20 >=20 |
From: Bruce A. <ba...@gr...> - 2004-07-20 16:57:18
|
Please run an extended self-test '-t long' to test for errors. After it has completed, please send the complete output of 'smartctl -a' as a .txt email attachment. Please copy all email to the mailing list. Bruce On Tue, 20 Jul 2004, Andreas Leicher wrote: > Hi, > > Recently, I've bought a hitachi travelstar. In general, it works fine. > However, if I start my computer after I had turned it of for a longer > period, it takes several minutes for the drive to work properly. I > beliefe this period gets longer and longer ... > > I've used the vendor tool to check the disk. It reports no error. > However, smartmontools reports some UNC (0x40) errors that I cannot > really interpret: > > Do you have any suggestions: > > Device Model: IC25N060ATMR04-0 > Serial Number: MRG377K3HYP0AH > Firmware Version: MO3OAD4A > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: 6 > ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a > Local Time is: Tue Jul 20 09:18:04 2004 UTC > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x05) Offline data collection activity was > aborted by an interrupting command from host. > Auto Offline Data Collection: Disabled. > Self-test execution status: ( 0) The previous self-test routine > completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 645) seconds. > Offline data collection > capabilities: (0x5b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > No Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 53) minutes. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000b 092 092 062 Pre-fail > Always - 2883584 > 2 Throughput_Performance 0x0005 100 100 040 Pre-fail > Offline - 0 > 3 Spin_Up_Time 0x0007 136 136 033 Pre-fail > Always - 1 > 4 Start_Stop_Count 0x0012 100 100 000 Old_age > Always - 329 > 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail > Always - 0 > 8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail > Offline - 0 > 9 Power_On_Hours 0x0012 100 100 000 Old_age > Always - 192 > 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 72 > 191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always > - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always > - 1 > 193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always > - 13201 > 194 Temperature_Celsius 0x0002 137 137 000 Old_age Always > - 40 (Lifetime Min/Max 19/57) > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always > - 27 > 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always > - 4 > 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always > - 0 > > SMART Error Log Version: 1 > ATA Error Count: 6 (device log contains only the most recent five errors) > CR = Command Register [HEX] > FR = Features Register [HEX] > SC = Sector Count Register [HEX] > SN = Sector Number Register [HEX] > CL = Cylinder Low Register [HEX] > CH = Cylinder High Register [HEX] > DH = Device/Head Register [HEX] > DC = Device Command Register [HEX] > ER = Error register [HEX] > ST = Status register [HEX] > Timestamp = decimal seconds since the previous disk power-on. > Note: timestamp "wraps" after 2^32 msec = 49.710 days. > > Error 6 occurred at disk power-on lifetime: 192 hours > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 3f 4d 98 18 e2 Error: UNC > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > c8 ff 3f 4d 98 18 e2 00 205.800 READ DMA > c8 ff 3f 00 00 00 e0 00 205.800 READ DMA > 10 ff 3f 01 fe 3f af 00 205.800 RECALIBRATE [OBS-4] > 91 ff 3f 01 fe 3f af 00 205.800 INITIALIZE DEVICE PARAMETERS [OBS-6] > c8 ff 00 16 00 00 e0 04 205.800 READ DMA > > Error 5 occurred at disk power-on lifetime: 192 hours > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 3f 4d 98 18 e2 Error: UNC > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > c8 ff 3f 4d 98 18 e2 00 36.400 READ DMA > c8 ff 3f 00 00 00 e0 00 36.400 READ DMA > 10 ff 3f 01 fe 3f af 00 36.400 RECALIBRATE [OBS-4] > 91 ff 3f 01 fe 3f af 00 36.400 INITIALIZE DEVICE PARAMETERS [OBS-6] > c8 ff 00 16 00 00 e0 04 36.300 READ DMA > > Error 4 occurred at disk power-on lifetime: 192 hours > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 21 6b 98 18 e2 Error: UNC > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > c8 ff 3f 4d 98 18 e2 00 7.800 READ DMA > c8 ff 3f 00 00 00 e0 00 7.800 READ DMA > 10 ff 3f 01 fe 3f af 00 7.800 RECALIBRATE [OBS-4] > 91 ff 3f 01 fe 3f af 00 7.800 INITIALIZE DEVICE PARAMETERS [OBS-6] > c8 ff 00 16 00 00 e0 04 7.700 READ DMA > > Error 3 occurred at disk power-on lifetime: 186 hours > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 01 1e 12 e6 ef > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > fe a2 01 1e 12 06 e0 02 311.200 [VENDOR SPECIFIC] > fe a2 01 1d 12 06 e0 02 311.200 [VENDOR SPECIFIC] > fe a2 01 1c 12 06 e0 02 311.200 [VENDOR SPECIFIC] > fe a2 01 1b 12 06 e0 02 311.200 [VENDOR SPECIFIC] > fe a2 01 1a 12 06 e0 02 311.200 [VENDOR SPECIFIC] > > Error 2 occurred at disk power-on lifetime: 186 hours > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 01 1e 12 e6 ef > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name > -- -- -- -- -- -- -- -- --------- -------------------- > fe a2 01 1e 12 06 e0 02 301.300 [VENDOR SPECIFIC] > fe a2 01 1d 12 06 e0 02 301.300 [VENDOR SPECIFIC] > fe a2 01 1c 12 06 e0 02 301.300 [VENDOR SPECIFIC] > fe a2 01 1b 12 06 e0 02 301.200 [VENDOR SPECIFIC] > fe a2 01 1a 12 06 e0 02 301.200 [VENDOR SPECIFIC] > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed without error 00% 188 > - > # 2 Short offline Completed without error 00% 183 > - > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |