From: Volker K. <lis...@pa...> - 2003-09-19 20:05:21
|
Smartmontools has some problems reporting the failed status of this disk. The disk isn't making it easy by the looks of it, but smartmontools needs to IMHO pick up on failed selftests. The disk is throwing read-errors left right and center, filesystem is corrupted, and some data is lost while other data is still retrievable, and that s**t-disk is still reporting SMART overall-health self-assessment test result: PASSED Yeaaaa... right. I don't think so. http://smartmontools.sourceforge.net/ version 5.1.4 SuSE 8.2, kernel 2.4.20 Device Model: Maxtor 51536H2 Serial Number: F2119J0C Firmware Version: JAC61HU0 ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 0 SMART overall-health self-assessment test result: PASSED General SMART Values: Off-line data collection status: (0x02) Offline data collection activity completed without error. Self-test execution status: ( 118) The previous self-test completed having the read element of the test failed. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000a 253 252 000 Old_age - 32 3 Spin_Up_Time 0x0027 228 228 063 Pre-fail - 7753 4 Start_Stop_Count 0x0032 253 253 000 Old_age - 327 5 Reallocated_Sector_Ct 0x0033 212 212 063 Pre-fail - 105 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age - 0 8 Seek_Time_Performance 0x0027 251 246 187 Pre-fail - 40371 9 Power_On_Hours 0x0032 247 247 000 Old_age - 3639 10 Spin_Retry_Count 0x002b 253 252 223 Pre-fail - 0 11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail - 0 12 Power_Cycle_Count 0x0032 251 251 000 Old_age - 1101 196 Reallocated_Event_Count 0x0008 253 253 000 Old_age - 0 197 Current_Pending_Sector 0x0008 237 237 000 Old_age - 16 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age - 1 199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age - 0 200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age - 0 201 Unknown_Attribute 0x000a 253 252 000 Old_age - 1 202 Unknown_Attribute 0x000a 253 252 000 Old_age - 0 203 Unknown_Attribute 0x000b 253 252 180 Pre-fail - 3 204 Unknown_Attribute 0x000a 253 252 000 Old_age - 0 205 Unknown_Attribute 0x000a 252 171 000 Old_age - 2 207 Unknown_Attribute 0x002a 253 252 000 Old_age - 0 208 Unknown_Attribute 0x002a 253 252 000 Old_age - 0 209 Unknown_Attribute 0x0024 193 190 000 Old_age - 0 96 Unknown_Attribute 0x0004 253 253 000 Old_age - 0 97 Unknown_Attribute 0x0004 253 253 000 Old_age - 0 98 Unknown_Attribute 0x0004 253 253 000 Old_age - 0 99 Unknown_Attribute 0x0004 253 253 000 Old_age - 0 100 Unknown_Attribute 0x0004 253 253 000 Old_age - 0 101 Unknown_Attribute 0x0004 253 253 000 Old_age - 0 SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short off-line Completed: read failure 60% 2104 0x00007d94 # 2 Extended off-line Completed: read failure 40% 2104 0x00007d94 # 3 Extended off-line Completed: read failure 40% 2103 0x00007d94 # 4 Short off-line Completed: read failure 60% 2103 0x00007d94 # 5 Short off-line Completed: read failure 60% 2103 0x00007d94 Ok, let's see whether smartd would actually ring the alarm bells: # SMART config is: # /dev/hda -d ata -S on -o on -a Sep 19 20:06:56 Rescue smartd[744]: smartd version 5.1-4: S.M.A.R.T. Monitoring Daemon Sep 19 20:06:56 Rescue smartd[744]: Home page is http://smartmontools.sourceforge.net/ Sep 19 20:06:56 Rescue smartd[744]: Using configuration file /etc/smartd.conf Sep 19 20:06:56 Rescue smartd[746]: Device: /dev/hda, opened Sep 19 20:06:56 Rescue smartd[746]: Device: /dev/hda, enabled SMART Attribute Autosave. Sep 19 20:06:56 Rescue smartd[746]: Device: /dev/hda, enabled SMART Automatic Offline Testing. Sep 19 20:06:57 Rescue smartd[746]: Device: /dev/hda, is SMART capable. Adding to "monitor" list. Sep 19 20:06:57 Rescue smartd[746]: Started monitoring 1 ATA and 0 SCSI devices # # running smartctl -t short here, which fails after 60% with read error # Sep 19 20:21:37 Rescue smartd[746]: Signal USR1 - checking devices now rather than in 929 seconds. Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 250 Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, SMART Usage Attribute: 209 Unknown_Attribute changed from 193 to 192 Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, Self-Test Log error count increased from 5 to 6 Not really. smartd doesn't tell me that this disk is essentially already dead. Especially, it should pick up on Self-test execution status: ( 118) The previous self-test completed having the read element of the test failed. This leaves me with a question: smartd doesn't run any self-tests. Am I supposed to set up cron jobs for that? It would be more sensible for smartd to take care of it. Thanks much for the software, Volker |
From: Bruce A. <ba...@gr...> - 2003-09-22 10:12:11
|
Hi Volker, > Smartmontools has some problems reporting the failed status of this > disk. Please remember that Smartmontools is only *reporting* what the disk has decided. It's not making these judgements itself. > The disk isn't making it easy by the looks of it, but > smartmontools needs to IMHO pick up on failed selftests. > The disk is throwing read-errors left right and center, filesystem is > corrupted, and some data is lost while other data is still retrievable, > and that s**t-disk is still reporting > SMART overall-health self-assessment test result: PASSED > Yeaaaa... right. I don't think so. In fact this is "correct" in the following sense: The firmware is reporting that there is nothing intrinsically wrong with the disk. Eg the servo system is not failing, the motor drive is not failing, etc. What is going wrong is that your disk has a set of bad sectors: > 5 Reallocated_Sector_Ct 0x0033 212 212 063 Pre-fail - 105 which is very common on a good normal disk. However there is (either 1 or 16) sectors: > 197 Current_Pending_Sector 0x0008 237 237 000 Old_age - 16 > 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age - 1 which can not be read. In other words, the disk has lost the data on these 1 or 16 sectors. However there is nothing "wrong" with the disk -- or at least the firmware thinks not. > SMART Self-test log, version number 1 > Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error > # 1 Short off-line Completed: read failure 60% 2104 0x00007d94 > # 2 Extended off-line Completed: read failure 40% 2104 0x00007d94 > # 3 Extended off-line Completed: read failure 40% 2103 0x00007d94 > # 4 Short off-line Completed: read failure 60% 2103 0x00007d94 > # 5 Short off-line Completed: read failure 60% 2103 0x00007d94 This LBA 0x00007d94 is where the data has been lost. > Ok, let's see whether smartd would actually ring the alarm bells: It will, if you run a self-test while smartd is running in the background. When the self-test finds the error, you'll get a report. > # running smartctl -t short here, which fails after 60% with read error > # > Sep 19 20:21:37 Rescue smartd[746]: Signal USR1 - checking devices now rather than in 929 seconds. > Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 250 > Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, SMART Usage Attribute: 209 Unknown_Attribute changed from 193 to 192 > Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, Self-Test Log error count increased from 5 to 6 > > Not really. smartd doesn't tell me that this disk is essentially already > dead. Especially, it should pick up on In fact the disk is not "dead", in the sense that if it can be told to reallocate the bad sectors, it should work OK again. Please remember that this is not *my* choice of logic -- it's only what the disk firmware has decided to do. You should try running the Maxtor MaxSafe utility -- it may be able to repair the disk. You should also be able to use some file system recover tools to determine what file(s) live at the LBA above. > Self-test execution status: ( 118) The previous self-test completed having > the read element of the test failed. > > This leaves me with a question: smartd doesn't run any self-tests. Am I > supposed to set up cron jobs for that? It would be more sensible for > smartd to take care of it. Correct -- smartd does NOT run self-tests. I will probably add an option to it to run short/long self-tests at regular intervals. Cheers, Bruce |