From: Bruce A. <ba...@gr...> - 2003-09-22 10:12:11
|
Hi Volker, > Smartmontools has some problems reporting the failed status of this > disk. Please remember that Smartmontools is only *reporting* what the disk has decided. It's not making these judgements itself. > The disk isn't making it easy by the looks of it, but > smartmontools needs to IMHO pick up on failed selftests. > The disk is throwing read-errors left right and center, filesystem is > corrupted, and some data is lost while other data is still retrievable, > and that s**t-disk is still reporting > SMART overall-health self-assessment test result: PASSED > Yeaaaa... right. I don't think so. In fact this is "correct" in the following sense: The firmware is reporting that there is nothing intrinsically wrong with the disk. Eg the servo system is not failing, the motor drive is not failing, etc. What is going wrong is that your disk has a set of bad sectors: > 5 Reallocated_Sector_Ct 0x0033 212 212 063 Pre-fail - 105 which is very common on a good normal disk. However there is (either 1 or 16) sectors: > 197 Current_Pending_Sector 0x0008 237 237 000 Old_age - 16 > 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age - 1 which can not be read. In other words, the disk has lost the data on these 1 or 16 sectors. However there is nothing "wrong" with the disk -- or at least the firmware thinks not. > SMART Self-test log, version number 1 > Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error > # 1 Short off-line Completed: read failure 60% 2104 0x00007d94 > # 2 Extended off-line Completed: read failure 40% 2104 0x00007d94 > # 3 Extended off-line Completed: read failure 40% 2103 0x00007d94 > # 4 Short off-line Completed: read failure 60% 2103 0x00007d94 > # 5 Short off-line Completed: read failure 60% 2103 0x00007d94 This LBA 0x00007d94 is where the data has been lost. > Ok, let's see whether smartd would actually ring the alarm bells: It will, if you run a self-test while smartd is running in the background. When the self-test finds the error, you'll get a report. > # running smartctl -t short here, which fails after 60% with read error > # > Sep 19 20:21:37 Rescue smartd[746]: Signal USR1 - checking devices now rather than in 929 seconds. > Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 250 > Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, SMART Usage Attribute: 209 Unknown_Attribute changed from 193 to 192 > Sep 19 20:21:37 Rescue smartd[746]: Device: /dev/hda, Self-Test Log error count increased from 5 to 6 > > Not really. smartd doesn't tell me that this disk is essentially already > dead. Especially, it should pick up on In fact the disk is not "dead", in the sense that if it can be told to reallocate the bad sectors, it should work OK again. Please remember that this is not *my* choice of logic -- it's only what the disk firmware has decided to do. You should try running the Maxtor MaxSafe utility -- it may be able to repair the disk. You should also be able to use some file system recover tools to determine what file(s) live at the LBA above. > Self-test execution status: ( 118) The previous self-test completed having > the read element of the test failed. > > This leaves me with a question: smartd doesn't run any self-tests. Am I > supposed to set up cron jobs for that? It would be more sensible for > smartd to take care of it. Correct -- smartd does NOT run self-tests. I will probably add an option to it to run short/long self-tests at regular intervals. Cheers, Bruce |