From: Guy C. <gm...@sa...> - 2002-10-23 13:00:13
|
I have a misbehaving drive that has failed some daiagnostics but smartctl still reports as being OK: Syslog output: Oct 22 10:54:07 rlx-1-1-13 kernel: EXT2-fs error (device ide1(22,1)): ext2_write _inode: unable to read inode block - inode=14, block=4 Oct 22 10:54:09 rlx-1-1-13 kernel: end_request: I/O error, dev 16:01 (hdc), sect or 32 Oct 22 10:54:09 rlx-1-1-13 kernel: end_request: I/O error, dev 16:01 (hdc), sect After running smart tests on the drive I get the following results: /usr/sbin/smartctl -L /dev/hdc <snip> smartctl version 5.0-10 Copyright (C) 2002 Bruce Allen SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error # 1 Extended off-line Completed: servo/seek failure 90% 2319 # 2 Short off-line Completed: servo/seek failure 90% 2319 # 3 Short off-line Completed: servo/seek failure 90% 2319 # 4 Short off-line Completed 00% 0 However, smartctl -c still reports the drive as being OK: sh-2.05# /usr/sbin/smartctl -c /dev/hdc <snip> Device Model: FUJITSU MHR2040AT Serial Number: NJ29T2712MSM Firmware Version: 30BA ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1 SMART support is: Enabled SMART overall-health self-assessment test result: PASSED sh-2.05# /usr/sbin/smartctl -l /dev/hdc smartctl version 5.0-10 Copyright (C) 2002 Bruce Allen Home page of smartctl is http://smartmontools.sourceforge.net/ SMART Error Log SMART Error Logging Version: 1 No Errors Logged Is this a bug in smartmontools or is this not really a disk hardware related error? Cheers, Guy Coates -- Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 |
From: Guy C. <gm...@sa...> - 2002-10-23 13:50:08
|
After running an extended test on the drive I now get : sh-2.05# /usr/sbin/smartctl -c /dev/hdc smartctl version 5.0-10 Copyright (C) 2002 Bruce Allen Home page of smartctl is http://smartmontools.sourceforge.net/ Device Model: FUJITSU MHR2040AT Serial Number: NJ29T2712MSM Firmware Version: 30BA ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1 SMART support is: Enabled Attribute ID 2 Failed SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA sh-2.05# echo $? 0 However: smartctl -c still exits with a 0 error code under these conditions. I would like to use smartctl in a script to poll drives that are failing. Can smartctl be fixed to return non zero error codes when disks failed? Secondly, should I be running extended or short tests? These machines are in a compute farm and I don't want to degrade IO performance unless I really need to. Will the short test pick up most failures eventually? Cheers, Guy -- Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 |
From: Guy C. <gm...@sa...> - 2002-10-23 14:39:03
Attachments:
ataprint.diff
|
This patch makes smartctl exit with 1 if it detects a smart failure. Guy -- Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 |