From: Bill H. <hud...@ve...> - 2006-01-17 06:30:34
|
[per the support page, I'm not subscribed to this email list. Please CC me on your replies! Thx] This is not a new story. I've found other instances of my issue... I have Fedora Core-4 on a laptop. It has two hard drives, one of which is a 60-GB Fujitsu. After getting smartd running, it started reporting 4 Offline uncorrectable sectors. It reported this every half hour, when it woke up. Every day. All day. I ran, on numerous occasions, offline, short, and long tests, and all passed. They always pass. See smartctl -a /dev/hdc at the end of this message. You'll see errors from a *long* time ago - and IIRC it was the drive connection that was loose (it's been a while... :-) I finally used the HOWTO at sourceforge. The file system on hdc6, which contained the 'bad' sector(s), was reiserfs. Just to go along with the HOWTO, I laid down an ext3 file system in its place (I am lucky in this instance, just old data on this partition, and a copy at that). While creating the ext3 file system, I used -cc to run badblocks & automatically update the bad block list. The test wrote & then read the various patterns (10101010, 01010101, 11111111...) and reported no errors to stdout/stderr (or in syslog). Because of this, I didn't bother to play with tune2fs or the latest dd (with oflag=direct). Instead, I ran a 'long' test again. No errors. So, I killed smartd (the daemon) and started it from the cmd line with debug options --- and it starts to spew debug output, then, far too quickly I think, it reports that there have been 4 offline uncorrectable sectors and sends email. This happense in under 5 seconds: 1 root petey 100% /var/local/src/freeTTS > smartd -d -r ioctl smartd version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Opened configuration file /etc/smartd.conf Configuration file /etc/smartd.conf parsed. Device: /dev/hda, opened REPORT-IOCTL: DeviceFD=3 Command=IDENTIFY DEVICE REPORT-IOCTL: DeviceFD=3 Command=IDENTIFY DEVICE returned 0 Device: /dev/hda, found in smartd database. REPORT-IOCTL: DeviceFD=3 Command=SMART ENABLE REPORT-IOCTL: DeviceFD=3 Command=SMART ENABLE returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE THRESHOLDS REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE THRESHOLDS returned 0 Device: /dev/hda, is SMART capable. Adding to "monitor" list. Device: /dev/hdc, opened REPORT-IOCTL: DeviceFD=3 Command=IDENTIFY DEVICE REPORT-IOCTL: DeviceFD=3 Command=IDENTIFY DEVICE returned 0 Device: /dev/hdc, found in smartd database. REPORT-IOCTL: DeviceFD=3 Command=SMART ENABLE REPORT-IOCTL: DeviceFD=3 Command=SMART ENABLE returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE THRESHOLDS REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE THRESHOLDS returned 0 Device: /dev/hdc, is SMART capable. Adding to "monitor" list. Monitoring 2 ATA and 0 SCSI devices REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned 0 REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES REPORT-IOCTL: DeviceFD=3 Command=SMART READ ATTRIBUTE VALUES returned 0 Device: /dev/hdc, 4 Offline uncorrectable sectors Sending warning via mail to ro...@pe...daceks.home ... Warning via mail to ro...@pe...daceks.home: successful With the error being reported every 30 minutes, I can only guess that perhaps smartd is not "remembering" what the error log on the drive shows. Every time it wakes up it's like the guy in "50 First Dates" who introduces himself every 15 seconds... :-) The trouble is, it's polluting my syslog and I'm getting emails. And I worry that one day it will report an actual error and I'll ignore it. I've searched the 'net for how to wipe or reset the SMART error log, but it does not seem to be possible. Or is it? I would prefer to keep running smartd, but if I can't get it to shut up about errors that occurred more than 2,000 hours ago...or am I completely missing the point here? Can anyone help? TIA Bill Hudacek ================================================================================ smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: FUJITSU MHS2060AT Serial Number: NL00T3213PC4 Firmware Version: 8004 User Capacity: 60,011,642,880 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a Local Time is: Tue Jan 17 01:09:22 2006 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 492) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 83) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 85813 2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 306 3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 25601 4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 3618 5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 089 047 Pre-fail Always - 3768 8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0 9 Power_On_Seconds 0x0032 001 001 000 Old_age Always - 21250h+20m+19s 10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 091 091 000 Old_age Always - 1363 192 Emergency_Retract_Cycle_Ct 0x0032 099 099 000 Old_age Always - 27 193 Load_Cycle_Count 0x0032 053 053 000 Old_age Always - 174310 194 Temperature_Celsius 0x0022 100 090 000 Old_age Always - 41 (Lifetime Min/Max 15/57) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 8973 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 080 000 Old_age Always - 0 198 Off-line_Scan_UNC_Sector_Ct 0x0010 098 098 000 Old_age Offline - 4 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Write_Error_Count 0x000f 100 100 060 Pre-fail Always - 28468 203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 3728051929362 SMART Error Log Version: 1 ATA Error Count: 22 (device log contains only the most recent five errors) <snip> Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 22 occurred at disk power-on lifetime: 19400 hours (808 days + 8 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 0a 1d 44 f6 e0 Error: UNC 10 sectors at LBA = 0x00f6441d = 16139293 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 10 17 44 f6 e0 00 1d+02:10:54.015 READ DMA Error 21 occurred at disk power-on lifetime: 19400 hours (808 days + 8 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 0a 1d 44 f6 e0 Error: UNC 10 sectors at LBA = 0x00f6441d = 16139293 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 10 17 44 f6 e0 00 1d+02:08:04.778 READ DMA Error 20 occurred at disk power-on lifetime: 19376 hours (807 days + 8 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 02 1d 44 f6 e0 Error: UNC 2 sectors at LBA = 0x00f6441d = 16139293 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 17 44 f6 e0 00 01:41:24.273 READ DMA Error 19 occurred at disk power-on lifetime: 19376 hours (807 days + 8 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 0a 1d 44 f6 e0 Error: UNC 10 sectors at LBA = 0x00f6441d = 16139293 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 10 17 44 f6 e0 00 01:34:38.528 READ DMA Error 18 occurred at disk power-on lifetime: 19274 hours (803 days + 2 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 02 1d 44 f6 e0 Error: UNC 2 sectors at LBA = 0x00f6441d = 16139293 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 17 44 f6 e0 00 2d+02:26:08.609 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 21246 - # 2 Extended offline Completed without error 00% 21227 - # 3 Conveyance offline Completed without error 00% 20864 - # 4 Extended offline Completed without error 00% 20864 - # 5 Extended offline Completed: read failure 90% 19310 16139293 # 6 Extended offline Completed: read failure 90% 19261 16139293 # 7 Conveyance offline Completed without error 00% 11531 - # 8 Extended offline Completed without error 00% 7136 - # 9 Short offline Completed without error 00% 7134 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. |
From: Bruce A. <ba...@gr...> - 2006-01-18 18:27:31
|
> With the error being reported every 30 minutes, I can only guess that > perhaps smartd is not "remembering" what the error log on the drive > shows. Every time it wakes up it's like the guy in "50 First Dates" > who introduces himself every 15 seconds... :-) > > The trouble is, it's polluting my syslog and I'm getting emails. And > I worry that one day it will report an actual error and I'll ignore > it. > > I've searched the 'net for how to wipe or reset the SMART error log, > but it does not seem to be possible. Or is it? > > I would prefer to keep running smartd, but if I can't get it to shut > up about errors that occurred more than 2,000 hours ago...or am I > completely missing the point here? > > Can anyone help? It sounds as if your disk is one of those that does not reset the pending or uncorrectable sector counts. Add '-C 0' or '-U 0' to your smartd.conf file. See the smartd or smartd.conf man page for more details. Cheers, Bruce |
From: Bill H. <hud...@ve...> - 2006-01-19 02:29:22
|
Bruce Allen wrote: >> With the error being reported every 30 minutes, I can only guess that >> perhaps smartd is not "remembering" what the error log on the drive >> shows. Every time it wakes up it's like the guy in "50 First Dates" >> who introduces himself every 15 seconds... :-) >> >> The trouble is, it's polluting my syslog and I'm getting emails. And >> I worry that one day it will report an actual error and I'll ignore >> it. >> >> I've searched the 'net for how to wipe or reset the SMART error log, >> but it does not seem to be possible. Or is it? >> >> I would prefer to keep running smartd, but if I can't get it to shut >> up about errors that occurred more than 2,000 hours ago...or am I >> completely missing the point here? >> >> Can anyone help? > > > It sounds as if your disk is one of those that does not reset the > pending or uncorrectable sector counts. Add '-C 0' or '-U 0' to your > smartd.conf file. See the smartd or smartd.conf man page for more > details. > > Cheers, > Bruce > OK. I was afraid of that... If I'm understanding your directions and the man page, -U 0 is the option and value to use - however this will leave me exposed, though it's /much/ more prudent than setting aside the use of the SMART tools altogether. With -U 0, smartd no longer reports the problem. Thanks for the help. This should be a FAQ - Fujitsu makes *alot* of laptop hard drives :-) Regards, Bill |