From: Jan S. <jan...@we...> - 2003-03-16 23:49:47
|
Hi out there! I'm running smartmontools 5.1-4 on a SuSE 8.0 system. It has already done a great job by telling me that one of my hard disks was going to die within the next 24 hours, so I had the chance to cretate a backup. The disk finally crashed as expected, but I didn't lose a single byte of data. Thank you for this great tool! Now I have a problem with the replacement disk which is a brand new IBM IC35L060AVV207-0 connected to a Promise Ultra100TX2 (my kernel detects this controller as a PDC20268 chip) as secondary master (/dev/hdc): When smartd turns on SMART monitoring for this disk, it usually works fine. But when I switched on the system this evening, the disk was not accessible after starting smartd. The disk LED was permanently lit and I had the following entries in my /var/log/messages: smartd[766]: Using configuration file /etc/smartd.conf smartd[777]: Device: /dev/hda, opened smartd[777]: Device: /dev/hda, is SMART capable. Adding to "monitor" list. smartd[777]: Device: /dev/hdc, opened kernel: hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest } kernel: hdc: drive not ready for command smartd[777]: Device: /dev/hdc, Read SMART Error Log Failed smartd[777]: Device: /dev/hdc, is SMART capable. Adding to "monitor" list. smartd[777]: Started monitoring 2 ATA and 0 SCSI devices smartd[777]: Device: /dev/hdc, not capable of SMART self-check smartd[777]: Device: /dev/hdc, failed to read SMART Attribute Data smartd[777]: Device: /dev/hdc, Read SMART Self Test Log Failed kernel: hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest } kernel: hdc: drive not ready for command A simple reboot made the drive work well. Just for the case that this is important: My smartd.conf contains only the two lines /dev/hda -a /dev/hdc -a smartctl -a /dev/hdc creates the following output: === START OF INFORMATION SECTION === Device Model: IC35L060AVV207-0 Serial Number: VNVB01G2RAY87G Firmware Version: V22OA63A ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a Local Time is: Mon Mar 17 00:28:24 2003 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Off-line data collection status: (0x00) Offline data collection activity was never started. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete off-line data collection: (1452) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Automatic timer ON/OFF support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 24) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 060 Pre-fail - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail - 0 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail - 12895649956 4 Start_Stop_Count 0x0012 100 100 000 Old_age - 29 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail - 0 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age - 146 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age - 29 192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age - 33 193 Load_Cycle_Count 0x0012 100 100 050 Old_age - 33 194 Temperature_Celsius 0x0002 189 189 000 Old_age - 29 (Lifetime Min/Max 17/32) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended captive Completed 00% 144 - # 2 Short captive Completed 00% 144 - Due to being short of time I haven't run IBM's Drive Fitness Test until now, but as you can see from smartctl's output, the drive doesn's report any errors after an extended SMART self-test which I executed after rebooting the machine. This is the first time this hard disk behaved strange; it runs well for hours and hours even unter heavy load so I don't think it's a general problem with this disk, the used controller or the combination of these components. I'd be very pleased if you could give me a hint to solve this problem - may it be by telling me that my system ist faulty or by changing smartmontools! ;-) Greetings Jan |
From: Bruce A. <ba...@gr...> - 2003-03-17 01:50:00
|
Hi Jan, On Mon, 17 Mar 2003, Jan Statzner wrote: > Hi out there! > > I'm running smartmontools 5.1-4 on a SuSE 8.0 system. It has already > done a great job by telling me that one of my hard disks was going to > die within the next 24 hours, so I had the chance to cretate a backup. > The disk finally crashed as expected, but I didn't lose a single byte > of data. Thank you for this great tool! That's wonderful -- I'm delighted that it worked as it was supposed to. > Now I have a problem with the replacement disk which is a brand new IBM > IC35L060AVV207-0 I'm sorry you are having problems again. Some good news -- I own a lot of these GXP180 disks (320 or 620 of them, depending upon how you count, though the 120 GB versions) and like them *very* much. And they have well-documented SMART functionality (thanks, IBM). > connected to a Promise Ultra100TX2 (my kernel detects this controller > as a PDC20268 chip) We've had a warning about problems with some Promise controller chips. See the WARNINGS file: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/smartmontools/sm5/WARNINGS?rev=HEAD&content-type=text/vnd.viewcvs-markup where it says: SYSTEM: Box with Promise 20265 IDE-controller (pdc202xx-driver) and > 2.4.18 kernel with ide-taskfile support PROBLEM: Smartctl locks system solid when used on /dev/hd[ef]. REPORTER: Georg Acher <ac...@in...> LINK: http://sourceforge.net/mailarchive/forum.php?thread_id=1457979&forum_id=12495 NOTE: Lockup doesn't happen with 2.4.18 kernel, and doesn't affect /dev/hd[a-d] This appears to be a problem with the pdc202xx-driver and has been reported to the pcdx maintainers. If you enable the the Promise-BIOS (ATA100-BIOS) then everything will work fine. But if you disable it, then the machine will hang. I wonder if you are perhaps seeing some manifestation of this problem. But it doesn't really sound that way... > as secondary master (/dev/hdc): When > smartd turns on SMART monitoring for this disk, it usually works fine. > But when I switched on the system this evening, the disk was not > accessible after starting smartd. The disk LED was permanently lit and I > had the following entries in my /var/log/messages: > > smartd[766]: Using configuration file /etc/smartd.conf > smartd[777]: Device: /dev/hda, opened > smartd[777]: Device: /dev/hda, is SMART capable. Adding to "monitor" list. > smartd[777]: Device: /dev/hdc, opened > kernel: hdc: status error: status=0x58 { DriveReady SeekComplete > DataRequest } > kernel: hdc: drive not ready for command > smartd[777]: Device: /dev/hdc, Read SMART Error Log Failed > smartd[777]: Device: /dev/hdc, is SMART capable. Adding to "monitor" list. > smartd[777]: Started monitoring 2 ATA and 0 SCSI devices > smartd[777]: Device: /dev/hdc, not capable of SMART self-check > smartd[777]: Device: /dev/hdc, failed to read SMART Attribute Data > smartd[777]: Device: /dev/hdc, Read SMART Self Test Log Failed > kernel: hdc: status error: status=0x58 { DriveReady SeekComplete > DataRequest } > kernel: hdc: drive not ready for command > > A simple reboot made the drive work well. > > Just for the case that this is important: My smartd.conf contains only > the two lines > /dev/hda -a > /dev/hdc -a This is fine. > smartctl -a /dev/hdc creates the following output: > > === START OF INFORMATION SECTION === > Device Model: IC35L060AVV207-0 > Serial Number: VNVB01G2RAY87G <SNIP> > Automatic timer ON/OFF support. You might want to add the -o Directive for the IBM disk, which supports this automatic timer functionality. > 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail - > > 12895649956 If you upgrade to 5.1-9 you'll get this presented as average/last spin up time -- slightly more readable. > SMART Self-test log, version number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Extended captive Completed 00% 144 > - > # 2 Short captive Completed 00% 144 I notice that you've been running captive self-tests -- these WOULD busy out the drive -- is it possible that running one of these tests was what turned the drive light on solid (for tens of minutes). > Due to being short of time I haven't run IBM's Drive Fitness Test until > now, but as you can see from smartctl's output, the drive doesn's report > any errors after an extended SMART self-test which I executed after > rebooting the machine. OK, that answers my question. > This is the first time this hard disk behaved strange; it runs well for > hours and hours even unter heavy load so I don't think it's a general > problem with this disk, the used controller or the combination of these > components. I agree with you -- the disk looks healthy and I am also sceptical that the problem is due to the Promise controller. > I'd be very pleased if you could give me a hint to solve this problem - > may it be by telling me that my system ist faulty or by changing > smartmontools! ;-) I'm afraid that I don't know what is wrong. One thing that is worth checking is the drive cable -- I have seen these "light turns on, drive doesn't respond" problems with faulty ATA cables. About the only other thing I can think of is that it is somehow related to the problems that have been reported for the Promise pdc202xx-driver. But it doesn't sound that way. Please, let me know if you are able to learn anything more about this. I am sorry that I can't be more helpful. Sincerely, Bruce Allen |