From: Jonas P. <jon...@xm...> - 2008-08-30 09:23:38
|
[ I wasn't a member on my first submit and it seems that moderation is somewhat slow so I resend this now ] Hi experts, I've tried searching every archive I can find, and my priminary conclusion is that my drive must have an odd fault, but I'd love to hear your confirmation. On my MacBookPro 4,1 (penryn) - which is less than a month old - I started to have strange random hangs after a couple of weeks. It often happens after about 5 minutes of idle time (either after resume or reboot), but also long after that when a burst of disk activity starts (say an apt-get install). My point is that I never see this problem just after boot, nor at the point of resuming from suspend-to-ram. Suspending to ram seems unrelated as it is just as likely to happen over the weekend after a fresh boot on friday night and no suspending. When I'm lucky, things revert to normal after a while, when I'm unlucky the disk is marked read-only and all I can do is reboot. Here is a snippet from this morning - this time it came back to life: [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 dma 4096 in [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [46874.898709] ata3.00: status: { DRDY } [46879.643962] ata3: port is slow to respond, please be patient (Status 0xd0) [46884.473195] ata3: device not ready (errno=-16), forcing hardreset [46884.473202] ata3: soft resetting link [46912.740010] ata3.00: qc timeout (cmd 0xec) [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) [46912.740023] ata3.00: revalidation failed (errno=-5) [46912.740028] ata3: failed to recover some devices, retrying in 5 secs [46917.458070] ata3: soft resetting link [46917.636464] ata3.00: configured for UDMA/100 [46917.636482] ata3: EH complete [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) [46917.699257] sd 2:0:0:0: [sda] Write Protect is off [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA I've seen some references that this COULD be a problem with smartmontools pre 5.38 (though no explanation how), but still I am using 5.38-1ubuntu1~hardy1 and in fact, initially I didn't have smartd running, so I doubt that could be the problem. My drive is a: Device Model: FUJITSU MHY2250BH Serial Number: K43CT862GMK9 I have run self tests that seem OK: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 837 - # 2 Extended offline Completed without error 00% 827 - # 3 Short offline Completed without error 00% 825 - I have a very similar driver (just smaller) in a somewhat older MacBookPro running the same installation and it works perfectly. Tried both with the most recent standard kernels (2.6.24-*-generic) and also a self compiled 2.6.26 without finding any more clues. So I guess I will have to replace it anyway - unless you have any ideas for me. Thanks in advance / Jonas |
From: Jonas P. <jon...@xm...> - 2008-08-29 00:12:14
|
Hi experts, I've tried searching every archive I can find, and my priminary conclusion is that my drive must have an odd fault, but I'd love to hear your confirmation. On my MacBookPro 4,1 (penryn) - which is less than a month old - I started to have strange random hangs after a couple of weeks. It often happens after about 5 minutes of idle time (either after resume or reboot), but also long after that when a burst of disk activity starts (say an apt-get install). My point is that I never see this problem just after boot, nor at the point of resuming from suspend-to-ram. Suspending to ram seems unrelated as it is just as likely to happen over the weekend after a fresh boot on friday night and no suspending. When I'm lucky, things revert to normal after a while, when I'm unlucky the disk is marked read-only and all I can do is reboot. Here is a snippet from this morning - this time it came back to life: [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 dma 4096 in [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [46874.898709] ata3.00: status: { DRDY } [46879.643962] ata3: port is slow to respond, please be patient (Status 0xd0) [46884.473195] ata3: device not ready (errno=-16), forcing hardreset [46884.473202] ata3: soft resetting link [46912.740010] ata3.00: qc timeout (cmd 0xec) [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) [46912.740023] ata3.00: revalidation failed (errno=-5) [46912.740028] ata3: failed to recover some devices, retrying in 5 secs [46917.458070] ata3: soft resetting link [46917.636464] ata3.00: configured for UDMA/100 [46917.636482] ata3: EH complete [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) [46917.699257] sd 2:0:0:0: [sda] Write Protect is off [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA I've seen some references that this COULD be a problem with smartmontools pre 5.38 (though no explanation how), but still I am using 5.38-1ubuntu1~hardy1 and in fact, initially I didn't have smartd running, so I doubt that could be the problem. My drive is a: Device Model: FUJITSU MHY2250BH Serial Number: K43CT862GMK9 I have run self tests that seem OK: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 837 - # 2 Extended offline Completed without error 00% 827 - # 3 Short offline Completed without error 00% 825 - I have a very similar driver (just smaller) in a somewhat older MacBookPro running the same installation and it works perfectly. Tried both with the most recent standard kernels (2.6.24-*-generic) and also a self compiled 2.6.26 without finding any more clues. So I guess I will have to replace it anyway - unless you have any ideas for me. Thanks in advance / Jonas |
From: Justin P. <jp...@lu...> - 2008-08-30 11:09:44
|
On Sat, 30 Aug 2008, Jonas Petersson wrote: > [ I wasn't a member on my first submit and it seems that moderation is > somewhat slow so I resend this now ] > > Device Model: FUJITSU MHY2250BH > Serial Number: K43CT862GMK9 > If the drive is capable of NCQ, I would try disable/enabling it (via kernel) parameter and see if that makes any difference (doubt it) but something to try. Also, do/not run smart tests during heavy I/O? Have you tried disabline smartmontools to see if the problem persists? Justin. |
From: Jonas P. <jon...@xm...> - 2008-08-30 20:31:23
|
Hi Justin and thanks for your prompt response, Justin Piszcz skrev: > On Sat, 30 Aug 2008, Jonas Petersson wrote: >> [ I wasn't a member on my first submit and it seems that moderation is >> somewhat slow so I resend this now ] >> >> Device Model: FUJITSU MHY2250BH >> Serial Number: K43CT862GMK9 > > If the drive is capable of NCQ, I would try disable/enabling it (via > kernel) parameter and see if that makes any difference (doubt it) but > something to try. That would be /sys/block/sda/device/queue_depth, right? I already checked that and it was "1" so I guess there is nothing to do. > Also, do/not run smart tests during heavy I/O? They are certainly OK without heavy I/O. Those that take some time to run where run during "normal" I/O - I kept using the system normally. The problem occurs during low I/O conditions as far as I can tell. For example during the night (slocate may run at some stage, but according to the logs the problems has occured at 4 different time during the night and even a fairly long slocate session can't possibly have covered more than 2 of them at worst). > Have you tried disabline smartmontools to see if the problem persists? Yes: Initially, it happened before I even installed smartmontools. It kept happening with while smartd was running for a while. It kept happening when I disabled smartd and just ran smartctl now and then. It also happened (one) about 5 minutes after rebooting when no smartmontools programs either smartd nor smartctl had efter run. In this particular case, the boot process involved automatic installation of the nvidia binary driver which may have caised a bit of heavy I/O for a while. Does this give you any further clues? Would a full dump of smartctl --all help? Best / Jonas |
From: Justin P. <jp...@lu...> - 2008-08-30 21:14:07
|
On Sat, 30 Aug 2008, Jonas Petersson wrote: > Hi Justin and thanks for your prompt response, > > That would be /sys/block/sda/device/queue_depth, right? I already checked > that and it was "1" so I guess there is nothing to do. Correct.. >> Also, do/not run smart tests during heavy I/O? > > They are certainly OK without heavy I/O. Those that take some time to run > where run during "normal" I/O - I kept using the system normally. The problem > occurs during low I/O conditions as far as I can tell. For example during the > night (slocate may run at some stage, but according to the logs the problems > has occured at 4 different time during the night and even a fairly long > slocate session can't possibly have covered more than 2 of them at worst). I do not have the same exact problem you have, but I have a similar problem: [189770.219773] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [189770.219784] ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out [189770.219786] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [189770.219790] ata1.00: status: { DRDY } [189770.219795] ata1: hard resetting link [189770.524770] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [189770.543960] ata1.00: configured for UDMA/133 [189770.543977] ata1: EH complete I think its a chipset issue personally, I've spoken to various people about it on LKML, and other mailing lists basically there is no ultimate box, I've tried disabling noapic, apci=off etc, no luck. I just ignore it, it doesn't seem to affect anything (in my case) as of yet and via smart tests/etc I cannot/do not see any form of drive failure, all tests pass normally and the smart output looks good. > > >> Have you tried disabline smartmontools to see if the problem persists? > > Yes: Initially, it happened before I even installed smartmontools. It kept > happening with while smartd was running for a while. It kept happening when I > disabled smartd and just ran smartctl now and then. It also happened (one) > about 5 minutes after rebooting when no smartmontools programs either smartd > nor smartctl had efter run. In this particular case, the boot process > involved automatic installation of the nvidia binary driver which may have > caised a bit of heavy I/O for a while. > > Does this give you any further clues? Would a full dump of smartctl --all > help? smartctl -a would be useful (#1) An "lshw" run would also be useful, I am curious which chipset they are using for the SATA I/O on that hardware type. Justin. |
From: Jonas P. <jon...@xm...> - 2008-08-30 21:57:57
Attachments:
lshw
|
Justin Piszcz skrev: > On Sat, 30 Aug 2008, Jonas Petersson wrote: >> [...] > smartctl -a would be useful (#1) # smartctl -a /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: FUJITSU MHY2250BH Serial Number: K43CT862GMK9 Firmware Version: 0081000D User Capacity: 250 059 350 016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3f Local Time is: Sat Aug 30 23:52:36 2008 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (1009) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 143) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 0 3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 117 5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 087 047 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1021 10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 109 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 65547 193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always - 26215 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 35 (Lifetime Min/Max 21/53) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 0 203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 0 240 Head_Flying_Hours 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 837 - # 2 Extended offline Completed without error 00% 827 - # 3 Short offline Completed without error 00% 825 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. > An "lshw" run would also be useful, I am curious which chipset they are > using for the SATA I/O on that hardware type. Bit big, but I've attached it. As I mentioned, this is a pretty recent MacBookPro 4,1 with the new multitouch ("iphone" style) mousepad and the higher resolution screen. Was pretty cool the first two week - until I ran into this issue... :-( Best / Jonas |
From: Justin P. <jp...@lu...> - 2008-08-30 22:12:54
|
On Sat, 30 Aug 2008, Jonas Petersson wrote: > Justin Piszcz skrev: >> On Sat, 30 Aug 2008, Jonas Petersson wrote: >>> [...] >> smartctl -a would be useful (#1) > > # smartctl -a /dev/sda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ I have the same controller in my host as well, but it does not appear to matter whether it happens on the ICH8 controller or other controllers. I have noticed on Velociraptors I seem to get the same/similar error that you do as well, and I ran all the same tests as you, to no avail as to getting any closer to finding the root cause/problem. (.. more so than the regular old raptor150s) Besides the annoying messages in the kernel log/syslog/dmesg, does it affect your system stability in any way as of yet? I must add a very important note here though, you are using an ICH8 chipset and so am I, we both have same/similar problems-- however, I also have another machine setup VERY similarly (except different HDDs) for the RAID5 but the RAID1 is the same as one of my ICH8 boxes (dual raptor150s)-- and to date it has never? or rarely thrown the frozen error except when a disk actually failed (or when NCQ is enabled for a WD drive), (NCQ+Linux for WD) is broken. I have disks in a raid set (both raid1 and raid5) that get same/similar warnings as I mentioned above and so far it has not had any impact that I have noticed in relation to these specific errors. I think for now we just have to live with them, I am not sure what else to say here.. CC'ing linux-ide and linux-kernel with your original error from the start of this e-mail thread: Here is a snippet from this morning - this time it came back to life: [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 dma 4096 in [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [46874.898709] ata3.00: status: { DRDY } [46879.643962] ata3: port is slow to respond, please be patient (Status 0xd0) [46884.473195] ata3: device not ready (errno=-16), forcing hardreset [46884.473202] ata3: soft resetting link [46912.740010] ata3.00: qc timeout (cmd 0xec) [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) [46912.740023] ata3.00: revalidation failed (errno=-5) [46912.740028] ata3: failed to recover some devices, retrying in 5 secs [46917.458070] ata3: soft resetting link [46917.636464] ata3.00: configured for UDMA/100 [46917.636482] ata3: EH complete [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) [46917.699257] sd 2:0:0:0: [sda] Write Protect is off [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Here is an example from my host (same/similar issue): Aug 23 20:00:32 p34 kernel: [189770.219773] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 23 20:00:32 p34 kernel: [189770.219784] ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out Aug 23 20:00:32 p34 kernel: [189770.219786] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 23 20:00:32 p34 kernel: [189770.219790] ata1.00: status: { DRDY } Aug 23 20:00:32 p34 kernel: [189770.219795] ata1: hard resetting link Aug 23 20:00:32 p34 kernel: [189770.524770] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 23 20:00:32 p34 kernel: [189770.543960] ata1.00: configured for UDMA/133 Aug 23 20:00:32 p34 kernel: [189770.543977] ata1: EH complete Aug 23 20:00:32 p34 kernel: [189770.544810] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB) Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Write Protect is off Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Aug 23 20:00:32 p34 kernel: [189770.863810] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA What is the root cause of this? It still seems to be a mystery to most as far as I can tell, but the one thing in common is we are both using ICH8 chipsets, which, just may happen to be part of the problem? Justin. |
From: Jonas P. <jon...@xm...> - 2008-08-31 10:01:05
|
Hi again Justin, Justin Piszcz skrev: > On Sat, 30 Aug 2008, Jonas Petersson wrote: >> Justin Piszcz skrev: >>> On Sat, 30 Aug 2008, Jonas Petersson wrote: >>>> [...] >>> smartctl -a would be useful (#1) >> # smartctl -a /dev/sda >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ > > I have the same controller in my host as well, but it does not appear to > matter whether it happens on the ICH8 controller or other controllers. > > I have noticed on Velociraptors I seem to get the same/similar error that > you do as well, and I ran all the same tests as you, to no avail as to getting > any closer to finding the root cause/problem. > (.. more so than the regular old raptor150s) > > Besides the annoying messages in the kernel log/syslog/dmesg, does it > affect your system stability in any way as of yet? Very much so, yes. At best, all disk access will hang for a while and then resume after the reset has worked out - this often happens a couple of times per day now. At worst, the reset will not work and the disk is remounted read-only and I can sort of use the system a bit this way. It seems somewhat random how much still works: Up until today I could at least always use dmesg and tail various logs to try to hunt down what happened, but this morning dmesg could not be found and I got I/O errors when accessing anything in /var/log. Rebooting helped as usual. This fatal variant has happened about every second day lately. The first two weeks I had the system showed nothing at all like this: I have log files since July 26 and the first recorded (reset-able) glitch is from Aug 16. Obviously, any non-resetable problem would have been easy to spot. > I must add a very important note here though, you are using an ICH8 chipset > and so am I, we both have same/similar problems-- however, I also have > another machine setup VERY similarly (except different HDDs) for the RAID5 > but the RAID1 is the same as one of my ICH8 boxes (dual raptor150s)-- > and to date it has never? or rarely thrown the frozen error except when a disk > actually failed (or when NCQ is enabled for a WD drive), (NCQ+Linux for WD) is > broken. Yes, I would not point fingers to the ICH8 chipset either: The other MacBookPro I have experimented with now is a 2,2 (ATI based) and has ICH7, but I'm 99.9% sure my previous MacBookPro 3,1 (nvidia based) was ICH8 and it worked flawlessly (I saw no reason to swap for the 4,1 version, but it was stolen from me in June). As far as I know the significant differences with my current MBP are just: higher screen resolution, multitouch ("iphone") touchpad and more memory. Alas, I didn't keep a lshw dump. > [...] > CC'ing linux-ide and linux-kernel with your original error from the start > of this e-mail thread: > > Here is a snippet from this morning - this time it came back to life: > > [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 > frozen > [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 > dma 4096 in > [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask > 0x4 (timeout) > [46874.898709] ata3.00: status: { DRDY } > [46879.643962] ata3: port is slow to respond, please be patient (Status > 0xd0) > [46884.473195] ata3: device not ready (errno=-16), forcing hardreset > [46884.473202] ata3: soft resetting link > [46912.740010] ata3.00: qc timeout (cmd 0xec) > [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) > [46912.740023] ata3.00: revalidation failed (errno=-5) > [46912.740028] ata3: failed to recover some devices, retrying in 5 secs > [46917.458070] ata3: soft resetting link > [46917.636464] ata3.00: configured for UDMA/100 > [46917.636482] ata3: EH complete > [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors > (250059 MB) > [46917.699257] sd 2:0:0:0: [sda] Write Protect is off > [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 > [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: > enabled, doesn't support DPO or FUA I'll just clarify that the errno after "revalidation failed" is not always -5. When it ends up fatal I've also seen -3 and possibly something else too. I would have taken a screen shot this morning if only dmesg had worked. :-( > What is the root cause of this? It still seems to be a mystery to most as far > as I can tell, but the one thing in common is we are both using ICH8 chipsets, > which, just may happen to be part of the problem? For the record: My current theory is that it is some kind of hardware problem - either in the disk or on the motherboard so I have persuaded my local AppleStore to swap the harddisk on Monday and then they will run their full hardware stress test (4+ hours according to him). The stress test was apparently suggested from the central repair people (who have no idea I run Linux on it - the local techie knows, but has no problem with it as long as I keep a small OSX partition) so I guess this sort of hints that they are aware of hardware issues. (Note: I've had the same techie replace a broken motherboard in the past when the Linux messages where at least as clear as the OSX ones - in that case drives would in the end only show up in the boot menue when the system had cooled down for at least 20 minutes. To be on the safe side, I've upped the minimum fan speed by 50% to ensure all sensors give me happy readings all the time - luckily the 4,1 fans are very silent compared to the 2,2) I hope to have everything back in shape on Wednesday and I'll let you know how it fares. BTW: For a while I displayed the hddtemp sensor all the time along with coretemp etc, but I now understand that this is also SMART based so I've turned it off in the past weeks experimentation. Again, it seemed to work flawlessly for months on my previous (stolen) MBP 3,1. Best / Jonas |
From: Bruce A. <ba...@gr...> - 2008-09-02 06:49:24
|
Jonas, Justin, <speculation> One possible theory is that the drive is being put into some type of sleep/idle state and not responding quickly enough to commands. I don't know if this is linked to internal drive settings (that could be tweaked with hdparm, for example) or to something external in the kernel or utilities that is sending the drive into a lower-power idle or sleep state. </speculation> Cheers, Bruce On Sat, 30 Aug 2008, Jonas Petersson wrote: > Justin Piszcz skrev: >> On Sat, 30 Aug 2008, Jonas Petersson wrote: >> > [...] >> smartctl -a would be useful (#1) > > # smartctl -a /dev/sda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF INFORMATION SECTION === > Device Model: FUJITSU MHY2250BH > Serial Number: K43CT862GMK9 > Firmware Version: 0081000D > User Capacity: 250 059 350 016 bytes > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 3f > Local Time is: Sat Aug 30 23:52:36 2008 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection activity > was never started. > Auto Offline Data Collection: > Disabled. > Self-test execution status: ( 0) The previous self-test routine > completed > without error or no self-test has > ever > been run. > Total time to complete Offline > data collection: (1009) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off > support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 143) minutes. > Conveyance self-test routine > recommended polling time: ( 2) minutes. > SCT capabilities: (0x003d) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always > - 0 > 2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline > - 0 > 3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always > - 0 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always > - 117 > 5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always > - 0 > 7 Seek_Error_Rate 0x000f 100 087 047 Pre-fail Always > - 0 > 8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline > - 0 > 9 Power_On_Hours 0x0032 098 098 000 Old_age Always > - 1021 > 10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always > - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always > - 109 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always > - 65547 > 193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always > - 26215 > 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always > - 35 (Lifetime Min/Max 21/53) > 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always > - 0 > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always > - 0 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always > - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline > - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 253 000 Old_age Always > - 0 > 200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always > - 0 > 203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always > - 0 > 240 Head_Flying_Hours 0x003e 200 200 000 Old_age Always > - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Conveyance offline Completed without error 00% 837 - > # 2 Extended offline Completed without error 00% 827 - > # 3 Short offline Completed without error 00% 825 > - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > >> An "lshw" run would also be useful, I am curious which chipset they are >> using for the SATA I/O on that hardware type. > > > Bit big, but I've attached it. As I mentioned, this is a pretty recent > MacBookPro 4,1 with the new multitouch ("iphone" style) mousepad and the > higher resolution screen. Was pretty cool the first two week - until I ran > into this issue... :-( > > Best / Jonas > > |
From: Jonas P. <jon...@xm...> - 2008-09-02 07:04:50
|
Hi Bruce, Bruce Allen wrote: > <speculation> > One possible theory is that the drive is being put into some type of > sleep/idle state and not responding quickly enough to commands. I don't > know if this is linked to internal drive settings (that could be tweaked > with hdparm, for example) or to something external in the kernel or > utilities that is sending the drive into a lower-power idle or sleep > state. > </speculation> Yes, I've had that theory too, so I tried fiddling with hdparam -B 255 to turn it off, but the problem did not go away. (I also upped the minimum fan speed by 50% in case it was heat related) As it is somewhat random in occuring it have a vague feeling that it show up somewhat less often, but I wouldn't put any money on it. Anyway: I did a complete reinstall on a brand new drive last night and so far I have not seen any hint of the problem (which proves nothing really). I'll let you know in a few days if it is stable. (Oddly, ndiswrapper only worked once after reinstallation, but that can hardly be related - everything else seems fine so far.) Best / Jonas |
From: Owen M. <om...@ne...> - 2008-09-02 13:39:25
|
This looks like a timeout during a read command: ata3.00: cmd c8/00:08:90:3c:59 Read dma of 8 blocks from 0x903c59 Next time it happens, see if it is the same LBA. Since the drive came back after the bus reset makes me think it was probably in error recovery for an extended amount of time. Sorry, but I am new to using smartmontools for decoding SMART attributes. Your previous email showed: Device is: Not in smartctl database [for details use: -P showall] Does that imply the tool will not know the exact meaning of all the attributes? I am not familiar with Fujitsu's implementation. >From the data you sent about the attributes before, it looks like the pending and reallocated sector counts are zero, so the block must have not failed recovery. Can you try to dump the sector using hdparm-8.9 to see if it reproduces? hdparm --read-sector 9452633 /dev/sda What is the timeout set to? cat /sys/block/sda/device/timeout Maybe try to increase that. You want to be sure that it is not a drive issue by verifying the block is readable and the raw values from the pending, uncorrectable or reallocated sector attributes don't change. I was seeing the exact same thing when I was trying to run the SMART selftest in captive mode (not using smartmon). When I increased the timeout it was able to complete. Aug 27 02:36:42 spu0201 user.err kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Aug 27 02:36:42 spu0201 user.err kernel: ata1.00: cmd b0/d4:00:83:4f:c2/00:00:00:00:00/00 tag 0 Aug 27 02:36:42 spu0201 user.warn kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Justin's error is from a write: ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out That typically only happens in a high vibration environment. Since the write is open loop, typically the only thing that can prevent it from completing is position error. It might be a PHY issue, but without a bus analyzer, it is hard to tell. The new Seagate drives have attribute 199, SATA R-err count, which might help to identify the issue, if you think it is related to the chipset/PHY. -Owen -----Original Message----- From: sma...@li... [mailto:sma...@li...] On Behalf Of Justin Piszcz Sent: Saturday, August 30, 2008 6:13 PM To: Jonas Petersson Cc: lin...@vg...; sma...@li...; lin...@vg... Subject: Re: [smartmontools-support] exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen On Sat, 30 Aug 2008, Jonas Petersson wrote: > Justin Piszcz skrev: >> On Sat, 30 Aug 2008, Jonas Petersson wrote: >>> [...] >> smartctl -a would be useful (#1) > > # smartctl -a /dev/sda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ I have the same controller in my host as well, but it does not appear to matter whether it happens on the ICH8 controller or other controllers. I have noticed on Velociraptors I seem to get the same/similar error that you do as well, and I ran all the same tests as you, to no avail as to getting any closer to finding the root cause/problem. (.. more so than the regular old raptor150s) Besides the annoying messages in the kernel log/syslog/dmesg, does it affect your system stability in any way as of yet? I must add a very important note here though, you are using an ICH8 chipset and so am I, we both have same/similar problems-- however, I also have another machine setup VERY similarly (except different HDDs) for the RAID5 but the RAID1 is the same as one of my ICH8 boxes (dual raptor150s)-- and to date it has never? or rarely thrown the frozen error except when a disk actually failed (or when NCQ is enabled for a WD drive), (NCQ+Linux for WD) is broken. I have disks in a raid set (both raid1 and raid5) that get same/similar warnings as I mentioned above and so far it has not had any impact that I have noticed in relation to these specific errors. I think for now we just have to live with them, I am not sure what else to say here.. CC'ing linux-ide and linux-kernel with your original error from the start of this e-mail thread: Here is a snippet from this morning - this time it came back to life: [46874.898690] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [46874.898703] ata3.00: cmd c8/00:08:90:3c:59/00:00:00:00:00/ef tag 0 dma 4096 in [46874.898705] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [46874.898709] ata3.00: status: { DRDY } [46879.643962] ata3: port is slow to respond, please be patient (Status 0xd0) [46884.473195] ata3: device not ready (errno=-16), forcing hardreset [46884.473202] ata3: soft resetting link [46912.740010] ata3.00: qc timeout (cmd 0xec) [46912.740020] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) [46912.740023] ata3.00: revalidation failed (errno=-5) [46912.740028] ata3: failed to recover some devices, retrying in 5 secs [46917.458070] ata3: soft resetting link [46917.636464] ata3.00: configured for UDMA/100 [46917.636482] ata3: EH complete [46917.699224] sd 2:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) [46917.699257] sd 2:0:0:0: [sda] Write Protect is off [46917.699263] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 [46917.699300] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Here is an example from my host (same/similar issue): Aug 23 20:00:32 p34 kernel: [189770.219773] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 23 20:00:32 p34 kernel: [189770.219784] ata1.00: cmd 35/00:40:9a:d9:7a/00:00:12:00:00/e0 tag 0 dma 32768 out Aug 23 20:00:32 p34 kernel: [189770.219786] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 23 20:00:32 p34 kernel: [189770.219790] ata1.00: status: { DRDY } Aug 23 20:00:32 p34 kernel: [189770.219795] ata1: hard resetting link Aug 23 20:00:32 p34 kernel: [189770.524770] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 23 20:00:32 p34 kernel: [189770.543960] ata1.00: configured for UDMA/133 Aug 23 20:00:32 p34 kernel: [189770.543977] ata1: EH complete Aug 23 20:00:32 p34 kernel: [189770.544810] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB) Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Write Protect is off Aug 23 20:00:32 p34 kernel: [189770.551810] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Aug 23 20:00:32 p34 kernel: [189770.863810] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA What is the root cause of this? It still seems to be a mystery to most as far as I can tell, but the one thing in common is we are both using ICH8 chipsets, which, just may happen to be part of the problem? Justin. ------------------------------------------------------------------------ - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Smartmontools-support mailing list Sma...@li... https://lists.sourceforge.net/lists/listinfo/smartmontools-support |
From: Jonas P. <jon...@xm...> - 2008-09-02 15:49:33
|
Hi Owen, Owen Martin wrote: > This looks like a timeout during a read command: > > ata3.00: cmd c8/00:08:90:3c:59 > > Read dma of 8 blocks from 0x903c59 > > Next time it happens, see if it is the same LBA. Since the drive came > back after the bus reset makes me think it was probably in error > recovery for an extended amount of time. Sounds like a good idea. However, I had the drive swapped yesterday and have now reinstalled on a (seemingly) identical one which so far seems to be free from these messages. Hence, I keep my fingers crossed that this was indeed a hw error. As it was on warranty I was not allowed to keep the bad drive for further experiments. > Sorry, but I am new to using smartmontools for decoding SMART > attributes. Your previous email showed: > > Device is: Not in smartctl database [for details use: -P showall] > > Does that imply the tool will not know the exact meaning of all the > attributes? I am not familiar with Fujitsu's implementation. I believe you are correct. >>From the data you sent about the attributes before, it looks like the > pending and reallocated sector counts are zero, so the block must have > not failed recovery. Can you try to dump the sector using hdparm-8.9 to > see if it reproduces? > > hdparm --read-sector 9452633 /dev/sda Would if I could... The messages I sent were indeed only from cases where the driver succeeded write to the disk in the end (extracts from /var/log/messages). In the failure cases I did not make a hard copy. > What is the timeout set to? > > cat /sys/block/sda/device/timeout 30 > Maybe try to increase that. You want to be sure that it is not a drive > issue by verifying the block is readable and the raw values from the > pending, uncorrectable or reallocated sector attributes don't change. Will do if I ever see it again. Best / Jonas |
From: Jonas P. <jon...@xm...> - 2008-09-07 20:48:58
|
For the record: Jonas Petersson wrote: > Owen Martin wrote: > > This looks like a timeout during a read command: > > > > ata3.00: cmd c8/00:08:90:3c:59 > > > > Read dma of 8 blocks from 0x903c59 > > > > Next time it happens, see if it is the same LBA. Since the drive came > > back after the bus reset makes me think it was probably in error > > recovery for an extended amount of time. > > Sounds like a good idea. However, I had the drive swapped yesterday and > have now reinstalled on a (seemingly) identical one which so far seems > to be free from these messages. Hence, I keep my fingers crossed that > this was indeed a hw error. > [...] I've now stressed the new disk for almost a week and seen no indication at all to the previous error. Everything else is the same as before - I even installed from the very same DVD. My conclusion is therefore that I really had a disk that was broken in a way that normal tests will not detect. Hence, my tip to anyone having a similar experience: Don't blame the driver, nor the motherboard/chipset - just replace the drive. It would of course be even nicer if the error message could spell this out somewhat clearer too, but I guess the "I/O error" in the middle is a fair hint in retrospect. Best / Jonas |