From: Tim S. <ti...@bu...> - 2010-02-05 14:08:03
|
Hi, I have a couple of Debian Lenny ("2.6.26-2-amd64") boxes on rented hardware, each has a couple of SATA drives: One has 2x 1TB Seagate Barracuda 7200.11 model ST31000333AS firmware SD35 The other has 2x 2TB WD Caviar Green model WDC WD20EADS-00R6B0 firmware 01.00A01 ... the machines are currently set up to run smartd, and also log HDD temp via munin. ata_piix is the driver in use. The WD machine did this sort of thing a couple of times, which got my attention. [119061.717865] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [119061.717865] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [119061.717865] ata1.00: status: { DRDY } [119071.117368] ata1: link is slow to respond, please be patient (ready=0) [119079.800059] ata1: device not ready (errno=-16), forcing hardreset [119079.800091] ata1: soft resetting link [119087.950128] ata1: link is slow to respond, please be patient (ready=0) [119097.895803] ata1: SRST failed (errno=-16) [119097.895881] ata1: soft resetting link [119107.170874] ata1: link is slow to respond, please be patient (ready=0) [119114.902193] ata1: SRST failed (errno=-16) [119114.902219] ata1: soft resetting link [119123.749111] ata1: link is slow to respond, please be patient (ready=0) [119176.735727] ata1: SRST failed (errno=-16) [119176.735761] ata1: soft resetting link [119185.513569] ata1: SRST failed (errno=-16) [119185.513593] ata1: reset failed, giving up [119185.513622] ata1.00: disabled [119185.513643] ata1.01: disabled [119185.513680] end_request: I/O error, dev sda, sector 39069887 [119185.516684] ata1: EH complete [119186.013456] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK [119186.013456] end_request: I/O error, dev sda, sector 36525807 If I run a continuous "dd of=file ; sync ; rm file ; sync" to a file on the RAID1 mirror of both drives, at the same time as run a continous "smartctl -s on -a /dev/sdX > /dev/null || echo failed", then: 1. The smartctl command fails about once in 20 times, and I get a lot of this happening: [93058.989603] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [93058.989645] ata1.01: cmd 35/00:00:a4:f2:51/00:04:03:00:00/f0 tag 0 dma 524288 out [93058.993582] ata1.01: status: { DRDY } [93058.993582] ata1: soft resetting link [93090.804353] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [93090.804395] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in [93090.804427] ata1.01: status: { DRDY } [93090.804458] ata1: soft resetting link [93252.493902] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [93252.493913] ata1.01: cmd c8/00:80:4c:d0:83/00:00:00:00:00/fa tag 0 dma 65536 in [93252.493913] ata1.01: status: { DRDY } [93252.493913] ata1: soft resetting link [96265.917847] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [96265.917889] ata1.01: cmd c8/00:80:4c:2c:c1/00:00:00:00:00/fa tag 0 dma 65536 in [96265.921800] ata1.01: status: { DRDY } [96265.921800] ata1: soft resetting link [96405.491834] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [96405.491834] ata1.01: cmd 25/00:00:cc:a6:c3/00:04:0a:00:00/f0 tag 0 dma 524288 in [96405.491834] ata1.01: status: { DRDY } [96413.900149] ata1: link is slow to respond, please be patient (ready=0) [99772.901861] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [99772.901861] ata1.01: cmd ca/00:08:cc:d3:54/00:00:00:00:00/f3 tag 0 dma 4096 out [99772.901861] ata1.01: status: { DRDY } [99783.604235] ata1: link is slow to respond, please be patient (ready=0) [100012.860158] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [100012.860201] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in [100012.860247] ata1.01: status: { DRDY } [100012.860281] ata1: soft resetting link [100256.314912] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [100256.314950] ata1.01: cmd c8/00:80:cc:12:13/00:00:00:00:00/fb tag 0 dma 65536 in [100256.314997] ata1.01: status: { DRDY } [100256.315025] ata1: soft resetting link [101528.503318] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [101528.503318] ata1.01: cmd c8/00:00:4c:c4:2c/00:00:00:00:00/fb tag 0 dma 131072 in [101528.503318] ata1.01: status: { DRDY } [101535.883662] ata1: link is slow to respond, please be patient (ready=0) [107747.382563] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [107747.382605] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in [107747.386545] ata1.01: status: { DRDY } [107747.386545] ata1: soft resetting link [107918.831736] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [107918.831736] ata1.01: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in [107918.831736] ata1.01: status: { DRDY } [107918.831736] ata1: soft resetting link Sometimes the "resetting link" happens a few times, and if it happens enough times, then ata_piix gives up and disables BOTH drives (like the first time), which is a bit annoying - this reset-fails behaviour normally seems to happen when the drives are not doing much (i.e. in normal operation rather than under-test). If I disable smart data collection (smartd and munin), then the errors seem to stop - which I can do obviously, but would prefer not to. smartctl -x reports the following interesting-looking stuff on the device which I've been stressing with smartctl: SATA Phy Event Counters (GP Log 0x11) ID Size Value Description ... 0x000a 2 5 Device-to-host register FISes sent due to a COMRESET 0x8000 4 79322 Vendor specific and this on the one where I haven't: 0x000a 2 2 Device-to-host register FISes sent due to a COMRESET 0x8000 4 6779 Vendor specific ... so I would suspect that this is a bug in the WD drives, except that the same thing seems to occasionally happen on the machine with the Seagate drives: [1718254.879156] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [1718254.879211] ata1.00: cmd c8/00:08:3c:f1:bf/00:00:00:00:00/e9 tag 0 dma 4096 in [1718254.879213] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [1718254.879316] ata1.00: status: { DRDY } [1718262.237404] ata1: link is slow to respond, please be patient (ready=0) [1718270.057698] ata1: device not ready (errno=-16), forcing hardreset [1718270.057732] ata1: soft resetting link [1718277.841779] ata1: link is slow to respond, please be patient (ready=0) [1718281.134473] ata1.00: configured for UDMA/133 [1718281.192815] ata1.01: configured for UDMA/133 [1718281.192815] ata1: EH complete [1729049.865692] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [1729049.865692] ata1.00: cmd c8/00:08:dc:b3:bf/00:00:00:00:00/e9 tag 0 dma 4096 in [1729049.865692] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [1729049.865692] ata1.00: status: { DRDY } [1729059.627313] ata1: link is slow to respond, please be patient (ready=0) [1729068.499782] ata1: device not ready (errno=-16), forcing hardreset [1729068.499823] ata1: soft resetting link [1729078.434813] ata1: link is slow to respond, please be patient (ready=0) [1729088.807850] ata1: SRST failed (errno=-16) [1729088.807881] ata1: soft resetting link [1729089.582856] ata1.00: configured for UDMA/133 with this on the stressed drive: SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 10 Device-to-host register FISes sent due to a COMRESET and this on the non-stressed drive: SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 1 Device-to-host register FISes sent due to a COMRESET I'd be happy to put a newer kernel on one or both machines to see if that'd have any effect. I also tried doing "hdparm -I" instead of "smartctl -a" for a few hours but that didn't elicit any "frozen" messages (although I should probably run it for a bit longer to have more confidence in that statement). So, err I suppose that this could be a bug in: . smartctl . both HD firmwares . ata_piix (certainly disabling both drives seems a bit drastic, but I don't know if this is a function of the hardware) . the ICH7 hardware unfortunately as I don't own the hardware, I'm not in a position to get a different SATA controller in the boxes to eliminate the last two. Any ideas welcome.... Cheers, Tim. |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Justin P. <jp...@lu...> - 2010-02-05 14:17:13
|
On Fri, 5 Feb 2010, Tim Small wrote: > Hi, > > I have a couple of Debian Lenny ("2.6.26-2-amd64") boxes on rented > hardware, each has a couple of SATA drives: > > One has 2x 1TB Seagate Barracuda 7200.11 model ST31000333AS firmware SD35 > > The other has 2x 2TB WD Caviar Green model WDC WD20EADS-00R6B0 firmware > 01.00A01 > > ... the machines are currently set up to run smartd, and also log HDD > temp via munin. ata_piix is the driver in use. > > The WD machine did this sort of thing a couple of times, which got my > attention. > I have seen people report similar problems with the following drives: 1 - Velociraptors (me/others) (don't work at all in raid correctly) http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ 2 - Green Drives (search this list, there are similar problems) in Linux. I have Caviar Black and WD RE3, they work OK in Linux. -- The WD Velociraptors do not work (at least in RAID). The Green, again, search the list I recall seeing people have problems. -- |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tim S. <ti...@bu...> - 2010-02-05 14:32:05
|
Justin Piszcz wrote: > I have seen people report similar problems with the following drives: > > 1 - Velociraptors (me/others) (don't work at all in raid correctly) > http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ > > 2 - Green Drives (search this list, there are similar problems) in Linux. > > I have Caviar Black and WD RE3, they work OK in Linux. OK, but: 1. Unlike the link you sent, there's nothing suspicious in any of the SMART attributes on any of the four drives - no bad sectors or other errors i.e. the following raw values are all zero on the WD drive which I've been stressing: Raw_Read_Error_Rate Reallocated_Sector_Ct Seek_Error_Rate Spin_Retry_Count Calibration_Retry_Count Reallocated_Event_Count Current_Pending_Sector Offline_Uncorrectable UDMA_CRC_Error_Count Multi_Zone_Error_Rate ... as well as empty SMART errors logs. 2. A few failures were seen with the Seagate drives as well (see last bits of the email), similarly with no apparent bad SMART attributes. Thanks, Tim. |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Justin P. <jp...@lu...> - 2010-02-05 14:48:24
|
On Fri, 5 Feb 2010, Tim Small wrote: > Justin Piszcz wrote: >> I have seen people report similar problems with the following drives: >> >> 1 - Velociraptors (me/others) (don't work at all in raid correctly) >> http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ >> 2 - Green Drives (search this list, there are similar problems) in Linux. >> >> I have Caviar Black and WD RE3, they work OK in Linux. > > OK, but: > > 1. Unlike the link you sent, there's nothing suspicious in any of the SMART > attributes on any of the four drives - no bad sectors or other errors i.e. > the following raw values are all zero on the WD drive which I've been > stressing: > > Raw_Read_Error_Rate > Reallocated_Sector_Ct > Seek_Error_Rate > Spin_Retry_Count > Calibration_Retry_Count > Reallocated_Event_Count > Current_Pending_Sector > Offline_Uncorrectable > UDMA_CRC_Error_Count > Multi_Zone_Error_Rate > > ... as well as empty SMART errors logs. Hi, They seem to have this problem with and without errors, but you should run (from a boot cd) for the last one (or from a recovery image) if you have console access. 1. smartctl -t short 2. smartctl -t long 3. smartctl -t offline # and don't touch the host/machine for the amount # of time it recommends 4. then show smartctl -a output The obvious things are: 1. Try/ask to get the cables replaced/check connectors. 2. Check to make sure the PSU is ok (w/ lm sensors etc) -- When these error occur and/or when you reboot do you ever notice any corruption or files in /lost+found? -- Does it happen if you leave the drives alone (do not poll them with smart?) -- Some other misc/info that seems like it might be useful: http://www.newegg.com/Product/ProductReview.aspx?Item=22-136-351&SortField=0&SummaryType=0&Pagesize=10&SelectedRating=-1&PurchaseMark=&VideoOnlyMark=False&VendorMark=&Page=1&Keywords=linux Cons: WD changed the firmware Oct 2009 to disable SCT ERC (Error Recovery Control). These drives are desktop drives that have a 2 minute ERC setting. Most hardware RAID controllers require a maximum of 7 seconds ERC or your drives will be kicked out of RAID array if it takes too long to recover a sector. Note the RAID controller can recovery the troubled sector itself from the parity disk so it doesn't need the drives to try very hard to recover. This timing is the main difference between RAID drives and desktop drives. Prior to Oct 2009 you could use a WD (leaked) utility to enable the WD equivalent of ERC (called TLER - Time Limited Error Recovery) that would then set the recovery timer to 7 seconds. On the newer drives, you should only use them as desktop drives. Other Thoughts: In addition, the green feature of the drive parks the head very often (every 8 seconds I think). If you use the drive as a Linux OS drive, chances are the drive head will be parked/unparked so often that it will exceed the rated 300,000 load cycles in less than a year. There is another WD (leaked) utility that allows you to set the park timer from 8 seconds to a maximum of 5 minutes. It uses a little more power but prolongs the drive life span. If you use the drive mainly as storage, then there should be nothing to worry about. Are you using them in raid or as a single disk? http://doug.warner.fm/d/blog/2009/11/Western-Digital-15TB-Green-Drives-Not-your-Linux-Software-RAID I'm not sure what to do with these WD drives; while they seem to work fine independantly, they don't perform correctly at all when put into a RAID array. I'm beginning to get afraid that as the hard drives get larger and larger the complexity of the firmware is growing too quickly for drive manufacterers to keep them performing reliably. |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Mark L. <ke...@te...> - 2010-02-05 22:21:24
|
Tim Small wrote: > Justin Piszcz wrote: >> I have seen people report similar problems with the following drives: >> >> 1 - Velociraptors (me/others) (don't work at all in raid correctly) >> http://forums.storagereview.com/index.php/topic/27303-velociraptor-premature-failure-rate-bad-drives-premature-to-market/ >> >> 2 - Green Drives (search this list, there are similar problems) in Linux. >> >> I have Caviar Black and WD RE3, they work OK in Linux. > > OK, but: > > 1. Unlike the link you sent, there's nothing suspicious in any of the > SMART attributes on any of the four drives - no bad sectors or other > errors i.e. the following raw values are all zero on the WD drive which > I've been stressing: > > Raw_Read_Error_Rate > Reallocated_Sector_Ct > Seek_Error_Rate > Spin_Retry_Count > Calibration_Retry_Count > Reallocated_Event_Count > Current_Pending_Sector > Offline_Uncorrectable > UDMA_CRC_Error_Count > Multi_Zone_Error_Rate > > ... as well as empty SMART errors logs. > > > 2. A few failures were seen with the Seagate drives as well (see last > bits of the email), similarly with no apparent bad SMART attributes. .. I have observed (and reported) the same issue in the past, on Hitachi and Seagate drives. The only constants seem to be libata and ICH7/8. We must have a bug somewhere in there. -ml |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tejun H. <tj...@ke...> - 2010-02-06 08:08:14
|
Hello, On 02/06/2010 06:47 AM, Mark Lord wrote: >> 2. A few failures were seen with the Seagate drives as well (see last >> bits of the email), similarly with no apparent bad SMART attributes. > .. > > I have observed (and reported) the same issue in the past, > on Hitachi and Seagate drives. > > The only constants seem to be libata and ICH7/8. > We must have a bug somewhere in there. In piix mode or ahci mode? If in piix mode, ich7 and 8 would behave quite differently. ICH8 has SIDPR so it can hardreset while 7 can't. ICH SIDPR access had a hardware problem where write to SControl to clear DET is sometimes ignored which led to occassional hardreset failure which got fixed recently. The reason why ich's are involved in those incidents could just be that they are extremely popular. Things to try after such completely drive shutdown are... * Disconnect the drive from the host but do not remove power. Reconnect the drive to a different port and/or controller, does the drive work there? * Power-cycle the drive (and issue manual rescan if necessary). Does the drive get recognized again? * Disconnect the drive and connect a different drive to the port. Does the port work? * Soft reset the machine. Can BIOS recognize the drive? In many cases I've seen, it's usually that the drive's firmware is completely hung and only power cycling the drive brought it back. But then again, there have been some number of cases which didn't get diagnosed properly, so it's definitely possible that we're doing something wrong in the driver. Anyways, if it happens again, please try the above and try to find out whether the controller or the drive is hung. Also, please keep in mind that timeouts on 0xEA (flush) is very often indicative of power related issues. FLUSH spikes power consumption and surprisingly many PSUs fail to sustain proper voltage over that, so powering up a separate PSU and connecting only the hard drive to it and see what happens is often interesting too. Thanks. -- tejun |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tim S. <ti...@bu...> - 2010-02-06 15:26:22
|
Tejun Heo wrote: >> The only constants seem to be libata and ICH7/8. >> We must have a bug somewhere in there. >> > > In piix mode or ahci mode? If in piix mode, ich7 and 8 would behave > quite differently. ICH8 has SIDPR so it can hardreset while 7 can't. > ICH SIDPR access had a hardware problem where write to SControl to > clear DET is sometimes ignored which led to occassional hardreset > failure which got fixed recently. The reason why ich's are involved > in those incidents could just be that they are extremely popular. > It's a non-AHCI capable ICH7, so it's in piix mode. > Things to try after such completely drive shutdown are... > Unfortunately I can't do much with this box, as it's a rented box in a datacentre, however.... > * Soft reset the machine. Can BIOS recognize the drive? > Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS recognises the drive, and the box reboot normally. > In many cases I've seen, it's usually that the drive's firmware is > completely hung and only power cycling the drive brought it back. But > then again, there have been some number of cases which didn't get > diagnosed properly, so it's definitely possible that we're doing > something wrong in the driver. > > Anyways, if it happens again, please try the above and try to find out > whether the controller or the drive is hung. Also, please keep in > mind that timeouts on 0xEA (flush) is very often indicative of power > OK, I didn't think I was seeing those - is it possible to tell from the detail which I posted in my original message? As for the potential for PSU shenanigans - I don't have access to the box to fiddle with that, unfortunately, but I believe I can stress the I/O subsystem quite heavily with dd and/or bonnie, but it's only when polling for SMART status that these errors show up. I've just started dd (to RAID mirror) + hdparm -I again to check... Do the SMART error counters in the OP make this suspicious? Is there likely to be any different between running smartctl -a and hdparm -I in terms of code path taken though the kernel, or timings on the hardware, as far as you know? Cheers, Tim. |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Mark L. <ke...@te...> - 2010-02-06 17:31:05
|
Tim Small wrote: > Tejun Heo wrote: >>> The only constants seem to be libata and ICH7/8. >>> We must have a bug somewhere in there. >>> >> In piix mode or ahci mode? If in piix mode, ich7 and 8 would behave >> quite differently. ICH8 has SIDPR so it can hardreset while 7 can't. >> ICH SIDPR access had a hardware problem where write to SControl to >> clear DET is sometimes ignored which led to occassional hardreset >> failure which got fixed recently. The reason why ich's are involved >> in those incidents could just be that they are extremely popular. >> > > It's a non-AHCI capable ICH7, so it's in piix mode. > >> Things to try after such completely drive shutdown are... >> > > Unfortunately I can't do much with this box, as it's a rented box in a > datacentre, however.... > >> * Soft reset the machine. Can BIOS recognize the drive? >> > > Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS > recognises the drive, and the box reboot normally. > >> In many cases I've seen, it's usually that the drive's firmware is >> completely hung and only power cycling the drive brought it back. But >> then again, there have been some number of cases which didn't get >> diagnosed properly, so it's definitely possible that we're doing >> something wrong in the driver. >> >> Anyways, if it happens again, please try the above and try to find out >> whether the controller or the drive is hung. Also, please keep in >> mind that timeouts on 0xEA (flush) is very often indicative of power >> > > OK, I didn't think I was seeing those - is it possible to tell from the > detail which I posted in my original message? As for the potential for > PSU shenanigans - I don't have access to the box to fiddle with that, > unfortunately, but I believe I can stress the I/O subsystem quite > heavily with dd and/or bonnie, but it's only when polling for SMART > status that these errors show up. I've just started dd (to RAID mirror) > + hdparm -I again to check... > > Do the SMART error counters in the OP make this suspicious? Is there > likely to be any different between running smartctl -a and hdparm -I in > terms of code path taken though the kernel, or timings on the hardware, > as far as you know? .. My theory on the problem when I first had it here, was that doing a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent the problem. This was never explored further (by me or others). Cheers |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tejun H. <tj...@ke...> - 2010-02-08 02:43:34
|
Hello, On 02/07/2010 02:30 AM, Mark Lord wrote: >>> * Soft reset the machine. Can BIOS recognize the drive? >> >> Yes, if I either 'echo b > /proc/sysrq-trigger', then the BIOS >> recognises the drive, and the box reboot normally. Hmmm... this means one of the followings. 1. The controller side is hung and needs some sort of reset or reinitialization to get working again. 2. The drive is hung requiring hardreset to continue. ata_piix currently can't do hardresets on ich7 but resetting the machine will definitely generate hardrsets. 3. The BIOS actually power-cycles the machine when told to reboot. Some BIOSen do this. No chance you can access the machine there? >>> Anyways, if it happens again, please try the above and try to find out >>> whether the controller or the drive is hung. Also, please keep in >>> mind that timeouts on 0xEA (flush) is very often indicative of power >>> >> >> OK, I didn't think I was seeing those - is it possible to tell from the >> detail which I posted in my original message? As for the potential for >> PSU shenanigans - I don't have access to the box to fiddle with that, >> unfortunately, but I believe I can stress the I/O subsystem quite >> heavily with dd and/or bonnie, but it's only when polling for SMART >> status that these errors show up. I've just started dd (to RAID mirror) >> + hdparm -I again to check... Oh... if that's the case, PSU problem wouldn't be very probable. >> Do the SMART error counters in the OP make this suspicious? Is there >> likely to be any different between running smartctl -a and hdparm -I in >> terms of code path taken though the kernel, or timings on the hardware, >> as far as you know? >From driver's POV, hdparm and smart commands behave pretty much the same. They travel through the same high/mid layer paths and gets issued using the same command protocol. From drive's POV, I imagine it can be pretty different tho. > My theory on the problem when I first had it here, was that doing > a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent > the problem. This was never explored further (by me or others). If that's the case, what would that mean? Would it be some nasty interaction inside the drive firmware? Thanks. -- tejun |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tim S. <ti...@bu...> - 2010-02-06 22:22:47
|
Mark Lord wrote: > My theory on the problem when I first had it here, was that doing > a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent > the problem. This was never explored further (by me or others). > Would using "option libata force=pio4" be a simple way to start to test this hypothesis? Ta, Tim. |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Mark L. <ke...@te...> - 2010-02-07 04:51:57
|
Tim Small wrote: > Mark Lord wrote: >> My theory on the problem when I first had it here, was that doing >> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent >> the problem. This was never explored further (by me or others). >> > > Would using "option libata force=pio4" be a simple way to start to test > this hypothesis? .. Yup. If the hypothesis is FALSE, then you'll still see trouble. Otherwise, it *might* be correct. ;) |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tejun H. <tj...@ke...> - 2010-02-08 02:34:56
|
Hello, On 02/07/2010 01:51 PM, Mark Lord wrote: > Tim Small wrote: >> Mark Lord wrote: >>> My theory on the problem when I first had it here, was that doing >>> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent >>> the problem. This was never explored further (by me or others). >>> >> >> Would using "option libata force=pio4" be a simple way to start to test >> this hypothesis? > .. > > Yup. If the hypothesis is FALSE, then you'll still see trouble. > Otherwise, it *might* be correct. ;) But that would be a big *might*. The effect of PIO is a bit too drastic to indicate postivity (as opposed to ruling out stuff). Anyways, yeap, no harm in trying. -- tejun |
Re: [smartmontools-support] SATA drive reset/disable events on ICH7
ata_piix when polling SMART info
From: Tim S. <ti...@bu...> - 2010-02-08 14:12:27
|
Mark Lord wrote: > Tim Small wrote: >> Mark Lord wrote: >>> My theory on the problem when I first had it here, was that doing >>> a FLUSH_CACHE[_EXT] before any PIO command (eg. SMART) should prevent >>> the problem. This was never explored further (by me or others). >>> >> >> Would using "option libata force=pio4" be a simple way to start to test >> this hypothesis? > .. > > Yup. If the hypothesis is FALSE, then you'll still see trouble. > Otherwise, it *might* be correct. ;) It looks like it is false then.... [59745.632984] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [59745.633036] ata1.01: cmd 34/00:00:87:c6:f7/00:04:00:00:00/f0 tag 0 pio 524288 out [59745.633086] ata1.01: status: { DRDY } [59745.633117] ata1: soft resetting link [59747.094498] ata1.00: FORCE: xfer_mask set to pio4 [59747.094498] ata1.01: FORCE: xfer_mask set to pio4 [59747.102353] ata1.00: configured for PIO4 [59747.108610] ata1.01: configured for PIO4 [59747.108610] ata1: EH complete [59747.437125] sd 0:0:0:0: [sda] 3907029168 512-byte hardware sectors (2000399 MB) [59747.499739] sd 0:0:0:0: [sda] Write Protect is off [59747.499739] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 [59747.844755] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [59748.047834] sd 0:0:1:0: [sdb] 3907029168 512-byte hardware sectors (2000399 MB) ... 7 14:20:32: [101181.209812] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 7 14:20:32: [101181.209865] ata1.01: cmd 34/00:00:0f:4d:f0/00:04:00:00:00/f0 tag 0 pio 524288 out 7 14:20:32: [101181.209909] ata1.01: status: { DRDY } 7 14:20:32: [101181.209946] ata1: soft resetting link -- 7 15:54:12: [110247.451925] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 7 15:54:12: [110247.451979] ata1.01: cmd 34/00:00:bf:8e:e8/00:04:00:00:00/f0 tag 0 pio 524288 out 7 15:54:12: [110247.452028] ata1.01: status: { DRDY } 7 15:54:12: [110247.452062] ata1: soft resetting link -- 7 23:47:13: [155689.544839] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 7 23:47:13: [155689.544892] ata1.01: cmd 34/00:00:d7:0f:fe/00:04:00:00:00/f0 tag 0 pio 524288 out 7 23:47:13: [155689.544935] ata1.01: status: { DRDY } 7 23:47:13: [155689.544974] ata1: soft resetting link -- 8 00:59:30: [162616.848048] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 00:59:30: [162616.848099] ata1.01: cmd 34/00:00:5f:6b:e9/00:04:00:00:00/f0 tag 0 pio 524288 out 8 00:59:30: [162616.848143] ata1.01: status: { DRDY } 8 00:59:30: [162616.848175] ata1: soft resetting link -- 8 01:01:22: [162789.662299] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 01:01:22: [162789.662338] ata1.01: cmd 34/00:00:5f:6c:ed/00:04:00:00:00/f0 tag 0 pio 524288 out 8 01:01:22: [162789.662381] ata1.01: status: { DRDY } 8 01:01:22: [162789.662418] ata1: soft resetting link -- 8 01:14:43: [164059.753030] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 01:14:43: [164059.753082] ata1.01: cmd ec/00:00:00:00:00/00:00:00:00:00/10 tag 0 pio 512 in 8 01:14:43: [164059.753129] ata1.01: status: { DRDY } 8 01:14:48: [164067.298313] ata1: link is slow to respond, please be patient (ready=0) -- 8 01:56:33: [168105.660062] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 01:56:33: [168105.660115] ata1.01: cmd 34/00:00:0f:2f:e6/00:04:00:00:00/f0 tag 0 pio 524288 out 8 01:56:33: [168105.660164] ata1.01: status: { DRDY } 8 01:56:33: [168105.660193] ata1: soft resetting link -- 8 02:11:42: [169562.773251] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 02:11:42: [169562.773303] ata1.01: cmd 34/00:00:87:8c:ef/00:04:00:00:00/f0 tag 0 pio 524288 out 8 02:11:42: [169562.773352] ata1.01: status: { DRDY } 8 02:11:42: [169562.773386] ata1: soft resetting link -- 8 04:35:16: [183417.972749] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 04:35:16: [183417.972749] ata1.01: cmd 34/00:40:a7:7f:fc/00:01:00:00:00/f0 tag 0 pio 163840 out 8 04:35:16: [183417.972749] ata1.01: status: { DRDY } 8 04:35:16: [183417.972749] ata1: soft resetting link -- 8 07:11:47: [198460.847454] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 07:11:47: [198460.847507] ata1.01: cmd 34/00:00:67:2c:ef/00:04:00:00:00/f0 tag 0 pio 524288 out 8 07:11:47: [198460.847555] ata1.01: status: { DRDY } 8 07:11:47: [198460.847583] ata1: soft resetting link -- 8 07:40:48: [201232.970903] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 8 07:40:48: [201232.970903] ata1.01: cmd 34/00:00:c7:2d:e5/00:04:00:00:00/f0 tag 0 pio 524288 out 8 07:40:48: [201232.970903] ata1.01: status: { DRDY } 8 07:40:48: [201232.970903] ata1: soft resetting link ... but, it turns out that I have another box at home which I've been able to provoke into doing similar things: 16:46:49: [1130032.307185] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xe frozen 16:46:49: [1130032.307197] ata1.00: irq_stat 0x00000040, connection status changed 16:46:49: [1130032.307200] ata1: SError: { DevExch } 16:46:49: [1130032.307205] ata1.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in 16:46:49: [1130032.307207] res 40/00:4c:1f:fa:9a/00:00:06:00:00/40 Emask 0x10 (ATA bus error) 16:46:49: [1130032.307210] ata1.00: status: { DRDY } 16:46:49: [1130032.307219] ata1: hard resetting link 16:46:55: [1130038.083028] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) 16:47:25: [1130068.090133] ata1.00: qc timeout (cmd 0xec) 16:47:25: [1130068.090148] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) 16:47:25: [1130068.090152] ata1.00: revalidation failed (errno=-5) 16:47:25: [1130068.090156] ata1: failed to recover some devices, retrying in 5 secs 16:47:30: [1130073.094116] ata1: hard resetting link 16:47:30: [1130073.414133] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) 16:47:30: [1130073.436396] ata1.00: configured for UDMA/133 16:47:30: [1130073.436396] ata1: EH complete 16:47:30: [1130073.436396] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB) 16:47:30: [1130073.436396] sd 0:0:0:0: [sda] Write Protect is off 16:47:30: [1130073.436396] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 17:21:21: [1132149.195367] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen 17:21:21: [1132149.195378] ata1.00: irq_stat 0x00000040, connection status changed 17:21:21: [1132149.195384] ata1: SError: { CommWake DevExch } 17:21:21: [1132149.195394] ata1.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in 17:21:21: [1132149.195397] res 40/00:2c:77:ad:63/00:00:06:00:00/40 Emask 0x10 (ATA bus error) 17:21:21: [1132149.195403] ata1.00: status: { DRDY } -- 18:28:29: [1136257.076898] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen 18:28:29: [1136257.076898] ata1.00: cmd 61/00:00:27:b5:89/04:00:06:00:00/40 tag 0 ncq 524288 out 18:28:29: [1136257.076898] res 40/00:f4:27:b1:89/00:00:06:00:00/40 Emask 0x4 (timeout) 18:28:29: [1136257.076898] ata1.00: status: { DRDY } 18:28:29: [1136257.076898] ata1.00: cmd 61/00:08:27:b9:89/04:00:06:00:00/40 tag 1 ncq 524288 out 18:28:29: [1136257.076898] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) -- 18:53:19: [1137768.517637] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen 18:53:19: [1137768.517637] ata1.00: irq_stat 0x00000040, connection status changed 18:53:19: [1137768.517637] ata1: SError: { CommWake DevExch } 18:53:19: [1137768.517637] ata1.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in 18:53:19: [1137768.517637] res 40/00:0c:7b:99:09/00:00:02:00:00/40 Emask 0x10 (ATA bus error) 18:53:19: [1137768.517637] ata1.00: status: { DRDY } This also has an ICH7, but it's in AHCI mode, so ata_piix would seem to be off the hook in this case. I have a couple of other SATA controllers in that box (JMicron 20360/20363 and a SiI 3132), so I should be able to put the drive on those controllers instead to see if the same thing happens. Annoyingly (but only from the PoV of that issue), I'm about to go on holiday, but I'll try and do this before I go.... Cheers, Tim. |