From: micah a. <mi...@ri...> - 2009-12-02 16:51:07
|
Hi, Excerpts from Tim Small's message of Tue Dec 01 14:58:04 -0500 2009: > Micah Anderson wrote: > > SMART Self-test log structure revision number 1 > > Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error > > # 1 Extended offline Interrupted (host reset) 40% 11051 - > > > > Looking in my dmesg, I see the following happen: > > > > [1006133.798423] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen > > [1006133.805959] ata1.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in > > [1006133.805963] res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) > > > > > I'm trying to figure out where to go from here, is this a kernel issue > > in the SATA subsystem or something similar? > > > > I'm running the Debian Lenny provided 2.6.26-2 kernel, using > > 5.38-2+lenny1 version of smartmontools. > > More likely to be a drive firmware bug, I would have thought - looks > like a command is issued, and the drive goes away. Does the same thing > work with different drive models? I've got another system with the same SATA controller, but with a different drive, both are Western Digital... The drives we have been talking about that are getting these resets are model # WDC WD1001FALS-00J7B0, and this other system with the same SATA controller have this model drive # WDC WD5001AALS-00L3B2 > Also what SATA controller are you using (plus drive model/firmware rev > etc.?), lspci shows: 02:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09) dmesg shows: [ 6.815861] sata_mv 0000:02:01.0: Gen-II 32 slots 8 ports SCSI mode IRQ via INTx > and have you checked for drive firmware updates? I just looked at Western Digital's site for these drives and there have not been any firmware updates issued. > You could also try turning NCQ off (set queue length to 1 via the > control file under /sys ) - although the error report shows tag 0, so > this probably isn't it... Ok, I did that: # echo 1 > /sys/block/sda/device/queue_depth # cat /sys/block/sda/device/queue_depth 1 # smartctl -t long /dev/sda Some time later... the long test completed without a host reset. So that is interesting. I tried it on /dev/sdb and the same results. Then as a control, I tried it on /dev/sdc without setting /sys/block/sdc/device/queue_depth to 1, and mysteriously it also completed with success. Hmm.... So, I tried launching a long test on /dev/sda *and* /dev/sdb at the same time (well, I did each command successively, but one after the other). Both of these have the queue_depth set to 1. A few minutes later, /dev/sda had a host reset, and a little afterwards, /dev/sdb also had a host reset. So it seems like things are not working when more than one drive is running self tests. This is what was put in dmesg: [1134931.137300] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [1134931.144627] ata1.00: cmd be/doe:01:09:4f:come/00:00:00:00:00/00 tag0 pio 512 in [1134931.144630] res 40/00:00:06:4f:come/00:00:00:00:00/00 Emask 0x4 (timeout) [1134931.149276] ata1.00: status: { DRDY } [1134931.153148] ata1: hard resetting link [1134931.629795] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [1134931.653325] ata1.00: max_sectors limited to 256 for NCQ [1134931.681337] ata1.00: max_sectors limited to 256 for NCQ [1134931.685346] ata1.00: configured for UDMA/133 [1134931.689871] ata1: EH complete [1134931.715382] sd 1:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB) [1134931.723589] sd 1:0:0:0: [sda] Write Protect is off [1134931.728635] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00 [1134931.731389] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [1135528.516798] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [1135528.524091] ata2.00: cmd be/doe:01:09:4f:come/00:00:00:00:00/00 tag0 pio 512 in [1135528.524094] res 40/00:00:06:4f:come/00:00:00:00:00/00 Emask 0x4 (timeout) [1135528.528774] ata2.00: status: { DRDY } [1135528.532657] ata2: hard resetting link [1135529.009396] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [1135529.048855] ata2.00: max_sectors limited to 256 for NCQ [1135529.070996] ata2.00: max_sectors limited to 256 for NCQ [1135529.074998] ata2.00: configured for UDMA/133 [1135529.078999] ata2: EH complete [1135529.095005] sd 2:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) [1135529.102508] sd 2:0:0:0: [sdb] Write Protect is off [1135529.107034] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00 [1135529.107172] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA micah |