From: Justin P. <jp...@lu...> - 2008-12-05 13:20:10
|
I swapped out my power supply, changed ALL cables and bought a $1000 raid controller with BBU, the drives are still having problems, when writing to them in a RAID10 configuration, it locks up the card: SMART shows the same thing on many of the disks in the RAID10: Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 00 00 8d b8 40 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 80 18 00 6c 91 19 08 00:57:50.230 WRITE FPDMA QUEUED 61 80 18 00 cd b7 19 08 00:57:50.148 WRITE FPDMA QUEUED 61 80 e8 80 cc b7 19 08 00:57:50.147 WRITE FPDMA QUEUED 61 80 18 00 cc b7 19 08 00:57:50.147 WRITE FPDMA QUEUED 61 80 e8 80 cb b7 19 08 00:57:50.146 WRITE FPDMA QUEUED The card has the latest 3ware BIOS/Firmware etc, the diag output from the card itself: DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0) DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0) DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0) DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0) E=1019 T=19:57:26 : Drive removed task file written out : cd dh ch cl sn sc ft : 61 59 B8 8E 00 80 80 E=1019 T=19:57:26 P=Bh: Hard reset drive P=Bh: HardResetDriveWait task file read back : st dh ch cl sn sc er : 50 00 00 00 01 01 01 E=1019 T=19:57:26 P=B : Soft reset drive E=0207 T=19:57:26 P=B : ResetDriveWait E=1019 T=19:57:26 P=B : Inserting Set UDMA command E=1019 T=19:57:26 P=B : Check power mode, active E=1019 T=19:57:26 P=B : Check drive swap, same drive E=1019 T=19:57:26 P=B : Check power cycles, initial=57, current=57 E=1019 T=19:57:26 P=Bh: exitCode = 0 Retrying chain DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0) DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0) Hm the last thing I will try I suppose is disabling NCQ and see if the problem recurs. So I did this, after 3-4 times of running the following with NCQ enabled, I would try it once more, the final time with NCQ disabled. dd if=/dev/zero of=bigfile bs=1M on the raid10 and having it crash everytime when NCQ was enabled. -- I don't want to get too excited yet but after disabling NCQ I was able to write to the RAID10 - over the entire array without it crashing! Where before it would get to 700-985GiB/1.4TiB and then all processes would go into D-state and I could not even echo b > sysrq-trigger to bring the host back up, it required a manual reboot. -- I will let it run a few more times before making any further comments though. echo "writing to raid10" writing to raid10 dd if=/dev/zero of=file2 bs=1M dd: writing `file2': No space left on device 1430328+0 records in 1430327+0 records out 1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s Just as with Linux-- when using NCQ Western Digital Raptor Drives (150) or Velociraptor Drives (300s) the drives in RAID; whether in Linux SW RAID or 3ware HW RAID, the result is the same, unstable drives, they appear to 'reset' or timeout and this obviously will cause problems with any RAID implementation. NCQ+Velociraptor => Bad in raid configuration, in non-raid it may be OK, I have not tested. I also have another host here that uses the P35 chipset and uses raptors in sw raid 1-- when NCQ is enabled (and with the 750s as well) it gets nasty NCQ errors and drive timeouts etc. With this data, essentially the NCQ implementation on raptors or WD 750s in a RAID configuration has problems. I also have a 750 which I have used for 1-2 years now (using NCQ) in a single-disk configuration with NO issues, so whatever the problem is-- only relates to when the disk is in a RAID configuration and in my case in Linux SW RAID or 3ware HW RAID. Justin. |
From: Justin P. <jp...@lu...> - 2008-12-05 14:00:04
|
On Fri, 5 Dec 2008, Justin Piszcz wrote: > I swapped out my power supply, changed ALL cables and bought a $1000 raid > controller with BBU, the drives are still having problems, when writing to > them in a RAID10 configuration, it locks up the card: > > SMART shows the same thing on many of the disks in the RAID10: > > Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 hours) > When the command that caused the error occurred, the device was active or > idle. I spoke to soon, turning off NCQ helped dramatically, it worked three times! writing to raid10 dd: writing `file2': No space left on device 1430328+0 records in 1430327+0 records out 1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s Fri Dec 5 06:00:25 EST 2008 writing to raid10 dd: writing `file2': No space left on device 1430328+0 records in 1430327+0 records out 1499806973952 bytes (1.5 TB) copied, 4063.25 s, 369 MB/s Fri Dec 5 07:08:11 EST 2008 writing to raid10 dd: writing `file2': No space left on device 1430328+0 records in 1430327+0 records out 1499806973952 bytes (1.5 TB) copied, 3926.71 s, 382 MB/s Fri Dec 5 08:13:41 EST 2008 Then it crashed again, with NCQ enabled, it would not even complete one test, So basically a new system, new PSU, new cables, its on a new APC UPS and the problem persists even when all disks are on a RAID card, SW raid, it does not matter, Velociraptors have problems, I think its time for me to get regular 1TiB disks and be done with it. Justin. |
From: Justin P. <jp...@lu...> - 2008-12-06 09:26:30
|
On Fri, 5 Dec 2008, Justin Piszcz wrote: > > > On Fri, 5 Dec 2008, Justin Piszcz wrote: > >> I swapped out my power supply, changed ALL cables and bought a $1000 raid >> controller with BBU, the drives are still having problems, when writing to >> them in a RAID10 configuration, it locks up the card: >> >> SMART shows the same thing on many of the disks in the RAID10: >> >> Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 >> hours) >> When the command that caused the error occurred, the device was active or >> idle. > > I spoke to soon, turning off NCQ helped dramatically, it worked three times! > > writing to raid10 > dd: writing `file2': No space left on device > 1430328+0 records in > 1430327+0 records out > 1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s > Fri Dec 5 06:00:25 EST 2008 > writing to raid10 > dd: writing `file2': No space left on device > 1430328+0 records in > 1430327+0 records out > 1499806973952 bytes (1.5 TB) copied, 4063.25 s, 369 MB/s > Fri Dec 5 07:08:11 EST 2008 > writing to raid10 > dd: writing `file2': No space left on device > 1430328+0 records in > 1430327+0 records out > 1499806973952 bytes (1.5 TB) copied, 3926.71 s, 382 MB/s > Fri Dec 5 08:13:41 EST 2008 > > Then it crashed again, with NCQ enabled, it would not even complete one test, > So basically a new system, new PSU, new cables, its on a new APC UPS and the > problem persists even when all disks are on a RAID card, SW raid, it does not > matter, Velociraptors have problems, I think its time for me to get regular > 1TiB disks and be done with it. > > Justin. > > I have swapped my disks: Removed my 12 velociraptors. Inserted my old 12 raptor150s (all ADFD). The 12 velociraptors are now in a test system using md/raid. The 12 raptor 150s are now in my main machine w/3ware. I am going to run the same tests, benchmarks, etc and see if any problems repeat with the 150s and I will also run a bunch more tests on the velociraptors as well. So far no problems with the good ol' raptor 150s and I have been running the same dd test for the past 8 hours+ on the same raid type and settings as I had with the velociraptors: Fri Dec 5 21:07:43 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2747.77 s, 273 MB/s Fri Dec 5 21:53:32 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2742.29 s, 273 MB/s Fri Dec 5 22:39:17 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2685.21 s, 279 MB/s Fri Dec 5 23:24:05 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2600.05 s, 288 MB/s Sat Dec 6 00:07:28 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2550.45 s, 294 MB/s Sat Dec 6 00:50:02 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2507.31 s, 299 MB/s Sat Dec 6 01:31:52 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2247.31 s, 334 MB/s Sat Dec 6 02:09:22 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2248.2 s, 334 MB/s Sat Dec 6 02:46:53 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2245.41 s, 334 MB/s Sat Dec 6 03:24:22 EST 2008 dd: writing `/t/bigfile': No space left on device 715067+0 records in 715066+0 records out 749801271296 bytes (750 GB) copied, 2494.1 s, 301 MB/s Sat Dec 6 04:05:59 EST 2008 So far no problems, just like before when I used to use these drives in the past, they are rock solid (will continue to test, the VRs are another story)-- more info soon to follow. Justin. |