From: Ross S. W. W. <RW...@me...> - 2010-12-07 22:56:42
|
Sean McCreadie [mailto:smccreadie@CanyonPartners.com] wrote: > > I have been trying to use IET to serve up some disk to my ESX > servers to use for VMFS. I have actually had the system > working for more than a year with no issues, until just recently. > > Up until now I have only put Windows VMs on the iscsi storage > and reported good performance and no issues. This week I > tried to install Centos 5.5 on a VM, and when it got the end > of the installer and started formatting the partitions the VM > froze and these errors began erupting in the messages log. > Subsequent testing revealed this would happen every time with > the Centos VM, no issues with Windows. I even hammered the > storage with IOMeter and SQLIO in Windows VM with no errors. The SCSI timeout values are higher on Windows then Linux. I would look at the storage setup again. Look for errors in the ESX/ESXi logs, look at the datastore performance graphs for access times. > The errors were originally on my Centos 5.5 IET server, and I > even built an Openfiler 2.3 box to see if that build had > fixed the issue. Errors are the same on both. If I were to bet, the problem has been there all along and only came up due to Linux's lower SCSI timeout settings. > Dec 7 13:41:30 SAN-A-01 kernel: [ 6906.594671] iscsi_trgt: > cmnd_abort(1167) 49e10402 1 0 42 131072 0 0 > > Dec 7 13:41:30 SAN-A-01 kernel: [ 6906.594962] iscsi_trgt: > Abort Task (01) issued on tid:1 lun:0 by sid:1689949375889920 > (Unknown Task) Hmmm, an Unknown Task, this will only happen if the command window has extended beyond the Abort Command's target SN, this means there is probably a lot of IO and a SCSI request or it's matching response was dropped on the network level. > Dec 7 13:41:41 SAN-A-01 kernel: [ 6917.676208] iscsi_trgt: > cmnd_abort(1167) ede10402 1 200 42 512 0 1 > > Dec 7 13:41:41 SAN-A-01 kernel: [ 6917.676298] iscsi_trgt: > Abort Task (01) issued on tid:1 lun:0 by sid:1971424352600576 > (Function Complete) > > I googled the errors and tried several things which made it > better but hasn't resolved the issue. > > 1. I updated to latest IET build 1.4.20.2 (On Centos 5.5 > host, latest updates) > > 2. I edited /etc/init.d/iscsi-target file and added the IP > address in the "$LISTEN_ADDR field > > 3. Changed Queued Commands to 8 from default of 32 in ietd.conf > > 4. Changed all targets to fileio from blockio, and vice versa > > 5. Verified the scheduler is set to deadline on target devices Did you look at the network statistics for dropped frames and retransmissions? > These errors are happening with the latest IET build and the > version 1.4.19 that is on the Openfiler build. > > After my changes listed above, the LUN doesn't freeze anymore > and in general the issue is much better, but I still get > these errors in the logs and Linux VMs will seem to hang up > briefly and continue, even when doing simple things like yum update. Increasing the queue probably helped, but the problem is still there and I bet it is either the storage controller or network controller or the both together can't handle the IO load. Look at the network layer, NICs and switches, then look to see if the NIC IRQs and the storage IRQs are conflicting or so many are being generated during this time the CPU can't handle them (vmstat for IRQ usage). > Any ideas what is so different about Linux VM I/O to cause > this behavior? Any ideas on a fix? Windows SCSI layer has bigger timeouts and writes are more aggressively cached then Linux, which with ext3 forces a flush every 5 seconds. -Ross ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof. |