From: Marc S. <mar...@mc...> - 2012-02-29 21:50:49
|
Hi, We are currently using two SCST disk arrays that have three volumes each giving us (6) total VMFS volumes. We have approximately 700 virtual machines spread across these six VMFS datastores and three ESXi 5 hosts. The volumes are backed by SATA SSDs and LSI MegaRAID SAS RAID controllers. We are not experiencing any performance issues that are noticeable to our users -- everything is extremely fast, however, we are seeing the following errors in the 'vmkernel' log on each host. --snip-- 2012-02-29T21:22:28.394Z cpu58:636519)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x41258056cf40) to dev "eui.3533633631313666" on path "vmhba1:C0:T5:L105" Failed: H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:NONE 2012-02-29T21:22:28.394Z cpu58:636519)ScsiDeviceIO: 2305: Cmd(0x41258056cf40) 0x2a, CmdSN 0x8000001e to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.394Z cpu58:636519)ScsiDeviceIO: 2305: Cmd(0x412580540340) 0x2a, CmdSN 0x8000005f to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.415Z cpu60:8252)ScsiDeviceIO: 2305: Cmd(0x41258129c200) 0x2a, CmdSN 0x8000006c to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.415Z cpu60:8252)ScsiDeviceIO: 2305: Cmd(0x4125803f1d00) 0x2a, CmdSN 0x8000004e to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.416Z cpu60:8252)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x4125801b8e00) to dev "eui.3533633631313666" on path "vmhba1:C0:T5:L105" Failed: H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:NONE 2012-02-29T21:22:28.441Z cpu50:98752)ScsiDeviceIO: 2305: Cmd(0x4125801d3a40) 0x2a, CmdSN 0x80000058 to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.441Z cpu50:98752)ScsiDeviceIO: 2305: Cmd(0x4125801d1a40) 0x2a, CmdSN 0x8000005f to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.496Z cpu62:636519)ScsiDeviceIO: 2305: Cmd(0x41258058a980) 0x2a, CmdSN 0x8000003c to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.496Z cpu62:636519)ScsiDeviceIO: 2305: Cmd(0x4125801d9fc0) 0x2a, CmdSN 0x8000003c to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.496Z cpu62:636519)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x4125812eaf40) to dev "eui.3533633631313666" on path "vmhba1:C0:T5:L105" Failed: H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:NONE 2012-02-29T21:22:28.497Z cpu62:636519)ScsiDeviceIO: 2305: Cmd(0x4125812b0440) 0x2a, CmdSN 0x8000003e to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-29T21:22:28.497Z cpu62:636519)ScsiDeviceIO: 2305: Cmd(0x412580a218c0) 0x2a, CmdSN 0x8000004e to dev "eui.3533633631313666" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0. --snip-- After reading some VMware knowledge base articles, it appears the command above (0x2a) is a SCSI WRITE command, and the "D:0x28" part is "VMK_SCSI_DEVICE_QUEUE_FULL (TASK SET FULL)". http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381 So that article is saying that on the SCST (array) side, it has stopped accepting commands since the queue is full. That article/section then recommends controlling the queue depth via throttling (adaptive queue depth algorithm) on the initiator side until it stops seeing TASK SET FULL from the device. http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1008113 I guess, I have a few questions then. First, I totally understand we are exceeding the recommended number of VMs per VMFS datastore. We are working on deploying additional SCST disk arrays, but we are not there yet. - I've read the queue depth stuff in the SCST README and I see that controller the queue depth from the initiator side is on solution. I feel our back-storage is quite fast on the SCST side, but we truly are just overwhelming the volumes with the 700 virtual machines. It is probably recommended to turn on the adaptive queue depth stuff in VMware ESXi to stop seeing these messages (or at least not so many). Any downside to this? - Is the queue depth size that we are hitting SCST_MAX_TGT_DEV_COMMANDS (in scst_priv.h) (its 48 in the version of SCST we're using)? Any advantages/disadvantages to increasing this? Recommended? Yay? Nay? - Is there any way to actually monitor what the queue depth is on the SCST side? We built SCST for performance as these are production machines so none of the SCST debug options are enabled. - Anything else we're missing? Suggestions? Again, I think everything is working correctly, but truly are overloaded with the 700 virtual machines across the (6) volumes. Thanks for your time. --Marc |