From: Chris W. <cw...@gm...> - 2007-12-11 17:43:34
|
On Dec 11, 2007 11:04 AM, Matteo Tescione <ma...@rm...> wrote: > Chris, > > > On Dec 11, 2007 9:52 AM, Matteo Tescione <ma...@rm...> wrote: > >>> It's coming. Currently the issues with VMware is the showstopper. > >> > >> Really?? i'm running vmware infrastructure with 8 esx hosts connected and a > >> hundred of vms without kind of problems. > >> > > > > can you share some details? I'm having critical issues. > Here's my setup: > > 2 quad-core xeon with intel psl5000 and 4gb of ram, 1 3ware 9650se 8 port > providing 8x500Gb sata2 configured in raid10. > Running Kernel 2.6.23 with all the patch suggested in scst, drbd on top of > the harware raid, with 2 dedicated gigabit link provided by a Intel Pro link > Dual port and bonded together with roundrobbin mode. > Iscsi-scst is started on the primary drbd node by heartbeat. > Switch is Cisco2960G, but i tested even netgear switches and they perform > well too. > I intentionally left lvm part because of the nature of the vmfs partition (u > can span or split even inside vmware datastores) and to be more clear, i > didn't want to add another shit-in-the-middle making more confusion... > Drbd is a robust solution, and the over-head is well payed. I'm using LVM because I intended to to put local file systems on it too for backing up guests to using esXpress. I've also done tests with scst in a vmw workstation guest and didn't use LVM there, just used vdisk to sdb, and I have the exact same issues, I expected it to be worse due to having the target fully vuritual, but it's not. > > Initiators are all 3.0.1 esx hosts with virtual center running. No major > error here, experencied some occasional crash but we are aware of this. > > Can you describe more your issues? I don't think it's networking issues since when i compile scst for debug it fails almost immediately after powering on a guest when there is only one guest and one ESX connected. on the target, syslog looks like this: Dec 10 10:46:46 file3 kernel: [ 636.942407] [5084]: scst_unblock_dev:2758:Device UNBLOCK(new 0), dev ffff81000e3d3600 Dec 10 10:46:46 file3 kernel: [ 636.944430] [5084]: scst_block_dev_cmd:2748:Needs unblocking cmd ffff81000f4bc510 (tag 3624075264) Dec 10 10:46:46 file3 kernel: [ 636.944441] [5084]: __scst_block_dev:2725:Device BLOCK(new 1), dev ffff81000e3d3600 Dec 10 10:46:46 file3 kernel: [ 636.944446] [5084]: scst_block_dev:2736:Waiting during blocking outstanding 1 (on_dev_count 2) Dec 10 10:46:50 file3 kernel: [ 641.053814] [5115]: iscsi-scst: execute_task_management:1589:TM cmd: req ffff810009587750, itt 3db, fn 1, rtt d7030000 Dec 10 10:46:50 file3 kernel: [ 641.054392] [5115]: __cmnd_abort:1420:Aborting cmd ffff810009587888, scst_cmd ffff81000f4bc360 (scst state 3, ref_cnt 2, itt 3d7, op 1, r2t_len 0, CDB op 2a, size to write 512, is_unsolicited_data 0, outstanding_r2t 0) Dec 10 10:46:50 file3 kernel: [ 641.054466] [5115]: __cmnd_abort:1423:net_ref_cnt 0 Dec 10 10:46:50 file3 kernel: [ 641.054659] [5115]: scst: scst_rx_mgmt_fn:4314:sess=ffff810007138000, fn 0, tag_set 1, tag 3607298048, lun_set 1, lun=0, cmd_sn_set 1, cmd_sn 3da Dec 10 10:46:50 file3 kernel: [ 641.054722] [5115]: scst_post_rx_mgmt_cmd:4249:Adding mgmt cmd ffff81000b35b410 to active mgmt cmd list Dec 10 10:46:50 file3 kernel: [ 641.054884] [5086]: scst_mgmt_cmd_thread:4146:Deleting mgmt cmd ffff81000b35b410 from active cmd list Dec 10 10:46:50 file3 kernel: [ 641.055040] [5086]: scst: scst_mgmt_cmd_init:3609:Cmd ffff81000f4bc360 for tag 3607298048 (sn 907, set 1, queue_type 1) found, aborting it Dec 10 10:46:50 file3 kernel: [ 641.055108] [5086]: scst: scst_abort_cmd:3258:Aborting cmd ffff81000f4bc360 (tag 3607298048) Dec 10 10:46:50 file3 kernel: [ 641.055212] [5086]: scst_call_dev_task_mgmt_fn:3223:Calling dev handler vdisk_blk task_mgmt_fn(fn=0) Dec 10 10:46:50 file3 kernel: [ 641.055283] [5086]: scst_call_dev_task_mgmt_fn:3227:Dev handler vdisk_blk task_mgmt_fn() returned 1 Dec 10 10:46:50 file3 kernel: [ 641.055339] [5086]: scst: scst_abort_cmd:3295:cmd ffff81000f4bc360 (tag 3607298048) being executed/xmitted (state 6), deferring ABORT... Dec 10 10:46:50 file3 kernel: [ 641.055642] [5086]: scst_set_mcmd_next_state:3325:cmd_wait_count(1) not 0, preparing to wait this repeats a few times until this: Dec 10 10:47:59 file3 kernel: [ 710.086628] [5117]: iscsi-scst: cmnd_rx_start:2053:***ERROR*** Error -4 (iSCSI opcode 2, ITT ffffffff, op ffffffff) Dec 10 10:48:00 file3 kernel: [ 710.086680] [5117]: iscsi_cmnd_reject:607:Reject: req ffff810009587c30, reason 4 Dec 10 10:48:00 file3 kernel: [ 710.086820] [5117]: cmnd_prepare_skip_pdu:798:Skipping (ffff810009587c30, ffffffff 2 0 0, 0000000000000000, scst state 0) Dec 10 10:48:01 file3 kernel: [ 711.695655] [0]: iscsi-scst: iscsi_state_change:180:***ERROR*** Connection with initiator iqn.1998-01.com.vmware:esx2-4cd7882d (ffff81000ce93c00) unexpectedly closed! Dec 10 10:48:01 file3 kernel: [ 711.696278] [5115]: iscsi-scst: close_conn:127:Closing connection ffff81000ce93c00 (conn_ref_cnt=9) Dec 10 10:48:01 file3 kernel: [ 711.696360] [5115]: conn_abort:1568:Aborting conn ffff81000ce93c00 Dec 10 10:48:01 file3 kernel: [ 711.696431] [5115]: __cmnd_abort:1420:Aborting cmd ffff810009587af8, scst_cmd 0000000000000000 (scst state 0, ref_cnt 1, itt 3df, op 2, r2t_len 0, CDB op 0, size to write 0, is_unsolicited_data 0, outstanding_r2t 0) Dec 10 10:48:01 file3 kernel: [ 711.696494] [5115]: __cmnd_abort:1423:net_ref_cnt 0 Dec 10 10:48:01 file3 kernel: [ 711.696549] [5115]: __cmnd_abort:1420:Aborting cmd ffff810009587ea0, scst_cmd 0000000000000000 (scst state 0, ref_cnt 1, itt 0, op 0, r2t_len 0, CDB op 0, size to write 0, is_unsolicited_data 0, outstanding_r2t 0) Dec 10 10:48:01 file3 iscsi-scstd: Connect from 192.168.0.253:-16382 Dec 10 10:48:01 file3 kernel: [ 711.696597] [5115]: __cmnd_abort:1423:net_ref_cnt 0 Dec 10 10:48:01 file3 kernel: [ 711.699409] [5115]: req_cmnd_release:320:Release aborted req cmd ffff810009587ea0 (scst cmd 0000000000000000, state 0) Dec 10 10:48:01 file3 kernel: [ 711.699447] [5115]: cmnd_done:171:Done aborted cmd ffff810009587ea0 (scst cmd 0000000000000000, state 0) Dec 10 10:48:01 file3 kernel: [ 711.707489] [5121]: iscsi_session_alloc:39:Creating session: target ffff81000f532800, tid 1, sid 0x20100003d0200 Dec 10 10:48:01 file3 kernel: [ 711.707529] [5121]: scst_suspend_activity:416:suspend_count 0 Dec 10 10:48:01 file3 kernel: [ 711.707533] [5121]: scst_suspend_activity:426:Waiting for 5 active commands to complete Dec 10 10:48:01 file3 kernel: [ 711.760696] [5115]: iscsi-scst: close_conn:170:conn ffff81000ce93c00, conn_ref_cnt 8 left, wr_state 0 and that last line repeats until i reboot the server. At that point I can't unload the iscsi-scst modules so I have to hard reset. On the ESX side, in the vmkernel log, it looks like this: Dec 10 10:47:48 esx2 vmkernel: 0:15:42:46.664 cpu1:1060)LinSCSI: 3201: Abort failed for cmd with serial=217, status=bad0001, retval=bad0001 Dec 10 10:47:48 esx2 vmkernel: 0:15:42:46.664 cpu1:1082)iSCSI: session 0x9e180a0 sending mgmt 987 abort for itt 983 task 0x9e01310 cmnd 0x3d20be00 cdb 0x2a to (4 0 1 0) at 5656667 Dec 10 10:47:58 esx2 vmkernel: 0:15:42:56.680 cpu0:1066)iSCSI: session 0x9e180a0 task mgmt 987 response timeout at 5657669 Dec 10 10:47:58 esx2 vmkernel: 0:15:42:56.680 cpu1:1082)iSCSI: session 0x9e180a0 sending mgmt 988 abort task set to (4 0 1 0) at 5657669 Dec 10 10:48:03 esx2 vmkernel: 0:15:43:02.337 cpu1:1031)FS3: 5031: Waiting for timed-out heartbeat [HB state abcdef02 offset 3637248 gen 190 stamp 56561662734 uuid 475c3c67-0109f940-b0fa-0030488cf528 jrnl <FB 14196> drv 4.31] Dec 10 10:48:08 esx2 vmkernel: 0:15:43:06.683 cpu0:1066)iSCSI: session 0x9e180a0 task mgmt 988 response timeout at 5658669 Dec 10 10:48:08 esx2 vmkernel: 0:15:43:06.684 cpu1:1082)iSCSI: session 0x9e180a0 sending mgmt 989 LUN reset to (4 0 1 0) at 5658669 Dec 10 10:48:21 esx2 vmkernel: 0:15:43:20.059 cpu1:1060)LinSCSI: 3201: Abort failed for cmd with serial=215, status=bad0001, retval=bad0001 Dec 10 10:48:21 esx2 vmkernel: 0:15:43:20.059 cpu0:1060)LinSCSI: 3201: Abort failed for cmd with serial=0, status=bad0001, retval=bad0001 Dec 10 10:48:21 esx2 vmkernel: 0:15:43:20.116 cpu0:1060)LinSCSI: 3201: Abort failed for cmd with serial=216, status=bad0001, retval=bad0001 Dec 10 10:48:28 esx2 vmkernel: 0:15:43:26.665 cpu0:1060)LinSCSI: 3201: Abort failed for cmd with serial=217, status=bad0001, retval=bad0001 Dec 10 10:48:41 esx2 vmkernel: 0:15:43:40.053 cpu0:1066)iSCSI: session 0x9e180a0 task mgmt 989 response timeout at 5662006 Dec 10 10:48:41 esx2 vmkernel: 0:15:43:40.053 cpu1:1082)iSCSI: session 0x9e180a0 sending mgmt 991 warm target reset to (4 0 1 *) at 5662006 Dec 10 10:48:45 esx2 vmkernel: 0:15:43:43.961 cpu1:1034)FSS: 390: Failed with status Timeout (ok to retry) for f530 28 2 4759c6ca 3c5ca6e8 30001bf0 2bf58c48 4 1 0 0 0 0 0 Dec 10 10:48:53 esx2 vmkernel: 0:15:43:52.339 cpu1:1031)BC: 2668: Failed to flush 1 buffers of size 131072 each for object f530 28 3 4759c6ca 3c5ca6e8 30001bf0 2bf58c48 2c01804 c 0 0 0 0 0: Busy one repeat of that then it spends the rest of the time trying to reconnect until i reboot it too. compiled for release, the logs are much less helpful but it will start and run guests with only one ESX connected, though with numerous Aborts and successful LUN resets. With 2 ESX's connected to the same LUN, the Aborts and resets from one seem to cause problems with the other and they go into reset loops with one triggering the other which in turn triggers the first again until a scsi host reset is requested, then the aacraid driver on the target logs a reset. if I do nothing at this point the loops continue and it usually results in a scst kernel Opps, but if I use esxcfg-iscsi to kill and restart the iscsi stack on both ESX's everything will recover till the next abort comes up. > U point the finger to your network connections? Can you run iperf and report here? I'll try that if I can get iperf installed in ESX. I don't expect network issues though due to intel cards being pretty damned reliable and I'm using pre-made cat6 cables direct between then the hosts. The ESX hosts are servers i've been using for a while and they have shown no issues, the target host is a new server, and my 4th one from this vendor, so I really hope it is OK since it passed their burn in tests (which i know includes CPU RAM and HDD's but i don't know about network). > What's inside your scst.conf?? [HANDLER vdisk] DEVICE vms-main,/dev/mapper/vms-main,BIO,512 [GROUP Default_iqn.2007-01.com.wilsonmfg:storage.disks [ASSIGNMENT Default_iqn.2007-01.com.wilsonmfg:storage.disks] DEVICE vms-main,0 I've also tried without BIO and BIO seems better. I plan on making a new LUN and trying a larger block size, but I won't be able to do that for a couple weeks since I'm about to go on vacation. |