Re: [Scst-devel] New user report and some questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Dec 11, 2007 11:04 AM, Matteo Tescione <ma...@rm...> wrote:
> Chris,
>
> > On Dec 11, 2007 9:52 AM, Matteo Tescione <ma...@rm...> wrote:
> >>> It's coming. Currently the issues with VMware is the showstopper.
> >>
> >> Really?? i'm running vmware infrastructure with 8 esx hosts connected and a
> >> hundred of vms without kind of problems.
> >>
> >
> > can you share some details?  I'm having critical issues.
> Here's my setup:
>
> 2 quad-core xeon with intel psl5000 and 4gb of ram, 1 3ware 9650se 8 port
> providing 8x500Gb sata2 configured in raid10.
> Running Kernel 2.6.23 with all the patch suggested in scst, drbd on top of
> the harware raid, with 2 dedicated gigabit link provided by a Intel Pro link
> Dual port and bonded together with roundrobbin mode.
> Iscsi-scst is started on the primary drbd node by heartbeat.
> Switch is Cisco2960G, but i tested even netgear switches and they perform
> well too.
> I intentionally left lvm part because of the nature of the vmfs partition (u
> can span or split even inside vmware datastores) and to be more clear, i
> didn't want to add another shit-in-the-middle making more confusion...
> Drbd is a robust solution, and the over-head is well payed.

I'm using LVM because I intended to to put local file systems on it
too for backing up guests to using esXpress.  I've also done tests
with scst in a vmw workstation guest and didn't use LVM there, just
used vdisk to sdb, and I have the exact same issues, I expected it to
be worse due to having the target fully vuritual, but it's not.

>
> Initiators are all 3.0.1 esx hosts with virtual center running. No major
> error here, experencied some occasional crash but we are aware of this.
>
> Can you describe more your issues?

I don't think it's networking issues since when i compile scst for
debug it fails almost immediately after powering on a guest when there
is only one guest and one ESX connected.

on the target, syslog looks like this:
Dec 10 10:46:46 file3 kernel: [  636.942407] [5084]:
scst_unblock_dev:2758:Device UNBLOCK(new 0), dev ffff81000e3d3600
Dec 10 10:46:46 file3 kernel: [  636.944430] [5084]:
scst_block_dev_cmd:2748:Needs unblocking cmd ffff81000f4bc510 (tag
3624075264)
Dec 10 10:46:46 file3 kernel: [  636.944441] [5084]:
__scst_block_dev:2725:Device BLOCK(new 1), dev ffff81000e3d3600
Dec 10 10:46:46 file3 kernel: [  636.944446] [5084]:
scst_block_dev:2736:Waiting during blocking outstanding 1
(on_dev_count 2)
Dec 10 10:46:50 file3 kernel: [  641.053814] [5115]: iscsi-scst:
execute_task_management:1589:TM cmd: req ffff810009587750, itt 3db, fn
1, rtt d7030000
Dec 10 10:46:50 file3 kernel: [  641.054392] [5115]:
__cmnd_abort:1420:Aborting cmd ffff810009587888, scst_cmd
ffff81000f4bc360 (scst state 3, ref_cnt 2, itt 3d7, op 1, r2t_len 0,
CDB op 2a, size to write 512, is_unsolicited_data 0, outstanding_r2t
0)
Dec 10 10:46:50 file3 kernel: [  641.054466] [5115]:
__cmnd_abort:1423:net_ref_cnt 0
Dec 10 10:46:50 file3 kernel: [  641.054659] [5115]: scst:
scst_rx_mgmt_fn:4314:sess=ffff810007138000, fn 0, tag_set 1, tag
3607298048, lun_set 1, lun=0, cmd_sn_set 1, cmd_sn 3da
Dec 10 10:46:50 file3 kernel: [  641.054722] [5115]:
scst_post_rx_mgmt_cmd:4249:Adding mgmt cmd ffff81000b35b410 to active
mgmt cmd list
Dec 10 10:46:50 file3 kernel: [  641.054884] [5086]:
scst_mgmt_cmd_thread:4146:Deleting mgmt cmd ffff81000b35b410 from
active cmd list
Dec 10 10:46:50 file3 kernel: [  641.055040] [5086]: scst:
scst_mgmt_cmd_init:3609:Cmd ffff81000f4bc360 for tag 3607298048 (sn
907, set 1, queue_type 1) found, aborting it
Dec 10 10:46:50 file3 kernel: [  641.055108] [5086]: scst:
scst_abort_cmd:3258:Aborting cmd ffff81000f4bc360 (tag 3607298048)
Dec 10 10:46:50 file3 kernel: [  641.055212] [5086]:
scst_call_dev_task_mgmt_fn:3223:Calling dev handler vdisk_blk
task_mgmt_fn(fn=0)
Dec 10 10:46:50 file3 kernel: [  641.055283] [5086]:
scst_call_dev_task_mgmt_fn:3227:Dev handler vdisk_blk task_mgmt_fn()
returned 1
Dec 10 10:46:50 file3 kernel: [  641.055339] [5086]: scst:
scst_abort_cmd:3295:cmd ffff81000f4bc360 (tag 3607298048) being
executed/xmitted (state 6), deferring ABORT...
Dec 10 10:46:50 file3 kernel: [  641.055642] [5086]:
scst_set_mcmd_next_state:3325:cmd_wait_count(1) not 0, preparing to
wait

this repeats a few times until this:
Dec 10 10:47:59 file3 kernel: [  710.086628] [5117]: iscsi-scst:
cmnd_rx_start:2053:***ERROR*** Error -4 (iSCSI opcode 2, ITT ffffffff,
op ffffffff)
Dec 10 10:48:00 file3 kernel: [  710.086680] [5117]:
iscsi_cmnd_reject:607:Reject: req ffff810009587c30, reason 4
Dec 10 10:48:00 file3 kernel: [  710.086820] [5117]:
cmnd_prepare_skip_pdu:798:Skipping (ffff810009587c30, ffffffff 2 0 0,
0000000000000000, scst state 0)
Dec 10 10:48:01 file3 kernel: [  711.695655] [0]: iscsi-scst:
iscsi_state_change:180:***ERROR*** Connection with initiator
iqn.1998-01.com.vmware:esx2-4cd7882d (ffff81000ce93c00) unexpectedly
closed!
Dec 10 10:48:01 file3 kernel: [  711.696278] [5115]: iscsi-scst:
close_conn:127:Closing connection ffff81000ce93c00 (conn_ref_cnt=9)
Dec 10 10:48:01 file3 kernel: [  711.696360] [5115]:
conn_abort:1568:Aborting conn ffff81000ce93c00
Dec 10 10:48:01 file3 kernel: [  711.696431] [5115]:
__cmnd_abort:1420:Aborting cmd ffff810009587af8, scst_cmd
0000000000000000 (scst state 0, ref_cnt 1, itt 3df, op 2, r2t_len 0,
CDB op 0, size to write 0, is_unsolicited_data 0, outstanding_r2t 0)
Dec 10 10:48:01 file3 kernel: [  711.696494] [5115]:
__cmnd_abort:1423:net_ref_cnt 0
Dec 10 10:48:01 file3 kernel: [  711.696549] [5115]:
__cmnd_abort:1420:Aborting cmd ffff810009587ea0, scst_cmd
0000000000000000 (scst state 0, ref_cnt 1, itt 0, op 0, r2t_len 0, CDB
op 0, size to write 0, is_unsolicited_data 0, outstanding_r2t 0)
Dec 10 10:48:01 file3 iscsi-scstd: Connect from 192.168.0.253:-16382
Dec 10 10:48:01 file3 kernel: [  711.696597] [5115]:
__cmnd_abort:1423:net_ref_cnt 0
Dec 10 10:48:01 file3 kernel: [  711.699409] [5115]:
req_cmnd_release:320:Release aborted req cmd ffff810009587ea0 (scst
cmd 0000000000000000, state 0)
Dec 10 10:48:01 file3 kernel: [  711.699447] [5115]:
cmnd_done:171:Done aborted cmd ffff810009587ea0 (scst cmd
0000000000000000, state 0)
Dec 10 10:48:01 file3 kernel: [  711.707489] [5121]:
iscsi_session_alloc:39:Creating session: target ffff81000f532800, tid
1, sid 0x20100003d0200
Dec 10 10:48:01 file3 kernel: [  711.707529] [5121]:
scst_suspend_activity:416:suspend_count 0
Dec 10 10:48:01 file3 kernel: [  711.707533] [5121]:
scst_suspend_activity:426:Waiting for 5 active commands to complete
Dec 10 10:48:01 file3 kernel: [  711.760696] [5115]: iscsi-scst:
close_conn:170:conn ffff81000ce93c00, conn_ref_cnt 8 left, wr_state 0

and that last line repeats until i reboot the server.  At that point I
can't unload the iscsi-scst modules so I have to hard reset.

On the ESX side, in the vmkernel log, it looks like this:
Dec 10 10:47:48 esx2 vmkernel: 0:15:42:46.664 cpu1:1060)LinSCSI: 3201:
Abort failed for cmd with serial=217, status=bad0001, retval=bad0001
Dec 10 10:47:48 esx2 vmkernel: 0:15:42:46.664 cpu1:1082)iSCSI: session
0x9e180a0 sending mgmt 987 abort for itt 983 task 0x9e01310 cmnd
0x3d20be00 cdb 0x2a to (4 0 1 0) at 5656667
Dec 10 10:47:58 esx2 vmkernel: 0:15:42:56.680 cpu0:1066)iSCSI: session
0x9e180a0 task mgmt 987 response timeout at 5657669
Dec 10 10:47:58 esx2 vmkernel: 0:15:42:56.680 cpu1:1082)iSCSI: session
0x9e180a0 sending mgmt 988 abort task set to (4 0 1 0) at 5657669
Dec 10 10:48:03 esx2 vmkernel: 0:15:43:02.337 cpu1:1031)FS3: 5031:
Waiting for timed-out heartbeat [HB state abcdef02 offset 3637248 gen
190 stamp 56561662734 uuid 475c3c67-0109f940-b0fa-0030488cf528 jrnl
<FB 14196> drv 4.31]
Dec 10 10:48:08 esx2 vmkernel: 0:15:43:06.683 cpu0:1066)iSCSI: session
0x9e180a0 task mgmt 988 response timeout at 5658669
Dec 10 10:48:08 esx2 vmkernel: 0:15:43:06.684 cpu1:1082)iSCSI: session
0x9e180a0 sending mgmt 989 LUN reset to (4 0 1 0) at 5658669
Dec 10 10:48:21 esx2 vmkernel: 0:15:43:20.059 cpu1:1060)LinSCSI: 3201:
Abort failed for cmd with serial=215, status=bad0001, retval=bad0001
Dec 10 10:48:21 esx2 vmkernel: 0:15:43:20.059 cpu0:1060)LinSCSI: 3201:
Abort failed for cmd with serial=0, status=bad0001, retval=bad0001
Dec 10 10:48:21 esx2 vmkernel: 0:15:43:20.116 cpu0:1060)LinSCSI: 3201:
Abort failed for cmd with serial=216, status=bad0001, retval=bad0001
Dec 10 10:48:28 esx2 vmkernel: 0:15:43:26.665 cpu0:1060)LinSCSI: 3201:
Abort failed for cmd with serial=217, status=bad0001, retval=bad0001
Dec 10 10:48:41 esx2 vmkernel: 0:15:43:40.053 cpu0:1066)iSCSI: session
0x9e180a0 task mgmt 989 response timeout at 5662006
Dec 10 10:48:41 esx2 vmkernel: 0:15:43:40.053 cpu1:1082)iSCSI: session
0x9e180a0 sending mgmt 991 warm target reset to (4 0 1 *) at 5662006
Dec 10 10:48:45 esx2 vmkernel: 0:15:43:43.961 cpu1:1034)FSS: 390:
Failed with status Timeout (ok to retry) for f530 28 2 4759c6ca
3c5ca6e8 30001bf0 2bf58c48 4 1 0 0 0 0 0
Dec 10 10:48:53 esx2 vmkernel: 0:15:43:52.339 cpu1:1031)BC: 2668:
Failed to flush 1 buffers of size 131072 each for object f530 28 3
4759c6ca 3c5ca6e8 30001bf0 2bf58c48 2c01804 c 0 0 0 0 0: Busy

one repeat of that then it spends the rest of the time trying to
reconnect until i reboot it too.

compiled for release, the logs are much less helpful but it will start
and run guests with only one ESX connected, though with numerous
Aborts and successful LUN resets.  With 2 ESX's connected to the same
LUN, the Aborts and resets from one seem to cause problems with the
other and they go into reset loops with one triggering the other which
in turn triggers the first again until a scsi host reset is requested,
then the aacraid driver on the target logs a reset.  if I do nothing
at this point the loops continue and it usually results in a scst
kernel Opps, but if I use esxcfg-iscsi to kill and restart the iscsi
stack on both ESX's everything will recover till the next abort comes
up.

>  U point the finger to your network connections?  Can you run iperf and report here?

I'll try that if I can get iperf installed in ESX.  I don't expect
network issues though due to intel cards being pretty damned reliable
and I'm using pre-made cat6 cables direct between then the hosts.  The
ESX hosts are servers i've been using for a while and they have shown
no issues, the target host is a new server, and my 4th one from this
vendor, so I really hope it is OK since it passed their burn in tests
(which i know includes CPU RAM and HDD's but i don't know about
network).

> What's inside your scst.conf??

[HANDLER vdisk]
DEVICE vms-main,/dev/mapper/vms-main,BIO,512

[GROUP Default_iqn.2007-01.com.wilsonmfg:storage.disks

[ASSIGNMENT Default_iqn.2007-01.com.wilsonmfg:storage.disks]
DEVICE vms-main,0

I've also tried without BIO and BIO seems better.  I plan on making a
new LUN and trying a larger block size, but I won't be able to do that
for a couple weeks since I'm about to go on vacation.