From: Michal K. <mk...@da...> - 2011-06-20 18:57:29
|
On Wed, 15 Jun 2011 18:20:04 -0400 "Ross S. W. Walker" <RW...@me...> wrote: > The Abort Tasks are merely the symptom of the problem. They are saying, hey > the initiator isn't getting it's PDUs, is everything functioning ok? > > When I see these the first thing that comes to mind is, network problem, > cause it's rarely storage, but I've seen Promise ATA cards go to la-la land > under load, so I won't say never. > > The biggest thing that can cause this is dropped packets and the biggest > cause of dropped packets is trying to squeeze 50lb crap into a 5lb bag. > > Check your flow-control, make sure your switches can keep up with the > traffic and make sure your LAGs and switch ports are configured correctly > (jumbo frames?). > > If your doing iSCSI in a VM make sure it can handle the network traffic > both interface and CPU wise. > > -Ross > Thanks for this information. Yes, definitely in our case we have some underlying problem that we try to pinpoint, and yes, we are also using iSCSI inside VM. Currently all our storage traffic goes through VSwitch to the iSCSI in VM (CPU, memory and configuration side, e.g. disabled page sharing, memory reservation, looks good). There is no physical network in use in our test environment, however we do not use jumbo frames and use default flow-control settings. Can this cause problems, if yes the what is their impact? Currently we are looking into drop packets statistics on ESX virtual network infrastructure, but initial checks did not show any problem on this front. Is it possible that slow backing storage can cause so big delays that ESX has to issue command aborts to our test iSCSI in VM? Is there any way to prevent overwhelming backing storage? Coming back to aborts, it would be desirable when in case of command aborts system could react properly, disconnecting storage is problematic. We did some more testing, swapping iet with scst (some logs below). In that case we still see Aborts but for some reason scst manages to prevent storage disconnection on the ESX server. Do you have any idea what may be the culprit in iet case? How can we debug it or help in fixing this issue? 1) SCST node warnings Jun 8 16:09:58 * kernel: [ 5512.412235] [1638]: scst: scst_rx_mgmt_fn:6045:TM fn 0 Jun 8 16:09:58 * kernel: [ 5512.412244] [1638]: scst_rx_mgmt_fn:6057:sess=f3ac0000, tag_set 1, tag 211466, lun_set 1, lun=0, cmd_sn_set 1, cmd_sn 131026, priv ceb05e00 Jun 8 16:09:58 * kernel: [ 5512.412255] [1638]: scst_post_rx_mgmt_cmd:5976:Adding mgmt cmd ee7d33f0 to active mgmt cmd list Jun 8 16:09:58 * kernel: [ 5512.413303] [1614]: scst_tm_thread:5862:Deleting mgmt cmd ee7d33f0 from active cmd list Jun 8 16:09:58 * kernel: [ 5512.413315] [1614]: scst_abort_task:5545:Abortind task (cmd df8df3c0, sn 108737, set 1, tag 211466, queue_type 1) Jun 8 16:09:58 * kernel: [ 5512.413323] [1614]: scst: scst_abort_cmd:4707:Aborting cmd df8df3c0 (tag 211466, op 2a) Jun 8 16:09:58 * kernel: [ 5512.413331] [1614]: scst_call_dev_task_mgmt_fn:4670:Calling dev handler vdisk_blockio task_mgmt_fn(fn=0) Jun 8 16:09:58 * kernel: [ 5512.413338] [1614]: scst_call_dev_task_mgmt_fn:4674:Dev handler vdisk_blockio task_mgmt_fn() returned 1 Jun 8 16:09:58 * kernel: [ 5512.413346] [1614]: scst_abort_cmd:4790:cmd df8df3c0 (tag 211466) is being executed Jun 8 16:09:58 * kernel: [ 5512.413351] [1614]: scst_abort_cmd:4813:cmd df8df3c0 (tag 211466, sn 108737) being executed/xmitted (state 119, op 2a, proc time 5 sec., timeout 60 sec.), deferring ABORT (cmd_done_wait_count 1, cmd_finish_wait_count 1) Jun 8 16:09:58 * kernel: [ 5512.413362] [1614]: scst_set_mcmd_next_state:4850:cmd_done_wait_count(1) not 0, preparing to wait Jun 8 16:10:05 * kernel: [ 5519.431637] [4822]: scst_xmit_process_aborted_cmd:6870:Aborted cmd df8df3c0 done (cmd_ref 1, scst_cmd_count 18) Jun 8 16:10:05 * kernel: [ 5519.431656] [4822]: scst_done_cmd_mgmt:4473:cmd df8df3c0 done (tag 211466) Jun 8 16:10:05 * kernel: [ 5519.431662] [4822]: scst_done_cmd_mgmt:4486:mcmd ee7d33f0, mcmd->cmd_done_wait_count 1 Jun 8 16:10:05 * kernel: [ 5519.431667] [4822]: scst_done_cmd_mgmt:4501:Adding mgmt cmd ee7d33f0 to active mgmt cmd list Jun 8 16:10:05 * kernel: [ 5519.431686] [4822]: scst_finish_cmd:3694:Aborted cmd df8df3c0 finished (cmd_ref 1, scst_cmd_count 18) Jun 8 16:10:05 * kernel: [ 5519.431696] [4822]: scst_finish_cmd_mgmt:4630:cmd df8df3c0 finished (tag 211466) Jun 8 16:10:05 * kernel: [ 5519.431701] [4822]: scst_finish_cmd_mgmt:4639:mcmd ee7d33f0, mcmd->cmd_finish_wait_count 1 Jun 8 16:10:05 * kernel: [ 5519.431709] [4822]: scst_free_cmd:4225:Freeing aborted cmd df8df3c0 (scst_cmd_count 18) Jun 8 16:10:05 * kernel: [ 5519.432717] [1614]: scst_tm_thread:5862:Deleting mgmt cmd ee7d33f0 from active cmd list Jun 8 16:10:05 * kernel: [ 5519.432732] [1614]: scst_call_task_mgmt_affected_cmds_done:5662:Target's iscsi task_mgmt_affected_cmds_done() returned Jun 8 16:10:05 * kernel: [ 5519.432742] [1614]: scst: scst_mgmt_cmd_send_done:5709:TM fn 0 finished, status 0 Jun 8 16:10:05 * kernel: [ 5519.432752] iscsi-scst: iSCSI TM fn 1 finished, status 0 2) ESX side warning Jun 8 15:54:21 * vmkernel: 0:01:23:22.947 cpu2:4191)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410004027540) to NMP device "eui.6469736b30312062" failed on physical path "vmhba33:C0:T0:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Jun 8 15:54:21 * vmkernel: 0:01:23:22.947 cpu2:4191)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "eui.6469736b30312062" state in doubt; requested fast path state update... Jun 8 15:54:21 * vmkernel: 0:01:23:22.947 cpu2:4191)ScsiDeviceIO: 770: Command 0x2a to device "eui.6469736b30312062" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Jun 8 16:09:07 * vmkernel: 0:01:38:08.980 cpu0:4191)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000401de40) to NMP device "eui.6469736b30312062" failed on physical path "vmhba33:C0:T0:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Jun 8 16:09:07 * vmkernel: 0:01:38:08.980 cpu0:4191)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "eui.6469736b30312062" state in doubt; requested fast path state update... Jun 8 16:09:07 * vmkernel: 0:01:38:08.980 cpu0:4191)ScsiDeviceIO: 770: Command 0x2a to device "eui.6469736b30312062" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Jun 8 16:09:07 * vmkernel: 0:01:38:09.108 cpu2:4112)FS3: 328: Reclaimed heartbeat for volume 4defae7b-b4c6210c-e164-0019b9f36fc1 (snap-028d4637-<STORAGE-NAME>): [Timeout] [HB state abcdef02 offset 3723264 gen 357 stamp 5889107238 uuid 4defdc55-ce403fba-29e2-0019b9f36fc1 jrn One more time, thanks for your help. -- Michal Kurgan |