|
From: Lev V. <le...@za...> - 2016-07-24 09:46:10
|
Hi Vlad, Sorry for delay, I was on vacation last week. Yes, window size 2048 should be fine as well. Thanks, -Lev. -----Original Message----- From: Vladislav Bolkhovitin Sent: Wednesday, July 13, 2016 01:49 To: Lev Vainblat ; scs...@li... Cc: Shyam Kaushik ; Yair Hershko Subject: Re: [Scst-devel] Infinite loop of aborts from ESX Hi, Lev Vainblat wrote on 07/05/2016 08:41 AM: > Hello, > > We're using SCST as a backend for the ESX datastore. The network is rather > unstable, there are a lot of abort commands. Usually SCST returns status 0 > on these aborts and then ESX retries the commands. The problem begins if > the > failed command is REPORT-LUNS. Then ESX issues 256 TUR commands to luns > 0..255. It bumps cmd_sn, so if now ESX wants to abort some command that > was > sent before the rescan, and its RTT is not found, abort fails in > cmnd_abort_pre_checks(). There the difference between req_hdr->cmd_sn and > req_hdr->ref_cmd_sn becomes > 128, so cmnd_abort_pre_checks() returns > ISCSI_RESPONSE_UNKNOWN_TASK. In this case ESX retries the abort that fails > again, ESX retries, fails again..., so ESX enters the infinite abort retry > loop. > > I got to reproduce the issue in the lab using black_hole: > > echo 1 > > /sys/kernel/scst_tgt/targets/iscsi/iqn.2011-04.com.zadarastorage:vsa-0000007d:1/ini_groups/iqn.1998-01.com.vmware:localhost:1405654056:327680/black_hole > > ===== Succesfull aborts ===== > > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.223881] iscsi-scst: > [25215] > execute_task_management[2633]: iSCSI TM fn 1 > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.223891] scst: [25215] > scst_rx_mgmt_fn[6888]: TM fn 0 (mcmd ffff8800b595fc40, initiator > iqn.1998-01.com.vmware:localhost:1405654056:327680, target > iqn.2011-04.com.zadarastorage:vsa-0000007d:1) > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.223904] scst: [920] > scst_mgmt_cmd_send_done[6554]: TM fn 0 (mcmd ffff8800b595fc40) finished, > status -1 > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.223906] iscsi-scst: [920] > iscsi_send_task_mgmt_resp[3663]: iSCSI TM fn 1 finished, status 0, dropped > 0 > > ===== Target rescan ===== > > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.224370] scst: [25215] > scst_translate_lun[4502]: tgt_dev for LUN 4 not found, command to > unexisting > LU (initiator iqn.1998-01.com.vmware:localhost:1405654056:327680, target > iqn.2011-04.com.zadarastorage:vsa-0000007d:1)? > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.224691] scst: [25215] > scst_translate_lun[4502]: tgt_dev for LUN 5 not found, command to > unexisting > LU (initiator iqn.1998-01.com.vmware:localhost:1405654056:327680, target > iqn.2011-04.com.zadarastorage:vsa-0000007d:1)? > ... > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.249344] scst: [25215] > scst_translate_lun[4502]: tgt_dev for LUN 254 not found, command to > unexisting LU (initiator > iqn.1998-01.com.vmware:localhost:1405654056:327680, > target iqn.2011-04.com.zadarastorage:vsa-0000007d:1)? > Jul 5 12:48:26 vsa-0000007d-vc-0 kernel: [ 2099.249428] scst: [25215] > scst_translate_lun[4502]: tgt_dev for LUN 255 not found, command to > unexisting LU (initiator > iqn.1998-01.com.vmware:localhost:1405654056:327680, > target iqn.2011-04.com.zadarastorage:vsa-0000007d:1)? > > ===== Failed aborts ===== > > Jul 5 12:48:57 vsa-0000007d-vc-0 kernel: [ 2130.376020] iscsi-scst: > [25215] > execute_task_management[2633]: iSCSI TM fn 1 > Jul 5 12:48:57 vsa-0000007d-vc-0 kernel: [ 2130.376026] iscsi-scst: > [25215] > iscsi_send_task_mgmt_resp[3663]: iSCSI TM fn 1 finished, status 1, dropped > 0 > > Jul 5 12:53:11 vsa-0000007d-vc-0 kernel: [ 2384.546141] scst: [23063] > scst_write_trace[306]: Changed trace level for "iscsi": old 0x00006508, > new > 0x00006d08 > > Jul 5 12:53:11 vsa-0000007d-vc-0 kernel: [ 2384.639184] iscsi-scst: > [12920] > execute_task_management[2633]: iSCSI TM fn 1 > Jul 5 12:53:11 vsa-0000007d-vc-0 kernel: [ 2384.639192] iscsi-scst: > [12920] > execute_task_management[2636]: TM req ffff880075491c80, ITT 30, RTT 24, sn > 92153, con ffff880072bec000 > Jul 5 12:53:11 vsa-0000007d-vc-0 kernel: [ 2384.639196] iscsi-scst: > [12920] > cmnd_abort_pre_checks[2448]: cmd RTT 24 not found > Jul 5 12:53:11 vsa-0000007d-vc-0 kernel: [ 2384.639199] iscsi-scst: > [12920] > iscsi_send_task_mgmt_resp[3661]: TM req ffff880075491c80 finished > Jul 5 12:53:11 vsa-0000007d-vc-0 kernel: [ 2384.639202] iscsi-scst: > [12920] > iscsi_send_task_mgmt_resp[3663]: iSCSI TM fn 1 finished, status 1, dropped > 0 > > Jul 5 12:53:13 vsa-0000007d-vc-0 kernel: [ 2386.641391] iscsi-scst: > [12920] > execute_task_management[2633]: iSCSI TM fn 1 > Jul 5 12:53:13 vsa-0000007d-vc-0 kernel: [ 2386.641400] iscsi-scst: > [12920] > execute_task_management[2636]: TM req ffff880077ec8000, ITT 18, RTT 24, sn > 92153, con ffff880072bec000 > Jul 5 12:53:13 vsa-0000007d-vc-0 kernel: [ 2386.641403] iscsi-scst: > [12920] > cmnd_abort_pre_checks[2448]: cmd RTT 24 not found > Jul 5 12:53:13 vsa-0000007d-vc-0 kernel: [ 2386.641405] iscsi-scst: > [12920] > iscsi_send_task_mgmt_resp[3661]: TM req ffff880077ec8000 finished > Jul 5 12:53:13 vsa-0000007d-vc-0 kernel: [ 2386.641407] iscsi-scst: > [12920] > iscsi_send_task_mgmt_resp[3663]: iSCSI TM fn 1 finished, status 1, dropped > 0 > ... > > ESX continues to send aborts with RTT 24 forever. Even after I disable > black_hole, ESX remains unresponsive. > > The workaround I've found is to always return > ISCSI_RESPONSE_FUNCTION_COMPLETE in cmnd_abort_pre_checks() if RTT was not > found. Is this workaround safe enough? Is there some other way to fix the > issue? This is a very interesting case. Basically, by this change you made the past CmdSN window indefinite, which common sense gut feeling suggests to not be too safe. From other side, the 128 commands window was chosen rather arbitrary based on that time (almost 10 years) ago typical 32 QD. I'd say, nowadays we can safely increase it to something like 2048. Will 2048 instead of 128 work for your case? If yes, I'll commit this change. Thank you! Vlad P.S. VMware could be smarter in such recovery. > We're using scst 3.0.2 and ESX 5.5U2. > > Thanks, > -Lev. |