From: Karol S. <sar...@9l...> - 2024-06-05 08:22:26
|
Hi! We observed that in rare occurrences enabling the target port using scst's sysfs hangs for 5 minutes and eventually link enters Unknown state, from which it cannot be recovered. We attach qla2xxx logs. Could you please investigate the issue? *** Problem description *** We are exporting SCSI devices using SCST and Qlogic FC target driver. Sometimes we need to disable & enable target for maintenance. We observed that in rare occurrences Fibre Channel port won't go up. Enabling it via scst's sysfs (basically calling sqa_enable_tgt) hangs for 5 minutes and eventually leaves the link in Unknown state. Subsequent disable/enable operations also take 5 minutes, and the problem cannot be solved without a reboot (qla2xxx_scstkernel module cannot be unloaded, rmmod hangs forever). When looking at syslog on the system when the enable operation takes long for the first time, we see a tens of "Chip Reset in progress" messages, e.g.: May 21 21:43:22 HS4x000xMN02 kernel: sqatgt(5/0): Enabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 21:43:22 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-00af:5: Performing ISP error recovery - ha=000000001468f64f. May 21 21:43:22 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-0075:5: ZIO mode 6 enabled; timer delay (200 us). May 21 21:43:24 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d302:5: qla2x00_get_fw_version: FC-NVMe is Enabled (0x785a) May 21 21:43:24 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-11a3:5: SCM in FW: Not Supported May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-500a:5: LOOP UP detected (8 Gbps). May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d035:5: Chip Reset in progress. Purging Mbox cmd=0x22. May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d035:5: Chip Reset in progress. Purging Mbox cmd=0x27. May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-8033:5: Unable to reinitialize FCE (261). May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d035:5: Chip Reset in progress. Purging Mbox cmd=0x27. May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-8034:5: Unable to reinitialize EFT (261). May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d035:5: Chip Reset in progress. Purging Mbox cmd=0x20. May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d035:5: Chip Reset in progress. Purging Mbox cmd=0x69. May 21 21:43:26 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-d035:5: Chip Reset in progress. Purging Mbox cmd=0x69. (...) Finally after 20 seconds, there is"FAILED" log: May 21 21:43:46 HS4x000xMN02 kernel: qla2xxx [0000:40:00.0]-803b:5: Firmware ready **** FAILED ****. Subsequent disable/enable target executions take 5 minutes each: May 21 21:43:46 HS4x000xMN02 kernel: sqatgt(5/0): Disabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 21:48:46 HS4x000xMN02 kernel: sqatgt(5/0): Enabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 21:53:46 HS4x000xMN02 kernel: sqatgt(5/0): Disabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 21:58:46 HS4x000xMN02 kernel: sqatgt(5/0): Enabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 22:03:46 HS4x000xMN02 kernel: sqatgt(5/0): Disabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 22:08:46 HS4x000xMN02 kernel: sqatgt(5/0): Enabling target pwwn=21:00:00:24:ff:7f:8f:a8 May 21 22:13:46 HS4x000xMN02 kernel: sqatgt(5/0): Disabling target pwwn=21:00:00:24:ff:7f:8f:a8 (...) The system we are using: - HBA: QLE2692 - OS: Red Hat 8.4 - driver version: 10.02.09.100-k (from SCST 3.8.0) - firmware version: 8.08.204 *** Investigation trial *** We did some investigation. It seems to us that there may be some problem with driver or firmware. During the first failed disable/enable attempt of each occurrence, we saw that qla_hw_data->flags->mbox_busy is turned on and never turned off. We were unable to identify the code responsible for turning on this flagon those occurrences, so it looks like it is modified by FW. We also reproduced this issue with the newer firmware 9.14.00. From our observations, usually even if one port is affected by this issue, the second works normally. In our environment it takes a few hours of disabling/enabling port in a loop (a few hundred disable/enable cycles)on production server under load. We also reproduced this issue on the same hardware, but with RHEL8.4 OS without any extra software. Note that on this setup the problem reproduced less frequently (once every few days). To simplify the test environment we removed the server from the zones on switch and we stopped exporting any devices. The frequency of the issue re occurrence hasn't changed. Regards, Karol Sarnecki |