From: Rob T. <ro...@rt...> - 2022-05-04 15:22:22
|
This really does look like a SCST bug. I have not found the actual cause, but I did find why the crash happens. I added some extra code checks to qla24xx_delete_sess_fn() in qla_target.c, as the kernel oops pointed there. This is what I added: if (fcport->se_sess) { if (!ha->tgt.tgt_ops) { ql_log(ql_log_warn, fcport->vha, 0x0000, "%s tgt_ops NULL pointer!", __func__); return; } if (!ha->tgt.tgt_ops->shutdown_sess) { ql_log(ql_log_warn, fcport->vha, 0x0000, "%s tgt_ops->shutdown_sess NULL pointer!", __func__); return; } if (!ha->tgt.tgt_ops->put_sess) { ql_log(ql_log_warn, fcport->vha, 0x0000, "%s tgt_ops->put_sess NULL pointer!", __func__); return; } ha->tgt.tgt_ops->shutdown_sess(fcport); ha->tgt.tgt_ops->put_sess(fcport); } else { qlt_unreg_sess(fcport); } } And this is what now shows up in the dmesg log: [ 378.057952] [902]: sqa_qla2xxx_remove_target:1435:Unregistering target for host 35(00000000923b47d6) [ 378.057974] [902]: sqa_target_release:1475:sqatgt(35/1): Target release finished sqa_tgt 00000000d2aa2236 [ 378.057977] [902]: scst: Waiting for 1 active commands to complete... [ 378.075561] [902]: scst: All active commands completed [ 378.075628] [902]: scst: Target 51:01:04:f0:18:0c:02:00 for template qla2x00t unregistered successfully [ 378.075813] qla2xxx [0000:0b:00.1]-7088:35: VP[1] deleted. [ 378.964363] qla2xxx [0000:0b:00.1]-5013:34: RSCN database changed -- 0001 1701 0000. [ 380.088715] [902]: sqa_enable_tgt:1950:sqatgt(34/0): Disabling target pwwn=21:00:00:24:ff:2f:90:5f [ 380.088729] *qla2xxx [0000:0b:00.1]-0000:34: qla24xx_delete_sess_fn tgt_ops NULL pointer!* [ 390.219160] qla2xxx [0000:0b:00.1]-00af:34: Performing ISP error recovery - ha=000000002c6459b1. Because the function now returns before jumping into oblivion, the system no longer crashes. Any suggestions how to go about finding the root cause? Thanks, Rob On 5/1/2022 10:57 AM, Rob Turk wrote: > Reporting back after some more tests. > > It appears that adding and then deleting an NPIV port causes > corruption. This can still be a hardware issue on my end but I'd like > to rule out an SCST issue. > Just interacting with sysfs, something gets corrupted at kernel level > to the point where Linux hangs. > This is reproducable on my system, without any of my code. > Fresh power-on boot, no tools or code running. I'm sending this sequence: > > root@fctgt:~# echo > "1">/sys/kernel/scst_tgt/targets/qla2x00t/21:00:00:24:ff:2f:90:5f/enabled > root@fctgt:~# echo "add_target 51:01:04:f0:18:0c:02:00 > node_name=50:01:04:f0:18:0c:02:00;parent_host=21:00:00:24:ff:2f:90:5f">/sys/kernel/scst_tgt/targets/qla2x00t/mgmt > root@fctgt:~# echo > "1">/sys/kernel/scst_tgt/targets/qla2x00t/51:01:04:f0:18:0c:02:00/enabled > root@fctgt:~# echo > "0">/sys/kernel/scst_tgt/targets/qla2x00t/51:01:04:f0:18:0c:02:00/enabled > root@fctgt:~# echo "del_target > 51:01:04:f0:18:0c:02:00">/sys/kernel/scst_tgt/targets/qla2x00t/mgmt > root@fctgt:~# reboot > root@fctgt:~# > > On the VM console, Linux shows the system shutting down until SCST > being shutdown, counts up to 30 seconds, then hard hangs. Even the > cursor doesn't blink anymore. > > Dmesg output: > [Sun May 1 10:20:59 2022] sqatgt: Initializing SCST Cavium adapter > target driver interface - driver version=10.02.07.400-k, SCST > version=3.7.0-pre, Cavium version=10.02.07.400-k > [Sun May 1 10:20:59 2022] [889]: scst: Target template qla2x00t > registered successfully > [Sun May 1 10:20:59 2022] sqatgt(17/0): Registering target > pwwn=21:00:00:24:ff:2f:90:5e > [Sun May 1 10:20:59 2022] [889]: scst: Target 21:00:00:24:ff:2f:90:5e > for template qla2x00t registered successfully > [Sun May 1 10:20:59 2022] sqatgt(17/0): Registering target > pwwn=21:00:00:24:ff:2f:90:5e scst_tgt 0000000094cfc6a1 sqa_tgt > 000000003e29951e > [Sun May 1 10:20:59 2022] sqatgt(34/0): Registering target > pwwn=21:00:00:24:ff:2f:90:5f > [Sun May 1 10:20:59 2022] [889]: scst: Target 21:00:00:24:ff:2f:90:5f > for template qla2x00t registered successfully > [Sun May 1 10:20:59 2022] sqatgt(34/0): Registering target > pwwn=21:00:00:24:ff:2f:90:5f scst_tgt 00000000d067a987 sqa_tgt > 000000009815396b > [Sun May 1 10:20:59 2022] [892]: iscsi-scst: iser portal with cm_id > 000000000d66db6a listens on 0.0.0.0:3260 > [Sun May 1 10:20:59 2022] [892]: iscsi-scst: Created iser portal > cm_id:000000000d66db6a > [Sun May 1 10:20:59 2022] [892]: iscsi-scst: iser portal with cm_id > 00000000743a842f listens on [::]:3260 > [Sun May 1 10:20:59 2022] [892]: iscsi-scst: Created iser portal > cm_id:00000000743a842f > [Sun May 1 10:20:59 2022] [870]: scst: Removed all devices from group > copy_manager_tgt > [Sun May 1 10:20:59 2022] [870]: scst: Removed all devices from group > 21:00:00:24:ff:2f:90:5e > [Sun May 1 10:20:59 2022] [870]: scst: Removed all devices from group > 21:00:00:24:ff:2f:90:5f > [Sun May 1 10:20:59 2022] [870]: scst: Removed all devices from group > scst_local_tgt > [Sun May 1 10:21:01 2022] [870]: scst: Waiting for 1 active commands > to complete... > [Sun May 1 10:21:01 2022] [870]: scst: All active commands completed > [Sun May 1 10:21:01 2022] [870]: scst: Target scst_local_tgt for > template scst_local unregistered successfully > [Sun May 1 10:21:01 2022] loop7: detected capacity change from 0 to 8 > [Sun May 1 10:24:44 2022] [870]: scst: Using autogenerated relative > target id 1 for target 21:00:00:24:ff:2f:90:5f > [Sun May 1 10:24:44 2022] sqatgt(34/0): Enabling target > pwwn=21:00:00:24:ff:2f:90:5f > [Sun May 1 10:24:44 2022] qla2xxx [0000:0b:00.1]-00af:34: Performing > ISP error recovery - ha=000000009e05ce0f. > [Sun May 1 10:24:45 2022] qla2xxx [0000:0b:00.1]-500a:34: LOOP UP > detected (8 Gbps). > [Sun May 1 10:24:47 2022] sqatgt(34/0): Registering initiator: > pwwn=21:01:00:1b:32:2f:94:1e > [Sun May 1 10:24:47 2022] [5]: scst: Using security group > "21:00:00:24:ff:2f:90:5f" for initiator "21:01:00:1b:32:2f:94:1e" > (target 21:00:00:24:ff:2f:90:5f) > [Sun May 1 10:24:47 2022] qla2xxx [0000:0b:00.1]-d034:34: > qla24xx_do_nack_work create sess success 000000009cbacb66 > [Sun May 1 10:25:00 2022] qla2xxx [0000:0b:00.1]-7080:35: VP entry id > 1 assigned. > [Sun May 1 10:25:00 2022] scsi host35: qla2xxx > [Sun May 1 10:25:00 2022] sqatgt: add target 51:01:04:f0:18:0c:02:00 > [Sun May 1 10:25:00 2022] sqatgt(35/1): Registering target > pwwn=51:01:04:f0:18:0c:02:00 > [Sun May 1 10:25:00 2022] [870]: scst: Target 51:01:04:f0:18:0c:02:00 > for template qla2x00t registered successfully > [Sun May 1 10:25:00 2022] sqatgt(35/1): Registering target > pwwn=51:01:04:f0:18:0c:02:00 scst_tgt 0000000055d33ec2 sqa_tgt > 00000000a4d0e7e1 > [Sun May 1 10:25:00 2022] scsi host34: vport-34:0-0 created via > shost34 channel 0 > [Sun May 1 10:25:01 2022] qla2xxx [0000:0b:00.1]-5013:34: RSCN > database changed -- 0001 1701 0000. > [Sun May 1 10:25:02 2022] qla2xxx [0000:0b:00.1]-5013:34: RSCN > database changed -- 0001 1701 0000. > [Sun May 1 10:26:05 2022] [870]: scst: Using autogenerated relative > target id 2 for target 51:01:04:f0:18:0c:02:00 > [Sun May 1 10:26:05 2022] sqatgt(35/1): Enabling target > pwwn=51:01:04:f0:18:0c:02:00 > [Sun May 1 10:26:07 2022] qla2xxx [0000:0b:00.1]-5013:34: RSCN > database changed -- 0001 1701 0000. > [Sun May 1 10:26:27 2022] sqatgt(35/1): Disabling target > pwwn=51:01:04:f0:18:0c:02:00 > [Sun May 1 10:27:19 2022] qlt_del_vtarget: 510104f0180c0200 > [Sun May 1 10:27:19 2022] qla2xxx 0000:0b:00.1: qlt_del_vtarget: > port_name 510104f0180c0200 > [Sun May 1 10:27:19 2022] Unregistering target for host > 35(00000000d01d51a2) > [Sun May 1 10:27:19 2022] sqatgt(35/1): Target release finished > sqa_tgt 00000000a4d0e7e1 > [Sun May 1 10:27:19 2022] [870]: scst: Waiting for 1 active commands > to complete... > [Sun May 1 10:27:19 2022] [870]: scst: All active commands completed > [Sun May 1 10:27:19 2022] [870]: scst: Target 51:01:04:f0:18:0c:02:00 > for template qla2x00t unregistered successfully > [Sun May 1 10:27:19 2022] qla2xxx [0000:0b:00.1]-7088:35: VP[1] deleted. > [Sun May 1 10:27:20 2022] qla2xxx [0000:0b:00.1]-5013:34: RSCN > database changed -- 0001 1701 0000. > > Not sure if it's relevant but on the FC client side (other VM attached > to SAN) I see this console message: > [Sun May 1 10:28:41 2022] rport-35:0-0 blocked FC remote port time > out: removing target and saving binding > > Any ideas? > Rob > > On 4/28/2022 7:34 PM, Rob Turk wrote: >> I'm a few steps further in development, hitting an issue that >> triggers a kernel NULL pointer BUG report. >> >> Hardware setup: Two VM's (FC Target, FC CLient), both with their own >> dual-port FC adapter and pass-through to VM. >> One port of the FC client point-to-point to a port on the FC target. >> The other port for client and target attach to a Brocade 300 switch >> with NPIV support. >> >> The sequence I'm running first creates a SCST User device. It then >> registers the device on a port of the FC adapter. On the >> point-to-point physical port I add a LUN and that works fine. When I >> use the NPIV-capable port then I add an NPIV virtual port, and >> register the device as LUN 0 on that port. >> >> This all works well. But then I want to shutdown and remove the >> device and the NPIV virtual port, and then disable the physical port >> if nothing is running on it anymore. This is when I hit a kernel BUG >> report. >> >> This is the dmesg output, showing the creation of the device and the >> NPIV port, as well as the sequence of removing: >> >> [Thu Apr 28 18:56:23 2022] [16942]: scst: Virtual device handler >> UVTL_DEVICE12 for type 1 registered successfully >> [snip] > >> [Thu Apr 28 18:56:34 2022] qla2xxx [0000:0b:00.1]-5013:34: RSCN >> database changed -- 0001 1701 0000. >> [Thu Apr 28 18:56:44 2022] qla2xxx [0000:0b:00.1]-00af:34: Performing >> ISP error recovery - ha=000000003bca05dd. >> [Thu Apr 28 18:56:55 2022] sqatgt(21/0): Disabling target >> pwwn=21:00:00:24:ff:2f:90:5e >> [Thu Apr 28 18:57:05 2022] qla2xxx [0000:0b:00.0]-00af:21: Performing >> ISP error recovery - ha=000000003a674198. >> >> This is my software log which shows what commands I'm sending to the >> sysfs filesystem: >> 21:00:00:24:ff:2f:90:5e (point to point) and 21:00:00:24:ff:2f:90:5f >> (to switch) are physical ports on the QLogic card. >> 51:01:04:f0:18:0c:02:00 is the WWPN for the NPIV port. >> >> [16942I] 18:56:26.012 - buildDefaultIdentity: V=STK , >> P=T10000C , devid=12, type=1 (Tape), >> wwnn=50:01:04:f0:18:0c:02:00, wwpn=51:01:04:f0:18:0c:02:00 >> [snip] > >> [16942I] 18:57:19.024 - _clearFcPortList: Removing Physical port >> 21:00:00:24:ff:2f:90:5e >> [16942I] 18:57:19.024 - -- END OF TRACE for ID 12 V0.1 -- >> [16942] 18:57:19.024 - -- Trace end: Thursday, April 28, 2022 >> >> Filtering just the sysfs interactions: >> [16942I] 18:56:28.048 - writeStringToFile: Wrote "1" to >> /sys/kernel/scst_tgt/targets/qla2x00t/21:00:00:24:ff:2f:90:5e/enabled >> [16942I] 18:56:30.096 - writeStringToFile: Wrote "1" to >> /sys/kernel/scst_tgt/targets/qla2x00t/21:00:00:24:ff:2f:90:5f/enabled >> [16942I] 18:56:30.105 - writeStringToFile: Wrote "add_target >> 51:01:04:f0:18:0c:02:00 >> node_name=50:01:04:f0:18:0c:02:00;parent_host=21:00:00:24:ff:2f:90:5f" >> to /sys/kernel/scst_tgt/targets/qla2x00t/mgmt >> [16942I] 18:56:31.105 - writeStringToFile: Wrote "add UVTL_DEVICE12 >> 0" to >> /sys/kernel/scst_tgt/targets/qla2x00t/51:01:04:f0:18:0c:02:00/luns/mgmt >> [16942I] 18:56:31.115 - writeStringToFile: Wrote "1" to >> /sys/kernel/scst_tgt/targets/qla2x00t/51:01:04:f0:18:0c:02:00/enabled >> [16942I] 18:56:36.117 - writeStringToFile: Wrote "0" to >> /sys/kernel/scst_tgt/targets/qla2x00t/51:01:04:f0:18:0c:02:00/enabled >> [16942I] 18:56:36.148 - writeStringToFile: Wrote "del_target >> 51:01:04:f0:18:0c:02:00" to /sys/kernel/scst_tgt/targets/qla2x00t/mgmt >> [16942I] 18:56:57.520 - writeStringToFile: Wrote "0" to >> /sys/kernel/scst_tgt/targets/qla2x00t/21:00:00:24:ff:2f:90:5f/enabled >> [16942I] 18:57:19.024 - writeStringToFile: Wrote "0" to >> /sys/kernel/scst_tgt/targets/qla2x00t/21:00:00:24:ff:2f:90:5e/enabled >> >> After the event, the FC target VM sometimes completely freezes with >> CPU going to 100%, requiring a power down/up cycle. Sometimes this >> also happens when I manually echo "0" to one of the /enabled controls. >> >> To remove the SCST User device I simply close the corresponding >> handle. I then disable and remove the NPIV port. Is this safe or >> should I first remove the device from the NPIV port before disabling >> and removing the NPIV port? What is the proper cleanup sequence? >> >> Anything else I can do to debug this further? >> >> Thanks, >> Rob >> >> >> >> _______________________________________________ >> Scst-devel mailing list >> https://lists.sourceforge.net/lists/listinfo/scst-devel > > > > _______________________________________________ > Scst-devel mailing list > https://lists.sourceforge.net/lists/listinfo/scst-devel |