Re: [Scst-devel] panic in scst_rx_mgmt_fn()

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

hello.  anyone has any insight into this?  feedback would be appreciated.  thanks.

________________________________
From: Chan, Sam (Servers ERT - SGI) <sam...@hp...>
Sent: Friday, January 26, 2024 12:04 PM
To: scs...@li... <scs...@li...>
Cc: Chan, Sam (Servers ERT - SGI) <sam...@hp...>
Subject: panic in scst_rx_mgmt_fn()

hello.
a customer of mine hit a panic in SCST, that appears to be this RHEL issue...

https://access.redhat.com/solutions/7017557
[https://access.redhat.com/webassets/avalon/g/shadowman-200.png]<https://access.redhat.com/solutions/7017557>
Kernel panic due to BUG_ON() in the third party kernel module [scst] - Red Hat Customer Portal<https://access.redhat.com/solutions/7017557>
access.redhat.com

the signature of the customer's incident looks like this...

crash.bin> sys
      KERNEL: vmlinux-3.10.0-1160.el7.x86_64.debug
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 36
        DATE: Fri Jul  7 16:30:28 2023
      UPTIME: 46 days, 00:31:15
LOAD AVERAGE: 0.07, 0.10, 0.13
       TASKS: 16860
    NODENAME: cirrus-rpool0.icexa.epcc.ed.ac.uk
     RELEASE: 3.10.0-1160.el7.x86_64
     VERSION: #1 SMP Tue Aug 18 14:50:17 EDT 2020
     MACHINE: x86_64  (3000 Mhz)
      MEMORY: 95.7 GB
       PANIC: "kernel BUG at /root/rpmbuild/BUILD/scst-master/scst/src/scst_targ.c:6927!"

crash.bin> bt
PID: 30180  TASK: ffff972e85e1b180  CPU: 1   COMMAND: "kworker/1:0"
 #0 [ffff972d3e40b920] machine_kexec at ffffffffb2e66294
 #1 [ffff972d3e40b980] __crash_kexec at ffffffffb2f22562
 #2 [ffff972d3e40ba50] crash_kexec at ffffffffb2f22650
 #3 [ffff972d3e40ba68] oops_end at ffffffffb358b798
 #4 [ffff972d3e40ba90] die at ffffffffb2e30a7b
 #5 [ffff972d3e40bac0] do_trap at ffffffffb358aee0
 #6 [ffff972d3e40bb10] do_invalid_op at ffffffffb2e2d2a4
 #7 [ffff972d3e40bbc0] invalid_op at ffffffffb35972ee
    [exception RIP: scst_rx_mgmt_fn+0x3f1]
    RIP: ffffffffc0ea96c1  RSP: ffff972d3e40bc78  RFLAGS: 00010292
    RAX: 0000000000000066  RBX: ffff972e32100bd0  RCX: 0000000000000006
    RDX: 0000000000000000  RSI: 0000000000000246  RDI: 0000000000000246
    RBP: ffff972d3e40bcf0   R8: 0000000000000000   R9: ffff97303c1fc600
    R10: 0000000000024afd  R11: 0000000000000001  R12: 000000000000000a
    R13: ffff972c087a4d00  R14: 000000000000000a  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff972d3e40bc70] scst_rx_mgmt_fn at ffffffffc0ea96c1 [scst]
 #9 [ffff972d3e40bd88] srpt_unregister_ch at ffffffffc0432389 [ib_srpt]
#10 [ffff972d3e40bda0] srpt_do_compl_work at ffffffffc0436228 [ib_srpt]
#11 [ffff972d3e40be20] process_one_work at ffffffffb2ebdc4f
#12 [ffff972d3e40be68] worker_thread at ffffffffb2ebed66
#13 [ffff972d3e40bec8] kthread at ffffffffb2ec5c21

[3966906.028065] [3750]: scst: TM fn 0 (mcmd ffff9725aa7b6af0) finished, status -1
[3966906.028229] [16851]: scst: TM fn ABORT_TASK/0 (mcmd ffff9725aa7b6a80, initiator fe80:0000:0000:0000:e41d:2d03:0029:6330, target fe80:0000:0000:0000:b883:03ff:ffa0:b520)
[3966906.028238] [3750]: scst: TM fn 0 (mcmd ffff9725aa7b6a80) finished, status -1
[3966906.397460] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22886.
[3966906.398071] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22887.
[3966906.398471] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22888.
[3966906.399072] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22889.
[3966906.399670] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22890.
[3966906.399978] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22891.
[3966906.400273] ------------[ cut here ]------------
[3966906.400285] WARNING: CPU: 1 PID: 30180 at /root/rpmbuild/BUILD/scst-master/srpt/src/ib_srpt.c:2249 srpt_unregister_ch+0x56/0x90 [ib_srpt]
[3966906.400289] Modules linked in: ib_srpt(OE) scst_vdisk(OE) scst(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) vfat fat skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi sgi_xvm(POE) kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sgi_pm(POE) lpc_ich mei_me hpilo hpwdt sg mei wmi ipmi_si ipmi_devintf ipmi_msghandler sgi_r_pool(OE) acpi_power_meter sgi_os_lib(POE) knem(OE) ip_tables xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) raid1 ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul ahci crct10dif_common crc32c_intel drm libahci mlx5_core(OE)
[3966906.400379]  nvme tg3 nvme_core libata mlxfw(OE) devlink mlx_compat(OE) ptp pps_core drm_panel_orientation_quirks uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
[3966906.400401] CPU: 1 PID: 30180 Comm: kworker/1:0 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.el7.x86_64 #1
[3966906.400404] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/13/2019
[3966906.400412] Workqueue: srpt srpt_do_compl_work [ib_srpt]
[3966906.400415] Call Trace:
[3966906.400432]  [<ffffffffb3581340>] dump_stack+0x19/0x1b
[3966906.400441]  [<ffffffffb2e9b228>] __warn+0xd8/0x100
[3966906.400447]  [<ffffffffb2e9b36d>] warn_slowpath_null+0x1d/0x20
[3966906.400453]  [<ffffffffc04323a6>] srpt_unregister_ch+0x56/0x90 [ib_srpt]
[3966906.400460]  [<ffffffffc0436228>] srpt_do_compl_work+0x2c8/0x640 [ib_srpt]
[3966906.400470]  [<ffffffffb2ebdc4f>] process_one_work+0x17f/0x440
[3966906.400479]  [<ffffffffb2ebed66>] worker_thread+0x126/0x3c0
[3966906.400486]  [<ffffffffb2ebec40>] ? manage_workers.isra.26+0x2a0/0x2a0
[3966906.400492]  [<ffffffffb2ec5c21>] kthread+0xd1/0xe0
[3966906.400498]  [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40
[3966906.400506]  [<ffffffffb3593df7>] ret_from_fork_nospec_begin+0x21/0x21
[3966906.400512]  [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40
[3966906.400515] ---[ end trace 5d97197aa6829cf9 ]---
[3966906.400521] [30180]: scst: ***CRITICAL ERROR***: New mgmt cmd while shutting down the session ffff972c087a4d00 shut_phase 1
[3966906.411909] ------------[ cut here ]------------
[3966906.416723] kernel BUG at /root/rpmbuild/BUILD/scst-master/scst/src/scst_targ.c:6927!
[3966906.424767] invalid opcode: 0000 [#1] SMP
[3966906.429079] Modules linked in: ib_srpt(OE) scst_vdisk(OE) scst(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) vfat fat skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi sgi_xvm(POE) kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sgi_pm(POE) lpc_ich mei_me hpilo hpwdt sg mei wmi ipmi_si ipmi_devintf ipmi_msghandler sgi_r_pool(OE) acpi_power_meter sgi_os_lib(POE) knem(OE) ip_tables xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) raid1 ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul ahci crct10dif_common crc32c_intel drm libahci mlx5_core(OE)
[3966906.501800]  nvme tg3 nvme_core libata mlxfw(OE) devlink mlx_compat(OE) ptp pps_core drm_panel_orientation_quirks uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
[3966906.515767] CPU: 1 PID: 30180 Comm: kworker/1:0 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-1160.el7.x86_64 #1
[3966906.527477] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/13/2019
[3966906.536221] Workqueue: srpt srpt_do_compl_work [ib_srpt]
[3966906.541743] task: ffff972e85e1b180 ti: ffff972d3e408000 task.ti: ffff972d3e408000
[3966906.549436] RIP: 0010:[<ffffffffc0ea96c1>]  [<ffffffffc0ea96c1>] scst_rx_mgmt_fn+0x3f1/0x430 [scst]
[3966906.558725] RSP: 0018:ffff972d3e40bc78  EFLAGS: 00010292
[3966906.564236] RAX: 0000000000000066 RBX: ffff972e32100bd0 RCX: 0000000000000006
[3966906.571581] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[3966906.578925] RBP: ffff972d3e40bcf0 R08: 0000000000000000 R09: ffff97303c1fc600
[3966906.586271] R10: 0000000000024afd R11: 0000000000000001 R12: 000000000000000a
[3966906.593617] R13: ffff972c087a4d00 R14: 000000000000000a R15: 0000000000000000
[3966906.600961] FS:  0000000000000000(0000) GS:ffff97304fc40000(0000) knlGS:0000000000000000
[3966906.609265] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3966906.615213] CR2: 00007f6661a3ebd0 CR3: 00000012fea10000 CR4: 00000000007607e0
[3966906.622557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3966906.629901] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3966906.637244] PKRU: 00000000
[3966906.640138] Call Trace:
[3966906.642770]  [<ffffffffb2e9e489>] ? vprintk_default+0x29/0x40
[3966906.648728]  [<ffffffffc0ea9771>] scst_unregister_session+0x71/0x150 [scst]
[3966906.655901]  [<ffffffffc0432389>] srpt_unregister_ch+0x39/0x90 [ib_srpt]
[3966906.662811]  [<ffffffffc0436228>] srpt_do_compl_work+0x2c8/0x640 [ib_srpt]
[3966906.669896]  [<ffffffffb2ebdc4f>] process_one_work+0x17f/0x440
[3966906.675931]  [<ffffffffb2ebed66>] worker_thread+0x126/0x3c0
[3966906.681706]  [<ffffffffb2ebec40>] ? manage_workers.isra.26+0x2a0/0x2a0
[3966906.688439]  [<ffffffffb2ec5c21>] kthread+0xd1/0xe0
[3966906.693514]  [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40
[3966906.699812]  [<ffffffffb3593df7>] ret_from_fork_nospec_begin+0x21/0x21
[3966906.706547]  [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40
[3966906.712843] Code: 48 89 44 24 08 4c 89 2c 24 41 b8 0e 1b 00 00 48 c7 c1 90 c9 eb c0 48 c7 c2 fe a3 ec c0 48 c7 c6 2d a4 ec c0 31 c0 e8 9f c3 fc ff <0f> 0b 83 f8 01 0f 84 6e fd ff ff 83 f8 02 74 17 85 c0 0f 84 23
[3966906.733764] RIP  [<ffffffffc0ea96c1>] scst_rx_mgmt_fn+0x3f1/0x430 [scst]
[3966906.741283]  RSP <ffff972d3e40bc78>

this seems to be triggered with IB connectivity issues, which SCST would try to recover from, but eventually would panic.

comments indicates that scst_rx_mgmt_fn() shouldn't be called at same time as scst_unregister_session().  but dmesg indicates that this is the case.

----------

/*
 * scst_rx_mgmt_fn() - create new management command and send it for execution
 *
 * Description:
 *    Creates new management command and sends it for execution.
 *
 *    Returns 0 for success, error code otherwise.
 *
 *    Must not be called in parallel with scst_unregister_session() for the
 *    same sess.
 */
int scst_rx_mgmt_fn(struct scst_session *sess,
        const struct scst_rx_mgmt_params *params)
{
        int res = -EFAULT;
        struct scst_mgmt_cmd *mcmd = NULL;
        char state_name[32];

----------

thanks.
sam