From: Chan, S. (S. E. - SGI) <sam...@hp...> - 2024-02-08 19:19:11
|
hello. anyone has any insight into this? feedback would be appreciated. thanks. ________________________________ From: Chan, Sam (Servers ERT - SGI) <sam...@hp...> Sent: Friday, January 26, 2024 12:04 PM To: scs...@li... <scs...@li...> Cc: Chan, Sam (Servers ERT - SGI) <sam...@hp...> Subject: panic in scst_rx_mgmt_fn() hello. a customer of mine hit a panic in SCST, that appears to be this RHEL issue... https://access.redhat.com/solutions/7017557 [https://access.redhat.com/webassets/avalon/g/shadowman-200.png]<https://access.redhat.com/solutions/7017557> Kernel panic due to BUG_ON() in the third party kernel module [scst] - Red Hat Customer Portal<https://access.redhat.com/solutions/7017557> access.redhat.com the signature of the customer's incident looks like this... crash.bin> sys KERNEL: vmlinux-3.10.0-1160.el7.x86_64.debug DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 36 DATE: Fri Jul 7 16:30:28 2023 UPTIME: 46 days, 00:31:15 LOAD AVERAGE: 0.07, 0.10, 0.13 TASKS: 16860 NODENAME: cirrus-rpool0.icexa.epcc.ed.ac.uk RELEASE: 3.10.0-1160.el7.x86_64 VERSION: #1 SMP Tue Aug 18 14:50:17 EDT 2020 MACHINE: x86_64 (3000 Mhz) MEMORY: 95.7 GB PANIC: "kernel BUG at /root/rpmbuild/BUILD/scst-master/scst/src/scst_targ.c:6927!" crash.bin> bt PID: 30180 TASK: ffff972e85e1b180 CPU: 1 COMMAND: "kworker/1:0" #0 [ffff972d3e40b920] machine_kexec at ffffffffb2e66294 #1 [ffff972d3e40b980] __crash_kexec at ffffffffb2f22562 #2 [ffff972d3e40ba50] crash_kexec at ffffffffb2f22650 #3 [ffff972d3e40ba68] oops_end at ffffffffb358b798 #4 [ffff972d3e40ba90] die at ffffffffb2e30a7b #5 [ffff972d3e40bac0] do_trap at ffffffffb358aee0 #6 [ffff972d3e40bb10] do_invalid_op at ffffffffb2e2d2a4 #7 [ffff972d3e40bbc0] invalid_op at ffffffffb35972ee [exception RIP: scst_rx_mgmt_fn+0x3f1] RIP: ffffffffc0ea96c1 RSP: ffff972d3e40bc78 RFLAGS: 00010292 RAX: 0000000000000066 RBX: ffff972e32100bd0 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff972d3e40bcf0 R8: 0000000000000000 R9: ffff97303c1fc600 R10: 0000000000024afd R11: 0000000000000001 R12: 000000000000000a R13: ffff972c087a4d00 R14: 000000000000000a R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff972d3e40bc70] scst_rx_mgmt_fn at ffffffffc0ea96c1 [scst] #9 [ffff972d3e40bd88] srpt_unregister_ch at ffffffffc0432389 [ib_srpt] #10 [ffff972d3e40bda0] srpt_do_compl_work at ffffffffc0436228 [ib_srpt] #11 [ffff972d3e40be20] process_one_work at ffffffffb2ebdc4f #12 [ffff972d3e40be68] worker_thread at ffffffffb2ebed66 #13 [ffff972d3e40bec8] kthread at ffffffffb2ec5c21 [3966906.028065] [3750]: scst: TM fn 0 (mcmd ffff9725aa7b6af0) finished, status -1 [3966906.028229] [16851]: scst: TM fn ABORT_TASK/0 (mcmd ffff9725aa7b6a80, initiator fe80:0000:0000:0000:e41d:2d03:0029:6330, target fe80:0000:0000:0000:b883:03ff:ffa0:b520) [3966906.028238] [3750]: scst: TM fn 0 (mcmd ffff9725aa7b6a80) finished, status -1 [3966906.397460] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22886. [3966906.398071] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22887. [3966906.398471] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22888. [3966906.399072] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22889. [3966906.399670] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22890. [3966906.399978] ib_srpt: Received CM TimeWait exit for ch fe80:0000:0000:0000:e41d:2d03:001f:7c80-22891. [3966906.400273] ------------[ cut here ]------------ [3966906.400285] WARNING: CPU: 1 PID: 30180 at /root/rpmbuild/BUILD/scst-master/srpt/src/ib_srpt.c:2249 srpt_unregister_ch+0x56/0x90 [ib_srpt] [3966906.400289] Modules linked in: ib_srpt(OE) scst_vdisk(OE) scst(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) vfat fat skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi sgi_xvm(POE) kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sgi_pm(POE) lpc_ich mei_me hpilo hpwdt sg mei wmi ipmi_si ipmi_devintf ipmi_msghandler sgi_r_pool(OE) acpi_power_meter sgi_os_lib(POE) knem(OE) ip_tables xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) raid1 ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul ahci crct10dif_common crc32c_intel drm libahci mlx5_core(OE) [3966906.400379] nvme tg3 nvme_core libata mlxfw(OE) devlink mlx_compat(OE) ptp pps_core drm_panel_orientation_quirks uas usb_storage dm_mirror dm_region_hash dm_log dm_mod [3966906.400401] CPU: 1 PID: 30180 Comm: kworker/1:0 Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.el7.x86_64 #1 [3966906.400404] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/13/2019 [3966906.400412] Workqueue: srpt srpt_do_compl_work [ib_srpt] [3966906.400415] Call Trace: [3966906.400432] [<ffffffffb3581340>] dump_stack+0x19/0x1b [3966906.400441] [<ffffffffb2e9b228>] __warn+0xd8/0x100 [3966906.400447] [<ffffffffb2e9b36d>] warn_slowpath_null+0x1d/0x20 [3966906.400453] [<ffffffffc04323a6>] srpt_unregister_ch+0x56/0x90 [ib_srpt] [3966906.400460] [<ffffffffc0436228>] srpt_do_compl_work+0x2c8/0x640 [ib_srpt] [3966906.400470] [<ffffffffb2ebdc4f>] process_one_work+0x17f/0x440 [3966906.400479] [<ffffffffb2ebed66>] worker_thread+0x126/0x3c0 [3966906.400486] [<ffffffffb2ebec40>] ? manage_workers.isra.26+0x2a0/0x2a0 [3966906.400492] [<ffffffffb2ec5c21>] kthread+0xd1/0xe0 [3966906.400498] [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40 [3966906.400506] [<ffffffffb3593df7>] ret_from_fork_nospec_begin+0x21/0x21 [3966906.400512] [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40 [3966906.400515] ---[ end trace 5d97197aa6829cf9 ]--- [3966906.400521] [30180]: scst: ***CRITICAL ERROR***: New mgmt cmd while shutting down the session ffff972c087a4d00 shut_phase 1 [3966906.411909] ------------[ cut here ]------------ [3966906.416723] kernel BUG at /root/rpmbuild/BUILD/scst-master/scst/src/scst_targ.c:6927! [3966906.424767] invalid opcode: 0000 [#1] SMP [3966906.429079] Modules linked in: ib_srpt(OE) scst_vdisk(OE) scst(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) vfat fat skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi sgi_xvm(POE) kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sgi_pm(POE) lpc_ich mei_me hpilo hpwdt sg mei wmi ipmi_si ipmi_devintf ipmi_msghandler sgi_r_pool(OE) acpi_power_meter sgi_os_lib(POE) knem(OE) ip_tables xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) raid1 ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul ahci crct10dif_common crc32c_intel drm libahci mlx5_core(OE) [3966906.501800] nvme tg3 nvme_core libata mlxfw(OE) devlink mlx_compat(OE) ptp pps_core drm_panel_orientation_quirks uas usb_storage dm_mirror dm_region_hash dm_log dm_mod [3966906.515767] CPU: 1 PID: 30180 Comm: kworker/1:0 Kdump: loaded Tainted: P W OE ------------ 3.10.0-1160.el7.x86_64 #1 [3966906.527477] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/13/2019 [3966906.536221] Workqueue: srpt srpt_do_compl_work [ib_srpt] [3966906.541743] task: ffff972e85e1b180 ti: ffff972d3e408000 task.ti: ffff972d3e408000 [3966906.549436] RIP: 0010:[<ffffffffc0ea96c1>] [<ffffffffc0ea96c1>] scst_rx_mgmt_fn+0x3f1/0x430 [scst] [3966906.558725] RSP: 0018:ffff972d3e40bc78 EFLAGS: 00010292 [3966906.564236] RAX: 0000000000000066 RBX: ffff972e32100bd0 RCX: 0000000000000006 [3966906.571581] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 [3966906.578925] RBP: ffff972d3e40bcf0 R08: 0000000000000000 R09: ffff97303c1fc600 [3966906.586271] R10: 0000000000024afd R11: 0000000000000001 R12: 000000000000000a [3966906.593617] R13: ffff972c087a4d00 R14: 000000000000000a R15: 0000000000000000 [3966906.600961] FS: 0000000000000000(0000) GS:ffff97304fc40000(0000) knlGS:0000000000000000 [3966906.609265] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3966906.615213] CR2: 00007f6661a3ebd0 CR3: 00000012fea10000 CR4: 00000000007607e0 [3966906.622557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3966906.629901] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3966906.637244] PKRU: 00000000 [3966906.640138] Call Trace: [3966906.642770] [<ffffffffb2e9e489>] ? vprintk_default+0x29/0x40 [3966906.648728] [<ffffffffc0ea9771>] scst_unregister_session+0x71/0x150 [scst] [3966906.655901] [<ffffffffc0432389>] srpt_unregister_ch+0x39/0x90 [ib_srpt] [3966906.662811] [<ffffffffc0436228>] srpt_do_compl_work+0x2c8/0x640 [ib_srpt] [3966906.669896] [<ffffffffb2ebdc4f>] process_one_work+0x17f/0x440 [3966906.675931] [<ffffffffb2ebed66>] worker_thread+0x126/0x3c0 [3966906.681706] [<ffffffffb2ebec40>] ? manage_workers.isra.26+0x2a0/0x2a0 [3966906.688439] [<ffffffffb2ec5c21>] kthread+0xd1/0xe0 [3966906.693514] [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40 [3966906.699812] [<ffffffffb3593df7>] ret_from_fork_nospec_begin+0x21/0x21 [3966906.706547] [<ffffffffb2ec5b50>] ? insert_kthread_work+0x40/0x40 [3966906.712843] Code: 48 89 44 24 08 4c 89 2c 24 41 b8 0e 1b 00 00 48 c7 c1 90 c9 eb c0 48 c7 c2 fe a3 ec c0 48 c7 c6 2d a4 ec c0 31 c0 e8 9f c3 fc ff <0f> 0b 83 f8 01 0f 84 6e fd ff ff 83 f8 02 74 17 85 c0 0f 84 23 [3966906.733764] RIP [<ffffffffc0ea96c1>] scst_rx_mgmt_fn+0x3f1/0x430 [scst] [3966906.741283] RSP <ffff972d3e40bc78> this seems to be triggered with IB connectivity issues, which SCST would try to recover from, but eventually would panic. comments indicates that scst_rx_mgmt_fn() shouldn't be called at same time as scst_unregister_session(). but dmesg indicates that this is the case. ---------- /* * scst_rx_mgmt_fn() - create new management command and send it for execution * * Description: * Creates new management command and sends it for execution. * * Returns 0 for success, error code otherwise. * * Must not be called in parallel with scst_unregister_session() for the * same sess. */ int scst_rx_mgmt_fn(struct scst_session *sess, const struct scst_rx_mgmt_params *params) { int res = -EFAULT; struct scst_mgmt_cmd *mcmd = NULL; char state_name[32]; ---------- thanks. sam |