openipmi-developer Mailing List for Open IPMI
Brought to you by:
cminyard
You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(8) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(4) |
Feb
(14) |
Mar
(40) |
Apr
(41) |
May
(17) |
Jun
(50) |
Jul
(16) |
Aug
(37) |
Sep
(57) |
Oct
(44) |
Nov
(48) |
Dec
(35) |
| 2004 |
Jan
(12) |
Feb
(3) |
Mar
(8) |
Apr
(8) |
May
(22) |
Jun
(23) |
Jul
(14) |
Aug
(51) |
Sep
(21) |
Oct
(38) |
Nov
(8) |
Dec
(17) |
| 2005 |
Jan
(27) |
Feb
(28) |
Mar
(50) |
Apr
(32) |
May
(55) |
Jun
(38) |
Jul
(26) |
Aug
(40) |
Sep
(67) |
Oct
(86) |
Nov
(25) |
Dec
(29) |
| 2006 |
Jan
(53) |
Feb
(19) |
Mar
(36) |
Apr
(25) |
May
(27) |
Jun
(56) |
Jul
(28) |
Aug
(15) |
Sep
(37) |
Oct
(63) |
Nov
(63) |
Dec
(105) |
| 2007 |
Jan
(54) |
Feb
(29) |
Mar
(23) |
Apr
(42) |
May
(6) |
Jun
(70) |
Jul
(51) |
Aug
(58) |
Sep
(27) |
Oct
(43) |
Nov
(52) |
Dec
(24) |
| 2008 |
Jan
(39) |
Feb
(76) |
Mar
(23) |
Apr
(18) |
May
(5) |
Jun
(7) |
Jul
(12) |
Aug
(7) |
Sep
(2) |
Oct
(6) |
Nov
(22) |
Dec
(31) |
| 2009 |
Jan
(4) |
Feb
(2) |
Mar
(32) |
Apr
(5) |
May
(22) |
Jun
(5) |
Jul
(9) |
Aug
(6) |
Sep
(12) |
Oct
(30) |
Nov
(27) |
Dec
(31) |
| 2010 |
Jan
(17) |
Feb
(2) |
Mar
(41) |
Apr
(8) |
May
(19) |
Jun
(11) |
Jul
(53) |
Aug
(1) |
Sep
(14) |
Oct
(31) |
Nov
(13) |
Dec
(10) |
| 2011 |
Jan
(10) |
Feb
(15) |
Mar
(6) |
Apr
(6) |
May
(4) |
Jun
|
Jul
(6) |
Aug
(5) |
Sep
(6) |
Oct
(9) |
Nov
(2) |
Dec
(3) |
| 2012 |
Jan
|
Feb
(10) |
Mar
(11) |
Apr
(3) |
May
(2) |
Jun
(6) |
Jul
(12) |
Aug
(1) |
Sep
(3) |
Oct
(23) |
Nov
(6) |
Dec
(11) |
| 2013 |
Jan
(9) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(40) |
Jun
(9) |
Jul
(47) |
Aug
(23) |
Sep
(52) |
Oct
(6) |
Nov
(9) |
Dec
(8) |
| 2014 |
Jan
(27) |
Feb
(15) |
Mar
(26) |
Apr
(36) |
May
(33) |
Jun
(4) |
Jul
(15) |
Aug
(2) |
Sep
(11) |
Oct
(120) |
Nov
(32) |
Dec
(27) |
| 2015 |
Jan
(30) |
Feb
(15) |
Mar
(7) |
Apr
(17) |
May
(27) |
Jun
(23) |
Jul
(15) |
Aug
(39) |
Sep
(19) |
Oct
(5) |
Nov
(26) |
Dec
(6) |
| 2016 |
Jan
(37) |
Feb
(35) |
Mar
(51) |
Apr
(18) |
May
(8) |
Jun
(11) |
Jul
(5) |
Aug
(7) |
Sep
(54) |
Oct
(6) |
Nov
(33) |
Dec
(11) |
| 2017 |
Jan
(15) |
Feb
(25) |
Mar
(25) |
Apr
(19) |
May
(17) |
Jun
(28) |
Jul
(11) |
Aug
(56) |
Sep
(53) |
Oct
(15) |
Nov
(19) |
Dec
(30) |
| 2018 |
Jan
(63) |
Feb
(44) |
Mar
(42) |
Apr
(41) |
May
(19) |
Jun
(22) |
Jul
(16) |
Aug
(38) |
Sep
(14) |
Oct
(6) |
Nov
(11) |
Dec
(12) |
| 2019 |
Jan
(44) |
Feb
(7) |
Mar
(11) |
Apr
(58) |
May
(10) |
Jun
(10) |
Jul
(42) |
Aug
(36) |
Sep
(3) |
Oct
(29) |
Nov
(29) |
Dec
(23) |
| 2020 |
Jan
(7) |
Feb
(22) |
Mar
(3) |
Apr
(38) |
May
(14) |
Jun
(7) |
Jul
(12) |
Aug
(48) |
Sep
(85) |
Oct
(71) |
Nov
(14) |
Dec
(4) |
| 2021 |
Jan
(11) |
Feb
(36) |
Mar
(65) |
Apr
(106) |
May
(73) |
Jun
(33) |
Jul
(25) |
Aug
(19) |
Sep
(19) |
Oct
(29) |
Nov
(95) |
Dec
(21) |
| 2022 |
Jan
(91) |
Feb
(30) |
Mar
(43) |
Apr
(95) |
May
(136) |
Jun
(47) |
Jul
(28) |
Aug
(36) |
Sep
(17) |
Oct
(46) |
Nov
(53) |
Dec
(15) |
| 2023 |
Jan
|
Feb
(15) |
Mar
(44) |
Apr
(9) |
May
(20) |
Jun
(18) |
Jul
(8) |
Aug
(18) |
Sep
(41) |
Oct
(67) |
Nov
(44) |
Dec
(2) |
| 2024 |
Jan
(4) |
Feb
(7) |
Mar
(45) |
Apr
(35) |
May
(4) |
Jun
(29) |
Jul
(4) |
Aug
(37) |
Sep
(16) |
Oct
(12) |
Nov
(6) |
Dec
(8) |
| 2025 |
Jan
(179) |
Feb
(49) |
Mar
(8) |
Apr
(41) |
May
(32) |
Jun
(35) |
Jul
(31) |
Aug
(46) |
Sep
(32) |
Oct
(18) |
Nov
(111) |
Dec
(19) |
| 2026 |
Jan
(21) |
Feb
(8) |
Mar
(18) |
Apr
(58) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Corey M. <co...@mi...> - 2026-04-30 02:26:30
|
On Wed, Apr 29, 2026 at 04:22:55PM -0700, Tony Hutter wrote:
> Add driver to control the NVMe slot LEDs on the Cray ClusterStor E1000.
> The driver provides hotplug attention status callbacks for the 24 NVMe
> slots on the E1000. This allows users to access the E1000's locate and
> fault LEDs via the normal /sys/bus/pci/slots/<slot>/attention sysfs
> entries. This driver uses IPMI to communicate with the E1000 controller
> to toggle the LEDs.
>
> Signed-off-by: Tony Hutter <hu...@ll...>
For the IPMI portions:
Reviewed-by: Corey Minyard <co...@mi...>
Have you tested removing and adding the IPMI interface while this is up?
You can do that with the hotmod interface on IPMI. I didn't see any
issues, but it's always good to test.
-corey
> ---
> Changes in v8:
> - Remove unused variable in craye1k_get_attention_status().
>
> Changes in v7:
> - Update sysfs-bus-pci text from feedback.
> - Add DMI dependency to Kconfig.
> - Refactor pciehp_core.c to remove CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
> code blocks.
> - Move errno.h #include into correct alphabetical order.
> - Add comment describing the reasoning for the debugfs counters.
> - Move craye1k_init() call from pcie_hp_init() to init_slot().
> - Make craye1k mutex global rather than in craye1k->lock. This enables
> handling of craye1k_[get|set]_attention_status() calls before the craye1k
> driver is initialized.
> - Do driver cleanup on craye1k_smi_gone().
>
> Changes in v6:
> - Change some dev_info_ratelimited() calls to dev_info().
> - Don't call craye1k_init() if pcie_port_service_register() fails.
> - Fix stray indent in #define CRAYE1K_POST_CMD_WAIT_MS
>
> Changes in v5:
> - Removed unnecessary ipmi_smi.h #include.
> - Added WARN_ON() to craye1k_do_message() to sanity check that craye1k->lock
> is held.
> - Added additional comments for when craye1k->lock should be held.
>
> Changes in v4:
> - Fix typo in Kconfig: "is it" -> "it is"
> - Rename some #defines to CRAYE1K_SUBCMD_*
> - Use IS_ERR() check in craye1k_debugfs_init()
> - Return -EIO instead of -EINVAL when LED value check fails
>
> Changes in v3:
> - Add 'attention' values in Documentation/ABI/testing/sysfs-bus-pci.
> - Remove ACPI_PCI_SLOT dependency.
> - Cleanup craye1k_do_message() error checking.
> - Skip unneeded memcpy() on failure in __craye1k_do_command().
> - Merge craye1k_do_command_and_netfn() code into craye1k_do_command().
> - Make craye1k_is_primary() return boolean.
> - Return negative error code on failure in craye1k_set_primary().
>
> Changes in v2:
> - Integrated E1000 code into the pciehp driver as an built-in
> extention rather than as a standalone module.
> - Moved debug tunables and counters to debugfs.
> - Removed forward declarations.
> - Kept the /sys/bus/pci/slots/<slot>/attention interface rather
> than using NPEM/_DSM or led_classdev as suggested. The "attention"
> interface is more beneficial for our site, since it allows us to
> control the NVMe slot LEDs agnostically across different enclosure
> vendors and kernel versions using the same scripts. It is also
> nice to use the same /sys/bus/pci/slots/<slot>/ sysfs directory for
> both slot LED toggling ("attention") and slot power control
> ("power").
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 21 +
> MAINTAINERS | 5 +
> drivers/pci/hotplug/Kconfig | 10 +
> drivers/pci/hotplug/Makefile | 3 +
> drivers/pci/hotplug/pciehp.h | 20 +
> drivers/pci/hotplug/pciehp_core.c | 20 +-
> drivers/pci/hotplug/pciehp_craye1k.c | 687 ++++++++++++++++++++++++
> 7 files changed, 765 insertions(+), 1 deletion(-)
> create mode 100644 drivers/pci/hotplug/pciehp_craye1k.c
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 92debe879ffb..8536d2ff30d1 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -231,6 +231,27 @@ Description:
> - scXX contains the device subclass;
> - iXX contains the device class programming interface.
>
> +What: /sys/bus/pci/slots/.../attention
> +Date: February 2025
> +Contact: lin...@vg...
> +Description:
> + The attention attribute is used to read or write the attention
> + status for an enclosure slot. This is often used to set the
> + slot LED value on a NVMe storage enclosure.
> +
> + Common values:
> + 0 = OFF
> + 1 = ON
> + 2 = blink
> +
> + Using the Cray ClusterStor E1000 extensions:
> + 0 = fault LED OFF, locate LED OFF
> + 1 = fault LED ON, locate LED OFF
> + 2 = fault LED OFF, locate LED ON
> + 3 = fault LED ON, locate LED ON
> +
> + Other values are no-op, OFF, or ON depending on the driver.
> +
> What: /sys/bus/pci/slots/.../module
> Date: June 2009
> Contact: lin...@vg...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9ac254f4ec41..861576d60a36 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6543,6 +6543,11 @@ S: Maintained
> F: Documentation/filesystems/cramfs.rst
> F: fs/cramfs/
>
> +CRAY CLUSTERSTOR E1000 NVME SLOT LED DRIVER EXTENSIONS
> +M: Tony Hutter <hu...@ll...>
> +S: Maintained
> +F: drivers/pci/hotplug/pciehp_craye1k.c
> +
> CRC LIBRARY
> M: Eric Biggers <ebi...@ke...>
> R: Ard Biesheuvel <ar...@ke...>
> diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
> index 3207860b52e4..3cb84e5e13e9 100644
> --- a/drivers/pci/hotplug/Kconfig
> +++ b/drivers/pci/hotplug/Kconfig
> @@ -183,4 +183,14 @@ config HOTPLUG_PCI_S390
>
> When in doubt, say Y.
>
> +config HOTPLUG_PCI_PCIE_CRAY_E1000
> + bool "PCIe Hotplug extensions for Cray ClusterStor E1000"
> + depends on DMI && HOTPLUG_PCI_PCIE && IPMI_HANDLER=y
> + help
> + Say Y here if you have a Cray ClusterStor E1000 and want to control
> + your NVMe slot LEDs. Without this driver it is not possible
> + to control the fault and locate LEDs on the E1000's 24 NVMe slots.
> +
> + When in doubt, say N.
> +
> endif # HOTPLUG_PCI
> diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
> index 40aaf31fe338..82a1f0592d0a 100644
> --- a/drivers/pci/hotplug/Makefile
> +++ b/drivers/pci/hotplug/Makefile
> @@ -66,6 +66,9 @@ pciehp-objs := pciehp_core.o \
> pciehp_ctrl.o \
> pciehp_pci.o \
> pciehp_hpc.o
> +ifdef CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
> +pciehp-objs += pciehp_craye1k.o
> +endif
>
> shpchp-objs := shpchp_core.o \
> shpchp_ctrl.o \
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index debc79b0adfb..3a8173f3e159 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -199,6 +199,17 @@ int pciehp_get_raw_indicator_status(struct hotplug_slot *h_slot, u8 *status);
>
> int pciehp_slot_reset(struct pcie_device *dev);
>
> +#ifdef CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
> +int craye1k_init(void);
> +bool is_craye1k_board(void);
> +int craye1k_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
> +int craye1k_set_attention_status(struct hotplug_slot *hotplug_slot, u8 status);
> +#else
> +#define craye1k_init() (0)
> +#define craye1k_get_attention_status NULL
> +#define craye1k_set_attention_status NULL
> +#endif
> +
> static inline const char *slot_name(struct controller *ctrl)
> {
> return hotplug_slot_name(&ctrl->hotplug_slot);
> @@ -209,4 +220,13 @@ static inline struct controller *to_ctrl(struct hotplug_slot *hotplug_slot)
> return container_of(hotplug_slot, struct controller, hotplug_slot);
> }
>
> +static inline bool is_craye1k_slot(struct controller *ctrl)
> +{
> +#ifdef CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
> + return (PSN(ctrl) >= 1 && PSN(ctrl) <= 24 && is_craye1k_board());
> +#else
> + return false;
> +#endif
> +}
> +
> #endif /* _PCIEHP_H */
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index f59baa912970..3e8e2b3069bf 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -72,6 +72,22 @@ static int init_slot(struct controller *ctrl)
> } else if (ctrl->pcie->port->hotplug_user_indicators) {
> ops->get_attention_status = pciehp_get_raw_indicator_status;
> ops->set_attention_status = pciehp_set_raw_indicator_status;
> + } else if (is_craye1k_slot(ctrl)) {
> + /*
> + * The Cray E1000 driver controls slots 1-24. Initialize the
> + * Cray E1000 driver when slot 1 is seen.
> + */
> + if (PSN(ctrl) == 1) {
> + retval = craye1k_init();
> + if (retval) {
> + ctrl_err(ctrl,
> + "Error loading Cray E1000 extensions");
> + kfree(ops);
> + return retval;
> + }
> + }
> + ops->get_attention_status = craye1k_get_attention_status;
> + ops->set_attention_status = craye1k_set_attention_status;
> }
>
> /* register this slot with the hotplug pci core */
> @@ -376,8 +392,10 @@ int __init pcie_hp_init(void)
>
> retval = pcie_port_service_register(&hpdriver_portdrv);
> pr_debug("pcie_port_service_register = %d\n", retval);
> - if (retval)
> + if (retval) {
> pr_debug("Failure to register service\n");
> + return retval;
> + }
>
> return retval;
> }
> diff --git a/drivers/pci/hotplug/pciehp_craye1k.c b/drivers/pci/hotplug/pciehp_craye1k.c
> new file mode 100644
> index 000000000000..9c5bee81fdf8
> --- /dev/null
> +++ b/drivers/pci/hotplug/pciehp_craye1k.c
> @@ -0,0 +1,687 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2022-2024 Lawrence Livermore National Security, LLC
> + */
> +/*
> + * Cray ClusterStor E1000 hotplug slot LED driver extensions
> + *
> + * This driver controls the NVMe slot LEDs on the Cray ClusterStore E1000.
> + * It provides hotplug attention status callbacks for the 24 NVMe slots on
> + * the E1000. This allows users to access the E1000's locate and fault
> + * LEDs via the normal /sys/bus/pci/slots/<slot>/attention sysfs entries.
> + * This driver uses IPMI to communicate with the E1000 controller to toggle
> + * the LEDs.
> + *
> + * This driver is based off of ibmpex.c
> + */
> +
> +#include <linux/debugfs.h>
> +#include <linux/delay.h>
> +#include <linux/dmi.h>
> +#include <linux/errno.h>
> +#include <linux/ipmi.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
> +#include <linux/random.h>
> +#include "pciehp.h"
> +
> +/* Cray E1000 commands */
> +#define CRAYE1K_CMD_NETFN 0x3c
> +#define CRAYE1K_CMD_PRIMARY 0x33
> +#define CRAYE1K_CMD_FAULT_LED 0x39
> +#define CRAYE1K_CMD_LOCATE_LED 0x22
> +
> +/* Subcommands */
> +#define CRAYE1K_SUBCMD_GET_LED 0x0
> +#define CRAYE1K_SUBCMD_SET_LED 0x1
> +#define CRAYE1K_SUBCMD_SET_PRIMARY 0x1
> +
> +/*
> + * Milliseconds to wait after get/set LED command. 200ms value found though
> + * experimentation
> + */
> +#define CRAYE1K_POST_CMD_WAIT_MS 200
> +
> +struct craye1k {
> + struct device *dev; /* BMC device */
> + struct mutex lock;
> + struct completion read_complete;
> + struct ipmi_addr address;
> + struct ipmi_user *user;
> + int iface;
> +
> + long tx_msg_id;
> + struct kernel_ipmi_msg tx_msg;
> + unsigned char tx_msg_data[IPMI_MAX_MSG_LENGTH];
> + unsigned char rx_msg_data[IPMI_MAX_MSG_LENGTH];
> + unsigned long rx_msg_len;
> + unsigned char rx_result; /* IPMI completion code */
> +
> + /* Parent dir for all our debugfs entries */
> + struct dentry *parent;
> +
> + /* debugfs stats */
> + u64 check_primary;
> + u64 check_primary_failed;
> + u64 was_already_primary;
> + u64 was_not_already_primary;
> + u64 set_primary;
> + u64 set_initial_primary_failed;
> + u64 set_primary_failed;
> + u64 set_led_locate_failed;
> + u64 set_led_fault_failed;
> + u64 set_led_readback_failed;
> + u64 set_led_failed;
> + u64 get_led_failed;
> + u64 completion_timeout;
> + u64 wrong_msgid;
> + u64 request_failed;
> +
> + /* debugfs configuration options */
> +
> + /* Print info on spurious IPMI messages */
> + bool print_errors;
> +
> + /* Retries for kernel IPMI layer */
> + u32 ipmi_retries;
> +
> + /* Timeout in ms for IPMI (0 = use IPMI default_retry_ms) */
> + u32 ipmi_timeout_ms;
> +
> + /* Timeout in ms to wait for E1000 message completion */
> + u32 completion_timeout_ms;
> +};
> +
> +/*
> + * Make our craye1k a global so get/set_attention_status() can access it.
> + * This is safe since there's only one node controller on the board, and so it's
> + * impossible to instantiate more than one craye1k.
> + */
> +static struct craye1k *craye1k_global;
> +static DEFINE_MUTEX(craye1k_lock);
> +
> +/*
> + * The E1000 command timeout and retry values were found though experimentation
> + * by looking at the error counters. Keep the counters around to troubleshoot
> + * any issues with our current timeout/retry values.
> + */
> +static struct dentry *
> +craye1k_debugfs_init(struct craye1k *craye1k)
> +{
> + umode_t mode = 0644;
> + struct dentry *parent = debugfs_create_dir("pciehp_craye1k", NULL);
> +
> + if (IS_ERR(parent))
> + return NULL;
> +
> + debugfs_create_x64("check_primary", mode, parent,
> + &craye1k->check_primary);
> + debugfs_create_x64("check_primary_failed", mode, parent,
> + &craye1k->check_primary_failed);
> + debugfs_create_x64("was_already_primary", mode, parent,
> + &craye1k->was_already_primary);
> + debugfs_create_x64("was_not_already_primary", mode, parent,
> + &craye1k->was_not_already_primary);
> + debugfs_create_x64("set_primary", mode, parent,
> + &craye1k->set_primary);
> + debugfs_create_x64("set_initial_primary_failed", mode, parent,
> + &craye1k->set_initial_primary_failed);
> + debugfs_create_x64("set_primary_failed", mode, parent,
> + &craye1k->set_primary_failed);
> + debugfs_create_x64("set_led_locate_failed", mode, parent,
> + &craye1k->set_led_locate_failed);
> + debugfs_create_x64("set_led_fault_failed", mode, parent,
> + &craye1k->set_led_fault_failed);
> + debugfs_create_x64("set_led_readback_failed", mode, parent,
> + &craye1k->set_led_readback_failed);
> + debugfs_create_x64("set_led_failed", mode, parent,
> + &craye1k->set_led_failed);
> + debugfs_create_x64("get_led_failed", mode, parent,
> + &craye1k->get_led_failed);
> + debugfs_create_x64("completion_timeout", mode, parent,
> + &craye1k->completion_timeout);
> + debugfs_create_x64("wrong_msgid", mode, parent,
> + &craye1k->wrong_msgid);
> + debugfs_create_x64("request_failed", mode, parent,
> + &craye1k->request_failed);
> +
> + debugfs_create_x32("ipmi_retries", mode, parent,
> + &craye1k->ipmi_retries);
> + debugfs_create_x32("ipmi_timeout_ms", mode, parent,
> + &craye1k->ipmi_timeout_ms);
> + debugfs_create_x32("completion_timeout_ms", mode, parent,
> + &craye1k->completion_timeout_ms);
> + debugfs_create_bool("print_errors", mode, parent,
> + &craye1k->print_errors);
> +
> + /* Return parent dir dentry */
> + return parent;
> +}
> +
> +/*
> + * craye1k_msg_handler() - IPMI message response handler
> + */
> +static void craye1k_msg_handler(struct ipmi_recv_msg *msg, void *user_msg_data)
> +{
> + struct craye1k *craye1k = user_msg_data;
> +
> + if (msg->msgid != craye1k->tx_msg_id) {
> + craye1k->wrong_msgid++;
> + if (craye1k->print_errors) {
> + dev_warn_ratelimited(craye1k->dev,
> + "rx msgid %ld != %ld",
> + msg->msgid, craye1k->tx_msg_id);
> + }
> + ipmi_free_recv_msg(msg);
> + return;
> + }
> +
> + /* Set rx_result to the IPMI completion code */
> + if (msg->msg.data_len > 0)
> + craye1k->rx_result = msg->msg.data[0];
> + else
> + craye1k->rx_result = IPMI_UNKNOWN_ERR_COMPLETION_CODE;
> +
> + if (msg->msg.data_len > 1) {
> + /* Exclude completion code from data bytes */
> + craye1k->rx_msg_len = msg->msg.data_len - 1;
> + memcpy(craye1k->rx_msg_data, msg->msg.data + 1,
> + craye1k->rx_msg_len);
> + } else {
> + craye1k->rx_msg_len = 0;
> + }
> +
> + ipmi_free_recv_msg(msg);
> +
> + complete(&craye1k->read_complete);
> +}
> +
> +static const struct ipmi_user_hndl craye1k_user_hndl = {
> + .ipmi_recv_hndl = craye1k_msg_handler
> +};
> +
> +static void craye1k_new_smi(int iface, struct device *dev)
> +{
> + int rc;
> + struct craye1k *craye1k;
> +
> + craye1k = kzalloc(sizeof(*craye1k), GFP_KERNEL);
> + if (!craye1k)
> + return;
> +
> + craye1k->address.addr_type = IPMI_SYSTEM_INTERFACE_ADDR_TYPE;
> + craye1k->address.channel = IPMI_BMC_CHANNEL;
> + craye1k->iface = iface;
> + craye1k->dev = dev;
> + craye1k->tx_msg.data = craye1k->tx_msg_data;
> + craye1k->ipmi_retries = 4;
> + craye1k->ipmi_timeout_ms = 500;
> + craye1k->completion_timeout_ms = 300;
> +
> + init_completion(&craye1k->read_complete);
> +
> + dev_set_drvdata(dev, craye1k);
> +
> + rc = ipmi_create_user(craye1k->iface, &craye1k_user_hndl, craye1k,
> + &craye1k->user);
> + if (rc < 0) {
> + dev_err(dev, "Unable to register IPMI user, iface %d\n",
> + craye1k->iface);
> + kfree(craye1k);
> + dev_set_drvdata(dev, NULL);
> + return;
> + }
> +
> + mutex_lock(&craye1k_lock);
> +
> + /* There's only one node controller so driver data should not be set */
> + WARN_ON(craye1k_global);
> +
> + craye1k_global = craye1k;
> + craye1k->parent = craye1k_debugfs_init(craye1k);
> + mutex_unlock(&craye1k_lock);
> + if (!craye1k->parent)
> + dev_warn(dev, "Cannot create debugfs");
> +
> + dev_info(dev, "Cray ClusterStor E1000 slot LEDs registered");
> +}
> +
> +static void craye1k_smi_gone(int iface)
> +{
> + pr_warn("craye1k: Got unexpected smi_gone, iface=%d", iface);
> +
> + mutex_lock(&craye1k_lock);
> + if (craye1k_global) {
> + debugfs_remove_recursive(craye1k_global->parent);
> + kfree(craye1k_global);
> + craye1k_global = NULL;
> + }
> + mutex_unlock(&craye1k_lock);
> +}
> +
> +static struct ipmi_smi_watcher craye1k_smi_watcher = {
> + .owner = THIS_MODULE,
> + .new_smi = craye1k_new_smi,
> + .smi_gone = craye1k_smi_gone
> +};
> +
> +/*
> + * craye1k_send_message() - Send the message already setup in 'craye1k'
> + *
> + * Context: craye1k_lock is already held.
> + * Return: 0 on success, non-zero on error.
> + */
> +static int craye1k_send_message(struct craye1k *craye1k)
> +{
> + int rc;
> +
> + rc = ipmi_validate_addr(&craye1k->address, sizeof(craye1k->address));
> + if (rc) {
> + dev_err_ratelimited(craye1k->dev, "ipmi_validate_addr() = %d\n",
> + rc);
> + return rc;
> + }
> +
> + craye1k->tx_msg_id++;
> +
> + rc = ipmi_request_settime(craye1k->user, &craye1k->address,
> + craye1k->tx_msg_id, &craye1k->tx_msg, craye1k,
> + 0, craye1k->ipmi_retries,
> + craye1k->ipmi_timeout_ms);
> +
> + if (rc) {
> + craye1k->request_failed++;
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * craye1k_do_message() - Send the message in 'craye1k' and wait for a response
> + *
> + * Context: craye1k_lock is already held.
> + * Return: 0 on success, non-zero on error.
> + */
> +static int craye1k_do_message(struct craye1k *craye1k)
> +{
> + int rc;
> + struct completion *read_complete = &craye1k->read_complete;
> + unsigned long tout = msecs_to_jiffies(craye1k->completion_timeout_ms);
> +
> + WARN_ON(!mutex_is_locked(&craye1k_lock));
> +
> + rc = craye1k_send_message(craye1k);
> + if (rc)
> + return rc;
> +
> + rc = wait_for_completion_killable_timeout(read_complete, tout);
> + if (rc == 0) {
> + /* timed out */
> + craye1k->completion_timeout++;
> + return -ETIME;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * __craye1k_do_command() - Do an IPMI command
> + *
> + * Send a command with optional data bytes, and read back response bytes.
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: 0 on success, non-zero on error.
> + */
> +static int __craye1k_do_command(struct craye1k *craye1k, u8 netfn, u8 cmd,
> + u8 *send_data, u8 send_data_len, u8 *recv_data,
> + u8 recv_data_len)
> +{
> + int rc;
> +
> + craye1k->tx_msg.netfn = netfn;
> + craye1k->tx_msg.cmd = cmd;
> +
> + if (send_data) {
> + memcpy(&craye1k->tx_msg_data[0], send_data, send_data_len);
> + craye1k->tx_msg.data_len = send_data_len;
> + } else {
> + craye1k->tx_msg_data[0] = 0;
> + craye1k->tx_msg.data_len = 0;
> + }
> +
> + rc = craye1k_do_message(craye1k);
> + if (rc == 0)
> + memcpy(recv_data, craye1k->rx_msg_data, recv_data_len);
> +
> + return rc;
> +}
> +
> +/*
> + * craye1k_do_command() - Do a Cray E1000 specific IPMI command.
> + * @cmd: Cray E1000 specific command
> + * @send_data: Data to send after the command
> + * @send_data_len: Data length
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: the last byte from the response or 0 if response had no response
> + * data bytes, else -1 on error.
> + */
> +static int craye1k_do_command(struct craye1k *craye1k, u8 cmd, u8 *send_data,
> + u8 send_data_len)
> +{
> + int rc;
> +
> + rc = __craye1k_do_command(craye1k, CRAYE1K_CMD_NETFN, cmd, send_data,
> + send_data_len, NULL, 0);
> + if (rc != 0) {
> + /* Error attempting command */
> + return -1;
> + }
> +
> + if (craye1k->tx_msg.data_len == 0)
> + return 0;
> +
> + /* Return last received byte value */
> + return craye1k->rx_msg_data[craye1k->rx_msg_len - 1];
> +}
> +
> +/*
> + * __craye1k_set_primary() - Tell the BMC we want to be the primary server
> + *
> + * An E1000 board has two physical servers on it. In order to set a slot
> + * NVMe LED, this server needs to first tell the BMC that it's the primary
> + * server.
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: 0 on success, non-zero on error.
> + */
> +static int __craye1k_set_primary(struct craye1k *craye1k)
> +{
> + u8 bytes[2] = {CRAYE1K_SUBCMD_SET_PRIMARY, 1}; /* set primary to 1 */
> +
> + craye1k->set_primary++;
> + return craye1k_do_command(craye1k, CRAYE1K_CMD_PRIMARY, bytes, 2);
> +}
> +
> +/*
> + * craye1k_is_primary() - Are we the primary server?
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: true if we are the primary server, false otherwise.
> + */
> +static bool craye1k_is_primary(struct craye1k *craye1k)
> +{
> + u8 byte = 0;
> + int rc;
> +
> + /* Response byte is 0x1 on success */
> + rc = craye1k_do_command(craye1k, CRAYE1K_CMD_PRIMARY, &byte, 1);
> + craye1k->check_primary++;
> + if (rc == 0x1)
> + return true; /* success */
> +
> + craye1k->check_primary_failed++;
> + return false; /* We are not the primary server node */
> +}
> +
> +/*
> + * craye1k_set_primary() - Attempt to set ourselves as the primary server
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: 0 on success, -1 otherwise.
> + */
> +static int craye1k_set_primary(struct craye1k *craye1k)
> +{
> + int tries = 10;
> +
> + if (craye1k_is_primary(craye1k)) {
> + craye1k->was_already_primary++;
> + return 0;
> + }
> + craye1k->was_not_already_primary++;
> +
> + /* delay found through experimentation */
> + msleep(300);
> +
> + if (__craye1k_set_primary(craye1k) != 0) {
> + craye1k->set_initial_primary_failed++;
> + return -1; /* error */
> + }
> +
> + /*
> + * It can take 2 to 3 seconds after setting primary for the controller
> + * to report that it is the primary.
> + */
> + while (tries--) {
> + msleep(500);
> + if (craye1k_is_primary(craye1k))
> + break;
> + }
> +
> + if (tries == 0) {
> + craye1k->set_primary_failed++;
> + return -1; /* never reported that it's primary */
> + }
> +
> + /* Wait for primary switch to finish */
> + msleep(1500);
> +
> + return 0;
> +}
> +
> +/*
> + * craye1k_get_slot_led() - Get slot LED value
> + * @slot: Slot number (1-24)
> + * @is_locate_led: 0 = get fault LED value, 1 = get locate LED value
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: slot value on success, -1 on failure.
> + */
> +static int craye1k_get_slot_led(struct craye1k *craye1k, unsigned char slot,
> + bool is_locate_led)
> +{
> + u8 bytes[2];
> + u8 cmd;
> +
> + bytes[0] = CRAYE1K_SUBCMD_GET_LED;
> + bytes[1] = slot;
> +
> + cmd = is_locate_led ? CRAYE1K_CMD_LOCATE_LED : CRAYE1K_CMD_FAULT_LED;
> +
> + return craye1k_do_command(craye1k, cmd, bytes, 2);
> +}
> +
> +/*
> + * craye1k_set_slot_led() - Attempt to set the locate/fault LED to a value
> + * @slot: Slot number (1-24)
> + * @is_locate_led: 0 = use fault LED, 1 = use locate LED
> + * @value: Value to set (0 or 1)
> + *
> + * Check the LED value after calling this function to ensure it has been set
> + * properly.
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: 0 on success, non-zero on failure.
> + */
> +static int craye1k_set_slot_led(struct craye1k *craye1k, unsigned char slot,
> + unsigned char is_locate_led,
> + unsigned char value)
> +{
> + u8 bytes[3];
> + u8 cmd;
> +
> + bytes[0] = CRAYE1K_SUBCMD_SET_LED;
> + bytes[1] = slot;
> + bytes[2] = value;
> +
> + cmd = is_locate_led ? CRAYE1K_CMD_LOCATE_LED : CRAYE1K_CMD_FAULT_LED;
> +
> + return craye1k_do_command(craye1k, cmd, bytes, 3);
> +}
> +
> +/*
> + * __craye1k_get_attention_status() - Get LED value
> + *
> + * Context: craye1k_lock is already held.
> + * Returns: 0 on success, -EIO on failure.
> + */
> +static int __craye1k_get_attention_status(struct hotplug_slot *hotplug_slot,
> + u8 *status, bool set_primary)
> +{
> + unsigned char slot;
> + int locate, fault;
> + struct craye1k *craye1k;
> +
> + craye1k = craye1k_global;
> + slot = PSN(to_ctrl(hotplug_slot));
> +
> + if (set_primary) {
> + if (craye1k_set_primary(craye1k) != 0) {
> + craye1k->get_led_failed++;
> + return -EIO;
> + }
> + }
> +
> + locate = craye1k_get_slot_led(craye1k, slot, true);
> + if (locate == -1) {
> + craye1k->get_led_failed++;
> + return -EIO;
> + }
> + msleep(CRAYE1K_POST_CMD_WAIT_MS);
> +
> + fault = craye1k_get_slot_led(craye1k, slot, false);
> + if (fault == -1) {
> + craye1k->get_led_failed++;
> + return -EIO;
> + }
> + msleep(CRAYE1K_POST_CMD_WAIT_MS);
> +
> + *status = locate << 1 | fault;
> +
> + return 0;
> +}
> +
> +int craye1k_get_attention_status(struct hotplug_slot *hotplug_slot,
> + u8 *status)
> +{
> + int rc;
> +
> + if (mutex_lock_interruptible(&craye1k_lock) != 0)
> + return -EINTR;
> +
> + if (!craye1k_global) {
> + /* Driver isn't initialized yet */
> + mutex_unlock(&craye1k_lock);
> + return -EOPNOTSUPP;
> + }
> +
> + rc = __craye1k_get_attention_status(hotplug_slot, status, true);
> +
> + mutex_unlock(&craye1k_lock);
> + return rc;
> +}
> +
> +int craye1k_set_attention_status(struct hotplug_slot *hotplug_slot,
> + u8 status)
> +{
> + unsigned char slot;
> + int tries = 4;
> + int rc;
> + u8 new_status;
> + struct craye1k *craye1k;
> + bool locate, fault;
> +
> + if (mutex_lock_interruptible(&craye1k_lock) != 0)
> + return -EINTR;
> +
> + if (!craye1k_global) {
> + /* Driver isn't initialized yet */
> + mutex_unlock(&craye1k_lock);
> + return -EOPNOTSUPP;
> + }
> +
> + craye1k = craye1k_global;
> +
> + slot = PSN(to_ctrl(hotplug_slot));
> +
> + /* Retry to ensure all LEDs are set */
> + while (tries--) {
> + /*
> + * The node must first set itself to be the primary node before
> + * setting the slot LEDs (each board has two nodes, or
> + * "servers" as they're called by the manufacturer). This can
> + * lead to contention if both nodes are trying to set the LEDs
> + * at the same time.
> + */
> + rc = craye1k_set_primary(craye1k);
> + if (rc != 0) {
> + /* Could not set as primary node. Just retry again. */
> + continue;
> + }
> +
> + /* Write value twice to increase success rate */
> + locate = (status & 0x2) >> 1;
> + craye1k_set_slot_led(craye1k, slot, 1, locate);
> + if (craye1k_set_slot_led(craye1k, slot, 1, locate) != 0) {
> + craye1k->set_led_locate_failed++;
> + continue; /* fail, retry */
> + }
> +
> + msleep(CRAYE1K_POST_CMD_WAIT_MS);
> +
> + fault = status & 0x1;
> + craye1k_set_slot_led(craye1k, slot, 0, fault);
> + if (craye1k_set_slot_led(craye1k, slot, 0, fault) != 0) {
> + craye1k->set_led_fault_failed++;
> + continue; /* fail, retry */
> + }
> +
> + msleep(CRAYE1K_POST_CMD_WAIT_MS);
> +
> + rc = __craye1k_get_attention_status(hotplug_slot, &new_status,
> + false);
> +
> + msleep(CRAYE1K_POST_CMD_WAIT_MS);
> +
> + if (rc == 0 && new_status == status)
> + break; /* success */
> +
> + craye1k->set_led_readback_failed++;
> +
> + /*
> + * At this point we weren't successful in setting the LED and
> + * need to try again.
> + *
> + * Do a random back-off to reduce contention with other server
> + * node in the unlikely case that both server nodes are trying to
> + * trying to set a LED at the same time.
> + *
> + * The 500ms minimum in the back-off reduced the chance of this
> + * whole retry loop failing from 1 in 700 to none in 10000.
> + */
> + msleep(500 + (get_random_long() % 500));
> + }
> + mutex_unlock(&craye1k_lock);
> + if (tries == 0) {
> + craye1k->set_led_failed++;
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +bool is_craye1k_board(void)
> +{
> + return dmi_match(DMI_PRODUCT_NAME, "VSSEP1EC");
> +}
> +
> +int craye1k_init(void)
> +{
> + return ipmi_smi_watcher_register(&craye1k_smi_watcher);
> +}
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Tony Hutter <hu...@ll...>");
> +MODULE_DESCRIPTION("Cray E1000 NVMe Slot LED driver");
> --
> 2.43.7
>
>
|
|
From: Tony H. <hu...@ll...> - 2026-04-29 23:23:29
|
Add driver to control the NVMe slot LEDs on the Cray ClusterStor E1000.
The driver provides hotplug attention status callbacks for the 24 NVMe
slots on the E1000. This allows users to access the E1000's locate and
fault LEDs via the normal /sys/bus/pci/slots/<slot>/attention sysfs
entries. This driver uses IPMI to communicate with the E1000 controller
to toggle the LEDs.
Signed-off-by: Tony Hutter <hu...@ll...>
---
Changes in v8:
- Remove unused variable in craye1k_get_attention_status().
Changes in v7:
- Update sysfs-bus-pci text from feedback.
- Add DMI dependency to Kconfig.
- Refactor pciehp_core.c to remove CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
code blocks.
- Move errno.h #include into correct alphabetical order.
- Add comment describing the reasoning for the debugfs counters.
- Move craye1k_init() call from pcie_hp_init() to init_slot().
- Make craye1k mutex global rather than in craye1k->lock. This enables
handling of craye1k_[get|set]_attention_status() calls before the craye1k
driver is initialized.
- Do driver cleanup on craye1k_smi_gone().
Changes in v6:
- Change some dev_info_ratelimited() calls to dev_info().
- Don't call craye1k_init() if pcie_port_service_register() fails.
- Fix stray indent in #define CRAYE1K_POST_CMD_WAIT_MS
Changes in v5:
- Removed unnecessary ipmi_smi.h #include.
- Added WARN_ON() to craye1k_do_message() to sanity check that craye1k->lock
is held.
- Added additional comments for when craye1k->lock should be held.
Changes in v4:
- Fix typo in Kconfig: "is it" -> "it is"
- Rename some #defines to CRAYE1K_SUBCMD_*
- Use IS_ERR() check in craye1k_debugfs_init()
- Return -EIO instead of -EINVAL when LED value check fails
Changes in v3:
- Add 'attention' values in Documentation/ABI/testing/sysfs-bus-pci.
- Remove ACPI_PCI_SLOT dependency.
- Cleanup craye1k_do_message() error checking.
- Skip unneeded memcpy() on failure in __craye1k_do_command().
- Merge craye1k_do_command_and_netfn() code into craye1k_do_command().
- Make craye1k_is_primary() return boolean.
- Return negative error code on failure in craye1k_set_primary().
Changes in v2:
- Integrated E1000 code into the pciehp driver as an built-in
extention rather than as a standalone module.
- Moved debug tunables and counters to debugfs.
- Removed forward declarations.
- Kept the /sys/bus/pci/slots/<slot>/attention interface rather
than using NPEM/_DSM or led_classdev as suggested. The "attention"
interface is more beneficial for our site, since it allows us to
control the NVMe slot LEDs agnostically across different enclosure
vendors and kernel versions using the same scripts. It is also
nice to use the same /sys/bus/pci/slots/<slot>/ sysfs directory for
both slot LED toggling ("attention") and slot power control
("power").
---
Documentation/ABI/testing/sysfs-bus-pci | 21 +
MAINTAINERS | 5 +
drivers/pci/hotplug/Kconfig | 10 +
drivers/pci/hotplug/Makefile | 3 +
drivers/pci/hotplug/pciehp.h | 20 +
drivers/pci/hotplug/pciehp_core.c | 20 +-
drivers/pci/hotplug/pciehp_craye1k.c | 687 ++++++++++++++++++++++++
7 files changed, 765 insertions(+), 1 deletion(-)
create mode 100644 drivers/pci/hotplug/pciehp_craye1k.c
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 92debe879ffb..8536d2ff30d1 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -231,6 +231,27 @@ Description:
- scXX contains the device subclass;
- iXX contains the device class programming interface.
+What: /sys/bus/pci/slots/.../attention
+Date: February 2025
+Contact: lin...@vg...
+Description:
+ The attention attribute is used to read or write the attention
+ status for an enclosure slot. This is often used to set the
+ slot LED value on a NVMe storage enclosure.
+
+ Common values:
+ 0 = OFF
+ 1 = ON
+ 2 = blink
+
+ Using the Cray ClusterStor E1000 extensions:
+ 0 = fault LED OFF, locate LED OFF
+ 1 = fault LED ON, locate LED OFF
+ 2 = fault LED OFF, locate LED ON
+ 3 = fault LED ON, locate LED ON
+
+ Other values are no-op, OFF, or ON depending on the driver.
+
What: /sys/bus/pci/slots/.../module
Date: June 2009
Contact: lin...@vg...
diff --git a/MAINTAINERS b/MAINTAINERS
index 9ac254f4ec41..861576d60a36 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6543,6 +6543,11 @@ S: Maintained
F: Documentation/filesystems/cramfs.rst
F: fs/cramfs/
+CRAY CLUSTERSTOR E1000 NVME SLOT LED DRIVER EXTENSIONS
+M: Tony Hutter <hu...@ll...>
+S: Maintained
+F: drivers/pci/hotplug/pciehp_craye1k.c
+
CRC LIBRARY
M: Eric Biggers <ebi...@ke...>
R: Ard Biesheuvel <ar...@ke...>
diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
index 3207860b52e4..3cb84e5e13e9 100644
--- a/drivers/pci/hotplug/Kconfig
+++ b/drivers/pci/hotplug/Kconfig
@@ -183,4 +183,14 @@ config HOTPLUG_PCI_S390
When in doubt, say Y.
+config HOTPLUG_PCI_PCIE_CRAY_E1000
+ bool "PCIe Hotplug extensions for Cray ClusterStor E1000"
+ depends on DMI && HOTPLUG_PCI_PCIE && IPMI_HANDLER=y
+ help
+ Say Y here if you have a Cray ClusterStor E1000 and want to control
+ your NVMe slot LEDs. Without this driver it is not possible
+ to control the fault and locate LEDs on the E1000's 24 NVMe slots.
+
+ When in doubt, say N.
+
endif # HOTPLUG_PCI
diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
index 40aaf31fe338..82a1f0592d0a 100644
--- a/drivers/pci/hotplug/Makefile
+++ b/drivers/pci/hotplug/Makefile
@@ -66,6 +66,9 @@ pciehp-objs := pciehp_core.o \
pciehp_ctrl.o \
pciehp_pci.o \
pciehp_hpc.o
+ifdef CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
+pciehp-objs += pciehp_craye1k.o
+endif
shpchp-objs := shpchp_core.o \
shpchp_ctrl.o \
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index debc79b0adfb..3a8173f3e159 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -199,6 +199,17 @@ int pciehp_get_raw_indicator_status(struct hotplug_slot *h_slot, u8 *status);
int pciehp_slot_reset(struct pcie_device *dev);
+#ifdef CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
+int craye1k_init(void);
+bool is_craye1k_board(void);
+int craye1k_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
+int craye1k_set_attention_status(struct hotplug_slot *hotplug_slot, u8 status);
+#else
+#define craye1k_init() (0)
+#define craye1k_get_attention_status NULL
+#define craye1k_set_attention_status NULL
+#endif
+
static inline const char *slot_name(struct controller *ctrl)
{
return hotplug_slot_name(&ctrl->hotplug_slot);
@@ -209,4 +220,13 @@ static inline struct controller *to_ctrl(struct hotplug_slot *hotplug_slot)
return container_of(hotplug_slot, struct controller, hotplug_slot);
}
+static inline bool is_craye1k_slot(struct controller *ctrl)
+{
+#ifdef CONFIG_HOTPLUG_PCI_PCIE_CRAY_E1000
+ return (PSN(ctrl) >= 1 && PSN(ctrl) <= 24 && is_craye1k_board());
+#else
+ return false;
+#endif
+}
+
#endif /* _PCIEHP_H */
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index f59baa912970..3e8e2b3069bf 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -72,6 +72,22 @@ static int init_slot(struct controller *ctrl)
} else if (ctrl->pcie->port->hotplug_user_indicators) {
ops->get_attention_status = pciehp_get_raw_indicator_status;
ops->set_attention_status = pciehp_set_raw_indicator_status;
+ } else if (is_craye1k_slot(ctrl)) {
+ /*
+ * The Cray E1000 driver controls slots 1-24. Initialize the
+ * Cray E1000 driver when slot 1 is seen.
+ */
+ if (PSN(ctrl) == 1) {
+ retval = craye1k_init();
+ if (retval) {
+ ctrl_err(ctrl,
+ "Error loading Cray E1000 extensions");
+ kfree(ops);
+ return retval;
+ }
+ }
+ ops->get_attention_status = craye1k_get_attention_status;
+ ops->set_attention_status = craye1k_set_attention_status;
}
/* register this slot with the hotplug pci core */
@@ -376,8 +392,10 @@ int __init pcie_hp_init(void)
retval = pcie_port_service_register(&hpdriver_portdrv);
pr_debug("pcie_port_service_register = %d\n", retval);
- if (retval)
+ if (retval) {
pr_debug("Failure to register service\n");
+ return retval;
+ }
return retval;
}
diff --git a/drivers/pci/hotplug/pciehp_craye1k.c b/drivers/pci/hotplug/pciehp_craye1k.c
new file mode 100644
index 000000000000..9c5bee81fdf8
--- /dev/null
+++ b/drivers/pci/hotplug/pciehp_craye1k.c
@@ -0,0 +1,687 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2022-2024 Lawrence Livermore National Security, LLC
+ */
+/*
+ * Cray ClusterStor E1000 hotplug slot LED driver extensions
+ *
+ * This driver controls the NVMe slot LEDs on the Cray ClusterStore E1000.
+ * It provides hotplug attention status callbacks for the 24 NVMe slots on
+ * the E1000. This allows users to access the E1000's locate and fault
+ * LEDs via the normal /sys/bus/pci/slots/<slot>/attention sysfs entries.
+ * This driver uses IPMI to communicate with the E1000 controller to toggle
+ * the LEDs.
+ *
+ * This driver is based off of ibmpex.c
+ */
+
+#include <linux/debugfs.h>
+#include <linux/delay.h>
+#include <linux/dmi.h>
+#include <linux/errno.h>
+#include <linux/ipmi.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/pci_hotplug.h>
+#include <linux/random.h>
+#include "pciehp.h"
+
+/* Cray E1000 commands */
+#define CRAYE1K_CMD_NETFN 0x3c
+#define CRAYE1K_CMD_PRIMARY 0x33
+#define CRAYE1K_CMD_FAULT_LED 0x39
+#define CRAYE1K_CMD_LOCATE_LED 0x22
+
+/* Subcommands */
+#define CRAYE1K_SUBCMD_GET_LED 0x0
+#define CRAYE1K_SUBCMD_SET_LED 0x1
+#define CRAYE1K_SUBCMD_SET_PRIMARY 0x1
+
+/*
+ * Milliseconds to wait after get/set LED command. 200ms value found though
+ * experimentation
+ */
+#define CRAYE1K_POST_CMD_WAIT_MS 200
+
+struct craye1k {
+ struct device *dev; /* BMC device */
+ struct mutex lock;
+ struct completion read_complete;
+ struct ipmi_addr address;
+ struct ipmi_user *user;
+ int iface;
+
+ long tx_msg_id;
+ struct kernel_ipmi_msg tx_msg;
+ unsigned char tx_msg_data[IPMI_MAX_MSG_LENGTH];
+ unsigned char rx_msg_data[IPMI_MAX_MSG_LENGTH];
+ unsigned long rx_msg_len;
+ unsigned char rx_result; /* IPMI completion code */
+
+ /* Parent dir for all our debugfs entries */
+ struct dentry *parent;
+
+ /* debugfs stats */
+ u64 check_primary;
+ u64 check_primary_failed;
+ u64 was_already_primary;
+ u64 was_not_already_primary;
+ u64 set_primary;
+ u64 set_initial_primary_failed;
+ u64 set_primary_failed;
+ u64 set_led_locate_failed;
+ u64 set_led_fault_failed;
+ u64 set_led_readback_failed;
+ u64 set_led_failed;
+ u64 get_led_failed;
+ u64 completion_timeout;
+ u64 wrong_msgid;
+ u64 request_failed;
+
+ /* debugfs configuration options */
+
+ /* Print info on spurious IPMI messages */
+ bool print_errors;
+
+ /* Retries for kernel IPMI layer */
+ u32 ipmi_retries;
+
+ /* Timeout in ms for IPMI (0 = use IPMI default_retry_ms) */
+ u32 ipmi_timeout_ms;
+
+ /* Timeout in ms to wait for E1000 message completion */
+ u32 completion_timeout_ms;
+};
+
+/*
+ * Make our craye1k a global so get/set_attention_status() can access it.
+ * This is safe since there's only one node controller on the board, and so it's
+ * impossible to instantiate more than one craye1k.
+ */
+static struct craye1k *craye1k_global;
+static DEFINE_MUTEX(craye1k_lock);
+
+/*
+ * The E1000 command timeout and retry values were found though experimentation
+ * by looking at the error counters. Keep the counters around to troubleshoot
+ * any issues with our current timeout/retry values.
+ */
+static struct dentry *
+craye1k_debugfs_init(struct craye1k *craye1k)
+{
+ umode_t mode = 0644;
+ struct dentry *parent = debugfs_create_dir("pciehp_craye1k", NULL);
+
+ if (IS_ERR(parent))
+ return NULL;
+
+ debugfs_create_x64("check_primary", mode, parent,
+ &craye1k->check_primary);
+ debugfs_create_x64("check_primary_failed", mode, parent,
+ &craye1k->check_primary_failed);
+ debugfs_create_x64("was_already_primary", mode, parent,
+ &craye1k->was_already_primary);
+ debugfs_create_x64("was_not_already_primary", mode, parent,
+ &craye1k->was_not_already_primary);
+ debugfs_create_x64("set_primary", mode, parent,
+ &craye1k->set_primary);
+ debugfs_create_x64("set_initial_primary_failed", mode, parent,
+ &craye1k->set_initial_primary_failed);
+ debugfs_create_x64("set_primary_failed", mode, parent,
+ &craye1k->set_primary_failed);
+ debugfs_create_x64("set_led_locate_failed", mode, parent,
+ &craye1k->set_led_locate_failed);
+ debugfs_create_x64("set_led_fault_failed", mode, parent,
+ &craye1k->set_led_fault_failed);
+ debugfs_create_x64("set_led_readback_failed", mode, parent,
+ &craye1k->set_led_readback_failed);
+ debugfs_create_x64("set_led_failed", mode, parent,
+ &craye1k->set_led_failed);
+ debugfs_create_x64("get_led_failed", mode, parent,
+ &craye1k->get_led_failed);
+ debugfs_create_x64("completion_timeout", mode, parent,
+ &craye1k->completion_timeout);
+ debugfs_create_x64("wrong_msgid", mode, parent,
+ &craye1k->wrong_msgid);
+ debugfs_create_x64("request_failed", mode, parent,
+ &craye1k->request_failed);
+
+ debugfs_create_x32("ipmi_retries", mode, parent,
+ &craye1k->ipmi_retries);
+ debugfs_create_x32("ipmi_timeout_ms", mode, parent,
+ &craye1k->ipmi_timeout_ms);
+ debugfs_create_x32("completion_timeout_ms", mode, parent,
+ &craye1k->completion_timeout_ms);
+ debugfs_create_bool("print_errors", mode, parent,
+ &craye1k->print_errors);
+
+ /* Return parent dir dentry */
+ return parent;
+}
+
+/*
+ * craye1k_msg_handler() - IPMI message response handler
+ */
+static void craye1k_msg_handler(struct ipmi_recv_msg *msg, void *user_msg_data)
+{
+ struct craye1k *craye1k = user_msg_data;
+
+ if (msg->msgid != craye1k->tx_msg_id) {
+ craye1k->wrong_msgid++;
+ if (craye1k->print_errors) {
+ dev_warn_ratelimited(craye1k->dev,
+ "rx msgid %ld != %ld",
+ msg->msgid, craye1k->tx_msg_id);
+ }
+ ipmi_free_recv_msg(msg);
+ return;
+ }
+
+ /* Set rx_result to the IPMI completion code */
+ if (msg->msg.data_len > 0)
+ craye1k->rx_result = msg->msg.data[0];
+ else
+ craye1k->rx_result = IPMI_UNKNOWN_ERR_COMPLETION_CODE;
+
+ if (msg->msg.data_len > 1) {
+ /* Exclude completion code from data bytes */
+ craye1k->rx_msg_len = msg->msg.data_len - 1;
+ memcpy(craye1k->rx_msg_data, msg->msg.data + 1,
+ craye1k->rx_msg_len);
+ } else {
+ craye1k->rx_msg_len = 0;
+ }
+
+ ipmi_free_recv_msg(msg);
+
+ complete(&craye1k->read_complete);
+}
+
+static const struct ipmi_user_hndl craye1k_user_hndl = {
+ .ipmi_recv_hndl = craye1k_msg_handler
+};
+
+static void craye1k_new_smi(int iface, struct device *dev)
+{
+ int rc;
+ struct craye1k *craye1k;
+
+ craye1k = kzalloc(sizeof(*craye1k), GFP_KERNEL);
+ if (!craye1k)
+ return;
+
+ craye1k->address.addr_type = IPMI_SYSTEM_INTERFACE_ADDR_TYPE;
+ craye1k->address.channel = IPMI_BMC_CHANNEL;
+ craye1k->iface = iface;
+ craye1k->dev = dev;
+ craye1k->tx_msg.data = craye1k->tx_msg_data;
+ craye1k->ipmi_retries = 4;
+ craye1k->ipmi_timeout_ms = 500;
+ craye1k->completion_timeout_ms = 300;
+
+ init_completion(&craye1k->read_complete);
+
+ dev_set_drvdata(dev, craye1k);
+
+ rc = ipmi_create_user(craye1k->iface, &craye1k_user_hndl, craye1k,
+ &craye1k->user);
+ if (rc < 0) {
+ dev_err(dev, "Unable to register IPMI user, iface %d\n",
+ craye1k->iface);
+ kfree(craye1k);
+ dev_set_drvdata(dev, NULL);
+ return;
+ }
+
+ mutex_lock(&craye1k_lock);
+
+ /* There's only one node controller so driver data should not be set */
+ WARN_ON(craye1k_global);
+
+ craye1k_global = craye1k;
+ craye1k->parent = craye1k_debugfs_init(craye1k);
+ mutex_unlock(&craye1k_lock);
+ if (!craye1k->parent)
+ dev_warn(dev, "Cannot create debugfs");
+
+ dev_info(dev, "Cray ClusterStor E1000 slot LEDs registered");
+}
+
+static void craye1k_smi_gone(int iface)
+{
+ pr_warn("craye1k: Got unexpected smi_gone, iface=%d", iface);
+
+ mutex_lock(&craye1k_lock);
+ if (craye1k_global) {
+ debugfs_remove_recursive(craye1k_global->parent);
+ kfree(craye1k_global);
+ craye1k_global = NULL;
+ }
+ mutex_unlock(&craye1k_lock);
+}
+
+static struct ipmi_smi_watcher craye1k_smi_watcher = {
+ .owner = THIS_MODULE,
+ .new_smi = craye1k_new_smi,
+ .smi_gone = craye1k_smi_gone
+};
+
+/*
+ * craye1k_send_message() - Send the message already setup in 'craye1k'
+ *
+ * Context: craye1k_lock is already held.
+ * Return: 0 on success, non-zero on error.
+ */
+static int craye1k_send_message(struct craye1k *craye1k)
+{
+ int rc;
+
+ rc = ipmi_validate_addr(&craye1k->address, sizeof(craye1k->address));
+ if (rc) {
+ dev_err_ratelimited(craye1k->dev, "ipmi_validate_addr() = %d\n",
+ rc);
+ return rc;
+ }
+
+ craye1k->tx_msg_id++;
+
+ rc = ipmi_request_settime(craye1k->user, &craye1k->address,
+ craye1k->tx_msg_id, &craye1k->tx_msg, craye1k,
+ 0, craye1k->ipmi_retries,
+ craye1k->ipmi_timeout_ms);
+
+ if (rc) {
+ craye1k->request_failed++;
+ return rc;
+ }
+
+ return 0;
+}
+
+/*
+ * craye1k_do_message() - Send the message in 'craye1k' and wait for a response
+ *
+ * Context: craye1k_lock is already held.
+ * Return: 0 on success, non-zero on error.
+ */
+static int craye1k_do_message(struct craye1k *craye1k)
+{
+ int rc;
+ struct completion *read_complete = &craye1k->read_complete;
+ unsigned long tout = msecs_to_jiffies(craye1k->completion_timeout_ms);
+
+ WARN_ON(!mutex_is_locked(&craye1k_lock));
+
+ rc = craye1k_send_message(craye1k);
+ if (rc)
+ return rc;
+
+ rc = wait_for_completion_killable_timeout(read_complete, tout);
+ if (rc == 0) {
+ /* timed out */
+ craye1k->completion_timeout++;
+ return -ETIME;
+ }
+
+ return 0;
+}
+
+/*
+ * __craye1k_do_command() - Do an IPMI command
+ *
+ * Send a command with optional data bytes, and read back response bytes.
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: 0 on success, non-zero on error.
+ */
+static int __craye1k_do_command(struct craye1k *craye1k, u8 netfn, u8 cmd,
+ u8 *send_data, u8 send_data_len, u8 *recv_data,
+ u8 recv_data_len)
+{
+ int rc;
+
+ craye1k->tx_msg.netfn = netfn;
+ craye1k->tx_msg.cmd = cmd;
+
+ if (send_data) {
+ memcpy(&craye1k->tx_msg_data[0], send_data, send_data_len);
+ craye1k->tx_msg.data_len = send_data_len;
+ } else {
+ craye1k->tx_msg_data[0] = 0;
+ craye1k->tx_msg.data_len = 0;
+ }
+
+ rc = craye1k_do_message(craye1k);
+ if (rc == 0)
+ memcpy(recv_data, craye1k->rx_msg_data, recv_data_len);
+
+ return rc;
+}
+
+/*
+ * craye1k_do_command() - Do a Cray E1000 specific IPMI command.
+ * @cmd: Cray E1000 specific command
+ * @send_data: Data to send after the command
+ * @send_data_len: Data length
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: the last byte from the response or 0 if response had no response
+ * data bytes, else -1 on error.
+ */
+static int craye1k_do_command(struct craye1k *craye1k, u8 cmd, u8 *send_data,
+ u8 send_data_len)
+{
+ int rc;
+
+ rc = __craye1k_do_command(craye1k, CRAYE1K_CMD_NETFN, cmd, send_data,
+ send_data_len, NULL, 0);
+ if (rc != 0) {
+ /* Error attempting command */
+ return -1;
+ }
+
+ if (craye1k->tx_msg.data_len == 0)
+ return 0;
+
+ /* Return last received byte value */
+ return craye1k->rx_msg_data[craye1k->rx_msg_len - 1];
+}
+
+/*
+ * __craye1k_set_primary() - Tell the BMC we want to be the primary server
+ *
+ * An E1000 board has two physical servers on it. In order to set a slot
+ * NVMe LED, this server needs to first tell the BMC that it's the primary
+ * server.
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: 0 on success, non-zero on error.
+ */
+static int __craye1k_set_primary(struct craye1k *craye1k)
+{
+ u8 bytes[2] = {CRAYE1K_SUBCMD_SET_PRIMARY, 1}; /* set primary to 1 */
+
+ craye1k->set_primary++;
+ return craye1k_do_command(craye1k, CRAYE1K_CMD_PRIMARY, bytes, 2);
+}
+
+/*
+ * craye1k_is_primary() - Are we the primary server?
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: true if we are the primary server, false otherwise.
+ */
+static bool craye1k_is_primary(struct craye1k *craye1k)
+{
+ u8 byte = 0;
+ int rc;
+
+ /* Response byte is 0x1 on success */
+ rc = craye1k_do_command(craye1k, CRAYE1K_CMD_PRIMARY, &byte, 1);
+ craye1k->check_primary++;
+ if (rc == 0x1)
+ return true; /* success */
+
+ craye1k->check_primary_failed++;
+ return false; /* We are not the primary server node */
+}
+
+/*
+ * craye1k_set_primary() - Attempt to set ourselves as the primary server
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: 0 on success, -1 otherwise.
+ */
+static int craye1k_set_primary(struct craye1k *craye1k)
+{
+ int tries = 10;
+
+ if (craye1k_is_primary(craye1k)) {
+ craye1k->was_already_primary++;
+ return 0;
+ }
+ craye1k->was_not_already_primary++;
+
+ /* delay found through experimentation */
+ msleep(300);
+
+ if (__craye1k_set_primary(craye1k) != 0) {
+ craye1k->set_initial_primary_failed++;
+ return -1; /* error */
+ }
+
+ /*
+ * It can take 2 to 3 seconds after setting primary for the controller
+ * to report that it is the primary.
+ */
+ while (tries--) {
+ msleep(500);
+ if (craye1k_is_primary(craye1k))
+ break;
+ }
+
+ if (tries == 0) {
+ craye1k->set_primary_failed++;
+ return -1; /* never reported that it's primary */
+ }
+
+ /* Wait for primary switch to finish */
+ msleep(1500);
+
+ return 0;
+}
+
+/*
+ * craye1k_get_slot_led() - Get slot LED value
+ * @slot: Slot number (1-24)
+ * @is_locate_led: 0 = get fault LED value, 1 = get locate LED value
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: slot value on success, -1 on failure.
+ */
+static int craye1k_get_slot_led(struct craye1k *craye1k, unsigned char slot,
+ bool is_locate_led)
+{
+ u8 bytes[2];
+ u8 cmd;
+
+ bytes[0] = CRAYE1K_SUBCMD_GET_LED;
+ bytes[1] = slot;
+
+ cmd = is_locate_led ? CRAYE1K_CMD_LOCATE_LED : CRAYE1K_CMD_FAULT_LED;
+
+ return craye1k_do_command(craye1k, cmd, bytes, 2);
+}
+
+/*
+ * craye1k_set_slot_led() - Attempt to set the locate/fault LED to a value
+ * @slot: Slot number (1-24)
+ * @is_locate_led: 0 = use fault LED, 1 = use locate LED
+ * @value: Value to set (0 or 1)
+ *
+ * Check the LED value after calling this function to ensure it has been set
+ * properly.
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: 0 on success, non-zero on failure.
+ */
+static int craye1k_set_slot_led(struct craye1k *craye1k, unsigned char slot,
+ unsigned char is_locate_led,
+ unsigned char value)
+{
+ u8 bytes[3];
+ u8 cmd;
+
+ bytes[0] = CRAYE1K_SUBCMD_SET_LED;
+ bytes[1] = slot;
+ bytes[2] = value;
+
+ cmd = is_locate_led ? CRAYE1K_CMD_LOCATE_LED : CRAYE1K_CMD_FAULT_LED;
+
+ return craye1k_do_command(craye1k, cmd, bytes, 3);
+}
+
+/*
+ * __craye1k_get_attention_status() - Get LED value
+ *
+ * Context: craye1k_lock is already held.
+ * Returns: 0 on success, -EIO on failure.
+ */
+static int __craye1k_get_attention_status(struct hotplug_slot *hotplug_slot,
+ u8 *status, bool set_primary)
+{
+ unsigned char slot;
+ int locate, fault;
+ struct craye1k *craye1k;
+
+ craye1k = craye1k_global;
+ slot = PSN(to_ctrl(hotplug_slot));
+
+ if (set_primary) {
+ if (craye1k_set_primary(craye1k) != 0) {
+ craye1k->get_led_failed++;
+ return -EIO;
+ }
+ }
+
+ locate = craye1k_get_slot_led(craye1k, slot, true);
+ if (locate == -1) {
+ craye1k->get_led_failed++;
+ return -EIO;
+ }
+ msleep(CRAYE1K_POST_CMD_WAIT_MS);
+
+ fault = craye1k_get_slot_led(craye1k, slot, false);
+ if (fault == -1) {
+ craye1k->get_led_failed++;
+ return -EIO;
+ }
+ msleep(CRAYE1K_POST_CMD_WAIT_MS);
+
+ *status = locate << 1 | fault;
+
+ return 0;
+}
+
+int craye1k_get_attention_status(struct hotplug_slot *hotplug_slot,
+ u8 *status)
+{
+ int rc;
+
+ if (mutex_lock_interruptible(&craye1k_lock) != 0)
+ return -EINTR;
+
+ if (!craye1k_global) {
+ /* Driver isn't initialized yet */
+ mutex_unlock(&craye1k_lock);
+ return -EOPNOTSUPP;
+ }
+
+ rc = __craye1k_get_attention_status(hotplug_slot, status, true);
+
+ mutex_unlock(&craye1k_lock);
+ return rc;
+}
+
+int craye1k_set_attention_status(struct hotplug_slot *hotplug_slot,
+ u8 status)
+{
+ unsigned char slot;
+ int tries = 4;
+ int rc;
+ u8 new_status;
+ struct craye1k *craye1k;
+ bool locate, fault;
+
+ if (mutex_lock_interruptible(&craye1k_lock) != 0)
+ return -EINTR;
+
+ if (!craye1k_global) {
+ /* Driver isn't initialized yet */
+ mutex_unlock(&craye1k_lock);
+ return -EOPNOTSUPP;
+ }
+
+ craye1k = craye1k_global;
+
+ slot = PSN(to_ctrl(hotplug_slot));
+
+ /* Retry to ensure all LEDs are set */
+ while (tries--) {
+ /*
+ * The node must first set itself to be the primary node before
+ * setting the slot LEDs (each board has two nodes, or
+ * "servers" as they're called by the manufacturer). This can
+ * lead to contention if both nodes are trying to set the LEDs
+ * at the same time.
+ */
+ rc = craye1k_set_primary(craye1k);
+ if (rc != 0) {
+ /* Could not set as primary node. Just retry again. */
+ continue;
+ }
+
+ /* Write value twice to increase success rate */
+ locate = (status & 0x2) >> 1;
+ craye1k_set_slot_led(craye1k, slot, 1, locate);
+ if (craye1k_set_slot_led(craye1k, slot, 1, locate) != 0) {
+ craye1k->set_led_locate_failed++;
+ continue; /* fail, retry */
+ }
+
+ msleep(CRAYE1K_POST_CMD_WAIT_MS);
+
+ fault = status & 0x1;
+ craye1k_set_slot_led(craye1k, slot, 0, fault);
+ if (craye1k_set_slot_led(craye1k, slot, 0, fault) != 0) {
+ craye1k->set_led_fault_failed++;
+ continue; /* fail, retry */
+ }
+
+ msleep(CRAYE1K_POST_CMD_WAIT_MS);
+
+ rc = __craye1k_get_attention_status(hotplug_slot, &new_status,
+ false);
+
+ msleep(CRAYE1K_POST_CMD_WAIT_MS);
+
+ if (rc == 0 && new_status == status)
+ break; /* success */
+
+ craye1k->set_led_readback_failed++;
+
+ /*
+ * At this point we weren't successful in setting the LED and
+ * need to try again.
+ *
+ * Do a random back-off to reduce contention with other server
+ * node in the unlikely case that both server nodes are trying to
+ * trying to set a LED at the same time.
+ *
+ * The 500ms minimum in the back-off reduced the chance of this
+ * whole retry loop failing from 1 in 700 to none in 10000.
+ */
+ msleep(500 + (get_random_long() % 500));
+ }
+ mutex_unlock(&craye1k_lock);
+ if (tries == 0) {
+ craye1k->set_led_failed++;
+ return -EIO;
+ }
+
+ return 0;
+}
+
+bool is_craye1k_board(void)
+{
+ return dmi_match(DMI_PRODUCT_NAME, "VSSEP1EC");
+}
+
+int craye1k_init(void)
+{
+ return ipmi_smi_watcher_register(&craye1k_smi_watcher);
+}
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tony Hutter <hu...@ll...>");
+MODULE_DESCRIPTION("Cray E1000 NVMe Slot LED driver");
--
2.43.7
|
|
From: Corey M. <co...@mi...> - 2026-04-28 18:12:24
|
The driver would just fetch events and receive messages until the BMC said it was done. To avoid issues with BMCs that never say they are done, add a limit of 10 fetches at a time. In addition, an si interface has an attn state it can return from the hardware which is supposed to cause a flag fetch to see if the driver needs to fetch events or message or a few other things. If the attn bit gets stuck, it's a similar problem. So allow messages in between flag fetches so the driver itself doesn't get stuck. This is a more general fix than the previous fix for the specific bad BMC, but should fix the more general issue of a BMC that won't stop saying it has data. This has been there from the beginning of the driver. It's not a bug per-se, but it is accounting for bugs in BMCs. Reported-by: Matt Fleming <mfl...@cl...> Closes: https://lore.kernel.org/lkml/202...@re.../ Fixes: <1da177e4c3f4> ("Linux-2.6.12-rc2") Cc: st...@vg... Signed-off-by: Corey Minyard <co...@mi...> --- I have added this problem as a capability in the openipmi library simulator so I can reproduce the issue and make sure everything works properly. drivers/char/ipmi/ipmi_si_intf.c | 54 +++++++++++++++++++++++++------- drivers/char/ipmi/ipmi_ssif.c | 23 ++++++++++++-- 2 files changed, 64 insertions(+), 13 deletions(-) diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 08c208cc64c5..7c3c463e08da 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -168,6 +168,10 @@ struct smi_info { OEM2_DATA_AVAIL) unsigned char msg_flags; + /* When requesting events and messages, don't do it forever. */ + unsigned int num_requests_in_a_row; + bool last_was_flag_fetch; + /* Does the BMC have an event buffer? */ bool has_event_buffer; @@ -410,7 +414,10 @@ static void start_getting_msg_queue(struct smi_info *smi_info) start_new_msg(smi_info, smi_info->curr_msg->data, smi_info->curr_msg->data_size); - smi_info->si_state = SI_GETTING_MESSAGES; + if (smi_info->si_state != SI_GETTING_MESSAGES) { + smi_info->num_requests_in_a_row = 0; + smi_info->si_state = SI_GETTING_MESSAGES; + } } static void start_getting_events(struct smi_info *smi_info) @@ -421,7 +428,10 @@ static void start_getting_events(struct smi_info *smi_info) start_new_msg(smi_info, smi_info->curr_msg->data, smi_info->curr_msg->data_size); - smi_info->si_state = SI_GETTING_EVENTS; + if (smi_info->si_state != SI_GETTING_EVENTS) { + smi_info->num_requests_in_a_row = 0; + smi_info->si_state = SI_GETTING_EVENTS; + } } /* @@ -595,6 +605,7 @@ static void handle_transaction_done(struct smi_info *smi_info) smi_info->si_state = SI_NORMAL; } else { smi_info->msg_flags = msg[3]; + smi_info->last_was_flag_fetch = true; handle_flags(smi_info); } break; @@ -646,6 +657,11 @@ static void handle_transaction_done(struct smi_info *smi_info) } else { smi_inc_stat(smi_info, events); + smi_info->num_requests_in_a_row++; + if (smi_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + smi_info->msg_flags &= ~EVENT_MSG_BUFFER_FULL; + /* * Do this before we deliver the message * because delivering the message releases the @@ -684,6 +700,11 @@ static void handle_transaction_done(struct smi_info *smi_info) } else { smi_inc_stat(smi_info, incoming_messages); + smi_info->num_requests_in_a_row++; + if (smi_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + smi_info->msg_flags &= ~RECEIVE_MSG_AVAIL; + /* * Do this before we deliver the message * because delivering the message releases the @@ -825,6 +846,26 @@ static enum si_sm_result smi_event_handler(struct smi_info *smi_info, goto out; } + /* + * If we are currently idle, or if the last thing that was + * done was a flag fetch and there is a message pending, try + * to start the next message. + * + * We do the waiting message check to avoid a stuck flag + * completely wedging the driver. Let a message through + * in between flag operations if that happens. + */ + if (si_sm_result == SI_SM_IDLE || + (si_sm_result == SI_SM_ATTN && smi_info->waiting_msg && + smi_info->last_was_flag_fetch)) { + smi_info->last_was_flag_fetch = false; + smi_inc_stat(smi_info, idles); + + si_sm_result = start_next_msg(smi_info); + if (si_sm_result != SI_SM_IDLE) + goto restart; + } + /* * We prefer handling attn over new messages. But don't do * this if there is not yet an upper layer to handle anything. @@ -852,15 +893,6 @@ static enum si_sm_result smi_event_handler(struct smi_info *smi_info, } } - /* If we are currently idle, try to start the next message. */ - if (si_sm_result == SI_SM_IDLE) { - smi_inc_stat(smi_info, idles); - - si_sm_result = start_next_msg(smi_info); - if (si_sm_result != SI_SM_IDLE) - goto restart; - } - if ((si_sm_result == SI_SM_IDLE) && (atomic_read(&smi_info->req_events))) { /* diff --git a/drivers/char/ipmi/ipmi_ssif.c b/drivers/char/ipmi/ipmi_ssif.c index b49500a1bd36..f3798f4e6a63 100644 --- a/drivers/char/ipmi/ipmi_ssif.c +++ b/drivers/char/ipmi/ipmi_ssif.c @@ -225,6 +225,9 @@ struct ssif_info { bool has_event_buffer; bool supports_alert; + /* When requesting events and messages, don't do it forever. */ + unsigned int num_requests_in_a_row; + /* * Used to tell what we should do with alerts. If we are * waiting on a response, read the data immediately. @@ -413,7 +416,10 @@ static void start_event_fetch(struct ssif_info *ssif_info, unsigned long *flags) } ssif_info->curr_msg = msg; - ssif_info->ssif_state = SSIF_GETTING_EVENTS; + if (ssif_info->ssif_state != SSIF_GETTING_EVENTS) { + ssif_info->num_requests_in_a_row = 0; + ssif_info->ssif_state = SSIF_GETTING_EVENTS; + } ipmi_ssif_unlock_cond(ssif_info, flags); msg->data[0] = (IPMI_NETFN_APP_REQUEST << 2); @@ -436,7 +442,10 @@ static void start_recv_msg_fetch(struct ssif_info *ssif_info, } ssif_info->curr_msg = msg; - ssif_info->ssif_state = SSIF_GETTING_MESSAGES; + if (ssif_info->ssif_state != SSIF_GETTING_MESSAGES) { + ssif_info->num_requests_in_a_row = 0; + ssif_info->ssif_state = SSIF_GETTING_MESSAGES; + } ipmi_ssif_unlock_cond(ssif_info, flags); msg->data[0] = (IPMI_NETFN_APP_REQUEST << 2); @@ -843,6 +852,11 @@ static void msg_done_handler(struct ssif_info *ssif_info, int result, ssif_info->msg_flags &= ~EVENT_MSG_BUFFER_FULL; handle_flags(ssif_info, flags); } else { + ssif_info->num_requests_in_a_row++; + if (ssif_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + ssif_info->msg_flags &= ~EVENT_MSG_BUFFER_FULL; + handle_flags(ssif_info, flags); ssif_inc_stat(ssif_info, events); deliver_recv_msg(ssif_info, msg); @@ -876,6 +890,11 @@ static void msg_done_handler(struct ssif_info *ssif_info, int result, ssif_info->msg_flags &= ~RECEIVE_MSG_AVAIL; handle_flags(ssif_info, flags); } else { + ssif_info->num_requests_in_a_row++; + if (ssif_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + ssif_info->msg_flags &= ~RECEIVE_MSG_AVAIL; + ssif_inc_stat(ssif_info, incoming_messages); handle_flags(ssif_info, flags); deliver_recv_msg(ssif_info, msg); -- 2.43.0 |
|
From: Corey M. <co...@mi...> - 2026-04-28 13:36:34
|
On Tue, Apr 28, 2026 at 06:45:15AM -0500, Corey Minyard wrote:
> On Tue, Apr 28, 2026 at 11:15:46AM +0100, Matt Fleming wrote:
> > On Sat, Apr 25, 2026 at 06:58:48PM -0500, Corey Minyard wrote:
> > >
> > > Oh, yeah.
> > >
> > > Moving it to handle_transaction_done() is not ideal, though. If
> > > something was spewing receive messages, it would never get to events,
> > > which is why I did it like I did.
> > >
> > > The following should fix this. You could also have different limits for
> > > receive messages and events, but I think the following is more clear.
> > >
> > > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> > > index 2a739123270c..e46f4150ceb5 100644
> > > --- a/drivers/char/ipmi/ipmi_si_intf.c
> > > +++ b/drivers/char/ipmi/ipmi_si_intf.c
> > > @@ -413,8 +413,10 @@ static void start_getting_msg_queue(struct smi_info *smi_info)
> > >
> > > start_new_msg(smi_info, smi_info->curr_msg->data,
> > > smi_info->curr_msg->data_size);
> > > - smi_info->num_requests_in_a_row = 0;
> > > - smi_info->si_state = SI_GETTING_MESSAGES;
> > > + if (smi_info->si_state != SI_GETTING_MESSAGES) {
> > > + smi_info->num_requests_in_a_row = 0;
> > > + smi_info->si_state = SI_GETTING_MESSAGES;
> > > + }
> > > }
> > >
> > > static void start_getting_events(struct smi_info *smi_info)
> > > @@ -425,8 +427,10 @@ static void start_getting_events(struct smi_info *smi_info)
> > >
> > > start_new_msg(smi_info, smi_info->curr_msg->data,
> > > smi_info->curr_msg->data_size);
> > > - smi_info->num_requests_in_a_row = 0;
> > > - smi_info->si_state = SI_GETTING_EVENTS;
> > > + if (smi_info->si_state != SI_GETTING_EVENTS) {
> > > + smi_info->num_requests_in_a_row = 0;
> > > + smi_info->si_state = SI_GETTING_EVENTS;
> > > + }
> >
> > Thanks. Does this correctly handle a stream of mixed receive+event
> > flags, though? If we bounce between MESSAGES and EVENTS, won't we keep
> > resetting the counter on each state transition and never hit the limit?
>
> It should. Once the limit is reached it clears the bit in msg_flags.
> That should prevent the messages or events from being re-requested
> until the next flags fetch. So if a continuous stream of messages
> and events was coming in, it should fetch 10 messages, clear the bit,
> then fetch 10 events, then clear that bit, then go back to normal
> operation.
>
> Of course, the next flag fetch will start the process over.
Actually, probing deeper, this probably won't work. And I'm not sure
there is much I can do to fix it. It will be much harder. But it
depends on how the BMC handles this.
If there is something in the event queue or receive message queue, the
BMC is supposed to set an attention bit (ATTN) on the interface If
ATTN is set, the driver is supposed to fetch flags to know what it
needs to do.
I haven't tried, but in the qemu changes below, I'm fairly sure the ATTN
bit will never get cleared, thus when it goes through all this the
KCS state machine will return SI_SM_ATTN at the end and the flag fetch
will start again. You are still wedged.
The qemu version also runs with interrupts by default, which only
magnifies the problem. In that case, when ATTN is set and you aren't
running transactions, the interrupt is enabled. On KCS there is no way
in the KCS interface to disable the interrupt at the interface, you have
to send it a message or disable it with disable_irq().
But the actual failing BMC may not work this way. It may or may not
clear the Event Message Buffer Full flag. It may or may not do anything
with ATTN.
A driver can only do so much to account for broken hardware. The driver
is already too complex, a lot of it due to having to handle broken
hardware. Fixing this adds more complexity and penalizes hardware
that works properly.
Anyway, I'm going to need to get this failing in simulation and figure
out how to handle this. Yet more issues may come up, especially with
interrupts.
Is there any way you can just get the hardware fixed? It's never going
to work very well as it is. I'd be inclined to just denylist it.
-corey
>
> >
> > I was able to cook up a simple repro in Qemu for this class of bug.
>
> I was thinking about how to do an automated test for this. I use an
> external simulated BMC for the automated tests I have. So I'll work in
> that direction.
>
> But thanks, this should help me develop that test.
>
> -corey
>
> >
> > ---->8----
> >
> > diff --git a/hw/ipmi/ipmi_bmc_sim.c b/hw/ipmi/ipmi_bmc_sim.c
> > index fd875491f5..127db30c03 100644
> > --- a/hw/ipmi/ipmi_bmc_sim.c
> > +++ b/hw/ipmi/ipmi_bmc_sim.c
> > @@ -249,6 +249,21 @@ struct IPMIBmcSim {
> > uint8_t evtbuf[16];
> >
> > QTAILQ_HEAD(, IPMIRcvBufEntry) rcvbufs;
> > +
> > + /*
> > + * Fault injection: sticky EVENT_MSG_BUFFER_FULL.
> > + *
> > + * Simulates a BMC that continuously generates events (e.g. after a
> > + * cold reset causes sensor state changes). Once armed, every
> > + * READ_EVENT_MSG_BUFFER returns success but never clears the
> > + * EVT_BUF_FULL flag, starving waiting_msg in the SI layer's
> > + * handle_flags() loop. Reproduces the 517m277 / KRN-1233 wedge.
> > + */
> > + bool fi_sticky_events; /* enable via property */
> > + uint32_t fi_evt_arm_after; /* arm after N completed responses */
> > + uint32_t fi_evt_rsp_count; /* responses completed so far */
> > + bool fi_evt_armed; /* fault is active */
> > + uint32_t fi_evt_read_count; /* READ_EVENT_MSG_BUFFER calls served */
> > };
> >
> > #define IPMI_BMC_MSG_FLAG_WATCHDOG_TIMEOUT_MASK (1 << 3)
> > @@ -494,7 +509,7 @@ static int sel_add_event(IPMIBmcSim *ibs, uint8_t *event)
> > static int attn_set(IPMIBmcSim *ibs)
> > {
> > return IPMI_BMC_MSG_FLAG_RCV_MSG_QUEUE_SET(ibs)
> > - || IPMI_BMC_MSG_FLAG_EVT_BUF_FULL_SET(ibs)
> > + || (!ibs->fi_evt_armed && IPMI_BMC_MSG_FLAG_EVT_BUF_FULL_SET(ibs))
> > || IPMI_BMC_MSG_FLAG_WATCHDOG_TIMEOUT_MASK_SET(ibs);
> > }
> >
> > @@ -750,6 +765,48 @@ static void ipmi_sim_handle_command(IPMIBmc *b,
> > out:
> > k->handle_rsp(s, msg_id, rsp.buffer, rsp.len);
> >
> > + /*
> > + * Fault injection: count completed responses and arm sticky
> > + * EVENT_MSG_BUFFER_FULL after the configured threshold.
> > + */
> > + if (ibs->fi_sticky_events && !ibs->fi_evt_armed) {
> > + ibs->fi_evt_rsp_count++;
> > + if (ibs->fi_evt_rsp_count >= ibs->fi_evt_arm_after) {
> > + ibs->fi_evt_armed = true;
> > + ibs->fi_evt_read_count = 0;
> > +
> > + /*
> > + * Seed the event buffer with a synthetic sensor event
> > + * (sensor type 0x01 = Temperature, event type 0x01 =
> > + * threshold, evd1 = upper critical going high).
> > + * This matches what real BMCs generate after a cold reset.
> > + */
> > + memset(ibs->evtbuf, 0, 16);
> > + ibs->evtbuf[2] = 0x02; /* System event record */
> > + ibs->evtbuf[7] = ibs->parent.slave_addr;
> > + ibs->evtbuf[9] = 0x04; /* Format version */
> > + ibs->evtbuf[10] = 0x01; /* Sensor type: Temperature */
> > + ibs->evtbuf[11] = 0x01; /* Sensor number */
> > + ibs->evtbuf[12] = 0x01; /* Event type: threshold */
> > + ibs->evtbuf[13] = 0x09; /* Upper critical going high */
> > + ibs->evtbuf[14] = 0x57; /* Threshold value */
> > + ibs->evtbuf[15] = 0x00; /* Sequence (incremented on reads) */
> > +
> > + ibs->msg_flags |= IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> > +
> > + /* Ensure event message buffer is enabled in global enables
> > + * so the kernel sees it. Also enable event logging.
> > + */
> > + ibs->bmc_global_enables |= (1 << IPMI_BMC_EVENT_MSG_BUF_BIT)
> > + | (1 << IPMI_BMC_EVENT_LOG_BIT);
> > +
> > + k->set_atn(s, 1, attn_irq_enabled(ibs));
> > +
> > + fprintf(stderr, "[FI] sticky-events armed after %u responses\n",
> > + ibs->fi_evt_rsp_count);
> > + }
> > + }
> > +
> > next_timeout(ibs);
> > }
> >
> > @@ -1013,8 +1070,14 @@ static void clr_msg_flags(IPMIBmcSim *ibs,
> > {
> > IPMIInterface *s = ibs->parent.intf;
> > IPMIInterfaceClass *k = IPMI_INTERFACE_GET_CLASS(s);
> > + uint8_t clear_mask = cmd[2];
> > +
> > + if (ibs->fi_evt_armed) {
> > + /* Don't allow clearing EVT_BUF_FULL when sticky events active */
> > + clear_mask &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> > + }
> >
> > - ibs->msg_flags &= ~cmd[2];
> > + ibs->msg_flags &= ~clear_mask;
> > k->set_atn(s, attn_set(ibs), attn_irq_enabled(ibs));
> > }
> >
> > @@ -1040,7 +1103,19 @@ static void read_evt_msg_buf(IPMIBmcSim *ibs,
> > for (i = 0; i < 16; i++) {
> > rsp_buffer_push(rsp, ibs->evtbuf[i]);
> > }
> > - ibs->msg_flags &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> > +
> > + if (ibs->fi_evt_armed) {
> > + /*
> > + * Sticky mode: return success but keep EVT_BUF_FULL set.
> > + * Vary the event data slightly so the kernel doesn't
> > + * de-duplicate (increment evd3 as a sequence number).
> > + */
> > + ibs->fi_evt_read_count++;
> > + ibs->evtbuf[15] = (uint8_t)(ibs->fi_evt_read_count & 0xff);
> > + /* Don't clear the flag — starvation continues */
> > + } else {
> > + ibs->msg_flags &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> > + }
> > k->set_atn(s, attn_set(ibs), attn_irq_enabled(ibs));
> > }
> >
> > @@ -2670,6 +2745,8 @@ static const Property ipmi_sim_properties[] = {
> > DEFINE_PROP_STRING("lan.netmask", IPMIBmcSim, lan.arg_netmask),
> > DEFINE_PROP_STRING("lan.defgw_ipaddr", IPMIBmcSim, lan.arg_defgw_ipaddr),
> > DEFINE_PROP_MACADDR("lan.defgw_macaddr", IPMIBmcSim, lan.defgw_macaddr),
> > + DEFINE_PROP_BOOL("fi_sticky_events", IPMIBmcSim, fi_sticky_events, false),
> > + DEFINE_PROP_UINT32("fi_evt_arm_after", IPMIBmcSim, fi_evt_arm_after, 40),
> > };
> >
> > static void ipmi_sim_class_init(ObjectClass *oc, const void *data)
|
|
From: Corey M. <co...@mi...> - 2026-04-28 11:45:27
|
On Tue, Apr 28, 2026 at 11:15:46AM +0100, Matt Fleming wrote:
> On Sat, Apr 25, 2026 at 06:58:48PM -0500, Corey Minyard wrote:
> >
> > Oh, yeah.
> >
> > Moving it to handle_transaction_done() is not ideal, though. If
> > something was spewing receive messages, it would never get to events,
> > which is why I did it like I did.
> >
> > The following should fix this. You could also have different limits for
> > receive messages and events, but I think the following is more clear.
> >
> > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> > index 2a739123270c..e46f4150ceb5 100644
> > --- a/drivers/char/ipmi/ipmi_si_intf.c
> > +++ b/drivers/char/ipmi/ipmi_si_intf.c
> > @@ -413,8 +413,10 @@ static void start_getting_msg_queue(struct smi_info *smi_info)
> >
> > start_new_msg(smi_info, smi_info->curr_msg->data,
> > smi_info->curr_msg->data_size);
> > - smi_info->num_requests_in_a_row = 0;
> > - smi_info->si_state = SI_GETTING_MESSAGES;
> > + if (smi_info->si_state != SI_GETTING_MESSAGES) {
> > + smi_info->num_requests_in_a_row = 0;
> > + smi_info->si_state = SI_GETTING_MESSAGES;
> > + }
> > }
> >
> > static void start_getting_events(struct smi_info *smi_info)
> > @@ -425,8 +427,10 @@ static void start_getting_events(struct smi_info *smi_info)
> >
> > start_new_msg(smi_info, smi_info->curr_msg->data,
> > smi_info->curr_msg->data_size);
> > - smi_info->num_requests_in_a_row = 0;
> > - smi_info->si_state = SI_GETTING_EVENTS;
> > + if (smi_info->si_state != SI_GETTING_EVENTS) {
> > + smi_info->num_requests_in_a_row = 0;
> > + smi_info->si_state = SI_GETTING_EVENTS;
> > + }
>
> Thanks. Does this correctly handle a stream of mixed receive+event
> flags, though? If we bounce between MESSAGES and EVENTS, won't we keep
> resetting the counter on each state transition and never hit the limit?
It should. Once the limit is reached it clears the bit in msg_flags.
That should prevent the messages or events from being re-requested
until the next flags fetch. So if a continuous stream of messages
and events was coming in, it should fetch 10 messages, clear the bit,
then fetch 10 events, then clear that bit, then go back to normal
operation.
Of course, the next flag fetch will start the process over.
>
> I was able to cook up a simple repro in Qemu for this class of bug.
I was thinking about how to do an automated test for this. I use an
external simulated BMC for the automated tests I have. So I'll work in
that direction.
But thanks, this should help me develop that test.
-corey
>
> ---->8----
>
> diff --git a/hw/ipmi/ipmi_bmc_sim.c b/hw/ipmi/ipmi_bmc_sim.c
> index fd875491f5..127db30c03 100644
> --- a/hw/ipmi/ipmi_bmc_sim.c
> +++ b/hw/ipmi/ipmi_bmc_sim.c
> @@ -249,6 +249,21 @@ struct IPMIBmcSim {
> uint8_t evtbuf[16];
>
> QTAILQ_HEAD(, IPMIRcvBufEntry) rcvbufs;
> +
> + /*
> + * Fault injection: sticky EVENT_MSG_BUFFER_FULL.
> + *
> + * Simulates a BMC that continuously generates events (e.g. after a
> + * cold reset causes sensor state changes). Once armed, every
> + * READ_EVENT_MSG_BUFFER returns success but never clears the
> + * EVT_BUF_FULL flag, starving waiting_msg in the SI layer's
> + * handle_flags() loop. Reproduces the 517m277 / KRN-1233 wedge.
> + */
> + bool fi_sticky_events; /* enable via property */
> + uint32_t fi_evt_arm_after; /* arm after N completed responses */
> + uint32_t fi_evt_rsp_count; /* responses completed so far */
> + bool fi_evt_armed; /* fault is active */
> + uint32_t fi_evt_read_count; /* READ_EVENT_MSG_BUFFER calls served */
> };
>
> #define IPMI_BMC_MSG_FLAG_WATCHDOG_TIMEOUT_MASK (1 << 3)
> @@ -494,7 +509,7 @@ static int sel_add_event(IPMIBmcSim *ibs, uint8_t *event)
> static int attn_set(IPMIBmcSim *ibs)
> {
> return IPMI_BMC_MSG_FLAG_RCV_MSG_QUEUE_SET(ibs)
> - || IPMI_BMC_MSG_FLAG_EVT_BUF_FULL_SET(ibs)
> + || (!ibs->fi_evt_armed && IPMI_BMC_MSG_FLAG_EVT_BUF_FULL_SET(ibs))
> || IPMI_BMC_MSG_FLAG_WATCHDOG_TIMEOUT_MASK_SET(ibs);
> }
>
> @@ -750,6 +765,48 @@ static void ipmi_sim_handle_command(IPMIBmc *b,
> out:
> k->handle_rsp(s, msg_id, rsp.buffer, rsp.len);
>
> + /*
> + * Fault injection: count completed responses and arm sticky
> + * EVENT_MSG_BUFFER_FULL after the configured threshold.
> + */
> + if (ibs->fi_sticky_events && !ibs->fi_evt_armed) {
> + ibs->fi_evt_rsp_count++;
> + if (ibs->fi_evt_rsp_count >= ibs->fi_evt_arm_after) {
> + ibs->fi_evt_armed = true;
> + ibs->fi_evt_read_count = 0;
> +
> + /*
> + * Seed the event buffer with a synthetic sensor event
> + * (sensor type 0x01 = Temperature, event type 0x01 =
> + * threshold, evd1 = upper critical going high).
> + * This matches what real BMCs generate after a cold reset.
> + */
> + memset(ibs->evtbuf, 0, 16);
> + ibs->evtbuf[2] = 0x02; /* System event record */
> + ibs->evtbuf[7] = ibs->parent.slave_addr;
> + ibs->evtbuf[9] = 0x04; /* Format version */
> + ibs->evtbuf[10] = 0x01; /* Sensor type: Temperature */
> + ibs->evtbuf[11] = 0x01; /* Sensor number */
> + ibs->evtbuf[12] = 0x01; /* Event type: threshold */
> + ibs->evtbuf[13] = 0x09; /* Upper critical going high */
> + ibs->evtbuf[14] = 0x57; /* Threshold value */
> + ibs->evtbuf[15] = 0x00; /* Sequence (incremented on reads) */
> +
> + ibs->msg_flags |= IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> +
> + /* Ensure event message buffer is enabled in global enables
> + * so the kernel sees it. Also enable event logging.
> + */
> + ibs->bmc_global_enables |= (1 << IPMI_BMC_EVENT_MSG_BUF_BIT)
> + | (1 << IPMI_BMC_EVENT_LOG_BIT);
> +
> + k->set_atn(s, 1, attn_irq_enabled(ibs));
> +
> + fprintf(stderr, "[FI] sticky-events armed after %u responses\n",
> + ibs->fi_evt_rsp_count);
> + }
> + }
> +
> next_timeout(ibs);
> }
>
> @@ -1013,8 +1070,14 @@ static void clr_msg_flags(IPMIBmcSim *ibs,
> {
> IPMIInterface *s = ibs->parent.intf;
> IPMIInterfaceClass *k = IPMI_INTERFACE_GET_CLASS(s);
> + uint8_t clear_mask = cmd[2];
> +
> + if (ibs->fi_evt_armed) {
> + /* Don't allow clearing EVT_BUF_FULL when sticky events active */
> + clear_mask &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> + }
>
> - ibs->msg_flags &= ~cmd[2];
> + ibs->msg_flags &= ~clear_mask;
> k->set_atn(s, attn_set(ibs), attn_irq_enabled(ibs));
> }
>
> @@ -1040,7 +1103,19 @@ static void read_evt_msg_buf(IPMIBmcSim *ibs,
> for (i = 0; i < 16; i++) {
> rsp_buffer_push(rsp, ibs->evtbuf[i]);
> }
> - ibs->msg_flags &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> +
> + if (ibs->fi_evt_armed) {
> + /*
> + * Sticky mode: return success but keep EVT_BUF_FULL set.
> + * Vary the event data slightly so the kernel doesn't
> + * de-duplicate (increment evd3 as a sequence number).
> + */
> + ibs->fi_evt_read_count++;
> + ibs->evtbuf[15] = (uint8_t)(ibs->fi_evt_read_count & 0xff);
> + /* Don't clear the flag — starvation continues */
> + } else {
> + ibs->msg_flags &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
> + }
> k->set_atn(s, attn_set(ibs), attn_irq_enabled(ibs));
> }
>
> @@ -2670,6 +2745,8 @@ static const Property ipmi_sim_properties[] = {
> DEFINE_PROP_STRING("lan.netmask", IPMIBmcSim, lan.arg_netmask),
> DEFINE_PROP_STRING("lan.defgw_ipaddr", IPMIBmcSim, lan.arg_defgw_ipaddr),
> DEFINE_PROP_MACADDR("lan.defgw_macaddr", IPMIBmcSim, lan.defgw_macaddr),
> + DEFINE_PROP_BOOL("fi_sticky_events", IPMIBmcSim, fi_sticky_events, false),
> + DEFINE_PROP_UINT32("fi_evt_arm_after", IPMIBmcSim, fi_evt_arm_after, 40),
> };
>
> static void ipmi_sim_class_init(ObjectClass *oc, const void *data)
|
|
From: Matt F. <ma...@re...> - 2026-04-28 10:22:37
|
On Sat, Apr 25, 2026 at 06:58:48PM -0500, Corey Minyard wrote:
>
> Oh, yeah.
>
> Moving it to handle_transaction_done() is not ideal, though. If
> something was spewing receive messages, it would never get to events,
> which is why I did it like I did.
>
> The following should fix this. You could also have different limits for
> receive messages and events, but I think the following is more clear.
>
> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> index 2a739123270c..e46f4150ceb5 100644
> --- a/drivers/char/ipmi/ipmi_si_intf.c
> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> @@ -413,8 +413,10 @@ static void start_getting_msg_queue(struct smi_info *smi_info)
>
> start_new_msg(smi_info, smi_info->curr_msg->data,
> smi_info->curr_msg->data_size);
> - smi_info->num_requests_in_a_row = 0;
> - smi_info->si_state = SI_GETTING_MESSAGES;
> + if (smi_info->si_state != SI_GETTING_MESSAGES) {
> + smi_info->num_requests_in_a_row = 0;
> + smi_info->si_state = SI_GETTING_MESSAGES;
> + }
> }
>
> static void start_getting_events(struct smi_info *smi_info)
> @@ -425,8 +427,10 @@ static void start_getting_events(struct smi_info *smi_info)
>
> start_new_msg(smi_info, smi_info->curr_msg->data,
> smi_info->curr_msg->data_size);
> - smi_info->num_requests_in_a_row = 0;
> - smi_info->si_state = SI_GETTING_EVENTS;
> + if (smi_info->si_state != SI_GETTING_EVENTS) {
> + smi_info->num_requests_in_a_row = 0;
> + smi_info->si_state = SI_GETTING_EVENTS;
> + }
Thanks. Does this correctly handle a stream of mixed receive+event
flags, though? If we bounce between MESSAGES and EVENTS, won't we keep
resetting the counter on each state transition and never hit the limit?
I was able to cook up a simple repro in Qemu for this class of bug.
---->8----
diff --git a/hw/ipmi/ipmi_bmc_sim.c b/hw/ipmi/ipmi_bmc_sim.c
index fd875491f5..127db30c03 100644
--- a/hw/ipmi/ipmi_bmc_sim.c
+++ b/hw/ipmi/ipmi_bmc_sim.c
@@ -249,6 +249,21 @@ struct IPMIBmcSim {
uint8_t evtbuf[16];
QTAILQ_HEAD(, IPMIRcvBufEntry) rcvbufs;
+
+ /*
+ * Fault injection: sticky EVENT_MSG_BUFFER_FULL.
+ *
+ * Simulates a BMC that continuously generates events (e.g. after a
+ * cold reset causes sensor state changes). Once armed, every
+ * READ_EVENT_MSG_BUFFER returns success but never clears the
+ * EVT_BUF_FULL flag, starving waiting_msg in the SI layer's
+ * handle_flags() loop. Reproduces the 517m277 / KRN-1233 wedge.
+ */
+ bool fi_sticky_events; /* enable via property */
+ uint32_t fi_evt_arm_after; /* arm after N completed responses */
+ uint32_t fi_evt_rsp_count; /* responses completed so far */
+ bool fi_evt_armed; /* fault is active */
+ uint32_t fi_evt_read_count; /* READ_EVENT_MSG_BUFFER calls served */
};
#define IPMI_BMC_MSG_FLAG_WATCHDOG_TIMEOUT_MASK (1 << 3)
@@ -494,7 +509,7 @@ static int sel_add_event(IPMIBmcSim *ibs, uint8_t *event)
static int attn_set(IPMIBmcSim *ibs)
{
return IPMI_BMC_MSG_FLAG_RCV_MSG_QUEUE_SET(ibs)
- || IPMI_BMC_MSG_FLAG_EVT_BUF_FULL_SET(ibs)
+ || (!ibs->fi_evt_armed && IPMI_BMC_MSG_FLAG_EVT_BUF_FULL_SET(ibs))
|| IPMI_BMC_MSG_FLAG_WATCHDOG_TIMEOUT_MASK_SET(ibs);
}
@@ -750,6 +765,48 @@ static void ipmi_sim_handle_command(IPMIBmc *b,
out:
k->handle_rsp(s, msg_id, rsp.buffer, rsp.len);
+ /*
+ * Fault injection: count completed responses and arm sticky
+ * EVENT_MSG_BUFFER_FULL after the configured threshold.
+ */
+ if (ibs->fi_sticky_events && !ibs->fi_evt_armed) {
+ ibs->fi_evt_rsp_count++;
+ if (ibs->fi_evt_rsp_count >= ibs->fi_evt_arm_after) {
+ ibs->fi_evt_armed = true;
+ ibs->fi_evt_read_count = 0;
+
+ /*
+ * Seed the event buffer with a synthetic sensor event
+ * (sensor type 0x01 = Temperature, event type 0x01 =
+ * threshold, evd1 = upper critical going high).
+ * This matches what real BMCs generate after a cold reset.
+ */
+ memset(ibs->evtbuf, 0, 16);
+ ibs->evtbuf[2] = 0x02; /* System event record */
+ ibs->evtbuf[7] = ibs->parent.slave_addr;
+ ibs->evtbuf[9] = 0x04; /* Format version */
+ ibs->evtbuf[10] = 0x01; /* Sensor type: Temperature */
+ ibs->evtbuf[11] = 0x01; /* Sensor number */
+ ibs->evtbuf[12] = 0x01; /* Event type: threshold */
+ ibs->evtbuf[13] = 0x09; /* Upper critical going high */
+ ibs->evtbuf[14] = 0x57; /* Threshold value */
+ ibs->evtbuf[15] = 0x00; /* Sequence (incremented on reads) */
+
+ ibs->msg_flags |= IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
+
+ /* Ensure event message buffer is enabled in global enables
+ * so the kernel sees it. Also enable event logging.
+ */
+ ibs->bmc_global_enables |= (1 << IPMI_BMC_EVENT_MSG_BUF_BIT)
+ | (1 << IPMI_BMC_EVENT_LOG_BIT);
+
+ k->set_atn(s, 1, attn_irq_enabled(ibs));
+
+ fprintf(stderr, "[FI] sticky-events armed after %u responses\n",
+ ibs->fi_evt_rsp_count);
+ }
+ }
+
next_timeout(ibs);
}
@@ -1013,8 +1070,14 @@ static void clr_msg_flags(IPMIBmcSim *ibs,
{
IPMIInterface *s = ibs->parent.intf;
IPMIInterfaceClass *k = IPMI_INTERFACE_GET_CLASS(s);
+ uint8_t clear_mask = cmd[2];
+
+ if (ibs->fi_evt_armed) {
+ /* Don't allow clearing EVT_BUF_FULL when sticky events active */
+ clear_mask &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
+ }
- ibs->msg_flags &= ~cmd[2];
+ ibs->msg_flags &= ~clear_mask;
k->set_atn(s, attn_set(ibs), attn_irq_enabled(ibs));
}
@@ -1040,7 +1103,19 @@ static void read_evt_msg_buf(IPMIBmcSim *ibs,
for (i = 0; i < 16; i++) {
rsp_buffer_push(rsp, ibs->evtbuf[i]);
}
- ibs->msg_flags &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
+
+ if (ibs->fi_evt_armed) {
+ /*
+ * Sticky mode: return success but keep EVT_BUF_FULL set.
+ * Vary the event data slightly so the kernel doesn't
+ * de-duplicate (increment evd3 as a sequence number).
+ */
+ ibs->fi_evt_read_count++;
+ ibs->evtbuf[15] = (uint8_t)(ibs->fi_evt_read_count & 0xff);
+ /* Don't clear the flag — starvation continues */
+ } else {
+ ibs->msg_flags &= ~IPMI_BMC_MSG_FLAG_EVT_BUF_FULL;
+ }
k->set_atn(s, attn_set(ibs), attn_irq_enabled(ibs));
}
@@ -2670,6 +2745,8 @@ static const Property ipmi_sim_properties[] = {
DEFINE_PROP_STRING("lan.netmask", IPMIBmcSim, lan.arg_netmask),
DEFINE_PROP_STRING("lan.defgw_ipaddr", IPMIBmcSim, lan.arg_defgw_ipaddr),
DEFINE_PROP_MACADDR("lan.defgw_macaddr", IPMIBmcSim, lan.defgw_macaddr),
+ DEFINE_PROP_BOOL("fi_sticky_events", IPMIBmcSim, fi_sticky_events, false),
+ DEFINE_PROP_UINT32("fi_evt_arm_after", IPMIBmcSim, fi_evt_arm_after, 40),
};
static void ipmi_sim_class_init(ObjectClass *oc, const void *data)
|
|
From: Corey M. <co...@mi...> - 2026-04-26 00:04:50
|
On Sat, Apr 25, 2026 at 10:36:05AM +0100, Matt Fleming wrote: > On Tue, Apr 21, 2026 at 07:42:44AM -0500, Corey Minyard wrote: > > The driver would just fetch events and receive messages until the > > BMC said it was done. To avoid issues with BMCs that never say they are > > done, add a limit of 10 fetches at a time. > > > > This is a more general fix than the previous fix for the specific bad > > BMC, but should fix the more general issue of a BMC that won't stop > > saying it has data. > > > > This has been there from the beginning of the driver. > > > > Reported-by: Matt Fleming <mfl...@cl...> > > Closes: https://lore.kernel.org/lkml/202...@re.../ > > Fixes: <1da177e4c3f4> ("Linux-2.6.12-rc2") > > Cc: st...@vg... > > Signed-off-by: Corey Minyard <co...@mi...> > > --- > > drivers/char/ipmi/ipmi_si_intf.c | 15 +++++++++++++++ > > drivers/char/ipmi/ipmi_ssif.c | 15 +++++++++++++++ > > 2 files changed, 30 insertions(+) > > [...] > > > @@ -410,6 +413,7 @@ static void start_getting_msg_queue(struct smi_info *smi_info) > > > > start_new_msg(smi_info, smi_info->curr_msg->data, > > smi_info->curr_msg->data_size); > > + smi_info->num_requests_in_a_row = 0; > > smi_info->si_state = SI_GETTING_MESSAGES; > > } > > > > @@ -421,6 +425,7 @@ static void start_getting_events(struct smi_info *smi_info) > > > > start_new_msg(smi_info, smi_info->curr_msg->data, > > smi_info->curr_msg->data_size); > > + smi_info->num_requests_in_a_row = 0; > > smi_info->si_state = SI_GETTING_EVENTS; > > } > > > > Would it be better to move this zeroing to handle_transaction_done()? > Otherwise we reset the counter in handle_flags() -> > start_getting_events() and the threshold is never reached. Oh, yeah. Moving it to handle_transaction_done() is not ideal, though. If something was spewing receive messages, it would never get to events, which is why I did it like I did. The following should fix this. You could also have different limits for receive messages and events, but I think the following is more clear. diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 2a739123270c..e46f4150ceb5 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -413,8 +413,10 @@ static void start_getting_msg_queue(struct smi_info *smi_info) start_new_msg(smi_info, smi_info->curr_msg->data, smi_info->curr_msg->data_size); - smi_info->num_requests_in_a_row = 0; - smi_info->si_state = SI_GETTING_MESSAGES; + if (smi_info->si_state != SI_GETTING_MESSAGES) { + smi_info->num_requests_in_a_row = 0; + smi_info->si_state = SI_GETTING_MESSAGES; + } } static void start_getting_events(struct smi_info *smi_info) @@ -425,8 +427,10 @@ static void start_getting_events(struct smi_info *smi_info) start_new_msg(smi_info, smi_info->curr_msg->data, smi_info->curr_msg->data_size); - smi_info->num_requests_in_a_row = 0; - smi_info->si_state = SI_GETTING_EVENTS; + if (smi_info->si_state != SI_GETTING_EVENTS) { + smi_info->num_requests_in_a_row = 0; + smi_info->si_state = SI_GETTING_EVENTS; + } } /* > > Thanks, > Matt |
|
From: Matt F. <ma...@re...> - 2026-04-25 09:36:20
|
On Tue, Apr 21, 2026 at 07:42:44AM -0500, Corey Minyard wrote: > The driver would just fetch events and receive messages until the > BMC said it was done. To avoid issues with BMCs that never say they are > done, add a limit of 10 fetches at a time. > > This is a more general fix than the previous fix for the specific bad > BMC, but should fix the more general issue of a BMC that won't stop > saying it has data. > > This has been there from the beginning of the driver. > > Reported-by: Matt Fleming <mfl...@cl...> > Closes: https://lore.kernel.org/lkml/202...@re.../ > Fixes: <1da177e4c3f4> ("Linux-2.6.12-rc2") > Cc: st...@vg... > Signed-off-by: Corey Minyard <co...@mi...> > --- > drivers/char/ipmi/ipmi_si_intf.c | 15 +++++++++++++++ > drivers/char/ipmi/ipmi_ssif.c | 15 +++++++++++++++ > 2 files changed, 30 insertions(+) [...] > @@ -410,6 +413,7 @@ static void start_getting_msg_queue(struct smi_info *smi_info) > > start_new_msg(smi_info, smi_info->curr_msg->data, > smi_info->curr_msg->data_size); > + smi_info->num_requests_in_a_row = 0; > smi_info->si_state = SI_GETTING_MESSAGES; > } > > @@ -421,6 +425,7 @@ static void start_getting_events(struct smi_info *smi_info) > > start_new_msg(smi_info, smi_info->curr_msg->data, > smi_info->curr_msg->data_size); > + smi_info->num_requests_in_a_row = 0; > smi_info->si_state = SI_GETTING_EVENTS; > } > Would it be better to move this zeroing to handle_transaction_done()? Otherwise we reset the counter in handle_flags() -> start_getting_events() and the threshold is never reached. Thanks, Matt |
|
From: Jian Z. <zha...@by...> - 2026-04-22 04:45:36
|
> 2026年4月22日 06:24,Matt Fleming <ma...@re...> 写道: > > On Tue, Apr 21, 2026 at 07:42:42AM -0500, Corey Minyard wrote: >> Matt reported that there were issues with the IPMI driver getting wedged >> in some cases. It turns out that the BMC was not reporting an error as >> it should have (per the spec) when the event queue was empty. The IPMI >> driver would then request the next event, and so on, wedging the driver. > > Thanks for replying so quickly, Corey. I'll test these out. > > One bit of info I pulled out of the stuck machine is that the response > looks properly formed. > > I sampled the first 8 entries and they were all identical 19-byte > successful READ_EVENT_MSG_BUFFER responses: > > 1c 35 00 55 55 c0 41 a7 00 00 00 00 00 3a ff 00 ff ff ff > Perhaps I know where this data comes from. During a previous debugging session (where ipmitool v1.8.19 failed on sensor list due to an underflow in nr_numbers, which has since been fixed), I noticed this behavior. However, I ’m not sure why it is implemented this way or what exactly this command is intended to do. If you are running on OpenBMC, it is very likely related to this part, where a fixed value is always returned (especially if the KCS channel happens to be configured as 15): See: https://github.com/openbmc/phosphor-host-ipmid/blob/master/systemintfcmds.cpp#L35 Jian. > So on this machine, the event replies do not look short or malformed; > they look like repeated successful event-buffer reads with the same > payload. > > Thanks, > Matt > > > _______________________________________________ > Openipmi-developer mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openipmi-developer |
|
From: Matt F. <ma...@re...> - 2026-04-21 22:24:26
|
On Tue, Apr 21, 2026 at 07:42:42AM -0500, Corey Minyard wrote: > Matt reported that there were issues with the IPMI driver getting wedged > in some cases. It turns out that the BMC was not reporting an error as > it should have (per the spec) when the event queue was empty. The IPMI > driver would then request the next event, and so on, wedging the driver. Thanks for replying so quickly, Corey. I'll test these out. One bit of info I pulled out of the stuck machine is that the response looks properly formed. I sampled the first 8 entries and they were all identical 19-byte successful READ_EVENT_MSG_BUFFER responses: 1c 35 00 55 55 c0 41 a7 00 00 00 00 00 3a ff 00 ff ff ff So on this machine, the event replies do not look short or malformed; they look like repeated successful event-buffer reads with the same payload. Thanks, Matt |
|
From: Corey M. <co...@mi...> - 2026-04-21 17:55:03
|
Matt reported that there were issues with the IPMI driver getting wedged in some cases. It turns out that the BMC was not reporting an error as it should have (per the spec) when the event queue was empty. The IPMI driver would then request the next event, and so on, wedging the driver. The BMC sits on a fuzzy line between a trusted devices and a remote and possibly untrusted device. If you compromised a BMC you have all sorts of tools you can use to attack the host: the reset line, interrupts, and usually access to write the system firmware and possibly devices like disk drives, serial ports and VGA consoles. So attacking through this interface would not be the first thing you would do. But it is an possible attack point. I'm assuming that the BMC was delivering an empty message when this happens, so the first patch checks the message length to make sure it's a valid message. It's a good check no matter what, so it's in whether that's the issue or not. The second patch limits the number of events or messages that can be fetched at a time to 10. This is a good thing to do, anyway. If more message or events were present, the next flag check should get them. So it's a more general fix. I looked at adding the patch Matt suggested, doing a timeout on the wait, but that introduces some race conditions if the response comes back late. That will require some more thought. The timeouts with IPMI can be pretty long, the spec specifies fairly long timeouts, 5 seconds waiting for the BMC to respond to anything. So failing an operation can take some time, and reducing the timeouts is probably a bad idea. No rationale is given in the spec, but I'm guessing it expects that a BMC in restart can recover within 5 seconds, so it gives timeouts so the BMC is always available within that tie. The spec gives you the gist that the BMC should always be available on a system that has one. So the driver (at the beginning) followed that. Thus the driver tries 10 times for a message before it gives up, giving 50 seconds total failure time for a message. That is not in the spec (I don't think) so that could be made selectable on a per-message basis. There are already mechanisms for this available in the APIs; I'll look at that. -corey |
|
From: Corey M. <co...@mi...> - 2026-04-21 13:26:04
|
The driver would just fetch events and receive messages until the BMC said it was done. To avoid issues with BMCs that never say they are done, add a limit of 10 fetches at a time. This is a more general fix than the previous fix for the specific bad BMC, but should fix the more general issue of a BMC that won't stop saying it has data. This has been there from the beginning of the driver. Reported-by: Matt Fleming <mfl...@cl...> Closes: https://lore.kernel.org/lkml/202...@re.../ Fixes: <1da177e4c3f4> ("Linux-2.6.12-rc2") Cc: st...@vg... Signed-off-by: Corey Minyard <co...@mi...> --- drivers/char/ipmi/ipmi_si_intf.c | 15 +++++++++++++++ drivers/char/ipmi/ipmi_ssif.c | 15 +++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 08c208cc64c5..a705aae29867 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -168,6 +168,9 @@ struct smi_info { OEM2_DATA_AVAIL) unsigned char msg_flags; + /* When requesting events and messages, don't do it forever. */ + unsigned int num_requests_in_a_row; + /* Does the BMC have an event buffer? */ bool has_event_buffer; @@ -410,6 +413,7 @@ static void start_getting_msg_queue(struct smi_info *smi_info) start_new_msg(smi_info, smi_info->curr_msg->data, smi_info->curr_msg->data_size); + smi_info->num_requests_in_a_row = 0; smi_info->si_state = SI_GETTING_MESSAGES; } @@ -421,6 +425,7 @@ static void start_getting_events(struct smi_info *smi_info) start_new_msg(smi_info, smi_info->curr_msg->data, smi_info->curr_msg->data_size); + smi_info->num_requests_in_a_row = 0; smi_info->si_state = SI_GETTING_EVENTS; } @@ -646,6 +651,11 @@ static void handle_transaction_done(struct smi_info *smi_info) } else { smi_inc_stat(smi_info, events); + smi_info->num_requests_in_a_row++; + if (smi_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + smi_info->msg_flags &= ~EVENT_MSG_BUFFER_FULL; + /* * Do this before we deliver the message * because delivering the message releases the @@ -684,6 +694,11 @@ static void handle_transaction_done(struct smi_info *smi_info) } else { smi_inc_stat(smi_info, incoming_messages); + smi_info->num_requests_in_a_row++; + if (smi_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + smi_info->msg_flags &= ~RECEIVE_MSG_AVAIL; + /* * Do this before we deliver the message * because delivering the message releases the diff --git a/drivers/char/ipmi/ipmi_ssif.c b/drivers/char/ipmi/ipmi_ssif.c index b49500a1bd36..547447f304ba 100644 --- a/drivers/char/ipmi/ipmi_ssif.c +++ b/drivers/char/ipmi/ipmi_ssif.c @@ -225,6 +225,9 @@ struct ssif_info { bool has_event_buffer; bool supports_alert; + /* When requesting events and messages, don't do it forever. */ + unsigned int num_requests_in_a_row; + /* * Used to tell what we should do with alerts. If we are * waiting on a response, read the data immediately. @@ -413,6 +416,7 @@ static void start_event_fetch(struct ssif_info *ssif_info, unsigned long *flags) } ssif_info->curr_msg = msg; + ssif_info->num_requests_in_a_row = 0; ssif_info->ssif_state = SSIF_GETTING_EVENTS; ipmi_ssif_unlock_cond(ssif_info, flags); @@ -436,6 +440,7 @@ static void start_recv_msg_fetch(struct ssif_info *ssif_info, } ssif_info->curr_msg = msg; + ssif_info->num_requests_in_a_row = 0; ssif_info->ssif_state = SSIF_GETTING_MESSAGES; ipmi_ssif_unlock_cond(ssif_info, flags); @@ -843,6 +848,11 @@ static void msg_done_handler(struct ssif_info *ssif_info, int result, ssif_info->msg_flags &= ~EVENT_MSG_BUFFER_FULL; handle_flags(ssif_info, flags); } else { + ssif_info->num_requests_in_a_row++; + if (ssif_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + ssif_info->msg_flags &= ~EVENT_MSG_BUFFER_FULL; + handle_flags(ssif_info, flags); ssif_inc_stat(ssif_info, events); deliver_recv_msg(ssif_info, msg); @@ -876,6 +886,11 @@ static void msg_done_handler(struct ssif_info *ssif_info, int result, ssif_info->msg_flags &= ~RECEIVE_MSG_AVAIL; handle_flags(ssif_info, flags); } else { + ssif_info->num_requests_in_a_row++; + if (ssif_info->num_requests_in_a_row > 10) + /* Stop if we do this too many times. */ + ssif_info->msg_flags &= ~RECEIVE_MSG_AVAIL; + ssif_inc_stat(ssif_info, incoming_messages); handle_flags(ssif_info, flags); deliver_recv_msg(ssif_info, msg); -- 2.43.0 |
|
From: Corey M. <co...@mi...> - 2026-04-21 13:26:03
|
The event message buffer response data size got checked later when processing, but check it right after the response comes back. It appears some BMCs may return an empty message instead of an error when fetching events. There are apparently some new BMCs that make this error, so we need to compensate. Reported-by: Matt Fleming <mfl...@cl...> Closes: https://lore.kernel.org/lkml/202...@re.../ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: <st...@vg...> Signed-off-by: Corey Minyard <co...@mi...> --- drivers/char/ipmi/ipmi_si_intf.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 4a9e9de4d684..08c208cc64c5 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info) */ msg = smi_info->curr_msg; smi_info->curr_msg = NULL; - if (msg->rsp[2] != 0) { + /* + * It appears some BMCs, with no event data, return no + * data in the message and not a 0x80 error as the + * spec says they should. Shut down processing if + * the data is not the right length. + */ + if (msg->rsp[2] != 0 || msg->rsp_size != 19) { /* Error getting event, probably done. */ msg->done(msg); -- 2.43.0 |
|
From: Dan C. <er...@gm...> - 2026-04-21 04:52:22
|
Hello Corey Minyard,
Commit 75c486cb1bca ("ipmi:ssif: Clean up kthread on errors") from
Apr 13, 2026 (linux-next), leads to the following Smatch static
checker warning:
drivers/char/ipmi/ipmi_ssif.c:1923 ssif_probe()
error: 'ssif_info->thread' dereferencing possible ERR_PTR()
drivers/char/ipmi/ipmi_ssif.c
1874 ssif_info->handlers.owner = THIS_MODULE;
1875 ssif_info->handlers.start_processing = ssif_start_processing;
1876 ssif_info->handlers.shutdown = shutdown_ssif;
1877 ssif_info->handlers.get_smi_info = get_smi_info;
1878 ssif_info->handlers.sender = sender;
1879 ssif_info->handlers.request_events = request_events;
1880 ssif_info->handlers.set_need_watch = ssif_set_need_watch;
1881
1882 thread_num = ((i2c_adapter_id(ssif_info->client->adapter) << 8) |
1883 ssif_info->client->addr);
1884 init_completion(&ssif_info->wake_thread);
1885 ssif_info->thread = kthread_run(ipmi_ssif_thread, ssif_info,
1886 "kssif%4.4x", thread_num);
1887 if (IS_ERR(ssif_info->thread)) {
^^^^^^^^^^^^^^^^^
1888 rv = PTR_ERR(ssif_info->thread);
1889 dev_notice(&ssif_info->client->dev,
1890 "Could not start kernel thread: error %d\n",
1891 rv);
1892 goto out;
1893 }
1894
1895 dev_set_drvdata(&ssif_info->client->dev, ssif_info);
1896 rv = device_add_group(&ssif_info->client->dev,
1897 &ipmi_ssif_dev_attr_group);
1898 if (rv) {
1899 dev_err(&ssif_info->client->dev,
1900 "Unable to add device attributes: error %d\n",
1901 rv);
1902 goto out;
1903 }
1904
1905 rv = ipmi_register_smi(&ssif_info->handlers,
1906 ssif_info,
1907 &ssif_info->client->dev,
1908 slave_addr);
1909 if (rv) {
1910 dev_err(&ssif_info->client->dev,
1911 "Unable to register device: error %d\n", rv);
1912 goto out_remove_attr;
1913 }
1914
1915 out:
1916 if (rv) {
1917 /*
1918 * If ipmi_register_smi() starts the interface, it will
1919 * call shutdown and that will free the thread and set
1920 * it to NULL. Otherwise it must be freed here.
1921 */
1922 if (ssif_info->thread) {
if (!IS_ERR_OR_NULL(ssif_info->thread))
--> 1923 kthread_stop(ssif_info->thread);
1924 ssif_info->thread = NULL;
1925 }
1926 if (addr_info)
1927 addr_info->client = NULL;
1928
1929 dev_err(&ssif_info->client->dev,
1930 "Unable to start IPMI SSIF: %d\n", rv);
1931 i2c_set_clientdata(client, NULL);
1932 kfree(ssif_info);
1933 }
1934 kfree(resp);
1935 mutex_unlock(&ssif_infos_mutex);
1936 return rv;
1937
1938 out_remove_attr:
1939 device_remove_group(&ssif_info->client->dev, &ipmi_ssif_dev_attr_group);
1940 dev_set_drvdata(&ssif_info->client->dev, NULL);
1941 goto out;
1942 }
This email is a free service from the Smatch-CI project [smatch.sf.net].
regards,
dan carpenter
|
|
From: Corey M. <co...@mi...> - 2026-04-20 18:11:46
|
On Mon, Apr 20, 2026 at 11:33:01AM -0500, Corey Minyard wrote: > On Sun, Apr 19, 2026 at 09:50:38PM +0100, Matt Fleming wrote: > > On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote: > > > > > > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful > > > READ_EVENT_MSG_BUFFER command completes. Getting data from the > > > BMC has higher priority than sending data to the BMC. > > > > > > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then > > > that would certainly wedge the driver. But it would have to continually > > > report success for that command, which would be strange as its supposed > > > to error out when the queue is empty. > > > > That does indeed appear to be what's happening. > > > > The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER > > handler does not fail when there is nothing to read, > > > > https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704 > > Actually, that is so clearly wrong that it's hard to imagine they made > this mistake. Anyway, a defect needs to be filed against it. It will > certainly break other software on other OSes. > > I think I have an easy workaround, though I'm guessing. I'm guessing > they are returning zero data bytes. There's no check on the size at that > point in the code (it's later). > > Can you try the following patch? Actually ignore that, I was thinking too much about the other stuff and didn't actually read my patch. Here's a working patch: diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 4a9e9de4d684..08c208cc64c5 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info) */ msg = smi_info->curr_msg; smi_info->curr_msg = NULL; - if (msg->rsp[2] != 0) { + /* + * It appears some BMCs, with no event data, return no + * data in the message and not a 0x80 error as the + * spec says they should. Shut down processing if + * the data is not the right length. + */ + if (msg->rsp[2] != 0 || msg->rsp_size != 19) { /* Error getting event, probably done. */ msg->done(msg); > > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c > index 4a9e9de4d684..cf8674a93af1 100644 > --- a/drivers/char/ipmi/ipmi_si_intf.c > +++ b/drivers/char/ipmi/ipmi_si_intf.c > @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info) > */ > msg = smi_info->curr_msg; > smi_info->curr_msg = NULL; > - if (msg->rsp[2] != 0) { > + /* > + * It appears some BMCs, with no event data, return no > + * data in the message and not a 0x80 error as the > + * spec says they should. Shut down processing if > + * the data is not the right length. > + */ > + if (msg->rsp[2] != 0 || smi_info->curr_msg->rsp_size != 19) { > /* Error getting event, probably done. */ > msg->done(msg); > > > With your approval on that, I'll send it to Linus after letting it sit > in the next tree for a bit. Actually, I'll probably add it in any case, > as it's a good check to do. > > > > > > If it's really something like that, I could also look at adding limits > > > for those operations. > > > > That would be great. Me and Fred would be happy to test out any patch. > > > > I still think the original patch I sent is a worthwhile defense. > > > > Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to > > block behind one another when we hit these kinds of issues in the IPMI > > code. Untangling that across thousands of machines can be time > > consuming and a more explicit EIO or ETIMEDOUT would help with triage. > > Unfortunately, that might have other issues, similar to the ones the > people with the watchdog issue found. I'll look at it a bit, but those > sorts of things would have to be scattered all over the code, not just > in that one place. As you say, it would make debugging easier. > > I think adding a counter to the number of operations occuring from a > single flag fetch would be a way to avoid this issue. That's going to > take a little more time, but I'll definitely work on that. > > Thanks, > > -corey |
|
From: Corey M. <co...@mi...> - 2026-04-20 16:33:18
|
On Sun, Apr 19, 2026 at 09:50:38PM +0100, Matt Fleming wrote: > On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote: > > > > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful > > READ_EVENT_MSG_BUFFER command completes. Getting data from the > > BMC has higher priority than sending data to the BMC. > > > > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then > > that would certainly wedge the driver. But it would have to continually > > report success for that command, which would be strange as its supposed > > to error out when the queue is empty. > > That does indeed appear to be what's happening. > > The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER > handler does not fail when there is nothing to read, > > https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704 Actually, that is so clearly wrong that it's hard to imagine they made this mistake. Anyway, a defect needs to be filed against it. It will certainly break other software on other OSes. I think I have an easy workaround, though I'm guessing. I'm guessing they are returning zero data bytes. There's no check on the size at that point in the code (it's later). Can you try the following patch? diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c index 4a9e9de4d684..cf8674a93af1 100644 --- a/drivers/char/ipmi/ipmi_si_intf.c +++ b/drivers/char/ipmi/ipmi_si_intf.c @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info) */ msg = smi_info->curr_msg; smi_info->curr_msg = NULL; - if (msg->rsp[2] != 0) { + /* + * It appears some BMCs, with no event data, return no + * data in the message and not a 0x80 error as the + * spec says they should. Shut down processing if + * the data is not the right length. + */ + if (msg->rsp[2] != 0 || smi_info->curr_msg->rsp_size != 19) { /* Error getting event, probably done. */ msg->done(msg); With your approval on that, I'll send it to Linus after letting it sit in the next tree for a bit. Actually, I'll probably add it in any case, as it's a good check to do. > > > If it's really something like that, I could also look at adding limits > > for those operations. > > That would be great. Me and Fred would be happy to test out any patch. > > I still think the original patch I sent is a worthwhile defense. > > Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to > block behind one another when we hit these kinds of issues in the IPMI > code. Untangling that across thousands of machines can be time > consuming and a more explicit EIO or ETIMEDOUT would help with triage. Unfortunately, that might have other issues, similar to the ones the people with the watchdog issue found. I'll look at it a bit, but those sorts of things would have to be scattered all over the code, not just in that one place. As you say, it would make debugging easier. I think adding a counter to the number of operations occuring from a single flag fetch would be a way to avoid this issue. That's going to take a little more time, but I'll definitely work on that. Thanks, -corey |
|
From: Sasha L. <sa...@ke...> - 2026-04-20 13:31:05
|
From: Jian Zhang <zha...@by...>
[ Upstream commit 7fc3e2546cf3fa9a28a2acc92a512c779a8e5038 ]
The response timer can stay armed across device teardown. If it fires after
remove, the callback dereferences the SSIF context and the i2c client after
teardown has started.
Cancel the timer in remove so the callback cannot run after the device is
unregistered.
Signed-off-by: Jian Zhang <zha...@by...>
Message-ID: <202...@by...>
Signed-off-by: Corey Minyard <co...@mi...>
Signed-off-by: Sasha Levin <sa...@ke...>
---
LLM Generated explanations, may be completely bogus:
Now I have enough information to do the full analysis.
---
## Phase 1: Commit Message Forensics
**Step 1.1 — Subject parsing**
Record: Subsystem `ipmi: ssif_bmc:`; action verb `cancel` (equivalent to
"fix/prevent"); summary: cancel response timer in remove path so a
queued timer callback cannot dereference a torn-down context.
**Step 1.2 — Commit message tags**
Record:
- `Signed-off-by: Jian Zhang <zha...@by...>` (author)
- `Message-ID: <202...@by...>`
— lore msgid
- `Signed-off-by: Corey Minyard <co...@mi...>` — IPMI subsystem
maintainer applied it
- No `Fixes:` tag, no `Cc: stable`, no `Reported-by:`, no `Reviewed-by:`
/ `Acked-by:` in the applied version (absence of Fixes/Cc stable is
expected in this pipeline and not a negative signal)
- Related patches 2/5, 3/5, 4/5 in the same series DO carry `Fixes:
dd2bc5cc9e25 ("ipmi: ssif_bmc: Add SSIF BMC driver")`
**Step 1.3 — Body analysis**
Record: Bug description is explicit — "response timer can stay armed
across device teardown"; failure mode is "the callback dereferences the
SSIF context and the i2c client after teardown has started". This is a
classic **use-after-free** scenario. The author names the exact
mechanism (pending timer firing after `ssif_bmc_remove()`).
**Step 1.4 — Hidden bug detection**
Record: Not hidden. Clearly presented as a UAF-prevention fix even
without the word "fix" in the subject. The phrase "so the callback
cannot run after the device is unregistered" is unambiguous fix
language.
## Phase 2: Diff Analysis
**Step 2.1 — Inventory**
Record: 1 file (`drivers/char/ipmi/ssif_bmc.c`), +1/-0 lines, inside
`ssif_bmc_remove()`. Single-file surgical fix.
**Step 2.2 — Flow change**
Record: Before: `ssif_bmc_remove()` calls `i2c_slave_unregister()` then
`misc_deregister()` and returns, leaving `response_timer` possibly
pending. After: `timer_delete_sync(&ssif_bmc->response_timer)` is called
first, which both cancels a pending timer and waits for any in-flight
callback on another CPU to finish before continuing.
**Step 2.3 — Bug mechanism**
Record: Category (d) memory safety / (b) synchronization with teardown.
Root cause:
- `ssif_bmc` is `devm_kzalloc`'d at probe (line 809), so it is freed by
devres AFTER `.remove` returns.
- `response_timer` is armed via `mod_timer(&ssif_bmc->response_timer,
jiffies + 500 ms)` inside `handle_request()` (line 335).
- `response_timeout()` callback dereferences `ssif_bmc`
(`timer_container_of`), takes `ssif_bmc->lock` and touches several
fields (lines 300–315).
- Without `timer_delete_sync()` in remove, the timer can fire after
remove returns and after devres frees `ssif_bmc`, producing a UAF on
`ssif_bmc` and on `ssif_bmc->client`. On module unload the callback
address itself may also be in freed module text.
**Step 2.4 — Fix quality**
Record: 1 line, the canonical pattern (`timer_delete_sync()` in remove
for a driver-owned timer). V1 of the patch used the non-sync variant;
review led to v2 using `timer_delete_sync()`, which is the correct
choice because it also waits for a concurrent callback on another CPU.
Zero-initialized `timer_list` (never armed because `handle_request`
never ran) is safely handled by `timer_delete_sync()` —
`timer_pending()` returns false on a zeroed list entry. No regression
risk.
## Phase 3: Git History Investigation
**Step 3.1 — Blame**
Record: `git blame -L 820,870` shows `ssif_bmc_remove` untouched since
`dd2bc5cc9e2555` (Quan Nguyen, 2022-10-04), which first appears in
**v6.2**. So the bug has been present in every release from v6.2 onward.
**Step 3.2 — Fixes: tag follow-up**
Record: No explicit `Fixes:` tag on this patch, but companion patches
2/5, 3/5, 4/5 all point to `dd2bc5cc9e25`. The same commit introduces
both the timer and the buggy `ssif_bmc_remove()`. `dd2bc5cc9e25` is
present in stable branches from 6.6.y onward (verified via `git show
pending-6.6:drivers/char/ipmi/ssif_bmc.c`) but NOT in 6.1.y (`git show
pending-6.1:drivers/char/ipmi/ssif_bmc.c` reports the file does not
exist — driver was added after 6.1 branched).
**Step 3.3 — File history**
Record: Recent file commits: 41cb08555c416 (treewide
`timer_container_of()` rename, v6.16), 8fa7292fee5c5 (treewide
`timer_delete[_sync]()` rename, v6.15), plus prior IPMI fixes. No
prerequisites for this patch. It is self-contained.
**Step 3.4 — Author context**
Record: Jian Zhang has multiple prior kernel contributions (NCSI, MCTP,
i2c-aspeed, etc.). Not the maintainer, but the patch was applied by
`co...@mi...`, the IPMI maintainer, and was discussed with Quan
Nguyen, the original driver author.
**Step 3.5 — Dependencies**
Record: None. A single `timer_delete_sync()` call inside `.remove` does
not rely on any other patch in the series. For stable branches below
v6.15, the API is `del_timer_sync()` — a trivial rename that the stable
maintainers routinely handle as a backport adjustment.
## Phase 4: Mailing List Research
**Step 4.1 — Original submission**
Record: `b4 dig -c 7fc3e2546cf3f` found the thread at `https://lore.kern
el.org/all/202...@by.../`. This
is patch **1/5** in a 5-patch series of IPMI SSIF BMC fixes.
**`b4 dig -a` — revisions**: v1 (2026-04-02) and v2 (2026-04-03). v2
changelog: "use `timer_delete_sync()` to cancel the timer" — review
feedback upgraded v1's non-sync delete to the sync variant.
In-thread discussion from Corey Minyard (reply to v2 1/5): "Thanks for
the updates on this. I have the new version in my tree." — explicit
maintainer acceptance.
**Step 4.2 — Recipients**
Record: CC list on v2 was Corey Minyard (IPMI maintainer), Quan Nguyen
(original author of the driver / MAINTAINERS entry), openipmi-developer
list, linux-kernel. Correct audience; reviewed by the right people.
**Step 4.3 — Bug report**
Record: No syzbot / bugzilla / user report cited. The bug was apparently
found through code review or internal testing at Bytedance. Unverified
whether there is a field occurrence.
**Step 4.4 — Series context**
Record: Siblings in the same series (all for
`drivers/char/ipmi/ssif_bmc.c`, all with `Fixes: dd2bc5cc9e25`):
copy_to_user partial failure fix (2/5), message desynchronization after
truncated response (3/5), log-level change (4/5), and a kunit test
(5/5). This patch is independent and applies on its own.
**Step 4.5 — Stable-list discussion**
Record: None visible. Anubis anti-bot blocked direct lore searches, but
the archived mbox I pulled via `b4 dig -m` contained no stable-list
cross-post or stable nomination.
## Phase 5: Code Semantic Analysis
**Step 5.1 — Modified function**: `ssif_bmc_remove()`.
**Step 5.2 — Callers**
Record: `ssif_bmc_remove` is assigned to `i2c_driver.remove` and invoked
by the i2c bus core when the device is unbound (rmmod, device unbind via
sysfs, or driver removal).
**Step 5.3 — Callees/related**
Record: The fix target is `response_timer`. Arming site is
`handle_request()` (`mod_timer(... RESPONSE_TIMEOUT=500 ms)`). Callback
is `response_timeout()` which locks `ssif_bmc->lock` and touches
`ssif_bmc->busy`, `response_timer_inited`, `aborting` — all of which are
inside a `devm_kzalloc`'d struct.
**Step 5.4 — Reachability**
Record: Triggered every time a SSIF IPMI request is handled from the
host side; the window to teardown can be up to 500 ms per request.
Remove path is reached when the module is unloaded or the i2c device is
unbound — a real, not theoretical, code path in BMC firmware
environments (OpenBMC) that allow driver rebinding.
**Step 5.5 — Similar patterns**
Record: `timer_delete_sync()` in driver `.remove` is the standard idiom
for any driver-owned timer with a pointer to devres-managed state. The
same pattern is also added in the kunit test (patch 5/5) for correctness
of the test fixture.
## Phase 6: Stable-tree Analysis
**Step 6.1 — Exists in stable?**
Record: Buggy `ssif_bmc_remove()` exists identically in 6.6.y, 6.12.y,
6.15.y, 6.16.y and later (verified by `git show
pending-6.X:drivers/char/ipmi/ssif_bmc.c`). Not present in 6.1.y (driver
added after 6.1 branched).
**Step 6.2 — Backport complications**
Record: Mainline (7.0) uses `timer_delete_sync()`; stable branches <
v6.15 expose it as `del_timer_sync()`. This is a trivial rename that
stable maintainers handle routinely. Otherwise the patch applies cleanly
— the 3 surrounding lines (`struct ssif_bmc_ctx *ssif_bmc = ...;
i2c_slave_unregister(client); misc_deregister(&ssif_bmc->miscdev);`) are
identical in all stable branches I checked.
**Step 6.3 — Prior stable fixes**
Record: No earlier fix for this UAF window has been applied to any
stable branch I checked.
## Phase 7: Subsystem Context
**Step 7.1 — Criticality**: `drivers/char/ipmi/ssif_bmc.c` — PERIPHERAL
(BMC-side SSIF; used on Linux-running BMCs, e.g. OpenBMC). Real users,
narrow hardware scope.
**Step 7.2 — Activity**: Low-frequency but active; Corey Minyard
actively maintains; Jian Zhang's 5-patch series is a recent hardening
round.
## Phase 8: Impact and Risk
**Step 8.1 — Affected users**: Users of Linux BMC firmware that exposes
SSIF to a host CPU (OpenBMC et al.).
**Step 8.2 — Trigger**: Requires device unbind/rmmod while a timer from
a recent IPMI request is still armed (≤500 ms window per request). Not
routinely triggered at runtime on production BMCs that never unbind, but
reachable on development / test / field-service paths and by privileged
userspace actions.
**Step 8.3 — Failure mode**: UAF on `ssif_bmc` (and potentially on
`client`), plus potential call into freed module text on module unload.
Severity: HIGH — potential kernel crash / memory corruption; the
callback also takes a spinlock on the freed struct.
**Step 8.4 — Risk/benefit**
Record: Benefit = eliminates a real UAF window on device removal in a
driver present in many live stable trees. Risk = near-zero; one line,
canonical idiom, no control-flow change, safe on a never-initialized
`timer_list`. Ratio is strongly favorable.
## Phase 9: Synthesis
**Evidence for**: Classic UAF-prevention pattern; minimal 1-line diff;
obviously correct; review feedback already incorporated (v1 → v2
upgraded non-sync to sync); IPMI maintainer applied it; original driver
author was on CC; self-contained (no dependencies); bug present since
driver introduction (v6.2).
**Evidence against**: No explicit `Fixes:` tag on this specific patch
(though companion patches in the series tag dd2bc5cc9e25); no
syzbot/user report cited; backport to < 6.15 needs `del_timer_sync()`
rename.
**Stable-rules checklist**
1. Obviously correct and tested — YES (1 line, canonical, in maintainer
tree)
2. Real bug — YES (UAF on timer callback after teardown)
3. Important issue — YES (UAF / possible kernel panic)
4. Small and contained — YES (+1/-0 line, single function)
5. No new features/APIs — YES (adds a cleanup call only)
6. Applies to stable — YES, with a trivial API rename for < v6.15
**Exception categories**: Not applicable; stands on bug-fix merit alone.
**Decision**: YES. Small UAF fix in a never-canceled timer on device
teardown, accepted by the subsystem maintainer, safe and trivial to
backport to all stable trees that carry the driver (6.6.y and later).
## Verification
- [Phase 1] Parsed subject / body / tags directly from the commit and
companion series — confirmed no `Fixes:`/`Cc: stable`, but sibling
patches in series carry `Fixes: dd2bc5cc9e25`.
- [Phase 2] Read `ssif_bmc.c` lines 77–100 (struct fields including
`response_timer`), 200–228 (`timer_delete()` in write path), 298–336
(`response_timeout` callback + `mod_timer` arming site), 804–848
(probe/remove) — confirmed `ssif_bmc` is `devm_kzalloc`'d (line 809)
and that `response_timeout` dereferences fields inside it.
- [Phase 3] `git blame -L 820,870 drivers/char/ipmi/ssif_bmc.c` — buggy
`ssif_bmc_remove` unchanged since `dd2bc5cc9e2555`.
- [Phase 3] `git show --stat dd2bc5cc9e255` — initial driver add (Oct
2022).
- [Phase 3] `git tag --contains dd2bc5cc9e255` — earliest release is
**v6.2**.
- [Phase 3] `git tag --contains 8fa7292fee5c5` — `timer_delete_sync()`
rename lands in v6.15.
- [Phase 3] `git tag --contains 41cb08555c416` — `timer_container_of()`
rename lands in v6.16.
- [Phase 4] `b4 dig -c 7fc3e2546cf3f` — matched series at `lore.kernel.o
rg/all/202...@by.../`.
- [Phase 4] `b4 dig -c 7fc3e2546cf3f -a` — two revisions (v1 and v2);
applied version is the latest (v2).
- [Phase 4] `b4 dig -m /tmp/ssif_thread.mbx` — full thread saved; Corey
Minyard acknowledges acceptance: "I have the new version in my tree."
v2 changelog: "use timer_delete_sync() to cancel the timer".
- [Phase 5] `Grep response_timer` in the file — arming site in
`handle_request`, callback `response_timeout` dereferences `ssif_bmc`
(line 300), uses `ssif_bmc->lock`.
- [Phase 6] `git show pending-6.6:drivers/char/ipmi/ssif_bmc.c` and
`pending-6.12`, `pending-6.15`, `pending-6.16` — confirmed identical
buggy `ssif_bmc_remove()` in all of them; confirmed older trees use
`del_timer_sync()` / `from_timer()`.
- [Phase 6] `git show pending-6.1:drivers/char/ipmi/ssif_bmc.c` — file
does not exist (driver added after 6.1).
- [Phase 8] Failure mode: timer callback on freed devres memory → UAF /
possible panic on module unload (inferred from code inspection of the
callback plus `devm_kzalloc` semantics).
- UNVERIFIED: no reachable syzbot/bugzilla/user report cited; impact in
field is plausible but I did not locate a concrete crash report.
- UNVERIFIED: did not retrieve v1 body directly (Anubis blocked lore
fetch), but v2 changelog and the maintainer reply unambiguously
document the v1 → v2 change.
**YES**
drivers/char/ipmi/ssif_bmc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/char/ipmi/ssif_bmc.c b/drivers/char/ipmi/ssif_bmc.c
index 7a52e3ea49ed8..dc1d5bb4a4604 100644
--- a/drivers/char/ipmi/ssif_bmc.c
+++ b/drivers/char/ipmi/ssif_bmc.c
@@ -843,6 +843,7 @@ static void ssif_bmc_remove(struct i2c_client *client)
{
struct ssif_bmc_ctx *ssif_bmc = i2c_get_clientdata(client);
+ timer_delete_sync(&ssif_bmc->response_timer);
i2c_slave_unregister(client);
misc_deregister(&ssif_bmc->miscdev);
}
--
2.53.0
|
|
From: Matt F. <ma...@re...> - 2026-04-19 20:50:47
|
On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote: > > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful > READ_EVENT_MSG_BUFFER command completes. Getting data from the > BMC has higher priority than sending data to the BMC. > > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then > that would certainly wedge the driver. But it would have to continually > report success for that command, which would be strange as its supposed > to error out when the queue is empty. That does indeed appear to be what's happening. The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER handler does not fail when there is nothing to read, https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704 > If it's really something like that, I could also look at adding limits > for those operations. That would be great. Me and Fred would be happy to test out any patch. I still think the original patch I sent is a worthwhile defense. Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to block behind one another when we hit these kinds of issues in the IPMI code. Untangling that across thousands of machines can be time consuming and a more explicit EIO or ETIMEDOUT would help with triage. |
|
From: <pr-...@ke...> - 2026-04-18 18:30:03
|
The pull request you sent on Fri, 17 Apr 2026 20:24:34 -0500: > https://github.com/cminyard/linux-ipmi.git tags/for-linus-7.1-1 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/1e769656963e0329b91d32ec76955e077966b603 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html |
|
From: Corey M. <co...@mi...> - 2026-04-18 01:24:51
|
The following changes since commit af4e9ef3d78420feb8fe58cd9a1ab80c501b3c08: uaccess: Fix scoped_user_read_access() for 'pointer to const' (2026-03-02 09:24:32 -0800) are available in the Git repository at: https://github.com/cminyard/linux-ipmi.git tags/for-linus-7.1-1 for you to fetch changes up to 75c486cb1bcaa1a3ec3a6438498176a3a4998ae4: ipmi:ssif: Clean up kthread on errors (2026-04-17 06:47:40 -0500) ---------------------------------------------------------------- ipmi: Small updates and fixes Mostly fixes to the BMC software. Fix one issue in the host side driver where a kthread can be left running on a specific memory allocation failre at probe time. Replace system_wq with system_percpu_wq so system_wq can eventually go away. ---------------------------------------------------------------- Corey Minyard (2): ipmi:ssif: Remove unnecessary indention ipmi:ssif: Clean up kthread on errors Jian Zhang (6): ipmi: ssif_bmc: cancel response timer on remove ipmi: ssif_bmc: fix missing check for copy_to_user() partial failure ipmi: ssif_bmc: fix message desynchronization after truncated response ipmi: ssif_bmc: change log level to dbg in irq callback ipmi: ssif_bmc: add unit test for state machine ipmi: ssif_bmc: Fix KUnit test link failure when KUNIT=m Marco Crivellari (1): ipmi: Replace use of system_wq with system_percpu_wq drivers/char/ipmi/Kconfig | 10 + drivers/char/ipmi/ipmi_msghandler.c | 10 +- drivers/char/ipmi/ipmi_ssif.c | 41 ++-- drivers/char/ipmi/ssif_bmc.c | 405 +++++++++++++++++++++++++++++++++++- 4 files changed, 435 insertions(+), 31 deletions(-) |
|
From: Corey M. <co...@mi...> - 2026-04-17 23:54:11
|
On Fri, Apr 17, 2026 at 11:23:03PM +0100, Matt Fleming wrote: > On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote: > > > > The lower level driver should never not return an answer, it is supposed > > to guarantee that it returns an error if the BMC doesn't respond. > > > > So the bug is not here, the bug is elsewhere. My guess is that there > > is some new failure mode where a BMC is not working but it responds well > > enough that it sort of works and fools the driver. But that's only a > > guess. > > I can now reproduce this pretty reliably by running concurrent > ipmitool commands (sensor/sel/mc info) + sysfs readers + periodic > ipmitool mc reset cold. It wedges in a few minutes. Hmm. If you are sending cold resets, then the driver is going into reset maintenance mode and it should be rejecting messages for 30 seconds after you send that command. You can disable that by changing is_maintenance_mode_cmd() in ipmi_msghandler.c to always return false. > > My working theory is handle_flags() in ipmi_si_intf.c can loop on > flag-driven commands (e.g. READ_EVENT_MSG_BUFFER) without ever calling > start_next_msg(), starving waiting_msg indefinitely. > > Captured state at wedge: > > si_state=SI_GETTING_EVENTS msg_flags=0x02 > si_curr cycling cmd=0x35 (READ_EVENT_MSG_BUFFER) > si_wait frozen cmd=0x08 (GET_DEVICE_GUID, never promoted) > > The cold reset makes the BMC report EVENT_MSG_BUFFER_FULL during > re-init, which drives the flag loop. The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful READ_EVENT_MSG_BUFFER command completes. Getting data from the BMC has higher priority than sending data to the BMC. If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then that would certainly wedge the driver. But it would have to continually report success for that command, which would be strange as its supposed to error out when the queue is empty. If it's really something like that, I could also look at adding limits for those operations. To debug things like this I often add module_params that let me see what is going on. But you can look at the "invalid_events" counter to see if the data is bogus. Or there should be an "Event queue full, discarding incoming events" log coming out once at the beginning of when this happens. -corey > > Thanks, > Matt |
|
From: Matt F. <ma...@re...> - 2026-04-17 22:23:17
|
On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote: > > The lower level driver should never not return an answer, it is supposed > to guarantee that it returns an error if the BMC doesn't respond. > > So the bug is not here, the bug is elsewhere. My guess is that there > is some new failure mode where a BMC is not working but it responds well > enough that it sort of works and fools the driver. But that's only a > guess. I can now reproduce this pretty reliably by running concurrent ipmitool commands (sensor/sel/mc info) + sysfs readers + periodic ipmitool mc reset cold. It wedges in a few minutes. My working theory is handle_flags() in ipmi_si_intf.c can loop on flag-driven commands (e.g. READ_EVENT_MSG_BUFFER) without ever calling start_next_msg(), starving waiting_msg indefinitely. Captured state at wedge: si_state=SI_GETTING_EVENTS msg_flags=0x02 si_curr cycling cmd=0x35 (READ_EVENT_MSG_BUFFER) si_wait frozen cmd=0x08 (GET_DEVICE_GUID, never promoted) The cold reset makes the BMC report EVENT_MSG_BUFFER_FULL during re-init, which drives the flag loop. Thanks, Matt |
|
From: Matt F. <ma...@re...> - 2026-04-17 16:11:08
|
On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote:
>
> I've seen this before in several scenarios, including a system that put
> IPMI in the ACPI tree and it sort of worked but there was no BMC
> present. I had to disable that particular device.
>
> What hardware is involved here?
I'm fairly sure we've seen this across a bunch of different BMCs, so
it's not a BMC hardware thing. Almost certainly a driver issue.
> Can you give a more detailed example of what's happening in the
> low-level hardware? If it's KCS there's a debug flag in the
> drivers/char/ipmi/ipmi_kcs_sm.c file that should help.
Yep, it's KCS. Unfortunately I haven't found a way to reproduce this
reliably yet.
Looking at a wedged machine (cat /sys/class/ipmi/.../firmware_revision)
with drgn I can see that there's 99,846 messages sat on intf->xmit_msgs
and the KCS SM is idle (it's responding to internal traffic like Get
Global Enables and Get Msg Flags). So it looks like completions are
getting dropped.
We're running a 6.18.18 kernel which includes c08ec55617cb ("ipmi: Fix
use-after-free and list corruption on sender error"), so it's not that.
Here's a dump of some of the data structures.
intf = 0xffff9d2f4a5a0000
intf->curr_msg = 0xffff9d34f21a9c00
intf->xmit_msgs.next = 0xffff9d30c28e3c80
intf->waiting_rcv_msgs = empty
intf->maintenance_mode = 0
intf->maintenance_mode_state = 0
intf->in_shutdown = false
intf->seq_table = 0/64 slots used
intf->smi_work.pending = 0
The stuck message itself — intf->curr_msg:
msg @ 0xffff9d34f21a9c00
.data = { 0x18, 0x01 } # NetFn 0x06 (App), cmd 0x01 = Get Device ID
.data_size = 2
.rsp_size = 38
.rsp[0..7] = 2c 01 00 00 ...
.done = free_smi_msg
.user_data = NULL
.msgid = (internal GDI poll)
.type = IPMI_SMI_MSG_TYPE_NORMAL
smi_info = 0xffff9d2f4a010000
smi_info->si_state = SI_NORMAL (0)
smi_info->curr_msg = 0xffff9d2f48c7b800
smi_info->waiting_msg = NULL
smi_info->interrupt_disabled = false
smi_info->supports_event_msg_buff = true
smi_info->io.irq = 0
smi_info->run_to_completion = false
smi_info->in_maintenance_mode = 0
Let me know if you want any other info. I'll try to trace this and
catch it reproducing.
|
|
From: Matt F. <ma...@re...> - 2026-04-17 16:09:54
|
On Thu, Apr 16, 2026 at 10:28:50AM -0400, Tony Camuso wrote: > > In my testing with updates from the Linus tree, after a BMC cold reset: > 1. The KCS driver returned -EBUSY to callers (good) > 2. The watchdog daemon received the error and initiated shutdown > 3. No D-state hang > > My tests, conducted on a Dell PER640, verified that Corey's upstream fixes > cause the driver to properly return errors instead of blocking. > At least on that platform. > > Which hich low-level driver are you using (KCS, BT, SSIF)? > The PER640 uses KCS. > # cat /sys/class/ipmi/ipmi0/device/params 2>/dev/null > kcs,i/o,0xca8,rsp=4,rsi=1,rsh=0,irq=10,ipmb=32 $ cat /sys/class/ipmi/ipmi0/device/params kcs,i/o,0xca2,rsp=1,rsi=1,rsh=0,irq=0,ipmb=32 attentions 3 complete_transactions 7080342 events 3 flag_fetches 0 hosed_count 1 idles 25359147 incoming_messages 0 interrupts 0 long_timeouts 264790 short_timeouts 13723711 watchdog_pretimeouts 0 > Actually, no. The 54 commits I backported simply bring my RHEL-9 test kernel > to parity with the Linus tree, which includes [2] and ... > cae66f1a1dcd 2026-02-13 co...@mi... ipmi:si: Fix check for a misbehaving BMC Ah, I see we have some machines on v6.18.20 which includes this and they're still triggering this problem. |
|
From: Tony C. <tc...@re...> - 2026-04-16 14:29:05
|
On 4/15/26 5:22 PM, Frederick Lawler wrote: > Hi Corey & Tony, > > On Wed, Apr 15, 2026 at 11:46:27AM -0400, 'Tony Camuso' via kernel-team wrote: >> On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote: >>> From: Matt Fleming <mfl...@cl...> >>> >>> When the BMC does not respond to a "Get Device ID" command, the >>> wait_event() in __get_device_id() blocks forever in >>> TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent >>> sysfs reader then piles up in D state. Replace with >>> wait_event_timeout() to return -EIO after 1 second. >> >> On Wed, Apr 15, 2026 at 12:17:04PM, Corey Minyard wrote: >>> This is the second report I have of something like this. So >>> something is up. I'm adding Tony, who reported something like this >>> dealing with the watchdog. >>> >>> The lower level driver should never not return an answer, it is >>> supposed to guarantee that it returns an error if the BMC doesn't >>> respond. So the bug is not here, the bug is elsewhere. > > This is a bit of a throwback to our previous discussions around [1]. > > I did end up applying [2] based on that discussion, and had limited > success, but we still have external resets that cause us to enter > this undesirable state :( In my testing with updates from the Linus tree, after a BMC cold reset: 1. The KCS driver returned -EBUSY to callers (good) 2. The watchdog daemon received the error and initiated shutdown 3. No D-state hang My tests, conducted on a Dell PER640, verified that Corey's upstream fixes cause the driver to properly return errors instead of blocking. At least on that platform. Which hich low-level driver are you using (KCS, BT, SSIF)? The PER640 uses KCS. # cat /sys/class/ipmi/ipmi0/device/params 2>/dev/null kcs,i/o,0xca8,rsp=4,rsi=1,rsh=0,irq=10,ipmb=32 > [1]: https://lore.kernel.org/all/aJU...@ma.../ > [2]: https://lore.kernel.org/all/202...@mi.../ >> >> I've been tracking a related issue (RHEL customer case) where BMC >> reset while the IPMI watchdog is active causes D-state hangs. This >> appears to be the same root cause Matt is hitting. >> >> I backported the recent upstream KCS/SI fixes to a RHEL 9 test kernel >> (54 patches bringing it to mainline parity) and tested today on a >> Dell R640. > > I assume this patch series: "ipmi:watchdog: Fix panic, D-state hang, and > lost protection on BMC reset" [3]? > > [3]: https://lore.kernel.org/all/202...@re.../ > Actually, no. The 54 commits I backported simply bring my RHEL-9 test kernel to parity with the Linus tree, which includes [2] and ... cae66f1a1dcd 2026-02-13 co...@mi... ipmi:si: Fix check for a misbehaving BMC |