Menu

#48 getevt exits with return code ( rv = -3)

ipmiutil-3.1.x
closed
None
1
2024-08-20
2022-02-02
Jyothi
No

Hi Team,
I am trying to use the getevt option of ipmiutil (v3.15) as shown below:
ipmiutil getevt -t 0 -s -b -r /usr/local/bin/evt.sh
or
ipmiutil getevt -t 0 -s

The getevt daemon starts successfully but exits soon as shown below. But the same works properly in lower version of IPMI util (v2.79).

e.g.,
NS9100_35# ipmiutil getevt -t 0 -s
ipmiutil getevent ver 3.15
-- BMC version 1.15, IPMI version 2.0
event receiver sa = 20 lun = 00
bmc enables = 0f
igetevent reading sensors ...
Get IPMI SEL events after ID 003a
igetevent waiting for events via method 1 (SEL_events)
Waiting 0 seconds for an event ...
get_event error: ret = 0xfffffffa
ipmiutil getevent exiting.

I put the debug option with 'x', this is what I got:
get_sel(779) rv=0 cc=0 id=779 next=ffff
sel ok, id=779 next=ffff

get_sel(779) rv=0 cc=0 id=779 next=ffff
sel ok, id=779 next=ffff

get_sel(779) rv=-3 cc=0 id=0 next=0
sel recid 779 error, rv = -3

Any help on how to avoid getevt exit would be really helpful?

Thanks!

Discussion

  • Andy Cress

    Andy Cress - 2022-02-02

    That error means that the receive failed (LAN_ERR_RECV_FAIL = -3 in ipmicmd.h).
    There are several methods to get the events, mainly the SEL_events method and the GetMessage method.
    This by default uses the SEL_events method, and should continue to read the last SEL event until a new event occurs. The fact that it fails to read the SEL event that it was able to read twice before implies some firmware anomaly. Which firmware vendor is this?
    It may work better in this case to use -m instead of -s.
    If not, perhaps we could add special-case handling of -3 in igetevent.c for this firmware vendor to keep waiting if it gets this error.

    Info about these two methods, from the igetevent.c header comment:

    • There are several methods to do this which are implemented here.
    • The SEL method (-s):
    • This method polls the SEL once a second, keeps track of the last
    • SEL event read, and only new events are processed. This ensures
    • that in a series of rapid events, all events are received in order,
    • however, some transition-to-OK events may not be configured to
    • write to the SEL on certain platforms.
    • This method is used if getevent -s is specified.
    • The ReadEventMessageBuffer method (-m getmessage option):
    • This uses an IPMI Message Buffer in the BMC firmware to read
    • each new event. This receives any event, but if two events
    • occur nearly simultaneously, only the most recent of the two
    • will be returned with this method. An example of simultaneous
    • events might be, if a fan stops/fails, both the non-critical
    • and critical fan threshold events would occur at that time.
    • This is the default method for getevent. It would be used
    • locally with the Intel IMB driver or with direct/driverless.
     
  • Andy Cress

    Andy Cress - 2022-02-02
    • status: open --> accepted
     
  • Jyothi

    Jyothi - 2022-02-03

    Hi Andy Cress,

    Thank you very much for your quick response.

    I missed to mention that I tried with 'GetMessage' too but it resulted in crash as shown below.

    NS9100_35# ipmiutil getevt -t 0
    ipmiutil getevent ver 3.15
    -- BMC version 1.15, IPMI version 2.0
    event receiver sa = 20 lun = 00
    bmc enables = 0f
    igetevent reading sensors ...
    Get IPMI events from kcs driver
    igetevent waiting for events via method 2 (GetMessage)
    Waiting 0 seconds for an event ...
    *** buffer overflow detected ***: terminated
    Aborted (core dumped)

    [Jyothi] Where can I find the coredump?

    NS9100_35# ipmiutil getevt -t 0 -s
    ipmiutil getevent ver 3.15
    -- BMC version 1.15, IPMI version 2.0
    event receiver sa = 20 lun = 00
    bmc enables = 0f
    igetevent reading sensors ...
    Get IPMI SEL events after ID 0019
    igetevent waiting for events via method 1 (SEL_events)
    Waiting 0 seconds for an event ...
    get_event error: ret = 0xfffffffd
    ipmiutil getevent exiting.

    Which firmware vendor is this?
    [Jyothi] ipmiutil getevent ver 3.15
    -- BMC version 1.15, IPMI version 2.0

    uname -r

    5.4.0-42-generic

     

    Last edit: Jyothi 2022-02-03
  • Jyothi

    Jyothi - 2022-02-03

    If not, perhaps we could add special-case handling of -3 in igetevent.c for this firmware vendor to keep waiting if it gets this error.
    [Jyothi] I modified the code to continue the loop incase of return code -3. Now the ipmiutil getevt exited with a different error code '0xc7'.

    got event id 002b, sensor_type = 13
    event data: 2b 00 02 a7 96 fb 61 33 00 04 13 05 71 a0 03 18
    002b 02/03/22 08:47:35 MIN Bios Critical Interrupt #05 PCIe Cor Sensor PCIe Warn Receiver Error on (03:03.0) 71 [a0 03 18]
    Waiting 0 seconds for an event ...
    got event id 002c, sensor_type = 13
    event data: 2c 00 02 f3 96 fb 61 33 00 04 13 05 71 a0 03 18
    002c 02/03/22 08:48:51 MIN Bios Critical Interrupt #05 PCIe Cor Sensor PCIe Warn Receiver Error on (03:03.0) 71 [a0 03 18]
    Waiting 0 seconds for an event ...
    get_event error: ret = 0xc7
    ipmiutil getevent exiting.

    I enabled debug logs (-x option) and now the return code is different.

    sel ok, id=2c next=ffff
    ipmidir Cmd=43 NetFn=0a Lun=00 Sa=20 Data(6): 00 00 2c 00 00 ff
    Send Netfn=0a Cmd=43, raw: 00 20 28 43 00 00 2c 00 00 ff
    ipmidir Resp(b,43): status=0 cc=00, Data(18): ff ff 2c 00 02 f3 96 fb 61 33 00 04 13 05 71 a0 03 18
    get_sel(2c) rv=0 cc=0 id=2c next=ffff
    sel ok, id=2c next=ffff
    ipmidir Cmd=43 NetFn=0a Lun=00 Sa=20 Data(6): 00 00 2c 00 00 ff
    Send Netfn=0a Cmd=43, raw: 00 20 28 43 00 00 2c 00 00 ff
    ipmidir Resp(2,43): status=-6 cc=ca, Data(0):
    get_sel(2c) rv=-6 cc=0 id=0 next=0
    sel recid 2c error, rv = -6
    get_event ret = -6
    get_event error: ret = 0xfffffffa
    sync: recid=2c time=61fb96f3
    ipmiutil getevent exiting.
    clear_lock rv = -1

    Looks like issue is somewhere else; we are unable to figure it out though.

     
    • Andy Cress

      Andy Cress - 2022-02-03

      For this function, it looks like the right approach is fixing the firmware
      on this motherboard.
      The command 'ipmiutil health -x' will show the make and firmware version
      from the get_deviceid function as hex codes, and will show known makes.
      The firmware version is 1.15 as shown above, but the make/manufacturer
      isn't shown by default.
      However, to enter a support request you will probably have to go through
      the server manufacturer.
      The problem is that the firmware doesn't handle more than 2 requests to get
      a given SEL entry in the same session.

      As an alternative in the meantime, you could create a shell script to run
      'ipmiutil sel' in a loop and take action when a new event occurs.
      This would use a fresh session each time.

      Andy

       

      Last edit: Andy Cress 2022-04-05
  • Jyothi

    Jyothi - 2022-02-04

    Hi Andy Cress,

    Thank you for your inputs.

    I went through the sample script 'ipmiutil_evt' script, it exits if the mode is driverless. And our appliance also uses ipmiutil in driverless way:

    NS9100_35# ipmiutil cmd -k
    IPMI access is ok, driver type = kcs
    Using driverless method
    ipmiutil cmd, completed successfully

    I have uploaded the 'ipmiutil health -x' output file 'health_out' to the thread.

    As an alternative in the meantime, you could create a shell script to run
    'ipmiutil sel' in a loop and take action when a new event occurs.
    This would use a fresh session each time.
    [Jyothi] Okay, as "ipmituil sel" returns all the existing logs, there is no way to figure out the latest ones, right?
    I have tried "ipmiutil sel -w" option, but '/var/log/message' file does not get updated.
    I can try "ipmiutil sel -l5" or so, but it will just display the last 5 logs but not the new events from previously read.

     

    Last edit: Jyothi 2022-02-04
  • Andy Cress

    Andy Cress - 2022-02-04

    Hmmm. I'm surprised. This is Intel firmware on the Intel S4600LH motherboard.

    BMC manufacturer = 000157 (Intel), product = 005c (S4600LH)
    BMC version = 1.15.4159 (Boot 1.13), IPMI v2.0
    BIOS Version = SE5C600.86B.01.07.0002.030620132047

    That means this isn't a one-off behavior, so ipmiutil needs to handle it more robustly.
    I guess the best approach then is to modify igetevent.c to initiate a new session for each pass.

     
  • Jyothi

    Jyothi - 2022-02-07

    Okay! Thanks for your input.

    History:
    We recently upgraded Linux kernel version and all the packages. After the upgrade, ipmiutil access via openIPMI driver has become extremely slow. In order to speed up, we experimented with 'driverless' mode and the speed increased.

    OLD:
    Linux kernel 4.4.110
    ipmiutil ver 2.79 (driver type = open)

    NEW:
    Linux kernel: 5.4.0-42-generic
    ipmiutil ver 3.15 ( driver type = kcs, Using driverless method)

    If 'driverless' method does not have multi-user support, using the openIPMI driver way looks safer.
    But openIPMI driver seems to extremely slow and our application APIs get stuck/crash in the latest kernel version.

    Do you happen to have any information or special configuration/mode that we need to enable in the latest kernel. We could not find much info online.

     
  • Andy Cress

    Andy Cress - 2022-04-05

    Coming back to this, you have probably already handled it, but you have a few alternatives:
    1) Contact the OpenIPMI driver project to resolve the slowness with kernel 5.4.0.
    2) Create a bash script to run the ipmiutil sel via driverless with only one session, which would need to save the last event, and only take action if the last event does not match the saved event (i.e. one or more new events).
    3) In ipmiutil, we could create an option to igetevent.c which would cause it to create a fresh session for each pass, and try again if return code is 0xc7 or 0xca.

     
  • Andy Cress

    Andy Cress - 2022-06-17
    • assigned_to: Andy Cress
     
  • Andy Cress

    Andy Cress - 2022-12-24
    • status: accepted --> pending
     
  • Andy Cress

    Andy Cress - 2024-08-20
    • status: pending --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB