Menu

#56 libOpenIPMI core dumping w/ blade removal/insertion on ATCA

v1.0_(example)
closed-fixed
Library (36)
7
2017-08-03
2010-06-09
No

Using an ATCA shelf with Kontron (AT8050) blades (3 FRUs: board, RTM & RTM/HDD), I have regular core dump when I insert a blade previously removed. This behavior is seen even when I use ipmi_ui.

Doing some tracing, here is some observations:
- When the blade is first found, a sensor update handler (atca_sensor_update_handler) is added using the fru (atca_fru_t*) as callback data.
- When the blade is removed, atca_ipmc_removal_handler is called, which frees the frus (atca_fru_T*)
- The entities are never removed, so the sensor update handler is still in the locked_list of the entity
- When the blade is re-inserted, all the sensor update handlers are called, so atca_sensor_update_handler is called with an invalid callback data pointer

Here is a stack trace of ipmi_ui when the core occurs:
(gdb) bt
#0 0xf7f0a9de in _ipmi_entity_set_fru (ent=0x0, fru=0x1) at entity.c:5700
#1 0xf7f72965 in setup_fru_hot_swap (finfo=0x81ad750, sensor=0x828cee0) at oem_atca.c:982
#2 0xf7f755d7 in atca_sensor_update_handler (op=IPMI_ADDED, entity=0x819cdd0, sensor=0x828cee0, cb_data=0x81ad750) at oem_atca.c:2774
#3 0xf7f0566c in ipmi_entity_remove_sensor_update_handler_cl (ent=0xffffd180, handler=0xf7f7550e <atca_find_mc_fru_info+155>, cb_data=0x81ad750) at entity.c:2759
#4 0xf7fc105b in locked_list_iterate_prefunc_nolock (ll=0x819d4d0, prefunc=0, handler=0xf7f05641 <ipmi_entity_remove_sensor_update_handler_cl+38>, cb_data=0xffffd180) at locked_list.c:330
#5 0xf7fc1176 in locked_list_iterate (ll=0x819d4d0, handler=0xf7f05641 <ipmi_entity_remove_sensor_update_handler_cl+38>, cb_data=0xffffd180) at locked_list.c:370
#6 0xf7f0570c in _ipmi_entity_call_sensor_handlers (ent=0x819cdd0, sensor=0x828cee0, op=IPMI_ADDED) at entity.c:2792
#7 0xf7f39412 in _ipmi_sensor_put (sensor=0x828cee0) at sensor.c:252
#8 0xf7f3df47 in ipmi_sensor_handle_sdrs (domain=0x817ed78, source_mc=0x81a6a60, sdrs=0x81a7840) at sensor.c:2019
#9 0xf7f210e7 in sdrs_fetched_mc_cb (mc=0x81a6a60, cb_data=0x8265f50) at mc.c:2995
#10 0xf7f207c0 in mc_ptr_cb (domain=0x817ed78, cb_data=0xffffd320) at mc.c:2584
#11 0xf7f18281 in ipmi_domain_pointer_cb (id={domain = 0x817ed78}, handler=0xf7f206d6 <ipmi_mc_convert_to_id+77>, cb_data=0xffffd320) at domain.c:4024
#12 0xf7f2082b in ipmi_mc_pointer_cb (id={domain_id = {domain = 0x817ed78}, mc_num = 140 '\214', channel = 0 '\0', seq = 46}, handler=0xf7f2102d <sdr_reread_done+21>, cb_data=0x8265f50) at mc.c:2602
#13 0xf7f21177 in sdrs_fetched (sdrs=0x81a7840, err=0, changed=1, count=138, cb_data=0x8265f50) at mc.c:3017
#14 0xf7f2815e in handle_fetch_done (cb_data=0x828edb8, shutdown=0) at sdr.c:1841
#15 0xf7f34c27 in opq_op_done (opq=0x82654c8) at opq.c:320
#16 0xf7f26154 in fetch_complete (sdrs=0x81a7840, err=0) at sdr.c:651
#17 0xf7f26378 in handle_reservation_check (mc=0x81a6a60, rsp=0x8231c4c, rsp_data=0x81a7840) at sdr.c:742
#18 0xf7f20a63 in addr_rsp_handler (domain=0x817ed78, rspi=0x8231c20) at mc.c:2696
#19 0xf7f11e73 in get_con_num (domain=0x817ed78, ipmi=0xf7f209e9) at domain.c:425
#20 0xf7f14f9f in ll_rsp_handler (ipmi=0x817b0a0, orspi=0x82f6790) at domain.c:2059
#21 0xf7f11c29 in ipmi_handle_rsp_item_copymsg (ipmi=0x817b0a0, rspi=0x82f6790, msg=0xf7f14e58, rsp_handler=0xffffd557) at ipmi.c:1764
#22 0xf7f7e33d in handle_payload (ipmi=0x817b0a0, lan=0x817b168, addr_num=0, payload_type=0, tmsg=0xffffd62e "\201\024k ▒!", payload_len=11) at ipmi_lan.c:3016
#23 0xf7f7f0cd in handle_lan15_recv (ipmi=0x817b0a0, lan=0x817b168, addr_num=0, data=0xffffd620 "\006", len=25) at ipmi_lan.c:3330
#24 0xf7f7f7cf in data_handler (fd=3, cb_data=0x817eac8, id=0x817eb90) at ipmi_lan.c:3530
#25 0xf7fdf1b0 in fd_handler (fd=3, data=0x817eb90) at ui_os.c:72
#26 0xf7edb40c in process_fds (sel=0x804bfa8, send_sig=0, thread_id=0, cb_data=0x0, timeout=0xffffd928) at selector.c:636
#27 0xf7edbb0f in sel_select_loop (sel=0x804bfa8, send_sig=0, thread_id=0, cb_data=0x0) at selector.c:771
#28 0x080491cb in main (argc=8, argv=0xffffda64) at basic_ui.c:320

Discussion

  • Stephane Blain

    Stephane Blain - 2010-06-09

    Since this is an important problem for us, I able to diagnose / debug / trace / fix the problem the problem. But because I don't know all the internals of OpenIPMI, I would appreciate help because I wish this bug to be fixed in the official release...

     
  • Stephane Blain

    Stephane Blain - 2010-06-09
    • priority: 5 --> 7
     
  • Corey Minyard

    Corey Minyard - 2010-06-09

    I'd be happy to take a patch and give you credit for it. If I understand you correctly, you have a fix for this. If you could send a diff of the fix, I can look at it. Internals can be subtle, so it may not be right, but if not it will probably point me in the right direction.

     
  • Stephane Blain

    Stephane Blain - 2010-06-09

    Thanks.

    Yes I made a really quick and really dirty fix. But it is probably not good....

    Before I go further, I have a question on the OpenIPMI internals: When a blade is removed, the MC is removed. In that case, does every entities on that blade should also be removed? In the traces that I have done, it is not case, only one entity gets removed (an entity that I didn't register any callback).

     
  • Corey Minyard

    Corey Minyard - 2010-06-09

    MCs and entities are orthogonal for the most part. An entity is created by an SDR entry in some MC someplace (perhaps in the shelf manager in this case). Entities can also be created automatically, like slot entities in an ATCA system. Entities are not "on" MCs, an entity is a thing in the system, the instruments that measure the things reside on MCs.

     
  • Stephane Blain

    Stephane Blain - 2010-06-10

    The file OpenIPMI-3013755.patch is my proposed patch.

    Here are the reasons:
    - When an Entity is created, the atca_sensor_update_handler is added to it using a atca_fru_t* as callback data
    - When the MC is removed, all the atca_fru_t* related to it were removed, leaving the Entity with an invalid pointer

     
  • Stephane Blain

    Stephane Blain - 2010-06-10

    First version of proposed patch

     
  • Corey Minyard

    Corey Minyard - 2017-08-03
    • status: open --> closed-fixed
    • assigned_to: Corey Minyard
    • Group: --> v1.0_(example)
     
  • Corey Minyard

    Corey Minyard - 2017-08-03

    I know this is old and it seems I dropped the ball here, but I have commited a proper fix for this. The proposed fix is not correct, it would leave incorrect information lying around. I changed it to not pass finfo into the callback funcsions, instead getting it from the entity OEM info and re-creating it in the presence handling if it was deleted.

     

Log in to post a comment.