Menu

#1960 openhpid process is consuming more than 10G

3.8.0
closed-fixed
Shyamala
HP c-Class Plugin
5
2017-09-15
2017-06-13
Sathish
No

Hi,

The memory of openhpid process running on my machine is building up a lot of memory (more than 10G). Along with the memory consumption, the following log is observed and the process is never coming out of the re-discovery loop.

Jun 8 00:23:10 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55
Jun 8 00:24:02 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55

The configuration file contains the following:
OPENHPI_LOG_ON_SEV = "DEBUG"
OPENHPI_EVT_QUEUE_LIMIT = 10000
OPENHPI_DEL_SIZE_LIMIT = 10000
OPENHPI_DEL_SAVE = "NO"
OPENHPI_DAT_SIZE_LIMIT = 0
OPENHPI_DAT_USER_LIMIT = 0
OPENHPI_DAT_SAVE = "NO"
OPENHPI_PATH = "openhpi path"
OPENHPI_VARPATH = "lib_path"

OPENHPI_AUTOINSERT_TIMEOUT = 0
OPENHPI_AUTOINSERT_TIMEOUT_READONLY = "NO"

Any pointers/help would be great.

Thanks,
Sathish

Discussion

1 2 > >> (Page 1 of 2)
  • dr_mohan

    dr_mohan - 2017-06-13
    • labels: memory --> HP c-Class Plugin
    • Subsystem: OpenHPI Daemon --> HP c-Class Plugin
    • 3.7.0: Future --> 3.8.0
     
  • dr_mohan

    dr_mohan - 2017-06-13

    Thank you for filing this bug Sathish. We need to fix this memory leak problem. Looks like there is some problem with the blade. More information on the bug could help.
    a. Which version of openhpi you are using?
    b. What is the version of OA firmware?
    c. If you know more details on the blade, please let's know. Looks like it has some unique problem that we are not handling well.

     
  • Sathish

    Sathish - 2017-06-14

    Dr Mohan,

    1. OpenHPI version we are using is 3.6.1
    2. Pasting the output from ./hpiinv
      SAF HPI Version B.03.02 (Is this the one?)
      Some more info:
      FieldId[1] Product Name : TEXT : BladeSystem c7000 Enclosure G3
      FieldId[2] Manufacturer : TEXT : HP
      FieldId[3] Product Versn : TEXT : 1.00
      FieldId[4] Custom Field : TEXT : Telco_Status: TRUE
      FieldId[5] Custom Field : TEXT : Power_Type: EXTERNAL_DC
      FieldId[6] Custom Field : TEXT : ENCLOSURE STATUS: OK

    3. Along the memory we have seen errors like,
      a. Management processor status flipping (Screen shot attached for reference) - on one chassis
      b. Power related issue on another chassis, as in the below logs


    Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5
    Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.
    Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:453: Failed to discover Server Blade
    Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:381: Discovery failed for active OA 10.193.69.5:443
    Jun 8 10:25:34 cz2 openhpid: oasoap: oasoaputils.c:753: OA 10.193.69.5 has become Active
    Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5
    Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.


    1. Could it be same as "https://sourceforge.net/p/openhpi/bugs/1854/"

    Just thinking aloud:
    Do we delete the resource-id from the RPTable, when the management processor is unresponsive? If so, the re-discovery infinite loop might have been entered just then, and thus the log in loop.

    "Jun 8 00:23:57 cz2 openhpid: oasoap: oasoaprediscover.c:923: resource RPT is NULL"

    Shouldn't the loop be exited when the rpt is NULL?

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-15

    Hi Sathish,

    Thanks for the information provided.

    From the logs, it looks like there is some problem with that particular blade. I need some more information regarding the server blade.
    1) Which model of server blade is placed at bay 5 and what is the iLo Firmwrae version?
    2) Did you do any iLo firmaware update on that server blade recently?
    3) From when or any specific point, where this problem is seen?

    I can suggest you to restart the openhpid and see.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-15

    Shyamala,

    1. Copying the info from ./hpiinv
      FieldId[1] Product Name : TEXT : ProLiant BL460c Gen8
      FieldId[2] Manufacturer : TEXT : HP
      FieldId[3] Product Versn : TEXT : 2.40 Dec 02 2015
      FieldId[4] Custom Field : TEXT : MP Model name = iLO4

    2. The last iLo FW update was on Dec 02, 2015

    3. We are seeing this issue from April.

    As a temperory recovery mechanism we are restarting openhpid process if it crosses some limit. But, the process keeps recurring.

    Also, can you help me in compiling the code, I am getting the following error:

    make[4]: Leaving directory '/home/platform/openhpi-3.6.1/src'
    /bin/bash: LIBTOOL@: command not found
    Makefile:618: recipe for target 'sahpi_enum_utils.lo' failed
    make[3]: [sahpi_enum_utils.lo] Error 127
    make[3]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils'
    Makefile:658: recipe for target 'all-recursive' failed
    make[2]:
    [all-recursive] Error 1
    make[2]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils'
    Makefile:603: recipe for target 'all-recursive' failed
    make[1]: [all-recursive] Error 1
    make[1]: Leaving directory '/home/platform/openhpi-3.6.1/src'
    Makefile:496: recipe for target 'all' failed
    make:
    [all] Error 2

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-16

    Hi Sathish,

    Looks like this is the problem with the particular server blade. Just for pointing out exact reason, please help me with results by following below steps.
    1) Remove server blade and run openhpid. See the results.
    2) After some time, insert server blade back. Confirm blade is powered on & then see the problem still persists.

    Coming to compilation part, from the error, looks like there is no libtool installed. Install libtool and then try to compile code. Which OS are you using?

    Regards,
    Shyamala

     
  • Shyamala

    Shyamala - 2017-06-16

    Sathish,

    What is the version of OA firmware? If possible attach hpiinv result.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-16

    Shyamala,

    It will take some time for providing the information, by pulling out the blade and restarting openhpid due to various reasons.

    For hpiinv result, it has got a lot of private info, kindly share me your email-id so that can send the info privately.

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-16

    Sure.

    My e-mail id is shyamala.hirepatt@hpe.com.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-20

    Shyamala,

    Did you get a chance to go through the hpiinv output? You found anything interesting?

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-20

    Hi Sathish,

    I could not get much info from hpiinv log. For repeated re-discovery issue, you have to follow the steps mentioned above and get the results to figure out the exact reason.

    For the memory leak issue, I have created & attached the patch. Please apply patch, test it and let me know the results.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-20

    SaErrorT oh_add_resource(RPTable table, SaHpiRptEntryT entry, void *data, int owndata){
    ..
    if (rptentry->data && rptentry->data != data && !rptentry->owndata)
    g_free(rptentry->data);
    rptentry->data = data;
    ..
    }
    rv = oh_add_resource(oh_handler->rptcache, &rpt, hotswap_state, 0);

    In oh_add_resource(), the hot_swap_state pointer is stored as the data, in the RPTable. Won't de-allocing on success cause issues?

     
  • Shyamala

    Shyamala - 2017-06-20

    From the logs provided, looks like an issue is caused by not freeing the memory in discover_server() in some error handling case. While going through code I found some more places where memory was not freed on success, not exactly causing this issue. So I covered those too.

    Did you get chance to apply patch and see the results?

     
  • Sathish

    Sathish - 2017-06-21

    Shyamala,

    First of all, thank you for your support.

    I have tested the patch multiple time to be sure. Every time there is a switchover of OA, the memory usage of openhpid is approximately increased by 3 MB. Following logs are printed during the same.

    Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:370: could not find enum value "EVENT_NETWORK_LINK_STATUS" in "EVENT_HEARTBEAT, EVENT_ENC_STATUS, EVENT_ENC_UID, EVENT_ENC_SHUTDOWN, EVENT_ENC_INFO, EVENT_ENC_NAMES, EVENT_USER_PERMISSION, EVENT_ADMIN_RIGHTS_CHANGED, EVENT_ENC_SHUTDOWN_PENDING, EVENT_ENC_TOPOLOGY, EVENT_FAN_STATUS, EVENT_FAN_INSERTED, EVENT_FAN_REMOVED, EVENT_FAN_GROUP_STATUS, EVENT_THERMAL_STATUS, EVENT_COOLING_STATUS, EVENT_FAN_ZONE_STATUS, EVENT_PS_STATUS, EVENT_PS_INSERTED, EVENT_PS_REMOVED, EVENT_PS_REDUNDANT, EVENT_PS_OVERLOAD, EVENT_AC_FAILURE, EVENT_PS_INFO, EVENT_PS_SUBSYSTEM_STATUS, EVENT_SERVER_POWER_REDUCTION_STATUS, EVENT_INTERCONNECT_STATUS, EVENT_INTERCONNECT_RESET, EVENT_INTERCONNECT_UID, EVENT_INTERCONNECT_INSERTED, EVENT_INTERCONNECT_REMOVED, EVENT_INTERCONNECT_INFO, EVENT_INTERCONNECT_HEALTH_LED, EVENT_INTERCONNECT_THERMAL, EVENT_INTERCONNECT_CPUFAULT, EVENT_INTERCONNECT_POWER, EVENT_INTERCONNECT_PORTMAP, EVENT_BLADE_PORTMAP, EVENT_INTERCONNECT_VENDOR_BLOCK, EVENT_INTERCONNECT_HEALTH_STATE, EVENT_DEMO_MODE, EVENT_BLADE_STATUS, EVENT_BLADE_INSERTED, EVENT_BLADE_REMOVED, EVENT_BLADE_POWER_STATE, EVENT_BLADE_POWER_MGMT, EVENT_BLADE_UID, EVENT_BLADE_SHUTDOWN, EVENT_BLADE_FAULT, EVENT_BLADE_THERMAL, EVENT_BLADE_INFO, EVENT_BLADE_MP_INFO, EVENT_ILO_READY, EVENT_LCD_BUTTON, EVENT_KEYING_ERROR, EVENT_ILO_HAS_IPADDRESS, EVENT_POWER_INFO, EVENT_LCD_STATUS, EVENT_LCD_INFO, EVENT_REDUNDANCY, EVENT_ILO_DEAD, EVENT_RACK_SERVICE_STARTED, EVENT_LCD_SCREEN_REFRESH, EVENT_ILO_ALIVE, EVENT_PERSONALITY_CHECK, EVENT_BLADE_POST_COMPLETE, EVENT_BLADE_SIGNATURE_CHANGED, EVENT_BLADE_PERSONALITY_CHANGED, EVENT_BLADE_TOO_LOW_POWER, EVENT_VIRTUAL_MEDIA_STATUS, EVENT_MEDIA_DRIVE_INSERTED, EVENT_MEDIA_DRIVE_REMOVED, EVENT_MEDIA_INSERTED, EVENT_MEDIA_REMOVED, EVENT_OA_NAMES, EVENT_OA_STATUS, EVENT_OA_UID, EVENT_OA_INSERTED, EVENT_OA_REMOVED, EVENT_OA_INFO, EVENT_OA_FAILOVER, EVENT_OA_TRANSITION_COMPLETE, EVENT_OA_VCM, EVENT_NETWORK_INFO_CHANGED, EVENT_SNMP_INFO_CHANGED, EVENT_SYSLOG_CLEARED, EVENT_
    Jun 21 01:45:22 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
    Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
    Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
    Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:771: No bytes read
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:708: oh_ssl_read() part 1 failed
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
    Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:549: SSL connection failed
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
    Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
    Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
    Jun 21 01:45:26 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
    Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
    Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
    Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
    Jun 21 01:45:33 cz1 openhpid: oa_soap: oa_soap_oa_event.c:187: OA switching started
    Jun 21 01:47:35 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection timed out
    Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
    Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
    Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
    Jun 21 01:48:21 cz1 openhpid: oa_soap: oa_soap_event.c:491: OA <ip_address> is accessible
    Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
    Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed
    Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_oa_event.c:339: OA switching completed

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-21

    Hi Sathish,

    Form the logs, look like this is connection problem with OA. Is OA accessible? If you are able to ping OA IP, try to run openhpid without my patch too. Actually patch does not do anything with OA switchover.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-21

    Shyamala,

    OA is accessible. OA calls are failing during switchover, which as per my understanding is expected. My point was memory taken by openhpid is still growing.

    At the end of all the switchover we see the logs:
    Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
    Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-21

    Hi Sathish,

    The thing is earlier whatever error messages were seen, now it is totally different, like you mentioned that every time switchover is happening. The patch given was based on earlier error messages. I need some more infomration.
    1) Why OA switchover is happening every time now? Any changes in setup or connection?
    2) Are you able to see the messages related to server blade RPT (like Failed to get server RPT or power state etc)? If yes, is there any difference in memory consumption with patch?

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-21

    The reason we are not seeing the same errors is, the logs I initially shared with you were from the production and I am not allowed to test patches in production. I am patching your code changes in my local machine and checking if there is no memory leak.

    1. Switching over OA is my test case - to test if no memory leak is happening during switch over. Switchover of OA, is one of the cases, where re-discovery happens, if I am not mistaken.

    I am observing memory usage increase of openhpid, when ever I force a switch over of OA.

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-06-21

    Ok got it. Yes you are right, re-discovery happens when there is OA Switchover. I will look into this further and update you.

    Regards,
    Shyamala

     
  • Shyamala

    Shyamala - 2017-06-29

    Hi Sathish,

    We are still working on the issue. Will get back to you with updates soon.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-06-30

    Thank you Shyamala, we are waiting for the fix.
    We are observing this issue in all the openhpid processes whenever there is a switchover.

    Thanks,
    Sathish

     
  • Shyamala

    Shyamala - 2017-07-25

    Hi Satish,

    Sorry for the delay in providing patch.
    PFA the patch for memory leak issue. Please apply patch, test it on your machine and let me know your observations.

    Regards,
    Shyamala

     
  • Sathish

    Sathish - 2017-07-26

    Hi Shyamala,

    Thank you for the patch.

    These are the steps I followed to test the patch.
    1. svn checkout http://svn.code.sf.net/p/openhpi/code/openhpi/trunk opehpid-test
    2. cd opehpid-test && ./bootstrap
    3. cd plugins && patch -p1 -i 1960.patch -- Patch applied without any errors
    4. ./configure --enable-oa_soap --enable-encryption --enable-watchdog && make
    5. Used the binary in openhpid-test/openhpid/.libs and replaced it on the destination machine. Confirmed with md5sums
    6. Restarted the openhpid
    7. Switched over OA multiple times.

    However, the memory increase is observed and it is in the same range as 3MB every OA switchover.

    Thanks,
    Sathish

     

    Last edit: Sathish 2017-07-26
  • Anirban Das

    Anirban Das - 2017-07-31

    Hi Shyamala,

    With 1960.patch, we have tested and ran openhpid with valgrind.

    Command used: valgrind --leak-check=full --show-reachable=yes --error-limit=no --undef-value-errors=no --log-file=/tmp/memcheck.log /opt/bbee/thirdparty/sbin/openhpid -v -d -c /etc/openhpi.conf -p 4743 -f /opt/bbee-5.0.0-5/thirdparty/var/run/openhpid.pid.4743

    The valgrind logs and process trace (verbose mode) logs are already shared. Each OA switchover is increasing 3 MB memory usage even with this patch.

    Thanks and Regards,

     
  • Shyamala

    Shyamala - 2017-09-15
    • status: open --> closed-fixed
    • assigned_to: Shyamala
     
1 2 > >> (Page 1 of 2)