Open HPI / Bugs / #1960 openhpid process is consuming more than 10G

dr_mohan - 2017-06-13

labels: memory --> HP c-Class Plugin

Subsystem: OpenHPI Daemon --> HP c-Class Plugin

3.7.0: Future --> 3.8.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

dr_mohan - 2017-06-13

Thank you for filing this bug Sathish. We need to fix this memory leak problem. Looks like there is some problem with the blade. More information on the bug could help.
a. Which version of openhpi you are using?
b. What is the version of OA firmware?
c. If you know more details on the blade, please let's know. Looks like it has some unique problem that we are not handling well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-14

Dr Mohan,

OpenHPI version we are using is 3.6.1

Pasting the output from ./hpiinv
SAF HPI Version B.03.02 (Is this the one?)
Some more info:
FieldId[1] Product Name : TEXT : BladeSystem c7000 Enclosure G3
FieldId[2] Manufacturer : TEXT : HP
FieldId[3] Product Versn : TEXT : 1.00
FieldId[4] Custom Field : TEXT : Telco_Status: TRUE
FieldId[5] Custom Field : TEXT : Power_Type: EXTERNAL_DC
FieldId[6] Custom Field : TEXT : ENCLOSURE STATUS: OK

Along the memory we have seen errors like,
a. Management processor status flipping (Screen shot attached for reference) - on one chassis
b. Power related issue on another chassis, as in the below logs

Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:453: Failed to discover Server Blade
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:381: Discovery failed for active OA 10.193.69.5:443
Jun 8 10:25:34 cz2 openhpid: oasoap: oasoaputils.c:753: OA 10.193.69.5 has become Active
Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5
Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.

Could it be same as "https://sourceforge.net/p/openhpi/bugs/1854/"

Just thinking aloud:
Do we delete the resource-id from the RPTable, when the management processor is unresponsive? If so, the re-discovery infinite loop might have been entered just then, and thus the log in loop.

"Jun 8 00:23:57 cz2 openhpid: oasoap: oasoaprediscover.c:923: resource RPT is NULL"

Shouldn't the loop be exited when the rpt is NULL?

Thanks,
Sathish

vista_oa_error1.JPG
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-15

Hi Sathish,

Thanks for the information provided.

From the logs, it looks like there is some problem with that particular blade. I need some more information regarding the server blade.
1) Which model of server blade is placed at bay 5 and what is the iLo Firmwrae version?
2) Did you do any iLo firmaware update on that server blade recently?
3) From when or any specific point, where this problem is seen?

I can suggest you to restart the openhpid and see.

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-15

Shyamala,

Copying the info from ./hpiinv
FieldId[1] Product Name : TEXT : ProLiant BL460c Gen8
FieldId[2] Manufacturer : TEXT : HP
FieldId[3] Product Versn : TEXT : 2.40 Dec 02 2015
FieldId[4] Custom Field : TEXT : MP Model name = iLO4

The last iLo FW update was on Dec 02, 2015

We are seeing this issue from April.

As a temperory recovery mechanism we are restarting openhpid process if it crosses some limit. But, the process keeps recurring.

Also, can you help me in compiling the code, I am getting the following error:

make[4]: Leaving directory '/home/platform/openhpi-3.6.1/src'
/bin/bash: LIBTOOL@: command not found
Makefile:618: recipe for target 'sahpi_enum_utils.lo' failed
make[3]: [sahpi_enum_utils.lo] Error 127
make[3]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils'
Makefile:658: recipe for target 'all-recursive' failed
make[2]: [all-recursive] Error 1
make[2]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils'
Makefile:603: recipe for target 'all-recursive' failed
make[1]: [all-recursive] Error 1
make[1]: Leaving directory '/home/platform/openhpi-3.6.1/src'
Makefile:496: recipe for target 'all' failed
make: [all] Error 2

Thanks,
Sathish
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-16

Hi Sathish,

Looks like this is the problem with the particular server blade. Just for pointing out exact reason, please help me with results by following below steps.
1) Remove server blade and run openhpid. See the results.
2) After some time, insert server blade back. Confirm blade is powered on & then see the problem still persists.

Coming to compilation part, from the error, looks like there is no libtool installed. Install libtool and then try to compile code. Which OS are you using?

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-16

Sathish,

What is the version of OA firmware? If possible attach hpiinv result.

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-16

Shyamala,

It will take some time for providing the information, by pulling out the blade and restarting openhpid due to various reasons.

For hpiinv result, it has got a lot of private info, kindly share me your email-id so that can send the info privately.

Thanks,
Sathish

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-16

Sure.

My e-mail id is shyamala.hirepatt@hpe.com.

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-20

Shyamala,

Did you get a chance to go through the hpiinv output? You found anything interesting?

Thanks,
Sathish

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-20

Hi Sathish,

I could not get much info from hpiinv log. For repeated re-discovery issue, you have to follow the steps mentioned above and get the results to figure out the exact reason.

For the memory leak issue, I have created & attached the patch. Please apply patch, test it and let me know the results.

Regards,
Shyamala

1960.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-20

SaErrorT oh_add_resource(RPTable table, SaHpiRptEntryT entry, void *data, int owndata){
..
if (rptentry->data && rptentry->data != data && !rptentry->owndata)
g_free(rptentry->data);
rptentry->data = data;
..
}
rv = oh_add_resource(oh_handler->rptcache, &rpt, hotswap_state, 0);

In oh_add_resource(), the hot_swap_state pointer is stored as the data, in the RPTable. Won't de-allocing on success cause issues?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-20

From the logs provided, looks like an issue is caused by not freeing the memory in discover_server() in some error handling case. While going through code I found some more places where memory was not freed on success, not exactly causing this issue. So I covered those too.

Did you get chance to apply patch and see the results?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-21

Shyamala,

First of all, thank you for your support.

I have tested the patch multiple time to be sure. Every time there is a switchover of OA, the memory usage of openhpid is approximately increased by 3 MB. Following logs are printed during the same.

Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:370: could not find enum value "EVENT_NETWORK_LINK_STATUS" in "EVENT_HEARTBEAT, EVENT_ENC_STATUS, EVENT_ENC_UID, EVENT_ENC_SHUTDOWN, EVENT_ENC_INFO, EVENT_ENC_NAMES, EVENT_USER_PERMISSION, EVENT_ADMIN_RIGHTS_CHANGED, EVENT_ENC_SHUTDOWN_PENDING, EVENT_ENC_TOPOLOGY, EVENT_FAN_STATUS, EVENT_FAN_INSERTED, EVENT_FAN_REMOVED, EVENT_FAN_GROUP_STATUS, EVENT_THERMAL_STATUS, EVENT_COOLING_STATUS, EVENT_FAN_ZONE_STATUS, EVENT_PS_STATUS, EVENT_PS_INSERTED, EVENT_PS_REMOVED, EVENT_PS_REDUNDANT, EVENT_PS_OVERLOAD, EVENT_AC_FAILURE, EVENT_PS_INFO, EVENT_PS_SUBSYSTEM_STATUS, EVENT_SERVER_POWER_REDUCTION_STATUS, EVENT_INTERCONNECT_STATUS, EVENT_INTERCONNECT_RESET, EVENT_INTERCONNECT_UID, EVENT_INTERCONNECT_INSERTED, EVENT_INTERCONNECT_REMOVED, EVENT_INTERCONNECT_INFO, EVENT_INTERCONNECT_HEALTH_LED, EVENT_INTERCONNECT_THERMAL, EVENT_INTERCONNECT_CPUFAULT, EVENT_INTERCONNECT_POWER, EVENT_INTERCONNECT_PORTMAP, EVENT_BLADE_PORTMAP, EVENT_INTERCONNECT_VENDOR_BLOCK, EVENT_INTERCONNECT_HEALTH_STATE, EVENT_DEMO_MODE, EVENT_BLADE_STATUS, EVENT_BLADE_INSERTED, EVENT_BLADE_REMOVED, EVENT_BLADE_POWER_STATE, EVENT_BLADE_POWER_MGMT, EVENT_BLADE_UID, EVENT_BLADE_SHUTDOWN, EVENT_BLADE_FAULT, EVENT_BLADE_THERMAL, EVENT_BLADE_INFO, EVENT_BLADE_MP_INFO, EVENT_ILO_READY, EVENT_LCD_BUTTON, EVENT_KEYING_ERROR, EVENT_ILO_HAS_IPADDRESS, EVENT_POWER_INFO, EVENT_LCD_STATUS, EVENT_LCD_INFO, EVENT_REDUNDANCY, EVENT_ILO_DEAD, EVENT_RACK_SERVICE_STARTED, EVENT_LCD_SCREEN_REFRESH, EVENT_ILO_ALIVE, EVENT_PERSONALITY_CHECK, EVENT_BLADE_POST_COMPLETE, EVENT_BLADE_SIGNATURE_CHANGED, EVENT_BLADE_PERSONALITY_CHANGED, EVENT_BLADE_TOO_LOW_POWER, EVENT_VIRTUAL_MEDIA_STATUS, EVENT_MEDIA_DRIVE_INSERTED, EVENT_MEDIA_DRIVE_REMOVED, EVENT_MEDIA_INSERTED, EVENT_MEDIA_REMOVED, EVENT_OA_NAMES, EVENT_OA_STATUS, EVENT_OA_UID, EVENT_OA_INSERTED, EVENT_OA_REMOVED, EVENT_OA_INFO, EVENT_OA_FAILOVER, EVENT_OA_TRANSITION_COMPLETE, EVENT_OA_VCM, EVENT_NETWORK_INFO_CHANGED, EVENT_SNMP_INFO_CHANGED, EVENT_SYSLOG_CLEARED, EVENT_
Jun 21 01:45:22 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:771: No bytes read
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:708: oh_ssl_read() part 1 failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:549: SSL connection failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:26 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:33 cz1 openhpid: oa_soap: oa_soap_oa_event.c:187: OA switching started
Jun 21 01:47:35 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection timed out
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:48:21 cz1 openhpid: oa_soap: oa_soap_event.c:491: OA <ip_address> is accessible
Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed
Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_oa_event.c:339: OA switching completed

Thanks,
Sathish

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-21

Hi Sathish,

Form the logs, look like this is connection problem with OA. Is OA accessible? If you are able to ping OA IP, try to run openhpid without my patch too. Actually patch does not do anything with OA switchover.

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-21

Shyamala,

OA is accessible. OA calls are failing during switchover, which as per my understanding is expected. My point was memory taken by openhpid is still growing.

At the end of all the switchover we see the logs:
Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed

Thanks,
Sathish

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-21

Hi Sathish,

The thing is earlier whatever error messages were seen, now it is totally different, like you mentioned that every time switchover is happening. The patch given was based on earlier error messages. I need some more infomration.
1) Why OA switchover is happening every time now? Any changes in setup or connection?
2) Are you able to see the messages related to server blade RPT (like Failed to get server RPT or power state etc)? If yes, is there any difference in memory consumption with patch?

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-21

The reason we are not seeing the same errors is, the logs I initially shared with you were from the production and I am not allowed to test patches in production. I am patching your code changes in my local machine and checking if there is no memory leak.

Switching over OA is my test case - to test if no memory leak is happening during switch over. Switchover of OA, is one of the cases, where re-discovery happens, if I am not mistaken.

I am observing memory usage increase of openhpid, when ever I force a switch over of OA.

Thanks,
Sathish
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-21

Ok got it. Yes you are right, re-discovery happens when there is OA Switchover. I will look into this further and update you.

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-06-29

Hi Sathish,

We are still working on the issue. Will get back to you with updates soon.

Regards,
Shyamala

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-06-30

Thank you Shyamala, we are waiting for the fix.
We are observing this issue in all the openhpid processes whenever there is a switchover.

Thanks,
Sathish

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-07-25

Hi Satish,

Sorry for the delay in providing patch.
PFA the patch for memory leak issue. Please apply patch, test it on your machine and let me know your observations.

Regards,
Shyamala

1960.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sathish - 2017-07-26

Hi Shyamala,

Thank you for the patch.

These are the steps I followed to test the patch.
1. svn checkout http://svn.code.sf.net/p/openhpi/code/openhpi/trunk opehpid-test
2. cd opehpid-test && ./bootstrap
3. cd plugins && patch -p1 -i 1960.patch -- Patch applied without any errors
4. ./configure --enable-oa_soap --enable-encryption --enable-watchdog && make
5. Used the binary in openhpid-test/openhpid/.libs and replaced it on the destination machine. Confirmed with md5sums
6. Restarted the openhpid
7. Switched over OA multiple times.

However, the memory increase is observed and it is in the same range as 3MB every OA switchover.

Thanks,
Sathish

Last edit: Sathish 2017-07-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anirban Das - 2017-07-31

Hi Shyamala,

With 1960.patch, we have tested and ran openhpid with valgrind.

Command used: valgrind --leak-check=full --show-reachable=yes --error-limit=no --undef-value-errors=no --log-file=/tmp/memcheck.log /opt/bbee/thirdparty/sbin/openhpid -v -d -c /etc/openhpi.conf -p 4743 -f /opt/bbee-5.0.0-5/thirdparty/var/run/openhpid.pid.4743

The valgrind logs and process trace (verbose mode) logs are already shared. Each OA switchover is increasing 3 MB memory usage even with this patch.

Thanks and Regards,

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shyamala - 2017-09-15

status: open --> closed-fixed

assigned_to: Shyamala
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

openhpid process is consuming more than 10G

3.7.0

Searches

Help

#1960 openhpid process is consuming more than 10G

Discussion