Hi,
The memory of openhpid process running on my machine is building up a lot of memory (more than 10G). Along with the memory consumption, the following log is observed and the process is never coming out of the re-discovery loop.
Jun 8 00:23:10 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed
Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55
Jun 8 00:24:02 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed
Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55
The configuration file contains the following:
OPENHPI_LOG_ON_SEV = "DEBUG"
OPENHPI_EVT_QUEUE_LIMIT = 10000
OPENHPI_DEL_SIZE_LIMIT = 10000
OPENHPI_DEL_SAVE = "NO"
OPENHPI_DAT_SIZE_LIMIT = 0
OPENHPI_DAT_USER_LIMIT = 0
OPENHPI_DAT_SAVE = "NO"
OPENHPI_PATH = "openhpi path"
OPENHPI_VARPATH = "lib_path"
OPENHPI_AUTOINSERT_TIMEOUT = 0
OPENHPI_AUTOINSERT_TIMEOUT_READONLY = "NO"
Any pointers/help would be great.
Thanks,
Sathish
Thank you for filing this bug Sathish. We need to fix this memory leak problem. Looks like there is some problem with the blade. More information on the bug could help.
a. Which version of openhpi you are using?
b. What is the version of OA firmware?
c. If you know more details on the blade, please let's know. Looks like it has some unique problem that we are not handling well.
Dr Mohan,
Pasting the output from ./hpiinv
SAF HPI Version B.03.02 (Is this the one?)
Some more info:
FieldId[1] Product Name : TEXT : BladeSystem c7000 Enclosure G3
FieldId[2] Manufacturer : TEXT : HP
FieldId[3] Product Versn : TEXT : 1.00
FieldId[4] Custom Field : TEXT : Telco_Status: TRUE
FieldId[5] Custom Field : TEXT : Power_Type: EXTERNAL_DC
FieldId[6] Custom Field : TEXT : ENCLOSURE STATUS: OK
Along the memory we have seen errors like,
a. Management processor status flipping (Screen shot attached for reference) - on one chassis
b. Power related issue on another chassis, as in the below logs
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:453: Failed to discover Server Blade
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:381: Discovery failed for active OA 10.193.69.5:443
Jun 8 10:25:34 cz2 openhpid: oasoap: oasoaputils.c:753: OA 10.193.69.5 has become Active
Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5
Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.
Just thinking aloud:
Do we delete the resource-id from the RPTable, when the management processor is unresponsive? If so, the re-discovery infinite loop might have been entered just then, and thus the log in loop.
"Jun 8 00:23:57 cz2 openhpid: oasoap: oasoaprediscover.c:923: resource RPT is NULL"
Shouldn't the loop be exited when the rpt is NULL?
Thanks,
Sathish
Hi Sathish,
Thanks for the information provided.
From the logs, it looks like there is some problem with that particular blade. I need some more information regarding the server blade.
1) Which model of server blade is placed at bay 5 and what is the iLo Firmwrae version?
2) Did you do any iLo firmaware update on that server blade recently?
3) From when or any specific point, where this problem is seen?
I can suggest you to restart the openhpid and see.
Regards,
Shyamala
Shyamala,
Copying the info from ./hpiinv
FieldId[1] Product Name : TEXT : ProLiant BL460c Gen8
FieldId[2] Manufacturer : TEXT : HP
FieldId[3] Product Versn : TEXT : 2.40 Dec 02 2015
FieldId[4] Custom Field : TEXT : MP Model name = iLO4
The last iLo FW update was on Dec 02, 2015
As a temperory recovery mechanism we are restarting openhpid process if it crosses some limit. But, the process keeps recurring.
Also, can you help me in compiling the code, I am getting the following error:
make[4]: Leaving directory '/home/platform/openhpi-3.6.1/src'
/bin/bash: LIBTOOL@: command not found
Makefile:618: recipe for target 'sahpi_enum_utils.lo' failed
make[3]: [sahpi_enum_utils.lo] Error 127
make[3]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils'
Makefile:658: recipe for target 'all-recursive' failed
make[2]: [all-recursive] Error 1
make[2]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils'
Makefile:603: recipe for target 'all-recursive' failed
make[1]: [all-recursive] Error 1
make[1]: Leaving directory '/home/platform/openhpi-3.6.1/src'
Makefile:496: recipe for target 'all' failed
make: [all] Error 2
Thanks,
Sathish
Hi Sathish,
Looks like this is the problem with the particular server blade. Just for pointing out exact reason, please help me with results by following below steps.
1) Remove server blade and run openhpid. See the results.
2) After some time, insert server blade back. Confirm blade is powered on & then see the problem still persists.
Coming to compilation part, from the error, looks like there is no libtool installed. Install libtool and then try to compile code. Which OS are you using?
Regards,
Shyamala
Sathish,
What is the version of OA firmware? If possible attach hpiinv result.
Regards,
Shyamala
Shyamala,
It will take some time for providing the information, by pulling out the blade and restarting openhpid due to various reasons.
For hpiinv result, it has got a lot of private info, kindly share me your email-id so that can send the info privately.
Thanks,
Sathish
Sure.
My e-mail id is shyamala.hirepatt@hpe.com.
Regards,
Shyamala
Shyamala,
Did you get a chance to go through the hpiinv output? You found anything interesting?
Thanks,
Sathish
Hi Sathish,
I could not get much info from hpiinv log. For repeated re-discovery issue, you have to follow the steps mentioned above and get the results to figure out the exact reason.
For the memory leak issue, I have created & attached the patch. Please apply patch, test it and let me know the results.
Regards,
Shyamala
SaErrorT oh_add_resource(RPTable table, SaHpiRptEntryT entry, void *data, int owndata){
..
if (rptentry->data && rptentry->data != data && !rptentry->owndata)
g_free(rptentry->data);
rptentry->data = data;
..
}
rv = oh_add_resource(oh_handler->rptcache, &rpt, hotswap_state, 0);
In oh_add_resource(), the hot_swap_state pointer is stored as the data, in the RPTable. Won't de-allocing on success cause issues?
From the logs provided, looks like an issue is caused by not freeing the memory in discover_server() in some error handling case. While going through code I found some more places where memory was not freed on success, not exactly causing this issue. So I covered those too.
Did you get chance to apply patch and see the results?
Shyamala,
First of all, thank you for your support.
I have tested the patch multiple time to be sure. Every time there is a switchover of OA, the memory usage of openhpid is approximately increased by 3 MB. Following logs are printed during the same.
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:370: could not find enum value "EVENT_NETWORK_LINK_STATUS" in "EVENT_HEARTBEAT, EVENT_ENC_STATUS, EVENT_ENC_UID, EVENT_ENC_SHUTDOWN, EVENT_ENC_INFO, EVENT_ENC_NAMES, EVENT_USER_PERMISSION, EVENT_ADMIN_RIGHTS_CHANGED, EVENT_ENC_SHUTDOWN_PENDING, EVENT_ENC_TOPOLOGY, EVENT_FAN_STATUS, EVENT_FAN_INSERTED, EVENT_FAN_REMOVED, EVENT_FAN_GROUP_STATUS, EVENT_THERMAL_STATUS, EVENT_COOLING_STATUS, EVENT_FAN_ZONE_STATUS, EVENT_PS_STATUS, EVENT_PS_INSERTED, EVENT_PS_REMOVED, EVENT_PS_REDUNDANT, EVENT_PS_OVERLOAD, EVENT_AC_FAILURE, EVENT_PS_INFO, EVENT_PS_SUBSYSTEM_STATUS, EVENT_SERVER_POWER_REDUCTION_STATUS, EVENT_INTERCONNECT_STATUS, EVENT_INTERCONNECT_RESET, EVENT_INTERCONNECT_UID, EVENT_INTERCONNECT_INSERTED, EVENT_INTERCONNECT_REMOVED, EVENT_INTERCONNECT_INFO, EVENT_INTERCONNECT_HEALTH_LED, EVENT_INTERCONNECT_THERMAL, EVENT_INTERCONNECT_CPUFAULT, EVENT_INTERCONNECT_POWER, EVENT_INTERCONNECT_PORTMAP, EVENT_BLADE_PORTMAP, EVENT_INTERCONNECT_VENDOR_BLOCK, EVENT_INTERCONNECT_HEALTH_STATE, EVENT_DEMO_MODE, EVENT_BLADE_STATUS, EVENT_BLADE_INSERTED, EVENT_BLADE_REMOVED, EVENT_BLADE_POWER_STATE, EVENT_BLADE_POWER_MGMT, EVENT_BLADE_UID, EVENT_BLADE_SHUTDOWN, EVENT_BLADE_FAULT, EVENT_BLADE_THERMAL, EVENT_BLADE_INFO, EVENT_BLADE_MP_INFO, EVENT_ILO_READY, EVENT_LCD_BUTTON, EVENT_KEYING_ERROR, EVENT_ILO_HAS_IPADDRESS, EVENT_POWER_INFO, EVENT_LCD_STATUS, EVENT_LCD_INFO, EVENT_REDUNDANCY, EVENT_ILO_DEAD, EVENT_RACK_SERVICE_STARTED, EVENT_LCD_SCREEN_REFRESH, EVENT_ILO_ALIVE, EVENT_PERSONALITY_CHECK, EVENT_BLADE_POST_COMPLETE, EVENT_BLADE_SIGNATURE_CHANGED, EVENT_BLADE_PERSONALITY_CHANGED, EVENT_BLADE_TOO_LOW_POWER, EVENT_VIRTUAL_MEDIA_STATUS, EVENT_MEDIA_DRIVE_INSERTED, EVENT_MEDIA_DRIVE_REMOVED, EVENT_MEDIA_INSERTED, EVENT_MEDIA_REMOVED, EVENT_OA_NAMES, EVENT_OA_STATUS, EVENT_OA_UID, EVENT_OA_INSERTED, EVENT_OA_REMOVED, EVENT_OA_INFO, EVENT_OA_FAILOVER, EVENT_OA_TRANSITION_COMPLETE, EVENT_OA_VCM, EVENT_NETWORK_INFO_CHANGED, EVENT_SNMP_INFO_CHANGED, EVENT_SYSLOG_CLEARED, EVENT_
Jun 21 01:45:22 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:771: No bytes read
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:708: oh_ssl_read() part 1 failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:549: SSL connection failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:26 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:33 cz1 openhpid: oa_soap: oa_soap_oa_event.c:187: OA switching started
Jun 21 01:47:35 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection timed out
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:48:21 cz1 openhpid: oa_soap: oa_soap_event.c:491: OA <ip_address> is accessible
Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed
Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_oa_event.c:339: OA switching completed
Thanks,
Sathish
Hi Sathish,
Form the logs, look like this is connection problem with OA. Is OA accessible? If you are able to ping OA IP, try to run openhpid without my patch too. Actually patch does not do anything with OA switchover.
Regards,
Shyamala
Shyamala,
OA is accessible. OA calls are failing during switchover, which as per my understanding is expected. My point was memory taken by openhpid is still growing.
At the end of all the switchover we see the logs:
Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started
Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed
Thanks,
Sathish
Hi Sathish,
The thing is earlier whatever error messages were seen, now it is totally different, like you mentioned that every time switchover is happening. The patch given was based on earlier error messages. I need some more infomration.
1) Why OA switchover is happening every time now? Any changes in setup or connection?
2) Are you able to see the messages related to server blade RPT (like Failed to get server RPT or power state etc)? If yes, is there any difference in memory consumption with patch?
Regards,
Shyamala
The reason we are not seeing the same errors is, the logs I initially shared with you were from the production and I am not allowed to test patches in production. I am patching your code changes in my local machine and checking if there is no memory leak.
I am observing memory usage increase of openhpid, when ever I force a switch over of OA.
Thanks,
Sathish
Ok got it. Yes you are right, re-discovery happens when there is OA Switchover. I will look into this further and update you.
Regards,
Shyamala
Hi Sathish,
We are still working on the issue. Will get back to you with updates soon.
Regards,
Shyamala
Thank you Shyamala, we are waiting for the fix.
We are observing this issue in all the openhpid processes whenever there is a switchover.
Thanks,
Sathish
Hi Satish,
Sorry for the delay in providing patch.
PFA the patch for memory leak issue. Please apply patch, test it on your machine and let me know your observations.
Regards,
Shyamala
Hi Shyamala,
Thank you for the patch.
These are the steps I followed to test the patch.
1. svn checkout http://svn.code.sf.net/p/openhpi/code/openhpi/trunk opehpid-test
2. cd opehpid-test && ./bootstrap
3. cd plugins && patch -p1 -i 1960.patch -- Patch applied without any errors
4. ./configure --enable-oa_soap --enable-encryption --enable-watchdog && make
5. Used the binary in openhpid-test/openhpid/.libs and replaced it on the destination machine. Confirmed with md5sums
6. Restarted the openhpid
7. Switched over OA multiple times.
However, the memory increase is observed and it is in the same range as 3MB every OA switchover.
Thanks,
Sathish
Last edit: Sathish 2017-07-26
Hi Shyamala,
With 1960.patch, we have tested and ran openhpid with valgrind.
Command used: valgrind --leak-check=full --show-reachable=yes --error-limit=no --undef-value-errors=no --log-file=/tmp/memcheck.log /opt/bbee/thirdparty/sbin/openhpid -v -d -c /etc/openhpi.conf -p 4743 -f /opt/bbee-5.0.0-5/thirdparty/var/run/openhpid.pid.4743
The valgrind logs and process trace (verbose mode) logs are already shared. Each OA switchover is increasing 3 MB memory usage even with this patch.
Thanks and Regards,