Changeset: 4811:eb57695a171b
Host: Win7, guest: Virtualbox, cluster: lxc
MDS/TCP (important, more of that later)
First core dump:
(gdb) bt
mbc_inst=mbc_inst@entry=0x890aa0) at mbcsv_util.c:486
deallocate=deallocate@entry=false) at immd_evt.c:283
at immd_evt.c:2325
with export MALLOC_CHECK_=2 in immd.conf I instead get:
(gdb) bt
mbc_inst=mbc_inst@entry=0xc12340) at mbcsv_util.c:486
deallocate=deallocate@entry=false) at immd_evt.c:283
At mbcsv_util.c:486 memory is freed after mds send failed. This is most likely the problem, the memory has already been freed by mds.
MDS send fails due to timeout, why that happens is a dtm/mds issue. More of that later.
This leads to active controller reboot and in worst case cluster restart.
This is probably because the legacy memory manager has been semi removed. Ref counting of objects no longer works. This area of "base" needs a major cleanup
Trace from immd:
Jan 19 10:43:45.442248 osafimmd [400:immd_evt.c:0235] >> immd_evt_proc_fevs_req
Jan 19 10:43:45.442270 osafimmd [400:immd_evt.c:0271] T5 immd_evt_proc_fevs_req send_count:625 size:111
Jan 19 10:43:45.442304 osafimmd [400:immd_mbcsv.c:0045] >> immd_mbcsv_sync_update
Jan 19 10:43:45.442326 osafimmd [400:mbcsv_api.c:0773] >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified
Jan 19 10:43:45.442346 osafimmd [400:mbcsv_api.c:0803] TR svc_id:42, pwe_hdl:65549
Jan 19 10:43:45.442366 osafimmd [400:mbcsv_util.c:0344] >> mbcsv_send_ckpt_data_to_all_peers
Jan 19 10:43:45.442385 osafimmd [400:mbcsv_util.c:0388] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Jan 19 10:43:45.442404 osafimmd [400:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 42, pwe_hdl: 65549
Jan 19 10:43:45.442424 osafimmd [400:mbcsv_util.c:0400] TR calling encode callback
Jan 19 10:43:45.442444 osafimmd [400:immd_mbcsv.c:0399] >> immd_mbcsv_callback
Jan 19 10:43:45.442463 osafimmd [400:immd_mbcsv.c:0790] >> immd_mbcsv_encode_proc
Jan 19 10:43:45.442482 osafimmd [400:immd_mbcsv.c:0798] T5 MBCSV_MSG_ASYNC_UPDATE
Jan 19 10:43:45.442501 osafimmd [400:immd_mbcsv.c:0455] >> mbcsv_enc_async_update
Jan 19 10:43:45.442519 osafimmd [400:immd_mbcsv.c:0463] T5 **ENC SYNC COUNT 300
Jan 19 10:43:45.442539 osafimmd [400:immd_mbcsv.c:0482] T5 ENCODE IMMD_A2S_MSG_FEVS: send count: 625 handle: 85899477263
Jan 19 10:43:45.442564 osafimmd [400:immd_mbcsv.c:0605] << mbcsv_enc_async_update
Jan 19 10:43:45.442583 osafimmd [400:immd_mbcsv.c:0843] << immd_mbcsv_encode_proc
Jan 19 10:43:45.442602 osafimmd [400:immd_mbcsv.c:0428] T5 IMMD - MBCSv Callback Success
Jan 19 10:43:45.442621 osafimmd [400:immd_mbcsv.c:0429] << immd_mbcsv_callback
Jan 19 10:43:45.442639 osafimmd [400:mbcsv_util.c:0439] TR send the encoded message to any other peer with same s/w version
Jan 19 10:43:45.442658 osafimmd [400:mbcsv_util.c:0442] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Jan 19 10:43:45.442676 osafimmd [400:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 42, pwe_hdl: 65549
Jan 19 10:43:45.442697 osafimmd [400:mbcsv_mds.c:0185] >> mbcsv_mds_send_msg: sending to vdest:d
Jan 19 10:43:45.442717 osafimmd [400:mbcsv_mds.c:0209] TR send type MDS_SENDTYPE_REDRSP:
Jan 19 10:43:45.442739 osafimmd [400:mds_log.c:0192] TR INFO |MDS_SND_RCV: creating sync entry with xch_id=289
Jan 19 10:43:45.442978 osafimmd [400:mds_log.c:0192] TR INFO |MDS_SND_RCV: Msg Destination is on off node or diff process
Jan 19 10:43:45.443076 osafimmd [400:mds_log.c:0192] TR INFO |MDS_SND_RCV: Sending the data to MDTM layer
Jan 19 10:43:45.443148 osafimmd [400:mds_log.c:0192] TR INFO |MDTM: User Sending Data lenght=160 Fr_svc=19 to_svc=19
Jan 19 10:43:46.444955 osafimmd [400:mds_log.c:0192] TR ERR |MDS_SND_RCV: Timeout or Error occured
Jan 19 10:43:46.445272 osafimmd [400:mds_log.c:0192] TR ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id=19, to svc_id=19
Jan 19 10:43:46.445360 osafimmd [400:mds_log.c:0192] TR ERR |MDS_SND_RCV: Adest=<0x00000000,13>
Jan 19 10:43:46.445429 osafimmd [400:mds_log.c:0192] TR ERR |MDS_SND_RCV: Anchor=<0x0002020f,397>
Jan 19 10:43:46.445501 osafimmd [400:mds_log.c:0192] TR INFO |MDS_SND_RCV: Await active entry doesnt exists
Jan 19 10:43:46.445568 osafimmd [400:mds_log.c:0192] TR INFO |MDS_SND_RCV: Deleting the sync send entry with xch_id=289
Jan 19 10:43:46.445634 osafimmd [400:mds_log.c:0192] TR INFO |MDS_SND_RCV: Successfully Deleted the sync send entry with xch_id=289, fr_svc_id=19
can someone tell me howto add some markup/down that makes this readable?
The segfault is occurred because of the patch https://sourceforge.net/p/opensaf/tickets/712/.
In the failure case mds is, freeing the buffer somewhere (need to analyze more).
the same segfault is observed for lckd and evtd in our tests.
planning to revert #712 in this ticket, and reopening the #712 for further analyzing.
changeset: 4822:659384705601
tag: tip
parent: 4818:4ba0f7b4a0d4
user: Neelakanta Reddyreddy.neelakanta@oracle.com
date: Tue Jan 21 18:43:09 2014 +0530
summary: base: Reverted the fixed memory leaks in base [#731]
changeset: 4821:47c02a4ab823
branch: opensaf-4.4.x
parent: 4817:55cdaee9338b
user: Neelakanta Reddyreddy.neelakanta@oracle.com
date: Tue Jan 21 18:43:09 2014 +0530
summary: base: Reverted the fixed memory leaks in base [#731]
changeset: 4820:4eb3a5cd3bab
branch: opensaf-4.3.x
parent: 4812:a2d481559173
user: Neelakanta Reddyreddy.neelakanta@oracle.com
date: Tue Jan 21 18:43:09 2014 +0530
summary: base: Reverted the fixed memory leaks in base [#731]
changeset: 4819:c6106939fc4c
branch: opensaf-4.2.x
parent: 4813:4d5fd061131e
user: Neelakanta Reddyreddy.neelakanta@oracle.com
date: Tue Jan 21 18:43:09 2014 +0530
summary: base: Reverted the fixed memory leaks in base [#731]
Related
Tickets:
#731