Currently observed between saNtfFinalize() with saNtfNotificationUnsubscribe () and also with saNtfStateChangeNotificationAllocate().
Generally any API called after finalize returns with BAD_HANDLE.
But some times finalize() hangs when it is timed in between the processing of Allocate() or Unsubscribe() as some resources will be in use in these APIs.
1) bt1 between Finalize() and Allocate:
Thread 4 (Thread 0x41082d40 (LWP 13508)):
#0 0x00000039ef6cb2e6 in poll () from /lib64/libc.so.6
#1 0x00007f34c3440754 in osaf_poll_no_timeout (io_fds=0x410824f0, i_nfds=1) at osaf_poll.c:31
#2 0x00007f34c3440825 in osaf_ppoll (io_fds=0x410824f0, i_nfds=1, i_timeout_ts=0x0, i_sigmask=0x0)
at osaf_poll.c:78
#3 0x00007f34c344a681 in ncs_tmr_wait () at sysf_tmr.c:411
#4 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#5 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x41389d40 (LWP 13509)):
#0 0x00000039ef6cb2e6 in poll () from /lib64/libc.so.6
#1 0x00007f34c3490410 in mdtm_process_recv_events () at mds_dt_tipc.c:665
#2 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#3 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x41b8b940 (LWP 13510)):
#0 0x00000039f020d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000039f0208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000039f0208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f34c36bd068 in saNtfStateChangeNotificationAllocate (ntfHandle=4290772993, notification=0x601680,
numCorrelatedNotifications=0, lengthAdditionalText=19, numAdditionalInfo=1, numStateChanges=1,
variableDataSize=-1) at ntfa_api.c:1936
#4 0x0000000000400b7d in allocate ()
#5 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#6 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f34c34256e0 (LWP 13507)):
#0 0x00000039f020cd01 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007f34c34563c1 in hm_block_me (cell=0x1ff9ce0, pool_id=0 '\000') at hj_hdl.c:696
#2 0x00007f34c3455750 in ncshm_destroy_hdl (id=NCS_SERVICE_ID_NTFA, uhdl=4290772993) at hj_hdl.c:366
#3 0x00007f34c36c21a4 in ntfa_hdl_rec_del (list_head=0x7f34c38d4b28, rm_node=0x200b780) at ntfa_util.c:795
#4 0x00007f34c36bafd5 in saNtfFinalize (ntfHandle=4290772993) at ntfa_api.c:1195
#5 0x0000000000400df6 in main ()
(gdb)
2)BT2 between Finalize() and Unsubscribe()
Thread 4 (Thread 0x402c5d40 (LWP 13776)):
#0 0x00000039ef6cb2e6 in poll () from /lib64/libc.so.6
#1 0x00007fc506988754 in osaf_poll_no_timeout (io_fds=0x402c54f0, i_nfds=1) at osaf_poll.c:31
#2 0x00007fc506988825 in osaf_ppoll (io_fds=0x402c54f0, i_nfds=1, i_timeout_ts=0x0, i_sigmask=0x0)
at osaf_poll.c:78
#3 0x00007fc506992681 in ncs_tmr_wait () at sysf_tmr.c:411
#4 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#5 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x40bbed40 (LWP 13777)):
#0 0x00000039ef6cb2e6 in poll () from /lib64/libc.so.6
#1 0x00007fc5069d8410 in mdtm_process_recv_events () at mds_dt_tipc.c:665
#2 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#3 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x413c0940 (LWP 13778)):
#0 0x00000039f020d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000039f0208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000039f0208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fc506c04379 in subscriberListItemRemove (subscriptionId=111) at ntfa_api.c:1635
#4 0x00007fc506c077ab in saNtfNotificationUnsubscribe (subscriptionId=111) at ntfa_api.c:2785
#5 0x0000000000400ec5 in unsubscribe ()
#6 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#7 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fc50696d6e0 (LWP 13775)):
#0 0x00000039f020cd01 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fc50699e3c1 in hm_block_me (cell=0xfcece0, pool_id=0 '\000') at hj_hdl.c:696
#2 0x00007fc50699d750 in ncshm_destroy_hdl (id=NCS_SERVICE_ID_NTFA, uhdl=4290772993) at hj_hdl.c:366
#3 0x00007fc506c0a1a4 in ntfa_hdl_rec_del (list_head=0x7fc506e1cb28, rm_node=0xfe0780) at ntfa_util.c:795
#4 0x00007fc506c02fd5 in saNtfFinalize (ntfHandle=4290772993) at ntfa_api.c:1195
#5 0x0000000000400fa3 in main ()
Attached is the sample application ( unsubscribe.c ) to simulate for Unsubscribe() and Finalize().
Diff:
Also observed between PtrValAllocate() and Finalize().
Thread 4 (Thread 0x40a6bd40 (LWP 15671)):
#0 0x00000039ef6cb2e6 in poll () from /lib64/libc.so.6
#1 0x00007fcaaa3a6754 in osaf_poll_no_timeout (io_fds=0x40a6b4f0, i_nfds=1) at osaf_poll.c:31
#2 0x00007fcaaa3a6825 in osaf_ppoll (io_fds=0x40a6b4f0, i_nfds=1, i_timeout_ts=0x0, i_sigmask=0x0)
at osaf_poll.c:78
#3 0x00007fcaaa3b0681 in ncs_tmr_wait () at sysf_tmr.c:411
#4 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#5 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x409aed40 (LWP 15672)):
#0 0x00000039ef6cb2e6 in poll () from /lib64/libc.so.6
#1 0x00007fcaaa3f6410 in mdtm_process_recv_events () at mds_dt_tipc.c:665
#2 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#3 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x41bc9940 (LWP 15674)):
#0 0x00000039f020d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000039f0208e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000039f0208cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fcaaa623756 in saNtfPtrValAllocate (notificationHandle=4288675841, dataSize=8, dataPtr=0x41bc9120,
value=0x191bc48) at ntfa_api.c:2076
#4 0x0000000000400ccd in ptrallocate ()
#5 0x00000039f020673d in start_thread () from /lib64/libpthread.so.0
#6 0x00000039ef6d40cd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fcaaa38b6e0 (LWP 15670)):
#0 0x00000039f020cd01 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fcaaa3bc3c1 in hm_block_me (cell=0x1909ce0, pool_id=0 '\000') at hj_hdl.c:696
#2 0x00007fcaaa3bb750 in ncshm_destroy_hdl (id=NCS_SERVICE_ID_NTFA, uhdl=4290772993) at hj_hdl.c:366
#3 0x00007fcaaa6281a4 in ntfa_hdl_rec_del (list_head=0x7fcaaa83ab28, rm_node=0x191b780) at ntfa_util.c:795
#4 0x00007fcaaa620fd5 in saNtfFinalize (ntfHandle=4290772993) at ntfa_api.c:1195
#5 0x0000000000400e22 in main ()
agent trace:
Oct 6 17:29:53.427642 ntfa [15670:ntfa_api.c:1976] << saNtfStateChangeNotificationAllocate
Oct 6 17:29:53.427787 ntfa [15670:ntfa_api.c:2055] >> saNtfPtrValAllocate
Oct 6 17:29:53.427808 ntfa [15670:ntfsv_mem.c:0814] TR realloc base=(nil), size=0, data_size=8
Oct 6 17:29:53.429007 ntfa [15670:ntfa_api.c:1143] >> saNtfFinalize
Oct 6 17:29:53.429068 ntfa [15670:ntfa_mds.c:1102] >> ntfa_mds_msg_sync_send
Oct 6 17:29:53.429130 ntfa [15670:ntfa_mds.c:0457] >> ntfa_mds_enc
Oct 6 17:29:53.429146 ntfa [15670:ntfa_mds.c:0487] T2 msgtype: 1
Oct 6 17:29:53.429155 ntfa [15670:ntfa_mds.c:0499] T2 api_info.type: 2
Oct 6 17:29:53.429163 ntfa [15670:ntfa_mds.c:0079] >> ntfa_enc_finalize_msg
Oct 6 17:29:53.429196 ntfa [15670:ntfa_mds.c:0092] << ntfa_enc_finalize_msg
Oct 6 17:29:53.429206 ntfa [15670:ntfa_mds.c:0544] << ntfa_mds_enc
Oct 6 17:29:53.429818 ntfa [15670:ntfa_mds.c:0767] >> ntfa_mds_dec
Oct 6 17:29:53.429862 ntfa [15670:ntfa_mds.c:0797] T2 NTFSV_NTFA_API_RESP_MSG rc = 1
Oct 6 17:29:53.429869 ntfa [15670:ntfa_mds.c:0872] << ntfa_mds_dec
Oct 6 17:29:53.429925 ntfa [15670:ntfa_mds.c:1129] << ntfa_mds_msg_sync_send
Oct 6 17:29:53.429951 ntfa [15670:ntfa_api.c:2055] >> saNtfPtrValAllocate
Oct 6 17:29:53.429970 ntfa [15670:ntfa_util.c:0782] >> ntfa_hdl_rec_del
Oct 6 17:29:53.430000 ntfa [15670:ntfa_api.c:2083] << saNtfPtrValAllocate
Oct 6 17:29:53.430055 ntfa [15670:ntfa_api.c:2055] >> saNtfPtrValAllocate
Oct 6 17:29:53.430096 ntfa [15670:ntfa_api.c:2071] TR ncshm_take_hdl client_handle failed
Oct 6 17:29:53.430104 ntfa [15670:ntfa_api.c:2083] << saNtfPtrValAllocate
"/tmp/ntf.trace" 25062L, 1931341C 25062,1 B
It seems that the problem could be generalized:
. Thread B takes ncshm_hdl, enter lock region of ntfa_cb.cb_lock
. Thread A is waiting for ntfa_cb.cb_lock to be released
. Thread B releases ntfa_cb.cb_lock, but hasn't given ncshm_hdl. Meanwhile thread A enters lock region of ntfa_cb.cb_lock, but it's blocked by ncshm_destroy_hdl() because the thread B hasn't given ncshm_hdl
So there could be many interfering lock region of cb_lock and ncshm_hdl in agent codes on sorts of handles: NTF handles (finalize vs ...), notification handle (send notif vs notif free), filter handle (subscribe vs filter free), reader handle (read next vs read finalize)
All this happening because saNtfFinalize() has taken unnecessary lock on ntfa_cb.cb_lock. It is not required in saNtfFinalize() because after getting successful response from the server, finalize() takes lock and tries to free all the resources. Here it gets block in ntfa_hdl_rec_del() ---->ncshm_destroy_hdl() (actually in in ncshm_destroy_hdl()--->hm_block_me())) as client_handle is in use by other API and waits for it to get released. Thus this leap API wants other threads to complete thier functionality and release the handle and it decrements the count whenever the handle is released. At the same time, new APIs will not get the handle and will get BAD_HANDLE. Only existing APIS which has taken the client_handle will be able to execute. So if the lock is removed from Finalize(), other APIS will be able to complete their call and will return handle one by one which eventually will unblock finalize(). I am chechking if all these three APIs are returning BAD_HANDLE or not or atleast some proper SAF error code for other cases.
Published v2:
https://sourceforge.net/p/opensaf/mailman/message/34538541/
changeset: 7019:f378a750ff34
tag: tip
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:40:30 2015 +0530
summary: tests/ntf: add test cases for saNtfFinalize [#1521].
changeset: 7018:7963c6f42db3
parent: 7011:5b29434dd9a1
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:40:30 2015 +0530
summary: ntfa: fix deadlock situation due to saNtfFinalize [#1521].
changeset: 7017:fa54167cfe42
branch: opensaf-4.7.x
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:40:16 2015 +0530
summary: tests/ntf: add test cases for saNtfFinalize [#1521].
changeset: 7016:a64ae24627e6
branch: opensaf-4.7.x
parent: 7010:ed4c93647e3c
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:40:16 2015 +0530
summary: ntfa: fix deadlock situation due to saNtfFinalize [#1521].
changeset: 7015:a4a8aa4c4298
branch: opensaf-4.6.x
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:39:45 2015 +0530
summary: tests/ntf: add test cases for saNtfFinalize [#1521].
changeset: 7014:02458309eed8
branch: opensaf-4.6.x
parent: 7007:b92fa9ce57ee
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:39:45 2015 +0530
summary: ntfa: fix deadlock situation due to saNtfFinalize [#1521].
changeset: 7013:44f1a282922a
branch: opensaf-4.5.x
user: praveen.malviya@oracle.com
date: Mon Oct 19 10:39:04 2015 +0530
summary: tests/ntf: add test cases for saNtfFinalize [#1521].
https://sourceforge.net/p/opensaf/mailman/message/34538560/
Related
Tickets:
#1521