From: SourceForge.net <no...@so...> - 2011-03-12 06:01:59
|
Patches item #3202833, was opened at 2011-03-08 04:51 Message generated for change (Comment added) made by rstory You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=312694&aid=3202833&group_id=12694 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Santanu (santanusen) Assigned to: Nobody/Anonymous (nobody) Summary: Fix: Race between snmpd::receive() and snmp_sess_close() Initial Comment: This patch fixes a race between snmpd::receive() and snmp_sess_close() in a multi-threaded application. Consider the following scenario for example. Assume snmpd::receive is running on thread 'T1' and a second thread, 'T2', is trying to close a session 's1' with fd 'f1'. The following sequence of events take place in order of time. 1. T1: calls snmp_select_info2(). 'f1' gets added to the read fd-set for select. 2. T2: calls snmp_sess_close() for 's1'. 'f1' is closed. 3. T1: calls select(). select returns -1 with 'errno' set to EBADF as 'f1' is already closed. snmpd is shut down. One may propose to ignore select failures with EBADF errors. But, consider the following, more complicated, scenario. This, along with the above two threads, involves a third thread T3 trying to open a socket/file. 1. T1: calls snmp_select_info2(). 'f1' is added to the read fd-set for select. 2. T2: calls snmp_sess_close() for 's1'. 'f1' is closed. 3. T3: opens a new socket/file and gets back the fd 'f1'. 4. T1: does select() and recvfrom() on an unrelated fd 'f1'. The root of the problem is in calling snmp_sess_close() from a different thread than the thread running the snmpd::run code (agent thread, say). This patch fixes the issue by getting 'snmp_sess_close()' called in the context of the agent thread. The agent thread maintains a queue of jobs posted by other threads. A job is represented by a combination of a function pointer and an opaque data pointer to be passed as an argument to the function. A FIFO is maintained for notifying the agent thread about new job requests. The agent thread adds this FIFO fd to its select read fd-set to receive notifications. A thread calling snmp_sess_close() now actually posts a job request to close the snmp session and notifies the agent thread. The fix has been tested with ucd-snmp-4.2/net-snmp-5.6.1 on Linux platforms running on ppc32 and i586 processors. The patch provided is applicable to net-snmp-5.6.1. ---------------------------------------------------------------------- >Comment By: Robert Story (rstory) Date: 2011-03-12 01:01 Message: Some thoughts on this patch. - multi-threaded agents are not supported. thus, at best, this might could go into trunk. - why are you eliminating the return code from the new _snmp_sess_close()? - this job-request dohickey looks/feels very similar to the callback transport. maybe use (possilby enhance/tweak) that instead of something completely new? - probably should use the fd event manager (eg register_readfd()) instead of a local var in snmpd ---------------------------------------------------------------------- Comment By: (bvassche) Date: 2011-03-10 15:14 Message: How about passing 0 instead of 1 to agent_check_and_process() such that it doesn't block and to add the necessary code to avoid busy-waiting too ? ---------------------------------------------------------------------- Comment By: Santanu (santanusen) Date: 2011-03-10 00:17 Message: Yes. But the problem again is how to notify the agent thread if it is already blocked in select with infinite timeout and does not wake up soon (worse, does not wake up at all). If too many sessions are opened and are not cleaned up soon, we may end up having too many open file descriptors or running out of memory. Is there any mechanism (apart from delivering a signal, because we are using the signals for some other purpose) to unblock the agent thread blocked in select? ---------------------------------------------------------------------- Comment By: (bvassche) Date: 2011-03-09 14:53 Message: Have you already considered to set sess->transport->sock to -1 (see also the "This session was marked for deletion." comment in snmp_api.c) ? ---------------------------------------------------------------------- Comment By: Santanu (santanusen) Date: 2011-03-09 00:24 Message: In case the agent thread is already blocked in select with infinite timeout, how can it be woken up and informed about the new alarm handler registered by some other thread? May be, we can leave the snmp_sess_close() function as it is and make use of the notification mechanism provided in this patch from the app code. ---------------------------------------------------------------------- Comment By: (bvassche) Date: 2011-03-08 06:56 Message: I'm not sure the Net-SNMP source code has to be modified to fix that race. All you have to do is to register a single-shot alarm handler with timeout set to zero seconds. You can even unregister that alarm handler from inside its the function that handles it. See also http://www.net-snmp.org/dev/agent/group__snmp__alarm.html. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=312694&aid=3202833&group_id=12694 |