Menu

#1126 Fix: Race between snmpd::receive() and snmp_sess_close()

open
nobody
None
4
2012-11-09
2011-03-08
Santanu
No

This patch fixes a race between snmpd::receive() and snmp_sess_close() in a multi-threaded application. Consider the following scenario for example. Assume snmpd::receive is running on thread 'T1' and a second thread, 'T2', is trying to close a session 's1' with fd 'f1'. The following sequence of events take place in order of time.

1. T1: calls snmp_select_info2(). 'f1' gets added to the read fd-set for select.
2. T2: calls snmp_sess_close() for 's1'. 'f1' is closed.
3. T1: calls select(). select returns -1 with 'errno' set to EBADF as 'f1' is already closed. snmpd is shut down.

One may propose to ignore select failures with EBADF errors. But, consider the following, more complicated, scenario. This, along with the above two threads, involves a third thread T3 trying to open a socket/file.

1. T1: calls snmp_select_info2(). 'f1' is added to the read fd-set for select.
2. T2: calls snmp_sess_close() for 's1'. 'f1' is closed.
3. T3: opens a new socket/file and gets back the fd 'f1'.
4. T1: does select() and recvfrom() on an unrelated fd 'f1'.

The root of the problem is in calling snmp_sess_close() from a different thread than the thread running the snmpd::run code (agent thread, say). This patch fixes the issue by getting 'snmp_sess_close()' called in the context of the agent thread. The agent thread maintains a queue of jobs posted by other threads. A job is represented by a combination of a function pointer and an opaque data pointer to be passed as an argument to the function. A FIFO is maintained for notifying the agent thread about new job requests. The agent thread adds this FIFO fd to its select read fd-set to receive notifications. A thread calling snmp_sess_close() now actually posts a job request to close the snmp session and notifies the agent thread.

The fix has been tested with ucd-snmp-4.2/net-snmp-5.6.1 on Linux platforms running on ppc32 and i586 processors. The patch provided is applicable to net-snmp-5.6.1.

Discussion

  • Santanu

    Santanu - 2011-03-08

    Patch: Fixes receive() and snmp_sess_close() race

     
  • Bart Van Assche

    Bart Van Assche - 2011-03-08

    I'm not sure the Net-SNMP source code has to be modified to fix that race. All you have to do is to register a single-shot alarm handler with timeout set to zero seconds. You can even unregister that alarm handler from inside its the function that handles it. See also http://www.net-snmp.org/dev/agent/group__snmp__alarm.html.

     
  • Santanu

    Santanu - 2011-03-09

    In case the agent thread is already blocked in select with infinite timeout, how can it be woken up and informed about the new alarm handler registered by some other thread?

    May be, we can leave the snmp_sess_close() function as it is and make use of the notification mechanism provided in this patch from the app code.

     
  • Bart Van Assche

    Bart Van Assche - 2011-03-09

    Have you already considered to set sess->transport->sock to -1 (see also the "This session was marked for deletion." comment in snmp_api.c) ?

     
  • Santanu

    Santanu - 2011-03-10

    Yes. But the problem again is how to notify the agent thread if it is already blocked in select with infinite timeout and does not wake up soon (worse, does not wake up at all). If too many sessions are opened and are not cleaned up soon, we may end up having too many open file descriptors or running out of memory.

    Is there any mechanism (apart from delivering a signal, because we are using the signals for some other purpose) to unblock the agent thread blocked in select?

     
  • Bart Van Assche

    Bart Van Assche - 2011-03-10

    How about passing 0 instead of 1 to agent_check_and_process() such that it doesn't block and to add the necessary code to avoid busy-waiting too ?

     
  • Robert Story

    Robert Story - 2011-03-12

    Some thoughts on this patch.

    - multi-threaded agents are not supported. thus, at best, this might could go into trunk.

    - why are you eliminating the return code from the new _snmp_sess_close()?

    - this job-request dohickey looks/feels very similar to the callback transport. maybe use (possilby enhance/tweak) that instead of something completely new?

    - probably should use the fd event manager (eg register_readfd()) instead of a local var in snmpd

     
  • Santanu

    Santanu - 2011-03-16

    _snmp_session_close() is requested from one thread and is called from another thread. Hence the requesting thread does not have a direct access to the return value of _snmp_session_close(). If it has to be communicated back to the requesting thread, we need to employ some kind of condition_variable to get the requesting thread blocked for the results. Anyways, currently the snmp_session_close() fails only if the session pointer is NULL. This check is performed in snmp_session_close() before requesting _snmp_session_close(). Hence, the return code is omitted.

    The callback domain code seems to be too specific to parsing packets. I did not want to touch it so that it remains compatible to other transport domain modules and does not do anything outside its responsibilities.

    Should we call register_readfd() from the receive() function itself? Doing it from other threads will still have the problems mentioned in my earlier comments.

     

Log in to post a comment.