Menu

#2285 net-snmp-5.7.1.pre2 does not respect session timeout

open
nobody
library (262)
9
2012-11-08
2011-09-15
No

I'm using net-snmp-5.7.1.pre2 that I manually built and installed on a CentOS 4.8 FINAL (32-bit). We've been struggling with easily-reproducable lockups with older versions hanging in recvmsg, so I added this latest version, which is supposed to fix that. But now, it's easily hanging up somewhere else. Here's the stack trace at the time that our client freezes:

#0 0x00bfa7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00cd89f1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2 0x0027e01d in snmp_synch_response_cb (ss=0x8a29798, pdu=0x8908f08, response=0x0, pcb=0x27d650 <snmp\_synch\_input>)
at snmp_client.c:1060
#3 0x0027e132 in snmp_synch_response (ss=0x8a29798, pdu=0x8908f08, response=0xb7faee58) at snmp_client.c:1108
#4 0x0806c234 in SnmpTools::SnmpGet ()
#5 0x0806c792 in SnmpTools::SnmpGet ()
#6 0x0806c887 in snmpget ()
#7 0x080583c2 in proxyStr::get_request ()
#8 0x001bb6bd in Agentpp::Mib::process_request (this=0x8614848, req=0x87d6070, reqind=0) at mib.cpp:3286
#9 0x001bd890 in Agentpp::Mib::do_process_request (this=0x8614848, req=0x87d6070) at mib.cpp:3546
#10 0x001d0bcd in Agentpp::MibTask::run (this=0x89d3a08) at threads.cpp:1032
#11 0x001cf9ff in Agentpp::TaskManager::run (this=0x8672cd0) at threads.cpp:854
#12 0x001ced8e in Agentpp::thread_starter (t=0x8672d1c) at threads.cpp:563
#13 0x00d765cc in start_thread () from /lib/tls/libpthread.so.0
#14 0x00cdffae in clone () from /lib/tls/libc.so.6</snmp\_synch\_input>

This thread holds a lock, which is blocking another thread, which holds a lock, which is blocking another thread, etc. The result is a deadlock.

At the time we get into this state, we're also seeing invalid stack frames for some other threads. Instead of the stack bottom being clone(), as it normally is, when the deadlock occurs, we see things like this:

#0 0x00bfa7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00d7b5de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2 0x00d7820b in _L_mutex_lock_35 () from /lib/tls/libpthread.so.0
#3 0xb6bbc038 in ?? ()
#4 0x00257494 in ?? () from /usr/lib/libagent++.so
#5 0x00000000 in ?? ()

It takes us about 10 or 15 minutes to reproduce a deadlock.

Discussion

  • Bart Van Assche

    Bart Van Assche - 2011-09-16

    snmpd is single-threaded, so how could there be a deadlock between different threads ?

     
  • Timothy Miller

    Timothy Miller - 2011-09-16

    This isn't snmpd. This is libnetsnmp,so.30. My application is deadlocking because libnetsnmp is blocking here, which is very similar to the older blocking on recvmsg bug people complained about in the past.

     
  • Bart Van Assche

    Bart Van Assche - 2011-09-16

    Sorry, but I don't see what's wrong with a thread that issues a blocking call in a multithreaded application. That shouldn't cause any problem for your application. And of course, you'll have to follow the usual rules that apply to multithreaded applications, e.g. not triggering lock inversion.

     
  • Timothy Miller

    Timothy Miller - 2011-09-19

    I've continued to investigate this problem, and it appears that the problem here is that snmp_synch_response blocks forever on the select() despite the fact that we have set a timeout of 50000.

     
  • Timothy Miller

    Timothy Miller - 2011-09-19

    Yup. That's the problem. Timeouts are not being handled properly.

    We have done this:

    snmp\_sess\_init\( pStSession \);                   /\* set up defaults \*/
    

    ...
    pStSession->timeout = 50000; // us
    pStSession->retries = 1;

    So, the first timeout should be 50 milliseconds. I can't find where you do this, but I saw some comment about an exponential back off. I don't know if that applies here or not, but assuming exponential is powers of 10, then we should be seeing timeouts in the half second range.

    However, stopping in the debugger on that select(), this is the timeout it's using:

    print *tvp
    $3 = {tv_sec = 2, tv_usec = 874000}

    That's definitely not right. So, while our app may or may not technically hang there, it definitely appears to do so because of a timeout miscalculation.

     
  • Timothy Miller

    Timothy Miller - 2011-09-20

    I made a dirty hack down deep in snmp_synch_response_cb to force it to give me the timeout I had asked for, and now my program is working fine. IMHO, I've found a legitimate bug.

     
  • Magnus Fromreide

    I do not think you will like this but...

    The problem is that you are misusing an interface.
    snmp_synch_ * exist as a replacement for a main loop in simple applications that only issue a request and then wants to wait for the return value - the synch stands for synchronized but in the meaning "only this call will happen", not in a MT meaning.

    Your application apparently is not that simple and so you have to use a full event loop (to handle this you should use one of the snmp_sess_select_info functions).

     
  • Timothy Miller

    Timothy Miller - 2011-09-21

    I've tried spawning a separate worker thread for each separate session. According to documentation I found, libnetsnmp is not thread safe, but it is for independent sessions, However, I'm getting crashes, so I'm not sure this is true.

     
  • Timothy Miller

    Timothy Miller - 2011-09-21

    Scratch that last comment. The thread-safe API isn't complete, so I was using some of the traditional calls. I had to make a hack in my code that would give me access to the active session pointer from the opaque (session list) pointer. This is obviously not a good idea, because it could break with future versions. But there's no way around it. I filed a separate bug report.

     

Log in to post a comment.

MongoDB Logo MongoDB