#2411 snmpd crashes/hangs when AgentX subagent times-out

agentx
open
nobody
agent (1104)
7
2014-09-13
2012-09-05
Ken Farnen
No

SNMPD acting as master agent, AgentX subagent registering to handle a MIB, and processing GETNEXT requests. When the subagent is under heavy load (and so responds slowly) requests start to pile up in the queue, replies from the subagent arrive too late (per log messages) and eventually the subagent is timed out. When the timeout occurs there is a high probability of either a crash (Segfault) or a hang (100% CPU utilisation, tight loop in the snmpd code) dependent on the version of the snmpd under test. This also happens when the subagent dies unexpectedly with outstanding transactions unserviced.

Tested with net-snmp-5.7.1 (Segfaults), net-snmp-5.7.1 plus "subagent-free-cache" patch (basically patch 1633670) (Infinite loop), current trunk (infinite loop).

Our systems are Linux 2.6 based, Montavista CGL V4 and V5 on x86 and x86-64 platforms. glibc 2.3.3.

Attached is a stripped down test subagent that excercises the bug (by forcing a long delay between servicing the agentx requests), together with a script that throws traffic at the snmpd that will make it crash quite quickly. These assume the default snmpd/agentx config, with a 1 second timeout - though our testing indicates it will crash eventually with longer timeouts, especially in the situation where a subagent crashes.

Transactions are based on those we've seen in the field, and are GETNEXT requests for multiple OIDs, all from the MIB provided by the subagent, but with some OIDs numbered such that the response is in the adjacent MIB (i.e. the GETNEXT is walking off the end of the subagent MIB). This kind of transaction appears to excersise the bug very effectively.

Some more details on the degugging we've done so far in the net-snmp-coders list.

Also attached is a core dump from 5.7.1 segfault, and a log extract from 5.7.1 looping.

Discussion

  • Ken Farnen
    Ken Farnen
    2012-09-05

    Example subagent code

     
    Attachments
  • Ken Farnen
    Ken Farnen
    2012-09-05

    Script to send queries to demonstrate crash

     
    Attachments
  • Ken Farnen
    Ken Farnen
    2012-09-05

    Log output from 5.7.1 unpatched

     
    Attachments
  • Ken Farnen
    Ken Farnen
    2012-09-05

    Core file from 5.7.1 unpatched

     
  • Ken Farnen
    Ken Farnen
    2012-09-05

    Log file from 5.7.1 (patched) - 25766 repeated lines removed from end...

     
    Attachments
  • Ken Farnen
    Ken Farnen
    2012-09-05

    Log file from 5.8-dev hanging - many duplicate lines from end removed....

     
    Attachments
  • Jiri Cervenka
    Jiri Cervenka
    2012-10-26

    I have submitted patch 3580458 which fixes looping (and another crash situation) for me. It would be great if you could let me know whether the patch passes your test cases. Thanks.

     
  • Martin East
    Martin East
    2014-07-22

    Thanks for the useful posting, and test harness.

    Here is some info for a fix on Redhat Enterprise Linux v5.x which uses version net-snmp version 5.3.2.2.

    The bug is captured in Redhat bugzilla 1038007 (https://bugzilla.redhat.com/show_bug.cgi?id=1038007). There is also a CVE reference for this: CVE-2012-6151.

    Using the harness, I ran a stress test on net-snmp-*-5.3.2.2-20.el5 under RHEL5.9 (2.6.18-348.3.1.el5), and found it reproduced reliably after ~1 hour of running snmp- crashme.sh and looped agentofdeath subagent runs. The core backtrace was:

    Core was generated by `/usr/sbin/snmpd -LS0-7d -Lf /var/log/snmpd.log -p /var/run/snmpd.pid'.
    Program terminated with signal 11, Segmentation fault.
    #0  0x00002b7b03010ab9 in netsnmp_add_varbind_to_cache () from /usr/lib64/libnetsnmpagent.so.10
    (gdb) bt
    #0  0x00002b7b03010ab9 in netsnmp_add_varbind_to_cache () from /usr/lib64/libnetsnmpagent.so.10
    #1  0x00002b7b0301105c in netsnmp_reassign_requests () from /usr/lib64/libnetsnmpagent.so.10
    #2  0x00002b7b030110e8 in handle_getnext_loop () from /usr/lib64/libnetsnmpagent.so.10
    #3  0x00002b7b03012bb9 in check_delayed_request () from /usr/lib64/libnetsnmpagent.so.10
    #4  0x00002b7b03012c76 in netsnmp_check_outstanding_agent_requests () from /usr/lib64/libnetsnmpagent.so.10
    #5  0x00002b7b03012f27 in netsnmp_remove_delegated_requests_for_session () from /usr/lib64/libnetsnmpagent.so.10
    #6  0x00002b7b030388ce in close_agentx_session () from /usr/lib64/libnetsnmpagent.so.10
    #7  0x00002b7b0301f064 in agentx_got_response () from /usr/lib64/libnetsnmpagent.so.10
    #8  0x00002b7b036acf28 in snmp_sess_timeout () from /usr/lib64/libnetsnmp.so.10
    #9  0x00002b7b036ad088 in snmp_timeout () from /usr/lib64/libnetsnmp.so.10
    #10 0x00002b7b027e50d5 in main ()
    

    Then I upgraded to net-snmp-*-5.3.2.2-22.el5, and ran the same test for 24 hours. No crashes occurred.

    So, the net-snmp-*-5.3.2.2-22.el5 set of packages, seem to fix this bug. Although I have not analysed the source.

    Redhat RHSA reference: https://rhn.redhat.com/errata/RHSA-2014-0322.html