SNMPD acting as master agent, AgentX subagent registering to handle a MIB, and processing GETNEXT requests. When the subagent is under heavy load (and so responds slowly) requests start to pile up in the queue, replies from the subagent arrive too late (per log messages) and eventually the subagent is timed out. When the timeout occurs there is a high probability of either a crash (Segfault) or a hang (100% CPU utilisation, tight loop in the snmpd code) dependent on the version of the snmpd under test. This also happens when the subagent dies unexpectedly with outstanding transactions unserviced.
Tested with net-snmp-5.7.1 (Segfaults), net-snmp-5.7.1 plus "subagent-free-cache" patch (basically patch 1633670) (Infinite loop), current trunk (infinite loop).
Our systems are Linux 2.6 based, Montavista CGL V4 and V5 on x86 and x86-64 platforms. glibc 2.3.3.
Attached is a stripped down test subagent that excercises the bug (by forcing a long delay between servicing the agentx requests), together with a script that throws traffic at the snmpd that will make it crash quite quickly. These assume the default snmpd/agentx config, with a 1 second timeout - though our testing indicates it will crash eventually with longer timeouts, especially in the situation where a subagent crashes.
Transactions are based on those we've seen in the field, and are GETNEXT requests for multiple OIDs, all from the MIB provided by the subagent, but with some OIDs numbered such that the response is in the adjacent MIB (i.e. the GETNEXT is walking off the end of the subagent MIB). This kind of transaction appears to excersise the bug very effectively.
Some more details on the degugging we've done so far in the net-snmp-coders list.
Also attached is a core dump from 5.7.1 segfault, and a log extract from 5.7.1 looping.