net-snmp / Patches / #1237 snmpd crashes/hangs when subagent disconnects during GetNext

Jiri Cervenka - 2012-10-26

walk-array-and-remove-on-close-pdu.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ken Farnen - 2012-10-29

Thanks for the patch!

Having problems applying it to 5.7.2 though, as the patch fails when applied to the released 5.7.2 tarball after applying the final (v3) 1633670 patch.

Failed chunk is:

more master.c.rej
***************
*** 219,228 ****
if (!cache) {
DEBUGMSGTL(("agentx/master", "response too late on session %8p\n",
session));
- /* response is too late, free the cache */
if (magic)
netsnmp_free_delegated_cache((netsnmp_delegated_cache*) magic);
- return 0;
}
requests = cache->requests;

--- 219,231 ----
if (!cache) {
DEBUGMSGTL(("agentx/master", "response too late on session %8p\n",
session));
+ /*
+ * Response is too late, free the cache and return 1
+ * so that the session pending request list item can be deleted
+ */
if (magic)
netsnmp_free_delegated_cache((netsnmp_delegated_cache*) magic);
+ return 1;
}
requests = cache->requests;

Fails because:

if $magic$ netsnmp\_free\_delegated\_cache$\(netsnmp\_delegated\_cache\*$ magic\);

isn't present, am I missing a patch to be applied to 5.7.2? Or am I working with the wrong version of the original patch?

Meanwhile, I've manually applied the change and I'm testing with that - has not crashed yet...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2012-10-29

final version of patch 1633670

patch-1633670-commit.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2012-10-29

Tha patch is meant to be applied against the final version of the patch 1633670 committed on trunk,
i.e. the 43cd9db4 commit 'CHANGES: PATCH 1633670: fixed snmpd crashing when an AgentX subagent disconnect in the middle of processing of a request.'.

The final version contains two additional delegated cache memory leaks fixes.

I have added the final version to this tracker artifact as 'patch-1633670-commit.patch'.

Sorry for confusion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ken Farnen - 2012-10-29

Thanks for that, I'll re-test with the updated 1633670 and 5.7.2 stock - testing with my hacked together version stayed up for about 2 hours, but then started hanging up for long periods and chewing up memory.

Much better than anything that we achieved, but not quite there yet - I'll tell you how testing went after it has had an overnight soak.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ken Farnen - 2012-11-01

OK, with the right version of the 1633670 patch, that works a lot better!

Passes my tests (flat out traffic generation into both our production application and the test subagent at the same time) and also a looped SNMP-SET test (which was the original problem that got us started down the 1633670 patch road).

With the inevitable large queues of pending transactions generated by these tests, the SET test does cause the snmpd to pause a bit, but it never falls over, and always comes back and starts responding, with SETs still operational. It also seems to pass valgrind pretty much.

We're going to deploy the patched 5.7.2 into automated test, and create a formal patch of our system for further testing next week.

Overall though, this looks like a winner.

Thanks very much, you've succeeded where we had failed miserably!

Ken.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2012-11-06

draft of an alternative patch, against plain 5.7.2

alt-cancel-next-walk.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2012-11-06

While running some additional tests I have hit another inifinite loop
around delegated requests removal (this one triggered by requests
containing OIDs delegated to multiple different subagents).

This got me thinking whether the iterate-next-loop approach taken by
patch 1633670 is really necessary (it seems to me that -- apart from
the infinite loop risk -- there is also the effect of pausing new
request processing by favouring next loop iteration for a subset of
requests).

I am currently testing an alternative approach (uploaded as
alt-cancel-next-walk.patch). The basic idea is to a) mark the whole
asp explicitly as canceled (we cannot rely on the 'delegated' flag
alone), and b) break out of the next walk when cancel is in progress
(the asp is destined to fail with genError anyway).

While this patch has so far passed my crash/leak tests, I consider it
to be of preliminary status (I am aware of several open questions,
e.g. I don't like the idea of adding a new variable to
netsnmp_agent_session structure).

The patch is created against plain 5.7.2.

Any feedback gladly welcome.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2012-11-07

alt patch draft v2, against plain 5.7.2

alt-cancel-next-walk-v2.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2012-11-07

Uploaded alt-cancel-next-walk-v2.patch.

We don't want to start handling queued requests (from
netsnmp_agent_queued_list) when on the close session path. Decoupled
the delegated list check into a new function.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2013-04-23

The patch looks fine to me, or at least to my valgrind :), but adding new item to struct netsnmp_agent_session_s is IMHO ABI breaker. It can be submitted to git master, but not to stable branches.

There must be either separate list of cancelled sessions or other existing struct netsnmp_agent_session_s item must be misused to mark the cancelled ones.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2013-06-25

Jiri, you mention some 'additional tests' and 'my crash/leak tests', could you please elaborate on what these tests do and/or even share the tests here?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Cervenka - 2013-06-27

Attaching tools-p1237.tar.gz -- tools I have used to reproduce/test the problem.

$ cat README
./subagent/suba
Subagent derived from 'agentofdeath.tar' attached to bug#2411.
( agentx socket: "/tmp/snmp-test/var/agentx/master";
./runsub.sh pause-len sub-oid-1 sub-oid-2 )

./subagent/subwgt
Subagent with SetReq support derived from OpenHPI subagent tutorial
@ http://openhpi.sourceforge.net/subagent-manual/x649.html .
( agentx socket: "/tmp/snmp-test/var/agentx/master";
./runsub.sh pause-len)

./client/walk
GetNext, GetBulk, Get traffic mix generators, see e.g. crash1-n-v2c-bulkmix.sh.

./client/set
SetReq traffic mix example.

./misc
Some GDB macros printing relevant lists and structures (e.g. "padl"
(agent_delegated_list), "ptp" (netsnmp_subtree), "preq" (netsnmp_request_info),
"pasp" (netsnmp_agent_session), "ps" (netsnmp_session), "ppdu" (netsnmp_pdu) etc.)

An easy way to reproduce the original problem is to start an instance of the 'pausing'
suba agent (./runsub.sh 2 4 1) and generate some GetNext traffic (e.g. crash1-n-v2c-bulkmix.sh).
Adding more subagents responsible for adjacent subtrees increase the likelihood
of hitting the problem (e.g. 2 suba's with the same generator reproduce the snmpd loop
in patch 1633670).

My basic setup for soak tests was to run snmpd with 2-3 instances of suba's attached
to adjacent subtrees, plus an instance of subwgt; and generate walk and set traffic
(with some additional script repeatedly shutting down subagents (to test agentx
initiated Close)). I repeated the tests with snmpd being run under valgrind
(valgrind --tool=memcheck --leak-check=full).

(alt-cancel-next-walk-v2.patch against 5.7.2, RHEL 5.5, x86_64).

tools-p1237.tar.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2013-07-02

Thanks a lot for the scripts. I pushed the patch into git master, commit 793d596838ff7cb48a73b675d62897c56c9e62df, as it breaks ABI.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bill Fenner - 2013-07-02

Did the patch to master.c get missed in 793d596838ff7cb48a73b675d62897c56c9e62df?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2013-07-03

Master.c changes are already there, IMO.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Junling Zheng - 2015-10-30

Hi, Jan:
I can't find this fixing code in 5.7.3.
Does 5.7.3 release version have this vulnerability?
Thanks!

Last edit: Junling Zheng 2015-10-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2015-10-30

The patch was applied to master only, as it breaks API a bit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joan Landry - 2017-02-08

I am trying to run agentx in 5.7.3 and having no luck - first I get the crash - after I apply the patch I get a result of generic error to any subagent get request.

Any ideas what I need to do to get agentx to work?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joan Landry - 2017-02-08

also in my setup the subagent does not fail to respond - as I have verified that the handler is being called and the response sent.

Last edit: Joan Landry 2017-02-08

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

snmpd crashes/hangs when subagent disconnects during GetNext

Group

Searches

Help

#1237 snmpd crashes/hangs when subagent disconnects during GetNext

Discussion