This is my attempt to extend patch 1633670. This patch is created
against 5.7.2 plus patch 1633670.
Problems addressed:
1) Inifinite loop in delegated requests removal
In my opionion, the for-loop iteration on asp->requests in
netsnmp_remove_delegated_requests_for_session() is incorrect, as it
may end up flagging only a subset of asp's requests (depending on how
many different subtrees (and hence treecaches) this asp's requests
OIDs fall in).
The loop situation occurs when the asp->requests[0] request has
already been answered (and is not part of any treecache), whereas some
other requests on this asp still have the delegated flag on
(reproducible by GetNext requests which contain both a variable
pointing to subtree belonging to the closing session, and another one
pointing to a range preceding that very subtree (but not fulfilled by
any preceding handler).
I think this for-loop should walk all asp->vbcount items on the
asp->requests array instead.
2) Crashes on dangling subtree pointers after AgentX Close PDU
The same crash patterns as those described in patch 1633670 appear
when a session gets closed via an explicit Close PDU from a subagent.
So, I think we need to go through the remove delegated requests loop
also when a valid sessid has been passed to close_agentx_session().
Adding this change caused double frees on delegated caches in
agentx_got_response() ('response is too late') -- apparently caused by
responses to retries, and also by callback from snmp_sess_close(). I
think this can be fixed by returning 1 from agentx_got_response() on
this occasion, thus letting _sess_process_packet() delete the request
from the session pending requests list.
Tested with 5.7.2 w/ patch 1633670, on x86_64 Linux, RHEL 5.5.
I tested the patch also with the subagent and traffic generating
script attached to bug 3565004 -- and it worked for me.
Thanks for the patch!
Having problems applying it to 5.7.2 though, as the patch fails when applied to the released 5.7.2 tarball after applying the final (v3) 1633670 patch.
Failed chunk is:
more master.c.rej
***************
*** 219,228 ****
if (!cache) {
DEBUGMSGTL(("agentx/master", "response too late on session %8p\n",
session));
- /* response is too late, free the cache */
if (magic)
netsnmp_free_delegated_cache((netsnmp_delegated_cache*) magic);
- return 0;
}
requests = cache->requests;
--- 219,231 ----
if (!cache) {
DEBUGMSGTL(("agentx/master", "response too late on session %8p\n",
session));
+ /*
+ * Response is too late, free the cache and return 1
+ * so that the session pending request list item can be deleted
+ */
if (magic)
netsnmp_free_delegated_cache((netsnmp_delegated_cache*) magic);
+ return 1;
}
requests = cache->requests;
Fails because:
isn't present, am I missing a patch to be applied to 5.7.2? Or am I working with the wrong version of the original patch?
Meanwhile, I've manually applied the change and I'm testing with that - has not crashed yet...
final version of patch 1633670
Tha patch is meant to be applied against the final version of the patch 1633670 committed on trunk,
i.e. the 43cd9db4 commit 'CHANGES: PATCH 1633670: fixed snmpd crashing when an AgentX subagent disconnect in the middle of processing of a request.'.
The final version contains two additional delegated cache memory leaks fixes.
I have added the final version to this tracker artifact as 'patch-1633670-commit.patch'.
Sorry for confusion.
Thanks for that, I'll re-test with the updated 1633670 and 5.7.2 stock - testing with my hacked together version stayed up for about 2 hours, but then started hanging up for long periods and chewing up memory.
Much better than anything that we achieved, but not quite there yet - I'll tell you how testing went after it has had an overnight soak.
OK, with the right version of the 1633670 patch, that works a lot better!
Passes my tests (flat out traffic generation into both our production application and the test subagent at the same time) and also a looped SNMP-SET test (which was the original problem that got us started down the 1633670 patch road).
With the inevitable large queues of pending transactions generated by these tests, the SET test does cause the snmpd to pause a bit, but it never falls over, and always comes back and starts responding, with SETs still operational. It also seems to pass valgrind pretty much.
We're going to deploy the patched 5.7.2 into automated test, and create a formal patch of our system for further testing next week.
Overall though, this looks like a winner.
Thanks very much, you've succeeded where we had failed miserably!
Ken.
draft of an alternative patch, against plain 5.7.2
While running some additional tests I have hit another inifinite loop
around delegated requests removal (this one triggered by requests
containing OIDs delegated to multiple different subagents).
This got me thinking whether the iterate-next-loop approach taken by
patch 1633670 is really necessary (it seems to me that -- apart from
the infinite loop risk -- there is also the effect of pausing new
request processing by favouring next loop iteration for a subset of
requests).
I am currently testing an alternative approach (uploaded as
alt-cancel-next-walk.patch). The basic idea is to a) mark the whole
asp explicitly as canceled (we cannot rely on the 'delegated' flag
alone), and b) break out of the next walk when cancel is in progress
(the asp is destined to fail with genError anyway).
While this patch has so far passed my crash/leak tests, I consider it
to be of preliminary status (I am aware of several open questions,
e.g. I don't like the idea of adding a new variable to
netsnmp_agent_session structure).
The patch is created against plain 5.7.2.
Any feedback gladly welcome.
alt patch draft v2, against plain 5.7.2
Uploaded alt-cancel-next-walk-v2.patch.
We don't want to start handling queued requests (from
netsnmp_agent_queued_list) when on the close session path. Decoupled
the delegated list check into a new function.
The patch looks fine to me, or at least to my valgrind :), but adding new item to struct netsnmp_agent_session_s is IMHO ABI breaker. It can be submitted to git master, but not to stable branches.
There must be either separate list of cancelled sessions or other existing struct netsnmp_agent_session_s item must be misused to mark the cancelled ones.
Jiri, you mention some 'additional tests' and 'my crash/leak tests', could you please elaborate on what these tests do and/or even share the tests here?
Attaching tools-p1237.tar.gz -- tools I have used to reproduce/test the problem.
$ cat README
./subagent/suba
Subagent derived from 'agentofdeath.tar' attached to bug#2411.
( agentx socket: "/tmp/snmp-test/var/agentx/master";
./runsub.sh pause-len sub-oid-1 sub-oid-2 )
./subagent/subwgt
Subagent with SetReq support derived from OpenHPI subagent tutorial
@ http://openhpi.sourceforge.net/subagent-manual/x649.html .
( agentx socket: "/tmp/snmp-test/var/agentx/master";
./runsub.sh pause-len)
./client/walk
GetNext, GetBulk, Get traffic mix generators, see e.g. crash1-n-v2c-bulkmix.sh.
./client/set
SetReq traffic mix example.
./misc
Some GDB macros printing relevant lists and structures (e.g. "padl"
(agent_delegated_list), "ptp" (netsnmp_subtree), "preq" (netsnmp_request_info),
"pasp" (netsnmp_agent_session), "ps" (netsnmp_session), "ppdu" (netsnmp_pdu) etc.)
An easy way to reproduce the original problem is to start an instance of the 'pausing'
suba agent (./runsub.sh 2 4 1) and generate some GetNext traffic (e.g. crash1-n-v2c-bulkmix.sh).
Adding more subagents responsible for adjacent subtrees increase the likelihood
of hitting the problem (e.g. 2 suba's with the same generator reproduce the snmpd loop
in patch 1633670).
My basic setup for soak tests was to run snmpd with 2-3 instances of suba's attached
to adjacent subtrees, plus an instance of subwgt; and generate walk and set traffic
(with some additional script repeatedly shutting down subagents (to test agentx
initiated Close)). I repeated the tests with snmpd being run under valgrind
(valgrind --tool=memcheck --leak-check=full).
(alt-cancel-next-walk-v2.patch against 5.7.2, RHEL 5.5, x86_64).
Thanks a lot for the scripts. I pushed the patch into git master, commit 793d596838ff7cb48a73b675d62897c56c9e62df, as it breaks ABI.
Did the patch to master.c get missed in 793d596838ff7cb48a73b675d62897c56c9e62df?
Master.c changes are already there, IMO.
Hi, Jan:
I can't find this fixing code in 5.7.3.
Does 5.7.3 release version have this vulnerability?
Thanks!
Last edit: Junling Zheng 2015-10-30
The patch was applied to master only, as it breaks API a bit.
I am trying to run agentx in 5.7.3 and having no luck - first I get the crash - after I apply the patch I get a result of generic error to any subagent get request.
Any ideas what I need to do to get agentx to work?
also in my setup the subagent does not fail to respond - as I have verified that the handler is being called and the response sent.
Last edit: Joan Landry 2017-02-08