From: <ken...@no...> - 2012-08-28 10:36:02
|
Hi All, I'm currently trying to chase down a nasty bug in Net-SNMP for my current client, and I've pretty much hit the brick-wall of my own understanding of the way things are supposed to work, so I'm hoping things may make a little more sense to those who know the code better than I. The Scenario: We've got an application that registers as an AgentX subagent in order to answer queries for a private MIB related to the applications state. Platform is Montavista linux on x86 (specifically, glibc 2.3.3, kernel 2.6.10-x86 and 2.6.21-x86-64). We've been experiencing random crashes in the field for some time now, which seemed to be load related, and after much tracing and head-scratching, we've found the culprit to be snmpd. Specifically, the problem appears to be that under load on our app, the AgentX queries sometimes time-out (application prioritises it's primary function over SNMP, so sometimes AgentX queries get queued up a bit), and the situation where snmpd disconnects the session due to time-out is not handled well. Worse, shutting down our app. Is very likely to kill snmpd if there are requests outstanding at the point of shutdown (quite possible if the request load is high). I've built a test environment that can exercise this bug, so I've been able to do some investigation: 5.6.1 and 5.7.1 "stock" builds dump core (Segfault) when AgentX connection times out or disconnects We've tried the "subagent_free_cache" patch (which is the same as the patch in 1633670) on both 5.6.1 and 5.7.1 and this results in an infinite loop in the following code in "agent/mibgroup/agentx/master_admin.c", function "close_agentx_session()": if (session->subsession != NULL) { netsnmp_session *subsession = session->subsession; for(; subsession; subsession = subsession->next) { while (netsnmp_remove_delegated_requests_for_session(subsession)) { DEBUGMSGTL(("agentx/master", "Continue removing delegated subsession reqests\n")); It loops forever on the while, with the return value never decreasing. (log message and spelling mistake repeated ad-infinitum, 100% CPU load for snmpd). I've also tried the current trunk version, which has the 1633670 patch already applied, and get the same behaviour. After lots of additional debugging, the culprit behaviour appears to be that "netsnmp_remove_delegated_requests_for_session()" removes (or, more correctly, uses "netsnmp_request_set_error()" on) everything is the agent_delegated_list that matches the target session, then calls "netsnmp_check_outstanding_agent_requests()", which walks the agent_delegated list and de-queues anything that passes "netsnmp_check_for_delegated()". However, there appear to be requests in the subsession list that don't match, and thus are still marked as delegated, and thus don't pass check_for_delegated and..... Repeat until bored...... I've tried making (and using) a more aggressive flavour of "netsnmp_remove_delegated_requests_for_session()" that doesn't have the: if(request->subtree->session != sess) continue; Test, but that don't fix it! Note that "..check_for_delegated()" checks in asp->treecache, but "..remove_delegated_requests.." removes the requests from [agent_delegated_list]->requests, and it appears in our case the two don't quite meet up... I've tried writing an even more aggressive version of "netsnmp_remove_delegated_requests_for_session()" that eats every delegated request In the treecache, which, to be fair, stops the infinite loop above, but just causes snmpd to go catatonic elsewhere... ...and that's where my understanding of these inter-related data structures stops, I'm afraid! I'm sort of hoping that those that live, eat and breathe this code will have some suggestions. Other info that may help: My test SNMP query set is a set of SNMP GET and GETNEXTs taken from a customer network capture - they all hit the MIB that is delegated to our AgentX subagent, however, some of the GETNEXTs walk off the end of our MIB and into the next enterprise along (which happens to be the NET-SNMP MIB, in our particular case). Ken Farnen. Agilent don't authorise me to order paperclips, much less speak on their behalf, I'm just a freelance consultant who happens to sit at one of their desks at the moment, anything I say is my opinion only, and nothing to do with my Client! |