Menu

#1238 Disman Monitoring over Net-SNMP Proxy fails

disman
open
nobody
agent (1105)
7
2012-11-08
2004-11-01
icy006
No

Problem: Cannot use the Net-SNMP Disman/Event
implementation to monitor OIDs that are delegated
through a Net-SNMP proxy.
- - - - -
Software Version: Net-SNMP v5.1.2
Operating System: (uname -a)
Gentoo Linux. Kernel: v2.6.7-ck5
Linux 2.6.7-ck5 #2 Mon Aug 9 16:39:52 EDT 2004 i686
Intel(R) Pentium(R) 4 CPU 2.40GHz Genuine Intel
GNU/Linux
Compiler: (gcc -v)
gcc version 3.3.4 20040623 (Gentoo Linux 3.3.4-r1,
ssp-3.3.2-2, pie-8.7.6)
Configure:
./configure --with-cflags="-g" \ --with-mib-modules="ucd-snmp/dlmod" \ --with-mib-modules="disman/event-mib" \ --with-default-snmp-version="2" \ --with-sys-location="unknown" \ --with-logfile="none" \ --with-persistent-directory="/opt/net-snmp/cfg"
- - - - -

A better explanation of this problem requires some
background. I am trying to set up Net-SNMP to act as a
master agent in a system with any amount of other SNMP
agents under it. For sake of the example, let's say
that there is a single CISCO router in the system with
its own SNMP daemon.

The outside world only sees Net-SNMP. Internally,
Net-SNMP is configured with a command to proxy any
CISCO requests through to the router. The configuration
looks like this:

proxy -v 2c -c public <router ipaddr:port=""> .1.3.6.1.4.1.9

Now, any CISCO requests (aimed at our router) sent to
Net-SNMP from an outside source, or via command-line
tools locally, work fine. But now, I want to have
Net-SNMP do a little monitoring of the router, such
that certain values are checked for failure now and
then. To do this requires DISMAN/EVENT.

The Net-SNMP implementation of Disman allows monitoring
of any OID that Net-SNMP knows about. With the proxy
configured for the CISCO router, Net-SNMP should be
able to query router information. Say that the OID
1.3.6.1.4.1.9.1.1 points to the INTEGER routerStatus.
If the value 1 means UP (router is working) and the
value 2 means DOWN (router is broken), here's my
monitor configuration:

monitor -r 5 "router is UP!" 1.3.6.1.4.1.9.1.1 == 1
monitor -r 5 "router is DOWN!" 1.3.6.1.4.1.9.1.1 == 2

I should note that Disman monitoring works fine if used
on something non-proxied, such as values in MIB-II. In
the described case, however, things go wrong.

How it goes wrong:
- - - - -
Since I am working on a test install, snmpd is started
with the following line:
./snmpd -f -V -C -c <path-to-snmpd.conf> \ -M <path-to-my-mib-dir>

Everything starts as normal. Once five seconds pass,
the first monitor timer goes off, and Net-SNMP checks
the routerStatus value.

Received SNMP packet(s) from callback: 1 on fd 3
GETNEXT message
-- SOME-CISCO-MIB::routerStatus

That's all she wrote. No traps are fired. Here's a
little extra pertinent debug info if we add a -Dresults
to snmpd:

results: request results (status = 0):
results: SOME-CISCO-MIB::routerStatus =
Wrong Type (should be INTEGER): NULL

This is showing that a response was returned for the
Disman query, but it appears to be NULL. There are
many other debug output options that add some pieces to
this puzzle, but let me instead present the program
flow and failure I observed.

Detailed program flow
- - - - -
STARTUP: Trigger configuration is read.
mte_run_trigger() is registered with an alarm to run
every 5 seconds.

<5 seconds pass>

1.) Trigger fires, running mte_run_trigger(). The
trigger requests the value of routerStatus via
mte_get_response().

2.) A request PDU is created, and snmp_synch_response()
is called to fetch a response PDU with the routerStatus
information.

3.) snmp_send() is called to send out an SNMP request
PDU. No callback function is registered. Returns.

4.) Enter the SELECT loop within
snmp_synch_response_cb(). That last send was just to
ourselves, so there is already a response waiting for us.

5.) Call snmp_read() to get the response

6.) We have a packet, so process it. Since the request
PDU that prompted this packet was an SNMP request
intended for internal processing, the main session
callback (handle_snmp_packet()) fires to deal with it.
Goes to handle_pdu() and thusly netsnmp_call_handlers()
to pass this along to built-in handler functions.

7.) Pass things along to the appropriate handlers,
which are blk_to_next and proxy. blk_to_next does its
thing, so onto proxy.

8.) proxy_handler() sees that information is requested
that is not available locally, so it creates a "real"
SNMP data request to go out to the proxy'd CISCO
router. Registers proxy_got_response() as the request
callback, and marks this request as delegated. The
request PDU is sent out.

9.) Back through the handle_pdu() function, and passed
to handle_getnext_loop(). Since the request is
delegated, we bail out and return at this point.

10.) Back to the SELECT loop again to start this
process over. Once the CISCO router responds and we
have a packet waiting for us, we move on.

11.) Call snmp_read() to get the response

12.) We have a request callback function this time:
proxy_got_response(). Call it.

13.) The data is good! The request is de-delegated, and
a nice response PDU is packed up and ready to go.

14.) Return to the SELECT loop to wait FOREVER.

Ack
- - - - -
Ack! So what went wrong? The key lies in the callback
mechanism. When snmp_read() is called, it calls through
a few functions to get its way over to
_sess_process_packet(). This is where packet data is
parsed and started toward encapsulation in a response
PDU. After this happens, a callback is fired. Which
callback depends on a few things.

As we saw in step 6, the request PDU was an SNMP
request for Net-SNMP. This goes straight to Net-SNMP's
special handle_snmp_packet() callback. Every other type
of request PDU, such as those from DISMAN and PROXY
goes through a different code path. NOTE: I probably
explained this process a bit incorrectly, in
nomenclature if nothing else.

In the different code path, the request PDU is checked
to see if there is a request callback function
registered. In the case of DISMAN PROXY requests, there
is: proxy_got_response(). In the case of DISMAN
requests that are not over a proxy, there is no request
callback registered. The request callback is called if
it exists. If it doesn't, the session callback is
called instead. Herein lies precisely the difference
between DISMAN and DISMAN PROXY operation.

The session callback for normal DISMAN operation (and
anything that uses snmp_synch_response()) is
snmp_synch_input(). This is temporarily registered upon
entering the aforementioned SELECT loop. This function
is the key: it copies the received packet information
into the response PDU to be passed up to the original
snmp_synch_response() call, and toggles a flag that
allows the code to exit the SELECT loop.

Since the proxy has its own registered callback
function, it never calls the session callback
snmp_synch_input(), so it has no way to exit the SELECT
loop and return to program sanity.

Proposal
- - - - -
Though I have studied the code, I am not familiar with
all low-level details. My focus has only been code
related to this problem, so I may be missing a bigger
picture. Even so, it seems clear to me that the
snmp_synch_input() session callback needs to fire in
the case of DISMAN PROXY operation.

I believe that this may be done within the proxy
callback function proxy_got_response(). Since this is a
proxy-specific problem, it seems to make more sense
than a call buried in Net-SNMP Agent or API code
somewhere. Though I haven't considered other cases of
PROXY use, I have a feeling DISMAN isn't the only
extension capable of getting caught in this problem
(anything using snmp_synch_input() with the PROXY is
susceptible).

What I am not sure of, is how to do this without
interrupting normal proxy operation (the proxy works
fine without using DISMAN). Perhaps some sort of status
information resides in the request PDU or session that
would facilitate calling the session callback only at
appropriate times? Is there another codepath to deal
with this?

Thanks for your consideration of this problem, and for
an excellent SNMP suite! Take care,

-Kevin

Discussion

  • Robert Story

    Robert Story - 2004-11-01

    Logged In: YES
    user_id=76148

    Wow! I think this is one of the finest bug reports I've ever
    seen!

    After step 13, after exiting from snmp_read, it should go
    into netsnmp_check_outstanding_agent_requests (before going
    back to the select), which should notice that the delegated
    request is done, and send a response. What happens inside
    check_outstanding_requests?

     
  • icy006

    icy006 - 2004-11-02

    Logged In: YES
    user_id=1150478

    Thanks :)

    I'm unfortunately a few weeks removed from when I did the
    debugging work, so my mental code model is a bit faded. I
    skimmed the code in this area, but I couldn't find a path
    that leads to netsnmp_check_outstanding_agent_requests(). I
    don't remember encountering that routine before when I was
    tracing the bug's flow.

    The snmp_read() call branches through snmp_sess_read(),
    _sess_read(), and _sess_process_packet() (among other
    things) before returning to the select loop, so this is
    where I poked around. Have any more specific pointers on
    where it might happen?

     
  • Robert Story

    Robert Story - 2004-11-03

    Logged In: YES
    user_id=76148

    ding! I just realized the blocking select was in the cb
    routine, not the main agent. will go back and look some more...

     

Log in to post a comment.