I use 5.7.2 version of net-snmp and faced with the following problem:
my process as subagent tries to connect to snmpd (master agent) and hangs on blocking read() function. Please see call stack below:
#0 0x00007f1d4898d22d in read () at /lib64/libpthread.so.0 #1 0x00007f1d49437c80 in netsnmp_callback_recv () at /lib64/libnetsnmp.so.31~~~ #2 0x00007f1d494256e9 in netsnmp_transport_recv () at /lib64/libnetsnmp.so.31 #3 0x00007f1d493f699f in _sess_read () at /lib64/libnetsnmp.so.31 #4 0x00007f1d493f7839 in snmp_sess_read2 () at /lib64/libnetsnmp.so.31 #5 0x00007f1d493f788b in snmp_read2 () at /lib64/libnetsnmp.so.31 #6 0x00007f1d474ba4af in agent_check_and_process () at /lib64/libnetsnmpagent.so.31
agent_check_and_process(0); / 0 == don't block /
Large file descriptors uses select() in result:
I found in 'man 2 select'
Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.
So, when select() signals spuriously it falls on blocking read() method and waits forever, while desriptor will not closed yet and my subagent application cannot be connected
Added patch with O_NONBLOCK for unix pipes & unix domain socket for agentx
Even we are seeing the same issue repeatedly.
I am trying to determine the context for these reported AgentX hangs. I have come across similar symptoms with FRRouting, but I am pretty confident that spurious Linux select() behavior is not the root cause in my case.
In FRRouting, the event loop happens to sometimes execute poll(), run_alarms(), snmp_read() (in that order). Under some loading conditions, the run_alarms() consumes file descriptor data so the subsequent snmp_read() finds no available data and AgentX hangs forever. If I change that sequence to poll(), snmp_read(), run_alarms(), then I have no problems.
Can you explain the event loop construction used where you encountered AgentX hangs? I prefer follow up at https://github.com/net-snmp/net-snmp/issues/302 .