net-snmp / Patches / #718 5.[34] master agent segfaults after 15-20h of repeated walk

Nobody/Anonymous - 2007-01-12

Logged In: NO

OS Information: RedHat Linux 9.0

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Anders - 2007-01-25

Logged In: YES
user_id=848638
Originator: NO

Is this with a custom subagent? Can you tell some more details about it, please? Can you reproduce a similar behaviour with any other subagent (e.g. snmptrapd, snmpd -X)?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zachary Mark - 2011-08-12

I am seeing this behaviour as well on 5.6.0. I can reproduce the problem fairly reliably with a nearly identical test:

1) Place the OS under heavy load.
2) Do a repeated snmpwalk.

A couple of the stack dumps we were able to produce:

[57574.927712] snmpd[7383]: segfault at 20000002b ip 00007f7d51fee979 sp 00007fffdc80c940 error 4 in libnetsnmpagent.so.25.0.1[7f7d51fce000+61000]
#0 netsnmp_add_varbind_to_cache (asp=0x655180, vbcount=2, varbind_ptr=0x94b330, tp=0x200000003) at snmp_agent.c:1981
#1 0x00007f7d51feef8d in netsnmp_reassign_requests (asp=0x655180) at snmp_agent.c:2490
#2 0x00007f7d51feefb8 in handle_getnext_loop (asp=0x655180) at snmp_agent.c:3073
#3 0x00007f7d51ff20a9 in check_delayed_request (asp=0x655180) at snmp_agent.c:2837
#4 0x00007f7d51ff21a9 in netsnmp_check_outstanding_agent_requests () at snmp_agent.c:2732
#5 0x000000000040427a in main ()

[30638.694392] snmpd[15582] general protection ip:7f41ad92b8d5 sp:7fff6d987cf8 error:0 in libnetsnmp.so.25.0.1[7f41ad8fa000+9b000]
#0 netsnmp_oid_find_prefix (in_name1=0x5df37e989c7c2f84, len1=183, in_name2=0x13dd73f04a4ffb8b, len2=188) at snmp_api.c:6907
#1 0x00007f41af31598e in netsnmp_add_varbind_to_cache (asp=0x915680, vbcount=1, varbind_ptr=0x902510, tp=0x811190) at snmp_agent.c:1981
#2 0x00007f41af315f2c in netsnmp_reassign_requests (asp=0x915680) at snmp_agent.c:2477
#3 0x00007f41af315fb8 in handle_getnext_loop (asp=0x915680) at snmp_agent.c:3073
#4 0x00007f41af3190a9 in check_delayed_request (asp=0x915680) at snmp_agent.c:2837
#5 0x00007f41af3191a9 in netsnmp_check_outstanding_agent_requests () at snmp_agent.c:2732
#6 0x00007f41af319607 in netsnmp_remove_delegated_requests_for_session (sess=0x848420) at snmp_agent.c:1548
#7 0x00007f41af32e904 in close_agentx_session (session=0x848420, sessid=-1) at mibgroup/agentx/master_admin.c:132
#8 0x00007f41af32ee1a in handle_master_agentx_packet (operation=5, session=0x848420, reqid=<value optimized="" out="">, pdu=0x0, magic=<value optimized="" out="">) at mibgroup/agentx/master_admin.c:479
#9 0x00007f41ad93ad43 in _sess_read (sessp=0x870d00, fdset=<value optimized="" out="">) at snmp_api.c:5942
#10 0x00007f41ad93b6a9 in snmp_sess_read2 (sessp=0x5df37e989c7c2f84, fdset=0xb7) at snmp_api.c:6149
#11 0x00007f41ad93b763 in snmp_read2 (fdset=0x7fff6d988070) at snmp_api.c:5740
#12 0x000000000040484b in main ()</value></value></value>

Valgrind report which most likely corresponds to the segfault:

==15092== Invalid read of size 8
==15092== at 0x4E485DB: netsnmp_remove_delegated_requests_for_session (snmp_agent.c:1531)
==15092== by 0x4E5D903: close_agentx_session (master_admin.c:132)
==15092== by 0x4E5DE19: handle_master_agentx_packet (master_admin.c:479)
==15092== by 0x67EED42: _sess_read (snmp_api.c:5942)
==15092== by 0x67EF6A8: snmp_sess_read2 (snmp_api.c:6149)
==15092== by 0x67EF762: snmp_read2 (snmp_api.c:5740)
==15092== by 0x40484A: main (in /usr/sbin/snmpd)
==15092== Address 0x769F078 is 72 bytes inside a block of size 120 free'd
==15092== at 0x4C2041E: free (vg_replace_malloc.c:233)
==15092== by 0x4E473C0: netsnmp_wrap_up_request (snmp_agent.c:1787)
==15092== by 0x4E47E56: handle_snmp_packet (snmp_agent.c:1952)
==15092== by 0x67ED6CD: _sess_process_packet (snmp_api.c:5677)
==15092== by 0x67EEDB4: _sess_read (snmp_api.c:6117)
==15092== by 0x67EF6A8: snmp_sess_read2 (snmp_api.c:6149)
==15092== by 0x67EF762: snmp_read2 (snmp_api.c:5740)
==15092== by 0x40484A: main (in /usr/sbin/snmpd)

I can provide many (over 20) coredumps if necessary.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bart Van Assche - 2011-08-13

It would help a lot if you could provide a (minimal) subagent implementation that allows to reproduce this issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-08-15

I reproduced with net-snmp-5.7.
Stack trace:
==21720== Process terminating with default action of signal 11 (SIGSEGV)
==21720== Access not within mapped region at address 0xA8
==21720== at 0x4C49D43: netsnmp_add_varbind_to_cache (snmp_agent.c:2017)
==21720== by 0x4C4B1FA: netsnmp_reassign_requests (snmp_agent.c:2521)
==21720== by 0x4C4C4D1: handle_getnext_loop (snmp_agent.c:3121)
==21720== by 0x4C4BD9E: check_delayed_request (snmp_agent.c:2883)
==21720== by 0x4C4B936: netsnmp_check_outstanding_agent_requests (snmp_agent.c:2776)
==21720== by 0x4C48F17: netsnmp_remove_delegated_requests_for_session (snmp_agent.c:1567)
==21720== by 0x4C66266: close_agentx_session (master_admin.c:142)
==21720== by 0x4C67187: handle_master_agentx_packet (master_admin.c:485)
==21720== by 0x530B982: _sess_read (snmp_api.c:5670)
==21720== by 0x530C43C: snmp_sess_read2 (snmp_api.c:5868)
==21720== by 0x530AF01: snmp_read2 (snmp_api.c:5470)
==21720== by 0x40502C: receive (snmpd.c:1311)

Valgrind tells me quite often:
==21720== Invalid read of size 8
==21720== at 0x530DA2D: snmp_oid_compare (snmp_api.c:6452)
==21720== by 0x4C4C136: check_getnext_results (snmp_agent.c:3006)
==21720== by 0x4C4C307: handle_getnext_loop (snmp_agent.c:3092)
==21720== by 0x4C4BD9E: check_delayed_request (snmp_agent.c:2883)
==21720== by 0x4C4B936: netsnmp_check_outstanding_agent_requests (snmp_agent.c:2776)
==21720== by 0x405161: receive (snmpd.c:1348)
==21720== by 0x4047BA: main (snmpd.c:1101)
==21720== Address 0x6935130 is 0 bytes inside a block of size 72 free'd
==21720== at 0x4A055FE: free (vg_replace_malloc.c:366)
==21720== by 0x4C3E95D: netsnmp_subtree_free (agent_registry.c:471)
==21720== by 0x4C41BAB: unregister_mibs_by_session (agent_registry.c:1982)
==21720== by 0x4C66285: close_agentx_session (master_admin.c:146)
==21720== by 0x4C67187: handle_master_agentx_packet (master_admin.c:485)
==21720== by 0x530B982: _sess_read (snmp_api.c:5670)
==21720== by 0x530C43C: snmp_sess_read2 (snmp_api.c:5868)
==21720== by 0x530AF01: snmp_read2 (snmp_api.c:5470)
==21720== by 0x40502C: receive (snmpd.c:1311)
==21720== by 0x4047BA: main (snmpd.c:1101)
I think this could be cause of the sigsegv.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-08-15

Simple agentx subagent.

example.c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-08-15

I've uploaded very simple AgentX subagent which I use to reproduce the bug.

The only thing it does is that it exits when processing GETNEXT request for UCD-SNMP-MIB::ucdavis.255.2 and snmpd has to clean up 'delegated requests' to this subagent.

Usage:
1. net-snmp-config --compile-subagent example.c
2. enable agentx protocol in snmpd.conf:
master agentx
3. start snmpd (optionally under valgrind)
4. in enldess loop start my subagent:
$ while true; do ./example -f -Lo ; done
5. try if the subagent works - it should die and new one should be started
$ snmpgetnext -v2c -c public localhost -t 1 -r 0 UCD-SNMP-MIB::ucdavis.255.2
6. finally load the snmpd with the above getnext requests
$ while true; do snmpgetnext -v2c -c public localhost -t 1 -r 0 UCD-SNMP-MIB::ucdavis.255.2; done
(try it in several terminals in parallel)

With two parallel snmpget loops, I get quite often 'Invalid read' from valgrind, i.e. once per minute. I get crash irregularly, sometimes in 2-3 minutes, sometimes it takes one hour.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-08-15

My hypothesis so far:
close_agentx_session() first calls netsnmp_remove_delegated_requests_for_session() and then unregister_mibs_by_session(). The later function frees the sessions -> all delegated requests must not use it after this function ends.

I.e. netsnmp_remove_delegated_requests_for_session() should somehow finish all these request. And it does not do it properly - some requests still stay there. With second call to close_agentx_session (when another AgentX finishes), request->subtree of these leftover requests is invalid.

However, I am not able to tell what's really wrong here, I got lost in all the handlers involved. And please note it's only hypothesis, maybe I'm chasing red herring here.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leonardo Chiquitto - 2011-08-24

Here (5.7.1-pre2) the agent always crashes immediately after starting the getnext loop:

Connection from UDP: [127.0.0.1]:55107->[127.0.0.1]:161
snmp_agent: agent_sesion 0x7f4810918290 created
snmp_agent: add_vb_to_cache(0x7f4810918290, 1, UCD-SNMP-MIB::ucdavis.255.2, 0x7f481092d750)
snmp_agent: tp->start UCD-SNMP-MIB::ucdavis.255.2, tp->end UCD-SNMP-MIB::ucdavis.255.3,
agentx/master: agentx master handler starting, mode = 0xa1
agentx/master: request for variable (UCD-SNMP-MIB::ucdavis.255.2)
agentx/master: EXCLUSIVE varbind UCD-SNMP-MIB::ucdavis.255.2 scoped to UCD-SNMP-MIB::ucdavis.255.3
agentx/master: sending pdu (req=0x2,trans=0x1,sess=0x7)
snmp_agent: delegate session == 0x7f4810918290
snmp_agent: end of handle_snmp_packet, asp = 0x7f4810918290
agentx/master: transport disconnect on session 0x7f4810918080
agentx/master: close 0x7f4810918080, -1
agentx/master: timeout on session 0x7f4810918080 req=0x2
agentx/master: NULL sess_pointer??
snmp_agent: REMOVE session == 0x7f4810918080
agentx/master: transport connect on session 0x7f4810918080
snmp_agent: processing delegated request, asp = 0x7f4810918290
snmp_agent: add_vb_to_cache(0x7f4810918290, 1, UCD-SNMP-MIB::ucdavis.255.2, 0x4545454545454545)
Segmentation fault (core dumped)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-08-25

highly experimental patch

net-snmp-agentx.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-08-25

Further investigation:
1. snmpd gets GETNEXT request and passes it to AgentX subagent (-> req. is queued in agent_delegated_list)
2. the AgentX subagent disconnects without responding to the request
3. as result, snmpd goes through list of delegated requests and tries to resolve them (close_agentx_session calls netsnmp_remove_delegated_requests_for_session)
4. since the queued delegated request is GETNEXT and the subagent has registered more consecutive OIDs, the request is sent *back to the same AgentX subagent* and is queued back to agent_delegated_list. (That's the bug!)
5. registration of the AgentX subagent is freed. But the queued request still points to it!

Attached is very stupid patch, which loops in netsnmp_remove_delegated_requests_for_session() until all delegated requests for the AgentX's session which is being removed are resolved. I.e. it sends the GETNEXT requests to the AgentX subagent until it gets out of the registered tree. The patch is is really a proof of concept, the GETNEXT request should skip to the end of subagent's tree, but I do not know how to do it. Anyway, valgrind is happy with the patch and snmpd no longer crashes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leonardo Chiquitto - 2011-08-25

Jan, I tested the patch here (on 5.4.x and 5.7.x) and confirm it resolves the crashes. Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bart Van Assche - 2011-08-29

Untested patch that might fix this issue

net-snmp-agentx-2.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bart Van Assche - 2011-08-29

Does net-snmp-agentx-2.patch help ? Note: that patch hasn't been tested yet.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leonardo Chiquitto - 2011-08-29

Bart, net-snmp-agentx-2.patch doesn't seem to help. I've got the same crash:

Program terminated with signal 11, Segmentation fault.
#0 netsnmp_add_varbind_to_cache (asp=0x7fe410c1b210, vbcount=1, varbind_ptr=0x7fe410c30ee0,
tp=0x4545454545454545) at snmp_agent.c:1990
1990 prefix_len = netsnmp_oid_find_prefix(tp->start_a,
(gdb) bt
#0 netsnmp_add_varbind_to_cache (asp=0x7fe410c1b210, vbcount=1, varbind_ptr=0x7fe410c30ee0,
tp=0x4545454545454545) at snmp_agent.c:1990
#1 0x00007fe41031d9ec in netsnmp_reassign_requests (asp=0x7fe410c1b210) at snmp_agent.c:2475
#2 0x00007fe41031e178 in handle_getnext_loop (asp=0x7fe410c1b210) at snmp_agent.c:3127
#3 0x00007fe41031e732 in check_delayed_request (asp=0x7fe410c1b210) at snmp_agent.c:2889
#4 0x00007fe41031efb5 in netsnmp_check_outstanding_agent_requests2 (process_queue=1) at snmp_agent.c:2730
#5 0x00007fe41077a63a in receive () at snmpd.c:1352
#6 main (argc=<optimized out="">, argv=<optimized out="">) at snmpd.c:1105</optimized></optimized>

Let me know if you want to see the logs or if I should upload a core dump.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-09-05

v3 - without API change

net-snmp-agentx-3.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-09-05

I attached v3, which moves the loop to close_agentx_session(), so netsnmp_remove_delegated_requests_for_session is untouched in case some application uses it. Still, the loop can take some time when AgentX has large subtree.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2011-09-07

Another issue: even with the patch, the request which is sent to AgentX subagent which disconnects without answering it is not processes properly and leaks memory:
==8326== 9,549,100 (6,847,104 direct, 2,701,996 indirect) bytes in 142,648 blocks are definitely lost in loss record 6,303 of 6,303
==8326== at 0x4A04A28: calloc (vg_replace_malloc.c:467)
==8326== by 0x4C33E6A: netsnmp_create_delegated_cache (agent_handler.c:713)
==8326== by 0x4C36BC9: agentx_master_handler (master.c:591)
==8326== by 0x4C3642E: netsnmp_call_handlers (agent_handler.c:440)

The 'delegated cache' is freed in agentx_got_response(), but this function is not executed when the subagent disconnects - only when the subagent returns real response.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Story - 2011-09-07

re: getnext sending requests to the same agent... there is a flag (SUBTREE_ATTACHED) in the subtree struct that should be cleared when a subagent disconnects... if that's true, then I'd define new subtree search/find functions that optionally respect that flag, have the old functions call the new ones w/flag to ignore it, and then change the get-next code to call the new function w/the flag to ignore detached subtrees...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2012-02-07

I pushed the patch, git commit f9304c83f76202db0e684269ca1af32e43cd9db4

I also fixed the memory leaks I reported below. I hope I did not screw anything up, it's quite fragile piece of code.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Safranek - 2012-02-07

Thanks for the patch! It has been applied to the current
development code in git, and will appear in the next major release
of the Net-SNMP package.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Damon Yang - 2013-10-30

We got the similar problem after querying some tables frequently for about 2 hours.
After applying this patch, the problem disappeared.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Damon Yang - 2013-10-30

The SNMP version is 5.7.2, the core dump stack is:

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100bf668, vbcount=1,

varbind_ptr=0x10093e00, tp=0xa7) at snmp_agent.c:2015

2015 prefix_len = netsnmp_oid_find_prefix(tp->start_a,
(gdb) bt

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100bf668, vbcount=1,

varbind_ptr=0x10093e00, tp=0xa7) at snmp_agent.c:2015

1 0x0ffb50ac in netsnmp_reassign_requests (asp=0x100bf668)

at snmp_agent.c:2521

2 0x0ffb651c in handle_getnext_loop (asp=0x100bf668) at snmp_agent.c:3121

3 0x0ffb5dac in check_delayed_request (asp=0x100bf668) at snmp_agent.c:2883

4 0x0ffb5970 in netsnmp_check_outstanding_agent_requests ()

at snmp_agent.c:2776

5 0x10004d84 in receive () at snmpd.c:1356

6 0x100045fc in main (argc=2, argv=0xbffbea34) at snmpd.c:1108

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100e5d80, vbcount=1,

varbind_ptr=0x100c30e0, tp=0x6c5f6578) at snmp_agent.c:2009

2009 if (tp &&
(gdb) bt

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100e5d80, vbcount=1,

varbind_ptr=0x100c30e0, tp=0x6c5f6578) at snmp_agent.c:2009

1 0x0ffb50ac in netsnmp_reassign_requests (asp=0x100e5d80)

at snmp_agent.c:2521

2 0x0ffb651c in handle_getnext_loop (asp=0x100e5d80) at snmp_agent.c:3114

3 0x0ffb5dac in check_delayed_request (asp=0x100e5d80) at snmp_agent.c:2879

4 0x0ffb5970 in netsnmp_check_outstanding_agent_requests ()

at snmp_agent.c:2766

5 0x10004d84 in receive () at snmpd.c:1356

6 0x100045fc in main (argc=2, argv=0xbfcccaf4) at snmpd.c:1108
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jalindar - 2019-02-05

I got segfault too for 2 subagent,

will locking open and close fix most of multi-processor, multi-subagent situations ?

agent/mibgroup/agentx/master_admin.c

init LockA

open_agentx_session(){

getLockA(); //block to get lockA

open function

releaseLockA()
}

close_agentx_session(){

getLockA(); //block to get lockA

close function

releaseLockA()

}

this way open and close for multi-subagent or multi-processor shoule help
will it fix all situations ?
Any better solutions??

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

5.[34] master agent segfaults after 15-20h of repeated walk

Group

Searches

Help

#718 5.[34] master agent segfaults after 15-20h of repeated walk

Discussion

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100bf668, vbcount=1,

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100bf668, vbcount=1,

1 0x0ffb50ac in netsnmp_reassign_requests (asp=0x100bf668)

2 0x0ffb651c in handle_getnext_loop (asp=0x100bf668) at snmp_agent.c:3121

3 0x0ffb5dac in check_delayed_request (asp=0x100bf668) at snmp_agent.c:2883

4 0x0ffb5970 in netsnmp_check_outstanding_agent_requests ()

5 0x10004d84 in receive () at snmpd.c:1356

6 0x100045fc in main (argc=2, argv=0xbffbea34) at snmpd.c:1108

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100e5d80, vbcount=1,

0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100e5d80, vbcount=1,

1 0x0ffb50ac in netsnmp_reassign_requests (asp=0x100e5d80)

2 0x0ffb651c in handle_getnext_loop (asp=0x100e5d80) at snmp_agent.c:3114

3 0x0ffb5dac in check_delayed_request (asp=0x100e5d80) at snmp_agent.c:2879

4 0x0ffb5970 in netsnmp_check_outstanding_agent_requests ()

5 0x10004d84 in receive () at snmpd.c:1356

6 0x100045fc in main (argc=2, argv=0xbfcccaf4) at snmpd.c:1108