Menu

#718 5.[34] master agent segfaults after 15-20h of repeated walk

closed
None
6
2019-02-05
2007-01-12
Anonymous
No

I reproduced the same problem in both Net-SNMP 5.3 and 5.4 releases.
Scenario: I created an SNMP subagent with a few scalars, some strings and some integers. I started the Master Agent and the subagent. I use snmpwalk in a script that queries Master agent to get walk the subagent tree and it walks every 30 seconds. The master Agent coredumped after around 15 to 20 hours of repeated walk. The subagent was still running. The coredump in the master agent is as below:
----------------------------------------------------
COREDUMP IN 5.3 - STACK TRACE
----------------------------------------------------
NET-SNMP version: 5.3.0.1
Web: http://www.net-snmp.org/
Email: net-snmp-coders@lists.sourceforge.net

#0 netsnmp_oid_find_prefix (in_name1=0x2f203a43, len1=118, in_name2=0x6e656761, len2=116) at snmp_api.c:6472
6472 if (in_name1[i] != in_name2[i])
(gdb) where
#0 netsnmp_oid_find_prefix (in_name1=0x2f203a43, len1=118, in_name2=0x6e656761, len2=116) at snmp_api.c:6472
#1 0x40127125 in netsnmp_add_varbind_to_cache (asp=0x815a798, vbcount=1, varbind_ptr=0x81817c8, tp=0x8152398) at snmp_agent.c:1862
#2 0x40127888 in netsnmp_reassign_requests (asp=0x815a798) at snmp_agent.c:2308
#3 0x4012844f in handle_getnext_loop (asp=0x815a798) at snmp_agent.c:2885
#4 0x40127fe9 in check_delayed_request (asp=0x815a798) at snmp_agent.c:2668
#5 0x40127c49 in netsnmp_check_outstanding_agent_requests () at snmp_agent.c:2563
#6 0x40126111 in netsnmp_remove_delegated_requests_for_session (sess=0x81838b0) at snmp_agent.c:1433
#7 0x40144931 in close_agentx_session (session=0x81838b0, sessid=-1) at mibgroup/agentx/master_admin.c:147
#8 0x40145672 in handle_master_agentx_packet (operation=5, session=0x81838b0, reqid=0, pdu=0x0, magic=0x74) at mibgroup/agentx/master_admin.c:487
#9 0x4019e326 in _sess_read (sessp=0x825db28, fdset=0x81838b0) at snmp_api.c:5572
#10 0x4019e8a8 in snmp_sess_read (sessp=0x825db28, fdset=0xbfffd940) at snmp_api.c:5766
#11 0x4019d7ee in snmp_read (fdset=0xbfffd940) at snmp_api.c:5386
#12 0x0804bf6d in receive () at snmpd.c:1152
#13 0x0804b554 in main (argc=5, argv=0xbfffdb04) at snmpd.c:1001
#14 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
----------------------------------------------------
COREDUMP IN 5.4 - STACK TRACE
-----------------------------------------------------
NET-SNMP version: 5.4
Web: http://www.net-snmp.org/
Email: net-snmp-coders@lists.sourceforge.net

#0 netsnmp_add_varbind_to_cache (asp=0x8187958, vbcount=1, varbind_ptr=0x815bc90, tp=0x681e050a) at snmp_agent.c:1900
1900 prefix_len = netsnmp_oid_find_prefix(tp->start_a,
(gdb) where
#0 netsnmp_add_varbind_to_cache (asp=0x8187958, vbcount=1,
varbind_ptr=0x815bc90, tp=0x681e050a) at snmp_agent.c:1900
#1 0x4013bb10 in netsnmp_reassign_requests (asp=0x8187958) at
snmp_agent.c:2334
#2 0x4013c6db in handle_getnext_loop (asp=0x8187958) at snmp_agent.c:2923
#3 0x4013c275 in check_delayed_request (asp=0x8187958) at snmp_agent.c:2694
#4 0x4013bed1 in netsnmp_check_outstanding_agent_requests () at
snmp_agent.c:2589
#5 0x0804c3be in receive () at snmpd.c:1205
#6 0x0804b8ea in main (argc=7, argv=0xbfffd9a4) at snmpd.c:1016
#7 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6

Discussion

  • Nobody/Anonymous

    Logged In: NO

    OS Information: RedHat Linux 9.0

     
  • Thomas Anders

    Thomas Anders - 2007-01-25

    Logged In: YES
    user_id=848638
    Originator: NO

    Is this with a custom subagent? Can you tell some more details about it, please? Can you reproduce a similar behaviour with any other subagent (e.g. snmptrapd, snmpd -X)?

     
  • Zachary Mark

    Zachary Mark - 2011-08-12

    I am seeing this behaviour as well on 5.6.0. I can reproduce the problem fairly reliably with a nearly identical test:

    1) Place the OS under heavy load.
    2) Do a repeated snmpwalk.

    A couple of the stack dumps we were able to produce:

    [57574.927712] snmpd[7383]: segfault at 20000002b ip 00007f7d51fee979 sp 00007fffdc80c940 error 4 in libnetsnmpagent.so.25.0.1[7f7d51fce000+61000]
    #0 netsnmp_add_varbind_to_cache (asp=0x655180, vbcount=2, varbind_ptr=0x94b330, tp=0x200000003) at snmp_agent.c:1981
    #1 0x00007f7d51feef8d in netsnmp_reassign_requests (asp=0x655180) at snmp_agent.c:2490
    #2 0x00007f7d51feefb8 in handle_getnext_loop (asp=0x655180) at snmp_agent.c:3073
    #3 0x00007f7d51ff20a9 in check_delayed_request (asp=0x655180) at snmp_agent.c:2837
    #4 0x00007f7d51ff21a9 in netsnmp_check_outstanding_agent_requests () at snmp_agent.c:2732
    #5 0x000000000040427a in main ()

    [30638.694392] snmpd[15582] general protection ip:7f41ad92b8d5 sp:7fff6d987cf8 error:0 in libnetsnmp.so.25.0.1[7f41ad8fa000+9b000]
    #0 netsnmp_oid_find_prefix (in_name1=0x5df37e989c7c2f84, len1=183, in_name2=0x13dd73f04a4ffb8b, len2=188) at snmp_api.c:6907
    #1 0x00007f41af31598e in netsnmp_add_varbind_to_cache (asp=0x915680, vbcount=1, varbind_ptr=0x902510, tp=0x811190) at snmp_agent.c:1981
    #2 0x00007f41af315f2c in netsnmp_reassign_requests (asp=0x915680) at snmp_agent.c:2477
    #3 0x00007f41af315fb8 in handle_getnext_loop (asp=0x915680) at snmp_agent.c:3073
    #4 0x00007f41af3190a9 in check_delayed_request (asp=0x915680) at snmp_agent.c:2837
    #5 0x00007f41af3191a9 in netsnmp_check_outstanding_agent_requests () at snmp_agent.c:2732
    #6 0x00007f41af319607 in netsnmp_remove_delegated_requests_for_session (sess=0x848420) at snmp_agent.c:1548
    #7 0x00007f41af32e904 in close_agentx_session (session=0x848420, sessid=-1) at mibgroup/agentx/master_admin.c:132
    #8 0x00007f41af32ee1a in handle_master_agentx_packet (operation=5, session=0x848420, reqid=<value optimized="" out="">, pdu=0x0, magic=<value optimized="" out="">) at mibgroup/agentx/master_admin.c:479
    #9 0x00007f41ad93ad43 in _sess_read (sessp=0x870d00, fdset=<value optimized="" out="">) at snmp_api.c:5942
    #10 0x00007f41ad93b6a9 in snmp_sess_read2 (sessp=0x5df37e989c7c2f84, fdset=0xb7) at snmp_api.c:6149
    #11 0x00007f41ad93b763 in snmp_read2 (fdset=0x7fff6d988070) at snmp_api.c:5740
    #12 0x000000000040484b in main ()

    Valgrind report which most likely corresponds to the segfault:

    ==15092== Invalid read of size 8
    ==15092== at 0x4E485DB: netsnmp_remove_delegated_requests_for_session (snmp_agent.c:1531)
    ==15092== by 0x4E5D903: close_agentx_session (master_admin.c:132)
    ==15092== by 0x4E5DE19: handle_master_agentx_packet (master_admin.c:479)
    ==15092== by 0x67EED42: _sess_read (snmp_api.c:5942)
    ==15092== by 0x67EF6A8: snmp_sess_read2 (snmp_api.c:6149)
    ==15092== by 0x67EF762: snmp_read2 (snmp_api.c:5740)
    ==15092== by 0x40484A: main (in /usr/sbin/snmpd)
    ==15092== Address 0x769F078 is 72 bytes inside a block of size 120 free'd
    ==15092== at 0x4C2041E: free (vg_replace_malloc.c:233)
    ==15092== by 0x4E473C0: netsnmp_wrap_up_request (snmp_agent.c:1787)
    ==15092== by 0x4E47E56: handle_snmp_packet (snmp_agent.c:1952)
    ==15092== by 0x67ED6CD: _sess_process_packet (snmp_api.c:5677)
    ==15092== by 0x67EEDB4: _sess_read (snmp_api.c:6117)
    ==15092== by 0x67EF6A8: snmp_sess_read2 (snmp_api.c:6149)
    ==15092== by 0x67EF762: snmp_read2 (snmp_api.c:5740)
    ==15092== by 0x40484A: main (in /usr/sbin/snmpd)

    I can provide many (over 20) coredumps if necessary.

     
  • Bart Van Assche

    Bart Van Assche - 2011-08-13

    It would help a lot if you could provide a (minimal) subagent implementation that allows to reproduce this issue.

     
  • Jan Safranek

    Jan Safranek - 2011-08-15

    I reproduced with net-snmp-5.7.
    Stack trace:
    ==21720== Process terminating with default action of signal 11 (SIGSEGV)
    ==21720== Access not within mapped region at address 0xA8
    ==21720== at 0x4C49D43: netsnmp_add_varbind_to_cache (snmp_agent.c:2017)
    ==21720== by 0x4C4B1FA: netsnmp_reassign_requests (snmp_agent.c:2521)
    ==21720== by 0x4C4C4D1: handle_getnext_loop (snmp_agent.c:3121)
    ==21720== by 0x4C4BD9E: check_delayed_request (snmp_agent.c:2883)
    ==21720== by 0x4C4B936: netsnmp_check_outstanding_agent_requests (snmp_agent.c:2776)
    ==21720== by 0x4C48F17: netsnmp_remove_delegated_requests_for_session (snmp_agent.c:1567)
    ==21720== by 0x4C66266: close_agentx_session (master_admin.c:142)
    ==21720== by 0x4C67187: handle_master_agentx_packet (master_admin.c:485)
    ==21720== by 0x530B982: _sess_read (snmp_api.c:5670)
    ==21720== by 0x530C43C: snmp_sess_read2 (snmp_api.c:5868)
    ==21720== by 0x530AF01: snmp_read2 (snmp_api.c:5470)
    ==21720== by 0x40502C: receive (snmpd.c:1311)

    Valgrind tells me quite often:
    ==21720== Invalid read of size 8
    ==21720== at 0x530DA2D: snmp_oid_compare (snmp_api.c:6452)
    ==21720== by 0x4C4C136: check_getnext_results (snmp_agent.c:3006)
    ==21720== by 0x4C4C307: handle_getnext_loop (snmp_agent.c:3092)
    ==21720== by 0x4C4BD9E: check_delayed_request (snmp_agent.c:2883)
    ==21720== by 0x4C4B936: netsnmp_check_outstanding_agent_requests (snmp_agent.c:2776)
    ==21720== by 0x405161: receive (snmpd.c:1348)
    ==21720== by 0x4047BA: main (snmpd.c:1101)
    ==21720== Address 0x6935130 is 0 bytes inside a block of size 72 free'd
    ==21720== at 0x4A055FE: free (vg_replace_malloc.c:366)
    ==21720== by 0x4C3E95D: netsnmp_subtree_free (agent_registry.c:471)
    ==21720== by 0x4C41BAB: unregister_mibs_by_session (agent_registry.c:1982)
    ==21720== by 0x4C66285: close_agentx_session (master_admin.c:146)
    ==21720== by 0x4C67187: handle_master_agentx_packet (master_admin.c:485)
    ==21720== by 0x530B982: _sess_read (snmp_api.c:5670)
    ==21720== by 0x530C43C: snmp_sess_read2 (snmp_api.c:5868)
    ==21720== by 0x530AF01: snmp_read2 (snmp_api.c:5470)
    ==21720== by 0x40502C: receive (snmpd.c:1311)
    ==21720== by 0x4047BA: main (snmpd.c:1101)
    I think this could be cause of the sigsegv.

     
  • Jan Safranek

    Jan Safranek - 2011-08-15

    Simple agentx subagent.

     
  • Jan Safranek

    Jan Safranek - 2011-08-15

    I've uploaded very simple AgentX subagent which I use to reproduce the bug.

    The only thing it does is that it exits when processing GETNEXT request for UCD-SNMP-MIB::ucdavis.255.2 and snmpd has to clean up 'delegated requests' to this subagent.

    Usage:
    1. net-snmp-config --compile-subagent example.c
    2. enable agentx protocol in snmpd.conf:
    master agentx
    3. start snmpd (optionally under valgrind)
    4. in enldess loop start my subagent:
    $ while true; do ./example -f -Lo ; done
    5. try if the subagent works - it should die and new one should be started
    $ snmpgetnext -v2c -c public localhost -t 1 -r 0 UCD-SNMP-MIB::ucdavis.255.2
    6. finally load the snmpd with the above getnext requests
    $ while true; do snmpgetnext -v2c -c public localhost -t 1 -r 0 UCD-SNMP-MIB::ucdavis.255.2; done
    (try it in several terminals in parallel)

    With two parallel snmpget loops, I get quite often 'Invalid read' from valgrind, i.e. once per minute. I get crash irregularly, sometimes in 2-3 minutes, sometimes it takes one hour.

     
  • Jan Safranek

    Jan Safranek - 2011-08-15

    My hypothesis so far:
    close_agentx_session() first calls netsnmp_remove_delegated_requests_for_session() and then unregister_mibs_by_session(). The later function frees the sessions -> all delegated requests must not use it after this function ends.

    I.e. netsnmp_remove_delegated_requests_for_session() should somehow finish all these request. And it does not do it properly - some requests still stay there. With second call to close_agentx_session (when another AgentX finishes), request->subtree of these leftover requests is invalid.

    However, I am not able to tell what's really wrong here, I got lost in all the handlers involved. And please note it's only hypothesis, maybe I'm chasing red herring here.

     
  • Leonardo Chiquitto

    Here (5.7.1-pre2) the agent always crashes immediately after starting the getnext loop:

    Connection from UDP: [127.0.0.1]:55107->[127.0.0.1]:161
    snmp_agent: agent_sesion 0x7f4810918290 created
    snmp_agent: add_vb_to_cache(0x7f4810918290, 1, UCD-SNMP-MIB::ucdavis.255.2, 0x7f481092d750)
    snmp_agent: tp->start UCD-SNMP-MIB::ucdavis.255.2, tp->end UCD-SNMP-MIB::ucdavis.255.3,
    agentx/master: agentx master handler starting, mode = 0xa1
    agentx/master: request for variable (UCD-SNMP-MIB::ucdavis.255.2)
    agentx/master: EXCLUSIVE varbind UCD-SNMP-MIB::ucdavis.255.2 scoped to UCD-SNMP-MIB::ucdavis.255.3
    agentx/master: sending pdu (req=0x2,trans=0x1,sess=0x7)
    snmp_agent: delegate session == 0x7f4810918290
    snmp_agent: end of handle_snmp_packet, asp = 0x7f4810918290
    agentx/master: transport disconnect on session 0x7f4810918080
    agentx/master: close 0x7f4810918080, -1
    agentx/master: timeout on session 0x7f4810918080 req=0x2
    agentx/master: NULL sess_pointer??
    snmp_agent: REMOVE session == 0x7f4810918080
    agentx/master: transport connect on session 0x7f4810918080
    snmp_agent: processing delegated request, asp = 0x7f4810918290
    snmp_agent: add_vb_to_cache(0x7f4810918290, 1, UCD-SNMP-MIB::ucdavis.255.2, 0x4545454545454545)
    Segmentation fault (core dumped)

     
  • Jan Safranek

    Jan Safranek - 2011-08-25

    highly experimental patch

     
  • Jan Safranek

    Jan Safranek - 2011-08-25

    Further investigation:
    1. snmpd gets GETNEXT request and passes it to AgentX subagent (-> req. is queued in agent_delegated_list)
    2. the AgentX subagent disconnects without responding to the request
    3. as result, snmpd goes through list of delegated requests and tries to resolve them (close_agentx_session calls netsnmp_remove_delegated_requests_for_session)
    4. since the queued delegated request is GETNEXT and the subagent has registered more consecutive OIDs, the request is sent *back to the same AgentX subagent* and is queued back to agent_delegated_list. (That's the bug!)
    5. registration of the AgentX subagent is freed. But the queued request still points to it!

    Attached is very stupid patch, which loops in netsnmp_remove_delegated_requests_for_session() until all delegated requests for the AgentX's session which is being removed are resolved. I.e. it sends the GETNEXT requests to the AgentX subagent until it gets out of the registered tree. The patch is is really a proof of concept, the GETNEXT request should skip to the end of subagent's tree, but I do not know how to do it. Anyway, valgrind is happy with the patch and snmpd no longer crashes.

     
  • Leonardo Chiquitto

    Jan, I tested the patch here (on 5.4.x and 5.7.x) and confirm it resolves the crashes. Thanks!

     
  • Bart Van Assche

    Bart Van Assche - 2011-08-29

    Untested patch that might fix this issue

     
  • Bart Van Assche

    Bart Van Assche - 2011-08-29

    Does net-snmp-agentx-2.patch help ? Note: that patch hasn't been tested yet.

     
  • Leonardo Chiquitto

    Bart, net-snmp-agentx-2.patch doesn't seem to help. I've got the same crash:

    Program terminated with signal 11, Segmentation fault.
    #0 netsnmp_add_varbind_to_cache (asp=0x7fe410c1b210, vbcount=1, varbind_ptr=0x7fe410c30ee0,
    tp=0x4545454545454545) at snmp_agent.c:1990
    1990 prefix_len = netsnmp_oid_find_prefix(tp->start_a,
    (gdb) bt
    #0 netsnmp_add_varbind_to_cache (asp=0x7fe410c1b210, vbcount=1, varbind_ptr=0x7fe410c30ee0,
    tp=0x4545454545454545) at snmp_agent.c:1990
    #1 0x00007fe41031d9ec in netsnmp_reassign_requests (asp=0x7fe410c1b210) at snmp_agent.c:2475
    #2 0x00007fe41031e178 in handle_getnext_loop (asp=0x7fe410c1b210) at snmp_agent.c:3127
    #3 0x00007fe41031e732 in check_delayed_request (asp=0x7fe410c1b210) at snmp_agent.c:2889
    #4 0x00007fe41031efb5 in netsnmp_check_outstanding_agent_requests2 (process_queue=1) at snmp_agent.c:2730
    #5 0x00007fe41077a63a in receive () at snmpd.c:1352
    #6 main (argc=<optimized out="">, argv=<optimized out="">) at snmpd.c:1105

    Let me know if you want to see the logs or if I should upload a core dump.

     
  • Jan Safranek

    Jan Safranek - 2011-09-05

    v3 - without API change

     
  • Jan Safranek

    Jan Safranek - 2011-09-05

    I attached v3, which moves the loop to close_agentx_session(), so netsnmp_remove_delegated_requests_for_session is untouched in case some application uses it. Still, the loop can take some time when AgentX has large subtree.

     
  • Jan Safranek

    Jan Safranek - 2011-09-07

    Another issue: even with the patch, the request which is sent to AgentX subagent which disconnects without answering it is not processes properly and leaks memory:
    ==8326== 9,549,100 (6,847,104 direct, 2,701,996 indirect) bytes in 142,648 blocks are definitely lost in loss record 6,303 of 6,303
    ==8326== at 0x4A04A28: calloc (vg_replace_malloc.c:467)
    ==8326== by 0x4C33E6A: netsnmp_create_delegated_cache (agent_handler.c:713)
    ==8326== by 0x4C36BC9: agentx_master_handler (master.c:591)
    ==8326== by 0x4C3642E: netsnmp_call_handlers (agent_handler.c:440)

    The 'delegated cache' is freed in agentx_got_response(), but this function is not executed when the subagent disconnects - only when the subagent returns real response.

     
  • Robert Story

    Robert Story - 2011-09-07

    re: getnext sending requests to the same agent... there is a flag (SUBTREE_ATTACHED) in the subtree struct that should be cleared when a subagent disconnects... if that's true, then I'd define new subtree search/find functions that optionally respect that flag, have the old functions call the new ones w/flag to ignore it, and then change the get-next code to call the new function w/the flag to ignore detached subtrees...

     
  • Jan Safranek

    Jan Safranek - 2012-02-07

    I pushed the patch, git commit f9304c83f76202db0e684269ca1af32e43cd9db4

    I also fixed the memory leaks I reported below. I hope I did not screw anything up, it's quite fragile piece of code.

     
  • Jan Safranek

    Jan Safranek - 2012-02-07

    Thanks for the patch! It has been applied to the current
    development code in git, and will appear in the next major release
    of the Net-SNMP package.

     
  • Damon Yang

    Damon Yang - 2013-10-30

    We got the similar problem after querying some tables frequently for about 2 hours.
    After applying this patch, the problem disappeared.

     
  • Damon Yang

    Damon Yang - 2013-10-30

    The SNMP version is 5.7.2, the core dump stack is:

    0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100bf668, vbcount=1,

    varbind_ptr=0x10093e00, tp=0xa7) at snmp_agent.c:2015
    

    2015 prefix_len = netsnmp_oid_find_prefix(tp->start_a,
    (gdb) bt

    0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100bf668, vbcount=1,

    varbind_ptr=0x10093e00, tp=0xa7) at snmp_agent.c:2015
    

    1 0x0ffb50ac in netsnmp_reassign_requests (asp=0x100bf668)

    at snmp_agent.c:2521
    

    2 0x0ffb651c in handle_getnext_loop (asp=0x100bf668) at snmp_agent.c:3121

    3 0x0ffb5dac in check_delayed_request (asp=0x100bf668) at snmp_agent.c:2883

    4 0x0ffb5970 in netsnmp_check_outstanding_agent_requests ()

    at snmp_agent.c:2776
    

    5 0x10004d84 in receive () at snmpd.c:1356

    6 0x100045fc in main (argc=2, argv=0xbffbea34) at snmpd.c:1108

    0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100e5d80, vbcount=1,

    varbind_ptr=0x100c30e0, tp=0x6c5f6578) at snmp_agent.c:2009
    

    2009 if (tp &&
    (gdb) bt

    0 0x0ffb3b34 in netsnmp_add_varbind_to_cache (asp=0x100e5d80, vbcount=1,

    varbind_ptr=0x100c30e0, tp=0x6c5f6578) at snmp_agent.c:2009
    

    1 0x0ffb50ac in netsnmp_reassign_requests (asp=0x100e5d80)

    at snmp_agent.c:2521
    

    2 0x0ffb651c in handle_getnext_loop (asp=0x100e5d80) at snmp_agent.c:3114

    3 0x0ffb5dac in check_delayed_request (asp=0x100e5d80) at snmp_agent.c:2879

    4 0x0ffb5970 in netsnmp_check_outstanding_agent_requests ()

    at snmp_agent.c:2766
    

    5 0x10004d84 in receive () at snmpd.c:1356

    6 0x100045fc in main (argc=2, argv=0xbfcccaf4) at snmpd.c:1108

     
  • Jalindar

    Jalindar - 2019-02-05

    I got segfault too for 2 subagent,

    will locking open and close fix most of multi-processor, multi-subagent situations ?

    agent/mibgroup/agentx/master_admin.c

    init LockA

    open_agentx_session(){

    getLockA(); //block to get lockA

    open function

    releaseLockA()
    }

    close_agentx_session(){

    getLockA(); //block to get lockA

    close function

    releaseLockA()

    }

    this way open and close for multi-subagent or multi-processor shoule help
    will it fix all situations ?
    Any better solutions??

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.