Menu

#177 OpenDKIM goes into loop, eats up RAM

2.8.0
closed
None
5
2014-12-28
2013-08-27
No

One client has found that on one of their two MTAs, OpenDKIM will go into a state where it is consuming massive amounts of CPU and RAM. An strace shows it just continually going through:

select(6, [5], NULL, [5], {5, 0})       = 1 (in [5], left {4, 629862})
accept(5, {sa_family=AF_INET, sin_port=htons(60738), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7fbd54b0bfd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fbd54b0c9d0, tls=0x7fbd54b0c700, child_tidptr=0x7fbd54b0c9d0) = 13945
select(6, [5], NULL, [5], {5, 0})       = 1 (in [5], left {3, 320555})
accept(5, {sa_family=AF_INET, sin_port=htons(60742), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7fbd54b0bfd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fbd54b0c9d0, tls=0x7fbd54b0c700, child_tidptr=0x7fbd54b0c9d0) = 13947
select(6, [5], NULL, [5], {5, 0}^C <unfinished ...>
Process 16717 detached

Discussion

  • Murray S. Kucherawy

    • status: open --> pending
     
  • Murray S. Kucherawy

    This isn't a very tight loop given the times involved, and the functions are pretty low in terms of system impact. There must be something else going on that this trace doesn't show.

    The first select() returns after 0.37 seconds with a descriptor hot. It calls accept() right away which allocates descriptor 7, and assigns it to the new connection. It then sets SO_KEEPALIVE on it and calls clone() which spawns a new thread to manage the connection, and returns to select(), waiting up to five seconds for the next connection. During the select(), this thread is blocked, so it can't consume any resources.

    The second select() returns after 1.68 seconds with a descriptor hot. That doesn't strike me as a particularly tight loop. Anyway, it does the same thing: accept() to get a new descriptor, set SO_KEEPALIVE, and hand it off to a new thread. You interrupted it when it was back in select() again.

    Try running strace with flags to follow into child processes and maybe there will be more information available.

    The really curious thing is that the accept() call returns the same value every time. That means descriptor 7 was handed to the child, which for some reason closed it pretty quickly. Is there anything in the logs showing opendkim complaining about something?

     
  • limed

    limed - 2013-08-28

    I ran an strace with the -f option and got the following output, which I have attached to this bug.

    Our opendkim logs just has the following repeating over and over again

    Aug 28 12:08:19 zmmta2 opendkim[16717]: 307491020F5: no signing table match for '<redacted>'
    Aug 28 12:08:19 zmmta2 opendkim[16717]: 76FDE1020F5: no signing table match for '<redacted>'
    Aug 28 12:08:20 zmmta2 opendkim[16717]: 873131020F5: no signing table match for '<redacted>'
    

    Note that I have removed the addresses in the log snippets just in case.

     
  • limed

    limed - 2013-08-28

    Here is a gdb stack trace

    (gdb) thr apply all bt
    
    Thread 4 (Thread 0x7fbd57b6d700 (LWP 16719)):
    #0  0x000000386020f2a5 in sigwait () from /lib64/libpthread.so.0
    #1  0x00000000004085a8 in dkimf_reloader (vp=0x0) at opendkim.c:5924
    #2  0x0000003860207851 in start_thread () from /lib64/libpthread.so.0
    #3  0x000000385fae890d in clone () from /lib64/libc.so.6
    
    Thread 3 (Thread 0x7fbd5550d700 (LWP 16720)):
    #0  0x000000386020f2a5 in sigwait () from /lib64/libpthread.so.0
    #1  0x000000000042c7b1 in mi_signal_thread ()
    #2  0x0000003860207851 in start_thread () from /lib64/libpthread.so.0
    #3  0x000000385fae890d in clone () from /lib64/libc.so.6
    
    Thread 2 (Thread 0x7fbd4d7fb700 (LWP 6347)):
    #0  0x000000385fae14f3 in select () from /lib64/libc.so.6
    #1  0x000000000042eb3c in mi_rd_cmd ()
    #2  0x000000000042e324 in mi_engine ()
    #3  0x000000000042b4b8 in mi_handle_session ()
    #4  0x000000000042a179 in mi_thread_handle_wrapper ()
    #5  0x0000003860207851 in start_thread () from /lib64/libpthread.so.0
    #6  0x000000385fae890d in clone () from /lib64/libc.so.6
    
    Thread 1 (Thread 0x7fbd5571d720 (LWP 16717)):
    #0  0x000000385fae14f3 in select () from /lib64/libc.so.6
    #1  0x000000000042ae27 in mi_listener ()
    #2  0x0000000000429db1 in smfi_main ()
    #3  0x000000000041d628 in main (argc=5, argv=0x7fffe24d94b8) at opendkim.c:18274
    (gdb) quit
    
     
  • Murray S. Kucherawy

    The stack trace shows four threads doing nothing. The first two are blocked waiting for a signal, the the third is waiting for I/O from a client, and the fourth is waiting for a new connection to arrive.

    The trace output looks more interesting. I'll take a look there.

     
  • Murray S. Kucherawy

    • status: pending --> open
    • assigned_to: Murray S. Kucherawy
     
  • Murray S. Kucherawy

    • status: open --> pending
     
  • Murray S. Kucherawy

    The trace just shows a bunch of normal transactions. I don't see any tight loops or anything like that. It's just processing messages as it arrives.

     
  • limed

    limed - 2013-08-30

    So the odd thing here is that eventhough it seems to be doing what it normally does, top however shows this

     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    16717 zimbra    20   0 3423m 2.2g  932 S  0.0 57.6  57:53.48 opendkim
    

    We have 2 MTA's thats basically running the same version of opendkim with same kernel's and everything but yet one of the MTA's is misbehaving. Is there any data that I could provide that could help you look at this more?

     
  • limed

    limed - 2013-09-09

    Bump on this, is there any info or debugging info I can provide to help with this? Opendkim seemed to have slowly creeped up in memory usage over the week and caused our monitoring server to alert.

     
  • David Miller

    David Miller - 2013-10-29

    bump again (I work with limed) - this happened again yesterday and today on two different servers. Anything we can do to help debug?

     
  • Mike

    Mike - 2013-11-18

    We're seeing this same issue across 4 instances of opendkim 2.8.0. One instance is using 3GB of resident memory. Interestingly, it's signing very few messages. We have 500-1000 domains on the system and it's only signing for 2 very low volume domains.

     
  • Mike

    Mike - 2013-11-21

    I have attached a graph showing resident memory usage for 4 instances of opendkim across 4 MTAs. Neither mta1 or mta2 ever sign messages during normal usage.

    The daemons were restarted around 12 hours ago in this graph. Data collection was only started ~10:40 on this graph.

    Messages signed since opendkim restart
    MTA3 ~35 messages
    MTA4 ~36 messages

    It looks like mta3/4 have handled ~33% more mail volume (each) than mta1/2 have.

    Could this be a leak be related to ldap use?

     
  • Murray S. Kucherawy

    Th "top" output shows it using a lot of memory, but not much CPU time. It's possible there's a leak.

    Any chance you could get valgrind to run on it to show leaks and where they might've come from?

     
  • Murray S. Kucherawy

    Yes, it's possible it's specific to LDAP. A dump from valgrind of outstanding allocations would be useful in targeting such a leak. I'll focus on that code and see if maybe there's something we're doing that might cause a leak. This might be caused by my unfamiliarity with what parts of OpenLDAP allocate memory that we might be responsible for freeing, etc.

     
  • Murray S. Kucherawy

    Note that this would explain a spinning CPU, but not the kind of memory consumption being reported here.

     
  • Quanah Gibson-Mount

    One of our clients has apparently found the cause. In opendkim-db.c, if the ldap db is used, when there is no signature, it exits without freeing "result"

    In the OpenDKIM 2.9.2 source, this is the code block starting at line 4944 and ending at line 4955. Among the other tasks it takes, it should also free result, like it does at line 4987.

     
  • Murray S. Kucherawy

    That would explain the leak, but not the spin. Either way, that's in for next release.

     
  • Murray S. Kucherawy

    • status: pending --> open
     
  • Murray S. Kucherawy

    2.9.3 released.

     
  • Murray S. Kucherawy

    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB