One client has found that on one of their two MTAs, OpenDKIM will go into a state where it is consuming massive amounts of CPU and RAM. An strace shows it just continually going through:
select(6, [5], NULL, [5], {5, 0}) = 1 (in [5], left {4, 629862})
accept(5, {sa_family=AF_INET, sin_port=htons(60738), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7fbd54b0bfd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fbd54b0c9d0, tls=0x7fbd54b0c700, child_tidptr=0x7fbd54b0c9d0) = 13945
select(6, [5], NULL, [5], {5, 0}) = 1 (in [5], left {3, 320555})
accept(5, {sa_family=AF_INET, sin_port=htons(60742), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7fbd54b0bfd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fbd54b0c9d0, tls=0x7fbd54b0c700, child_tidptr=0x7fbd54b0c9d0) = 13947
select(6, [5], NULL, [5], {5, 0}^C <unfinished ...>
Process 16717 detached
This isn't a very tight loop given the times involved, and the functions are pretty low in terms of system impact. There must be something else going on that this trace doesn't show.
The first select() returns after 0.37 seconds with a descriptor hot. It calls accept() right away which allocates descriptor 7, and assigns it to the new connection. It then sets SO_KEEPALIVE on it and calls clone() which spawns a new thread to manage the connection, and returns to select(), waiting up to five seconds for the next connection. During the select(), this thread is blocked, so it can't consume any resources.
The second select() returns after 1.68 seconds with a descriptor hot. That doesn't strike me as a particularly tight loop. Anyway, it does the same thing: accept() to get a new descriptor, set SO_KEEPALIVE, and hand it off to a new thread. You interrupted it when it was back in select() again.
Try running strace with flags to follow into child processes and maybe there will be more information available.
The really curious thing is that the accept() call returns the same value every time. That means descriptor 7 was handed to the child, which for some reason closed it pretty quickly. Is there anything in the logs showing opendkim complaining about something?
I ran an strace with the -f option and got the following output, which I have attached to this bug.
Our opendkim logs just has the following repeating over and over again
Note that I have removed the addresses in the log snippets just in case.
Here is a gdb stack trace
The stack trace shows four threads doing nothing. The first two are blocked waiting for a signal, the the third is waiting for I/O from a client, and the fourth is waiting for a new connection to arrive.
The trace output looks more interesting. I'll take a look there.
The trace just shows a bunch of normal transactions. I don't see any tight loops or anything like that. It's just processing messages as it arrives.
So the odd thing here is that eventhough it seems to be doing what it normally does, top however shows this
We have 2 MTA's thats basically running the same version of opendkim with same kernel's and everything but yet one of the MTA's is misbehaving. Is there any data that I could provide that could help you look at this more?
Bump on this, is there any info or debugging info I can provide to help with this? Opendkim seemed to have slowly creeped up in memory usage over the week and caused our monitoring server to alert.
bump again (I work with limed) - this happened again yesterday and today on two different servers. Anything we can do to help debug?
We're seeing this same issue across 4 instances of opendkim 2.8.0. One instance is using 3GB of resident memory. Interestingly, it's signing very few messages. We have 500-1000 domains on the system and it's only signing for 2 very low volume domains.
I have attached a graph showing resident memory usage for 4 instances of opendkim across 4 MTAs. Neither mta1 or mta2 ever sign messages during normal usage.
The daemons were restarted around 12 hours ago in this graph. Data collection was only started ~10:40 on this graph.
Messages signed since opendkim restart
MTA3 ~35 messages
MTA4 ~36 messages
It looks like mta3/4 have handled ~33% more mail volume (each) than mta1/2 have.
Could this be a leak be related to ldap use?
Th "top" output shows it using a lot of memory, but not much CPU time. It's possible there's a leak.
Any chance you could get valgrind to run on it to show leaks and where they might've come from?
Yes, it's possible it's specific to LDAP. A dump from valgrind of outstanding allocations would be useful in targeting such a leak. I'll focus on that code and see if maybe there's something we're doing that might cause a leak. This might be caused by my unfamiliarity with what parts of OpenLDAP allocate memory that we might be responsible for freeing, etc.
Note that this would explain a spinning CPU, but not the kind of memory consumption being reported here.
One of our clients has apparently found the cause. In opendkim-db.c, if the ldap db is used, when there is no signature, it exits without freeing "result"
In the OpenDKIM 2.9.2 source, this is the code block starting at line 4944 and ending at line 4955. Among the other tasks it takes, it should also free result, like it does at line 4987.
That would explain the leak, but not the spin. Either way, that's in for next release.
2.9.3 released.