opendkim / Bugs / #177 OpenDKIM goes into loop, eats up RAM

Murray S. Kucherawy - 2013-08-28

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2013-08-28

This isn't a very tight loop given the times involved, and the functions are pretty low in terms of system impact. There must be something else going on that this trace doesn't show.

The first select() returns after 0.37 seconds with a descriptor hot. It calls accept() right away which allocates descriptor 7, and assigns it to the new connection. It then sets SO_KEEPALIVE on it and calls clone() which spawns a new thread to manage the connection, and returns to select(), waiting up to five seconds for the next connection. During the select(), this thread is blocked, so it can't consume any resources.

The second select() returns after 1.68 seconds with a descriptor hot. That doesn't strike me as a particularly tight loop. Anyway, it does the same thing: accept() to get a new descriptor, set SO_KEEPALIVE, and hand it off to a new thread. You interrupted it when it was back in select() again.

Try running strace with flags to follow into child processes and maybe there will be more information available.

The really curious thing is that the accept() call returns the same value every time. That means descriptor 7 was handed to the child, which for some reason closed it pretty quickly. Is there anything in the logs showing opendkim complaining about something?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

limed - 2013-08-28

I ran an strace with the -f option and got the following output, which I have attached to this bug.

Our opendkim logs just has the following repeating over and over again

Aug 28 12:08:19 zmmta2 opendkim[16717]: 307491020F5: no signing table match for '<redacted>' Aug 28 12:08:19 zmmta2 opendkim[16717]: 76FDE1020F5: no signing table match for '<redacted>' Aug 28 12:08:20 zmmta2 opendkim[16717]: 873131020F5: no signing table match for '<redacted>'

Note that I have removed the addresses in the log snippets just in case.

opendkim.trace
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Here is a gdb stack trace

(gdb) thr apply all bt

Thread 4 (Thread 0x7fbd57b6d700 (LWP 16719)):
#0  0x000000386020f2a5 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004085a8 in dkimf_reloader (vp=0x0) at opendkim.c:5924
#2  0x0000003860207851 in start_thread () from /lib64/libpthread.so.0
#3  0x000000385fae890d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fbd5550d700 (LWP 16720)):
#0  0x000000386020f2a5 in sigwait () from /lib64/libpthread.so.0
#1  0x000000000042c7b1 in mi_signal_thread ()
#2  0x0000003860207851 in start_thread () from /lib64/libpthread.so.0
#3  0x000000385fae890d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fbd4d7fb700 (LWP 6347)):
#0  0x000000385fae14f3 in select () from /lib64/libc.so.6
#1  0x000000000042eb3c in mi_rd_cmd ()
#2  0x000000000042e324 in mi_engine ()
#3  0x000000000042b4b8 in mi_handle_session ()
#4  0x000000000042a179 in mi_thread_handle_wrapper ()
#5  0x0000003860207851 in start_thread () from /lib64/libpthread.so.0
#6  0x000000385fae890d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fbd5571d720 (LWP 16717)):
#0  0x000000385fae14f3 in select () from /lib64/libc.so.6
#1  0x000000000042ae27 in mi_listener ()
#2  0x0000000000429db1 in smfi_main ()
#3  0x000000000041d628 in main (argc=5, argv=0x7fffe24d94b8) at opendkim.c:18274
(gdb) quit

Murray S. Kucherawy - 2013-08-29

The stack trace shows four threads doing nothing. The first two are blocked waiting for a signal, the the third is waiting for I/O from a client, and the fourth is waiting for a new connection to arrive.

The trace output looks more interesting. I'll take a look there.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2013-08-29

status: pending --> open

assigned_to: Murray S. Kucherawy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2013-08-29

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2013-08-29

The trace just shows a bunch of normal transactions. I don't see any tight loops or anything like that. It's just processing messages as it arrives.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

limed - 2013-08-30

So the odd thing here is that eventhough it seems to be doing what it normally does, top however shows this

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16717 zimbra 20 0 3423m 2.2g 932 S 0.0 57.6 57:53.48 opendkim

We have 2 MTA's thats basically running the same version of opendkim with same kernel's and everything but yet one of the MTA's is misbehaving. Is there any data that I could provide that could help you look at this more?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

limed - 2013-09-09

Bump on this, is there any info or debugging info I can provide to help with this? Opendkim seemed to have slowly creeped up in memory usage over the week and caused our monitoring server to alert.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Miller - 2013-10-29

bump again (I work with limed) - this happened again yesterday and today on two different servers. Anything we can do to help debug?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike - 2013-11-18

We're seeing this same issue across 4 instances of opendkim 2.8.0. One instance is using 3GB of resident memory. Interestingly, it's signing very few messages. We have 500-1000 domains on the system and it's only signing for 2 very low volume domains.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike - 2013-11-21

I have attached a graph showing resident memory usage for 4 instances of opendkim across 4 MTAs. Neither mta1 or mta2 ever sign messages during normal usage.

The daemons were restarted around 12 hours ago in this graph. Data collection was only started ~10:40 on this graph.

Messages signed since opendkim restart
MTA3 ~35 messages
MTA4 ~36 messages

It looks like mta3/4 have handled ~33% more mail volume (each) than mta1/2 have.

Could this be a leak be related to ldap use?

opendkim_memory_leak.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2013-11-26

Th "top" output shows it using a lot of memory, but not much CPU time. It's possible there's a leak.

Any chance you could get valgrind to run on it to show leaks and where they might've come from?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2013-12-02

Yes, it's possible it's specific to LDAP. A dump from valgrind of outstanding allocations would be useful in targeting such a leak. I'll focus on that code and see if maybe there's something we're doing that might cause a leak. This might be caused by my unfamiliarity with what parts of OpenLDAP allocate memory that we might be responsible for freeing, etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2014-04-14

Note that this would explain a spinning CPU, but not the kind of memory consumption being reported here.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quanah Gibson-Mount - 2014-07-09

One of our clients has apparently found the cause. In opendkim-db.c, if the ldap db is used, when there is no signature, it exits without freeing "result"

In the OpenDKIM 2.9.2 source, this is the code block starting at line 4944 and ending at line 4955. Among the other tasks it takes, it should also free result, like it does at line 4987.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2014-07-09

That would explain the leak, but not the spin. Either way, that's in for next release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2014-07-09

status: pending --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2014-12-28

2.9.3 released.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murray S. Kucherawy - 2014-12-28

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

OpenDKIM goes into loop, eats up RAM

Group

Searches

Help

#177 OpenDKIM goes into loop, eats up RAM

Discussion