From: Murray S. K. <ms...@se...> - 2008-07-08 23:03:34
|
On Fri, 23 May 2008, Jukka Salmi wrote: > Sure, thanks. With this patch applied, I see mi_wr_cmd() fail with > EBADF. Log is [1]available. > > Regards, Jukka > > [1] http://salmi.ch/~jukka/dkim-milter/maillog_20080522 Yep, this confirms the previous findings. Everything is fine with the I/O between postfix and the filter until: - postfix sends SMFIC_BODYEOM (end-of-message) to the filter and waits for a reply - postfix immediately decides the wait for the reply has failed (though the "why" remains a mystery), shuts down its connection to the filter and temp-fails the message - dkim-filter still thinks the connection is there, so it tries to send an SMFIR_INSHEADER (insert header) request, which fails because the socket is actually no longer open - since the insert header request fails, it replies with SMFIR_TEMPFAIL to try to get the message to temp-fail, but this also fails since the socket is no longer open We know the second write returns with EBADF, meaning the descriptor has been closed from the filter side. If it were the postfix side closing the connection, we'd be seeing EPIPE instead of EBADF. It looks a lot like fd 8 in the dkim-filter process has suddenly become invalid for no apparent reason. There's no path that I can see in libmilter's source code to having that descriptor closed and yet continuing to try to use it. dkim-filter doesn't have access to the milter context structure in order to get access to that descriptor number, so for it to be the problem it would have to call close() someplace on the wrong descriptor number. However, neither libdkim nor dkim-filter ever close() anything in normal operation because there's no need to do so. libdkim only creates (and later closes, via libcrypto) temporary files for certain special circumstances, and your configuration doesn't appear to be using any of those. So, for the moment, I'm stumped. My best guesses now are a bug in the underlying socket handling code (i.e. libc or the kernel) or something in libcrypto which is causing BIO_free() to close the wrong descriptor from time to time. Someone said this doesn't happen if you change from UNIX domain sockets to TCP sockets. Has this also been tried? |