Murray S. Kucherawy --> dkim-milter-discuss (2008-07-08 16:03:28 -0700):
> On Fri, 23 May 2008, Jukka Salmi wrote:
> > Sure, thanks. With this patch applied, I see mi_wr_cmd() fail with
> > EBADF. Log is available.
> > Regards, Jukka
> >  http://salmi.ch/~jukka/dkim-milter/maillog_20080522
> Yep, this confirms the previous findings. Everything is fine with the I/O
> between postfix and the filter until:
> - postfix sends SMFIC_BODYEOM (end-of-message) to the filter and waits for
> a reply
> - postfix immediately decides the wait for the reply has failed (though
> the "why" remains a mystery), shuts down its connection to the filter and
> temp-fails the message
> - dkim-filter still thinks the connection is there, so it tries to send an
> SMFIR_INSHEADER (insert header) request, which fails because the socket is
> actually no longer open
> - since the insert header request fails, it replies with SMFIR_TEMPFAIL to
> try to get the message to temp-fail, but this also fails since the socket
> is no longer open
> We know the second write returns with EBADF, meaning the descriptor has
> been closed from the filter side. If it were the postfix side closing the
> connection, we'd be seeing EPIPE instead of EBADF.
> It looks a lot like fd 8 in the dkim-filter process has suddenly become
> invalid for no apparent reason. There's no path that I can see in
> libmilter's source code to having that descriptor closed and yet
> continuing to try to use it. dkim-filter doesn't have access to the
> milter context structure in order to get access to that descriptor number,
> so for it to be the problem it would have to call close() someplace on the
> wrong descriptor number. However, neither libdkim nor dkim-filter ever
> close() anything in normal operation because there's no need to do so.
> libdkim only creates (and later closes, via libcrypto) temporary files for
> certain special circumstances, and your configuration doesn't appear to be
> using any of those.
> So, for the moment, I'm stumped. My best guesses now are a bug in the
> underlying socket handling code (i.e. libc or the kernel) or something in
> libcrypto which is causing BIO_free() to close the wrong descriptor from
> time to time.
The systems in question will get an OS upgrade in the next weeks /
months. Let's see if the milter problem disappears with that upgrade...
> Someone said this doesn't happen if you change from UNIX domain sockets to
> TCP sockets. Has this also been tried?
Yes; I was originally using UNIX domain sockets when I first saw the
problem but switched to TCP sockets then to be able to capture the
packets. I haven't switched back since...
$ ((RANDOM%6)) || rm -rf ~