From: Murray S. K. <ms...@se...> - 2008-03-27 21:23:11
|
To the people who were having this problem with Postfix (Jukka, etc.): Are you still having this problem with 2.5.1? Apparently the fix in dkim-canon.c in that version cleared up at least one problem that showed similar symptoms. If not, I think we'll have to add additional logging to libmilter and/or try harder to capture a copy of a message that trips the problem and/or extract a coredump from the filter at the time the failure occurs. The log information I saw suggested that mlfi_eom() was somehow returning using an integer that wasn't a valid SMIFS_* reply code. I haven't been able to trace a code path that would cause this, but it seems possible from the evidence that one exists. I may provide patches to libmilter for you to run that would catch this case if the problem is still occurring. So, let's start here: Is the problem still occurring? -MSK |
From: Jukka S. <j+d...@20...> - 2008-03-27 23:20:05
|
Hi, Murray S. Kucherawy --> dkim-milter-discuss (2008-03-27 14:23:03 -0700): > To the people who were having this problem with Postfix (Jukka, etc.): > > Are you still having this problem with 2.5.1? Apparently the fix in > dkim-canon.c in that version cleared up at least one problem that showed > similar symptoms. I upgraded to 2.5.1 some minutes ago. With 2.5.0 I saw the problem every few days. I'll report as soon as I see it with 2.5.1, or in a few weeks if I don't... Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-03-27 23:22:57
|
On Fri, 28 Mar 2008, Jukka Salmi wrote: > I upgraded to 2.5.1 some minutes ago. With 2.5.0 I saw the problem every > few days. I'll report as soon as I see it with 2.5.1, or in a few weeks > if I don't... OK. Just in case, I'll prepare some libmilter patches if you're willing to try them. All they will do is log unusual returns from the filter. |
From: Jukka S. <j+d...@20...> - 2008-03-29 09:51:02
|
Jukka Salmi --> dkim-milter-discuss (2008-03-28 00:19:57 +0100): > Hi, > > Murray S. Kucherawy --> dkim-milter-discuss (2008-03-27 14:23:03 -0700): > > To the people who were having this problem with Postfix (Jukka, etc.): > > > > Are you still having this problem with 2.5.1? Apparently the fix in > > dkim-canon.c in that version cleared up at least one problem that showed > > similar symptoms. > > I upgraded to 2.5.1 some minutes ago. With 2.5.0 I saw the problem > every few days. I'll report as soon as I see it with 2.5.1, or in a > few weeks if I don't... Today the problem reoccurred while running 2.5.1 (and Postfix 2.4.6): Mar 29 02:28:39 grouper postfix/smtpd[15006]: 37D0D24C7F: client=mail.netbsd.org[204.152.190.11] Mar 29 02:28:39 grouper postfix/cleanup[13747]: 37D0D24C7F: message-id=<200...@ft...> Mar 29 02:28:39 grouper postfix/cleanup[13747]: warning: milter inet:localhost:1026: can't read SMFIC_BODYEOB reply packet header: Undefined error: 0 Mar 29 02:28:39 grouper postfix/cleanup[13747]: 37D0D24C7F: milter-reject: END-OF-MESSAGE from mail.netbsd.org[204.152.190.11]: 4.7.1 Service unavailable - try again later; from=<bounces-tech-pkg-owner-j+nbsd=salmi.ch@NetBSD.org> to=<j+...@sa...> proto=ESMTP helo=<mail.netbsd.org> Mar 29 02:28:39 grouper postfix/smtpd[15006]: disconnect from mail.netbsd.org[204.152.190.11] Mar 29 02:28:39 grouper dkim-milter[22868]: 37D0D24C7F "X-DKIM" header add failed Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-03-30 05:43:18
|
A search of Google for "postfix" and "SMFIC_BODYEOB" report the same condition with mimedefang (admittedly a bit of an old report): http://readlist.com/lists/lists.roaringpenguin.com/mimedefang/0/2946.html More later when I've had time to prepare some debugging patches. |
From: Jukka S. <j+d...@20...> - 2008-03-30 10:54:27
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-03-29 22:43:10 -0700): > A search of Google for "postfix" and "SMFIC_BODYEOB" report the same > condition with mimedefang (admittedly a bit of an old report): > http://readlist.com/lists/lists.roaringpenguin.com/mimedefang/0/2946.html > > More later when I've had time to prepare some debugging patches. I should probably try to reproduce the problem with the latest stable Postfix... I'll see if I find some time to upgrade. Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Jukka S. <j+d...@20...> - 2008-03-30 13:34:17
|
Jukka Salmi --> dkim-milter-discuss (2008-03-30 12:40:41 +0200): > Murray S. Kucherawy --> dkim-milter-discuss (2008-03-29 22:43:10 -0700): > > A search of Google for "postfix" and "SMFIC_BODYEOB" report the same > > condition with mimedefang (admittedly a bit of an old report): > > http://readlist.com/lists/lists.roaringpenguin.com/mimedefang/0/2946.html > > > > More later when I've had time to prepare some debugging patches. > > I should probably try to reproduce the problem with the latest stable > Postfix... I'll see if I find some time to upgrade. I'm running Postfix 2.5.1 (and dkim-milter 2.5.2) now, let's see if the problem still occurs... Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Jukka S. <j+d...@20...> - 2008-04-10 06:37:37
|
Jukka Salmi --> dkim-milter-discuss (2008-03-30 15:34:13 +0200): > Jukka Salmi --> dkim-milter-discuss (2008-03-30 12:40:41 +0200): > > Murray S. Kucherawy --> dkim-milter-discuss (2008-03-29 22:43:10 -0700): > > > A search of Google for "postfix" and "SMFIC_BODYEOB" report the same > > > condition with mimedefang (admittedly a bit of an old report): > > > http://readlist.com/lists/lists.roaringpenguin.com/mimedefang/0/2946.html > > > > > > More later when I've had time to prepare some debugging patches. > > > > I should probably try to reproduce the problem with the latest stable > > Postfix... I'll see if I find some time to upgrade. > > I'm running Postfix 2.5.1 (and dkim-milter 2.5.2) now, let's see if > the problem still occurs... It does: after running without any problem for more than a week, yesterday the problem occurred on two different servers, both running Postfix 2.5.1 and dkim-milter 2.5.2. Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-04-10 17:29:23
|
On Thu, 10 Apr 2008, Jukka Salmi wrote: > It does: after running without any problem for more than a week, > yesterday the problem occurred on two different servers, both running > Postfix 2.5.1 and dkim-milter 2.5.2. I'll come up with some libmilter patches for you to try, which produce additional logging. Unfortunately that'll mean you get the extra logging for a week or more waiting for the problem to reappear. Will that be a problem in your environment? |
From: Jukka S. <j+d...@20...> - 2008-04-10 17:58:34
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-04-10 10:29:01 -0700): > On Thu, 10 Apr 2008, Jukka Salmi wrote: > > It does: after running without any problem for more than a week, > > yesterday the problem occurred on two different servers, both running > > Postfix 2.5.1 and dkim-milter 2.5.2. > > I'll come up with some libmilter patches for you to try, which produce > additional logging. Unfortunately that'll mean you get the extra logging > for a week or more waiting for the problem to reappear. Will that be a > problem in your environment? No, that would be fine. TIA, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-04-10 23:27:30
Attachments:
PATCH
|
Make the attached patch to libmilter, rebuild your library and then relink dkim-filter, and let 'er rip. This will log a message for every piece of I/O between the filter and the MTA so the increase in your mail log could be substantial, and will also log a message if a milter callback returns a non-standard value (which skips the reply, which may be the problem). Unfortunately it looks like this is the best next step to isolating the problem. Talk to you in about a week, right? :-) Thanks for trying this out. -MSK |
From: Jukka S. <j+d...@20...> - 2008-04-18 09:33:06
Attachments:
maillog.mx2
maillog.mx1
|
Hello, Murray S. Kucherawy --> dkim-milter-discuss (2008-04-10 16:27:17 -0700): > Make the attached patch to libmilter, rebuild your library and then > relink dkim-filter, and let 'er rip. This will log a message for every > piece of I/O between the filter and the MTA so the increase in your mail > log could be substantial, and will also log a message if a milter > callback returns a non-standard value (which skips the reply, which may > be the problem). Unfortunately it looks like this is the best next step > to isolating the problem. > > Talk to you in about a week, right? :-) a week and a day later the (probably same) problem happened on my secondary MX (mx2); previously it only happened on my primary MX (mx1). The first attachment (maillog.mx2) contains the relevant part of mx2's mail log. The message in question failed to be delivered, and 20 minutes later the sending MTA tried again and successfully delivered the message to mx1. The relevant part of that communication is what's in the second attachment (mx1). Both times dkim-milter fails to parse an Authentication-Results header. The message finally delivered to mx1 had the following headers: X-DomainKeys: Sendmail DomainKeys Filter v1.0.0 mx1.salmi.ch 0BCA624C7E Authentication-Results: mx1.salmi.ch; domainkeys=pass (testing) header.from=av...@av... I'm running dk-milter just before running dkim-milter on both system... Please tell if I should provide more information. Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-08-26 20:34:55
|
On Thu, 24 Jul 2008, Jukka Salmi wrote: > OK, I'll try to get a trace of the milter process ASAP. But as you > said, this could easily take some weeks... Hi Jukka. Were you able to capture a trace of the filter encountering this failure? |
From: Jukka S. <j+d...@20...> - 2008-09-11 12:46:35
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-08-26 13:34:47 -0700): > On Thu, 24 Jul 2008, Jukka Salmi wrote: > > OK, I'll try to get a trace of the milter process ASAP. But as you > > said, this could easily take some weeks... > > Hi Jukka. Were you able to capture a trace of the filter encountering > this failure? Sorry for the long delay, I've been AFK for some time. I wanted to upgrade dkim-milter to the latest release before trying to capture a trace, but haven't found enough time to do this so far. Thus I enabled tracing on the milter process only now, still running 2.6.0. Waiting for the next `"X-DKIM" header add failed' message to appear in the log... Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-07-08 23:16:13
|
The only thing I can think of asking for now is a truss/strace/ktrace of the process around the time of the error to see if we can spot the errant close() call. Unfortunately if it can take hours or days or weeks for the problem to appear, such a log could be enormous. ...but if you have a way to get one, it might prove valuable. |
From: Jukka S. <j+d...@20...> - 2008-07-24 15:47:53
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-07-08 16:16:10 -0700): > The only thing I can think of asking for now is a truss/strace/ktrace of > the process around the time of the error to see if we can spot the errant > close() call. Unfortunately if it can take hours or days or weeks for the > problem to appear, such a log could be enormous. > > ...but if you have a way to get one, it might prove valuable. OK, I'll try to get a trace of the milter process ASAP. But as you said, this could easily take some weeks... Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-05-23 17:46:06
|
Can you post your configuration file someplace? And are you using the stock resolver or libar? If we're catching EBADF, that means something on the filter side is improperly closing the file descriptor which connects the filter to the MTA. I'll try to find code that could do something like that. |
From: Jukka S. <j+d...@20...> - 2008-05-23 21:19:06
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-05-23 10:45:58 -0700): > Can you post your configuration file someplace? Sure, it's available [1]here. > And are you using the stock resolver or libar? Stock resolver. Regards, Jukka [1] http://salmi.ch/~jukka/dkim-milter/dkim-milter_vrfy.conf -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-07-08 23:03:34
|
On Fri, 23 May 2008, Jukka Salmi wrote: > Sure, thanks. With this patch applied, I see mi_wr_cmd() fail with > EBADF. Log is [1]available. > > Regards, Jukka > > [1] http://salmi.ch/~jukka/dkim-milter/maillog_20080522 Yep, this confirms the previous findings. Everything is fine with the I/O between postfix and the filter until: - postfix sends SMFIC_BODYEOM (end-of-message) to the filter and waits for a reply - postfix immediately decides the wait for the reply has failed (though the "why" remains a mystery), shuts down its connection to the filter and temp-fails the message - dkim-filter still thinks the connection is there, so it tries to send an SMFIR_INSHEADER (insert header) request, which fails because the socket is actually no longer open - since the insert header request fails, it replies with SMFIR_TEMPFAIL to try to get the message to temp-fail, but this also fails since the socket is no longer open We know the second write returns with EBADF, meaning the descriptor has been closed from the filter side. If it were the postfix side closing the connection, we'd be seeing EPIPE instead of EBADF. It looks a lot like fd 8 in the dkim-filter process has suddenly become invalid for no apparent reason. There's no path that I can see in libmilter's source code to having that descriptor closed and yet continuing to try to use it. dkim-filter doesn't have access to the milter context structure in order to get access to that descriptor number, so for it to be the problem it would have to call close() someplace on the wrong descriptor number. However, neither libdkim nor dkim-filter ever close() anything in normal operation because there's no need to do so. libdkim only creates (and later closes, via libcrypto) temporary files for certain special circumstances, and your configuration doesn't appear to be using any of those. So, for the moment, I'm stumped. My best guesses now are a bug in the underlying socket handling code (i.e. libc or the kernel) or something in libcrypto which is causing BIO_free() to close the wrong descriptor from time to time. Someone said this doesn't happen if you change from UNIX domain sockets to TCP sockets. Has this also been tried? |
From: Jukka S. <j+d...@20...> - 2008-07-24 15:46:38
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-07-08 16:03:28 -0700): > On Fri, 23 May 2008, Jukka Salmi wrote: > > Sure, thanks. With this patch applied, I see mi_wr_cmd() fail with > > EBADF. Log is [1]available. > > > > Regards, Jukka > > > > [1] http://salmi.ch/~jukka/dkim-milter/maillog_20080522 > > Yep, this confirms the previous findings. Everything is fine with the I/O > between postfix and the filter until: > > - postfix sends SMFIC_BODYEOM (end-of-message) to the filter and waits for > a reply > > - postfix immediately decides the wait for the reply has failed (though > the "why" remains a mystery), shuts down its connection to the filter and > temp-fails the message > > - dkim-filter still thinks the connection is there, so it tries to send an > SMFIR_INSHEADER (insert header) request, which fails because the socket is > actually no longer open > > - since the insert header request fails, it replies with SMFIR_TEMPFAIL to > try to get the message to temp-fail, but this also fails since the socket > is no longer open > > We know the second write returns with EBADF, meaning the descriptor has > been closed from the filter side. If it were the postfix side closing the > connection, we'd be seeing EPIPE instead of EBADF. > > It looks a lot like fd 8 in the dkim-filter process has suddenly become > invalid for no apparent reason. There's no path that I can see in > libmilter's source code to having that descriptor closed and yet > continuing to try to use it. dkim-filter doesn't have access to the > milter context structure in order to get access to that descriptor number, > so for it to be the problem it would have to call close() someplace on the > wrong descriptor number. However, neither libdkim nor dkim-filter ever > close() anything in normal operation because there's no need to do so. > libdkim only creates (and later closes, via libcrypto) temporary files for > certain special circumstances, and your configuration doesn't appear to be > using any of those. > > So, for the moment, I'm stumped. My best guesses now are a bug in the > underlying socket handling code (i.e. libc or the kernel) or something in > libcrypto which is causing BIO_free() to close the wrong descriptor from > time to time. The systems in question will get an OS upgrade in the next weeks / months. Let's see if the milter problem disappears with that upgrade... > Someone said this doesn't happen if you change from UNIX domain sockets to > TCP sockets. Has this also been tried? Yes; I was originally using UNIX domain sockets when I first saw the problem but switched to TCP sockets then to be able to capture the packets. I haven't switched back since... Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Murray S. K. <ms...@se...> - 2008-05-21 18:22:12
Attachments:
PATCH
|
Another diagnostics patch to libmilter. This one will also log I/O errors it encounters. This will (hopefully) determine whether or not the descriptor connecting the filter to the MTA is being prematurely closed and, if so, which end is doing it. Jukka, if you're still out there and willing, please give this one a try. |
From: Jukka S. <j+d...@20...> - 2008-05-23 16:37:19
|
Murray S. Kucherawy --> dkim-milter-discuss (2008-05-21 11:22:04 -0700): > Another diagnostics patch to libmilter. This one will also log I/O > errors it encounters. > > This will (hopefully) determine whether or not the descriptor connecting > the filter to the MTA is being prematurely closed and, if so, which end > is doing it. > > Jukka, if you're still out there and willing, please give this one a try. Sure, thanks. With this patch applied, I see mi_wr_cmd() fail with EBADF. Log is [1]available. Regards, Jukka [1] http://salmi.ch/~jukka/dkim-milter/maillog_20080522 -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: Jukka S. <j+d...@20...> - 2008-04-24 10:51:18
|
Jukka Salmi --> dkim-milter-discuss (2008-04-18 11:32:59 +0200): > Hello, > > Murray S. Kucherawy --> dkim-milter-discuss (2008-04-10 16:27:17 -0700): > > Make the attached patch to libmilter, rebuild your library and then > > relink dkim-filter, and let 'er rip. This will log a message for every > > piece of I/O between the filter and the MTA so the increase in your mail > > log could be substantial, and will also log a message if a milter > > callback returns a non-standard value (which skips the reply, which may > > be the problem). Unfortunately it looks like this is the best next step > > to isolating the problem. > > > > Talk to you in about a week, right? :-) > > a week and a day later the (probably same) problem happened on my > secondary MX (mx2); previously it only happened on my primary MX (mx1). Yesterday I saw the problem on my primary MX. The relevant parts of the MTA logs are available [1]here. Chronological order of events: - connect from first client (guinness.omniscient.com, connection 1 in the following) - connect from second client (suckless.org, connection 2) one second later - dkim-milter fails to parse message from connection 1 but successfully verifies it: failed to parse Authentication-Results: header DKIM verification successful - connect from third client (mail.netbsd.org, connection 3) about a minute later - immediate milter-reject for message from connection 3: can't read SMFIC_BODYEOB reply packet header: Undefined error: 0 "X-DKIM" header add failed - reception of that rejected message via secondary MX (mx2.salmi.ch) - some minutes later: milter-reject for messages from connections 1 and 2: can't read SMFIC_BODYEOB reply packet header: Operation timed out - reception of those rejected messages via secondary MX > Please tell if I should provide more information. Still true... Regards, Jukka [1] http://salmi.ch/~jukka/dkim-milter/maillog.libmilterdebug -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |
From: SM <sm...@re...> - 2008-04-24 19:17:21
|
Hi Jukka, At 03:51 24-04-2008, Jukka Salmi wrote: >Yesterday I saw the problem on my primary MX. The relevant parts of >the MTA logs are available [1]here. > >Chronological order of events: > >- connect from first client (guinness.omniscient.com, connection 1 in > the following) >- connect from second client (suckless.org, connection 2) one second > later >- dkim-milter fails to parse message from connection 1 but successfully > verifies it: > failed to parse Authentication-Results: header You can ignore that warning. > DKIM verification successful >- connect from third client (mail.netbsd.org, connection 3) about a > minute later >- immediate milter-reject for message from connection 3: > can't read SMFIC_BODYEOB reply packet header: Undefined error: 0 > "X-DKIM" header add failed I'm not sure whether this fix would work. You could patch libmilter to return a reply and see whether it avoids the above condition. Regards, -sm |
From: Jukka S. <j+d...@20...> - 2008-04-11 14:23:14
|
Hello, Murray S. Kucherawy --> dkim-milter-discuss (2008-04-10 16:27:17 -0700): > Make the attached patch to libmilter, rebuild your library and then > relink dkim-filter, and let 'er rip. done. I'm already seeing lots of "milter returned bogus status m in state n" lines. Is this expected behaviour? > Talk to you in about a week, right? :-) Given that your patch doesn't mask the problem, yes ;-) Regards, Jukka -- bashian roulette: $ ((RANDOM%6)) || rm -rf ~ |