You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(10) |
Nov
(37) |
Dec
(66) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(52) |
Feb
(136) |
Mar
(65) |
Apr
(38) |
May
(46) |
Jun
(143) |
Jul
(60) |
Aug
(33) |
Sep
(79) |
Oct
(29) |
Nov
(13) |
Dec
(14) |
2006 |
Jan
(25) |
Feb
(26) |
Mar
(4) |
Apr
(9) |
May
(29) |
Jun
|
Jul
(9) |
Aug
(11) |
Sep
(10) |
Oct
(9) |
Nov
(45) |
Dec
(8) |
2007 |
Jan
(82) |
Feb
(61) |
Mar
(39) |
Apr
(7) |
May
(9) |
Jun
(16) |
Jul
(2) |
Aug
(22) |
Sep
(2) |
Oct
|
Nov
(4) |
Dec
(5) |
2008 |
Jan
|
Feb
|
Mar
(5) |
Apr
(2) |
May
(8) |
Jun
|
Jul
(10) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
|
Apr
(32) |
May
|
Jun
(7) |
Jul
|
Aug
(38) |
Sep
(3) |
Oct
|
Nov
(4) |
Dec
|
2010 |
Jan
(36) |
Feb
(32) |
Mar
(2) |
Apr
(19) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(6) |
Nov
(8) |
Dec
|
2011 |
Jan
(3) |
Feb
|
Mar
(5) |
Apr
|
May
(2) |
Jun
(1) |
Jul
|
Aug
(3) |
Sep
|
Oct
|
Nov
|
Dec
(6) |
2012 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(6) |
Dec
(10) |
2014 |
Jan
(8) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
(34) |
Aug
(6) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(18) |
Jul
(13) |
Aug
(30) |
Sep
(4) |
Oct
(1) |
Nov
|
Dec
(4) |
2016 |
Jan
(2) |
Feb
(10) |
Mar
(3) |
Apr
|
May
|
Jun
(11) |
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2017 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2019 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Philippe C. <sql...@pa...> - 2009-11-19 11:25:04
|
On 11/18/09 8:41 PM, Len Conrad wrote: > out of 100Ks of msgs, I have no way of knowing which message, if any, > triggered he error. But you said that it's a disconnect followed by an basically immediate reconnect. I'm assuming you're not getting 100Ks of msg *per second*. But, I also just noticed your domain in your e-mail address, if you get "accents" regularly, but not the disconnects that would disprove the "accents" possibility I mentioned before. More information, just in case: I think we've talked before (a year or two ago) that the "accent" disconnect is from your DB having the "wrong" language set as it's backing/collating language. But I don't recall how one checks that per database or what the suggested fix was.... -- Philippe Chaintreuil |
From: Len C. <LC...@Go...> - 2009-11-19 02:06:25
|
>On 11/18/2009 05:00 PM, Len Conrad wrote: >> every few hours, and with immediate recovery, we get this error: >> >> SQLgrey encountered an SQL error and triggered a reconnection to: >> DBI:mysql:database=sqlgrey;host=xxx > > Does this occur whenever you get an e-mail with a non-ASCII character >in one of the e-mail addresses? Which is to say anytime you get an >e-mail from someone with an accent in the from address? out of 100Ks of msgs, I have no way of knowing which message, if any, triggered he error. btw, sqlgrey on localhost and the remote sqlgrey go over tcp (not localhost socket). Len |
From: Philippe C. <sql...@pa...> - 2009-11-18 22:56:25
|
On 11/18/2009 05:00 PM, Len Conrad wrote: > every few hours, and with immediate recovery, we get this error: > > SQLgrey encountered an SQL error and triggered a reconnection to: > DBI:mysql:database=sqlgrey;host=xxx Does this occur whenever you get an e-mail with a non-ASCII character in one of the e-mail addresses? Which is to say anytime you get an e-mail from someone with an accent in the from address? -- Philippe Chaintreuil |
From: Len C. <lc...@Go...> - 2009-11-18 22:29:28
|
5.0.77 FreeBSD port: mysql-server-5.0.77_1 freebsd 7.2 latest sqlgrey on the MX remote from the sql srvr, for 16 hours Wed: egrep -ic "sqlgrey.*grey: new:" /var/log/maillog 201826 on the MX with the sqlgrey db: mx6# egrep -ic "sqlgrey.*grey: new:" /var/log/maillog 287840 every few hours, and with immediate recovery, we get this error: SQLgrey encountered an SQL error and triggered a reconnection to: DBI:mysql:database=sqlgrey;host=xxx the hosts are on the same 1GB switch It's no very "broken", but is it something that can be fixed? thanks Len |
From: Michal L. <ml...@lo...> - 2009-09-28 22:15:59
|
Michal Ludvig wrote: > Michal Ludvig wrote: >> Lionel Bouton wrote: >> >>> You are right, the problem exists in the from_awl and (less often) in >>> the domain_awl tables where there are constraints but not the connect >>> table. I see from_awl errors even on a moderately loaded domain (less >>> than 10_000 mails/day). >> The attached patch seems to fix the problem for artificially triggered >> conflicts. I don't seem to experience the problem on my server so I >> can't verify that it's the right fix. Could someone test it in real >> world and let me know if it helped? > > Hey guys ... is anyone testing the patch? Did it help? ... a month later ... Hey guys ... has anyone tested the patch? Did it help? If nobody bothered to test it I guess the 'problem' doesn't really exist or is not annoying enough and therefore a fix is not needed. Does that sound like the right conclusion? ;-) Michal |
From: Michal L. <ml...@lo...> - 2009-09-02 05:06:01
|
Michal Ludvig wrote: > Lionel Bouton wrote: > >> You are right, the problem exists in the from_awl and (less often) in >> the domain_awl tables where there are constraints but not the connect >> table. I see from_awl errors even on a moderately loaded domain (less >> than 10_000 mails/day). > > The attached patch seems to fix the problem for artificially triggered > conflicts. I don't seem to experience the problem on my server so I > can't verify that it's the right fix. Could someone test it in real > world and let me know if it helped? Hey guys ... is anyone testing the patch? Did it help? Michal |
From: Michal L. <ml...@lo...> - 2009-08-25 00:53:52
|
Lionel Bouton wrote: > You are right, the problem exists in the from_awl and (less often) in > the domain_awl tables where there are constraints but not the connect > table. I see from_awl errors even on a moderately loaded domain (less > than 10_000 mails/day). The attached patch seems to fix the problem for artificially triggered conflicts. I don't seem to experience the problem on my server so I can't verify that it's the right fix. Could someone test it in real world and let me know if it helped? Michal |
From: Lionel B. <lio...@bo...> - 2009-08-24 14:32:02
|
Michal Ludvig a écrit, le 08/24/2009 03:10 PM : > Lionel Bouton wrote: > > >> We should fix the case where a spammer sends a message to all MX of a >> domain at the same time. The problem is that several SQLgrey instances >> sharing the same database do the following (simplifying a bit) : >> - check that the message doesn't match a whitelist (no), >> - check if it is already in the connect table (it isn't), >> - try to create an entry in the "connect" table. >> Only one instance can complete the last step, all others get an SQL >> error (and it's completely normal). >> > > I don't think this can happen with current SQLgrey as the connect table > doesn't have a unique or primary index: > I keep forgetting that ! Damn MySQL... > sub create_connect_table { > # Note: no primary key, Mysql can't handle 500+ byte primary keys > # connect should not become big enough to make it a problem > $self->do("CREATE TABLE $tablename " . > '(sender_name varchar(64) NOT NULL, ' . > 'sender_domain varchar(255) NOT NULL, ' . > 'src varchar(39) NOT NULL, ' . > 'rcpt varchar(255) NOT NULL, ' . > 'first_seen timestamp NOT NULL)') > or $self->mydie(...); > } > > sub create_connect_indexes($) { > my $self = shift; > $self->do("CREATE INDEX $connect" . '_idx ' . > "ON $connect (src, sender_domain, sender_name)") > or $self->mydie(...); > $self->do("CREATE INDEX $connect" . '_fseen ' . > "ON $connect (first_seen)") > or $self->mydie(...); > } > > Does the problem actually exist at all? > You are right, the problem exists in the from_awl and (less often) in the domain_awl tables where there are constraints but not the connect table. I see from_awl errors even on a moderately loaded domain (less than 10_000 mails/day). The same logic can solve it though : if we ignore errors writing to the awl tables, we can still detect DB errors when reading from them and accessing the other tables. Lionel |
From: Michal L. <ml...@lo...> - 2009-08-24 13:09:40
|
Lionel Bouton wrote: > We should fix the case where a spammer sends a message to all MX of a > domain at the same time. The problem is that several SQLgrey instances > sharing the same database do the following (simplifying a bit) : > - check that the message doesn't match a whitelist (no), > - check if it is already in the connect table (it isn't), > - try to create an entry in the "connect" table. > Only one instance can complete the last step, all others get an SQL > error (and it's completely normal). I don't think this can happen with current SQLgrey as the connect table doesn't have a unique or primary index: sub create_connect_table { # Note: no primary key, Mysql can't handle 500+ byte primary keys # connect should not become big enough to make it a problem $self->do("CREATE TABLE $tablename " . '(sender_name varchar(64) NOT NULL, ' . 'sender_domain varchar(255) NOT NULL, ' . 'src varchar(39) NOT NULL, ' . 'rcpt varchar(255) NOT NULL, ' . 'first_seen timestamp NOT NULL)') or $self->mydie(...); } sub create_connect_indexes($) { my $self = shift; $self->do("CREATE INDEX $connect" . '_idx ' . "ON $connect (src, sender_domain, sender_name)") or $self->mydie(...); $self->do("CREATE INDEX $connect" . '_fseen ' . "ON $connect (first_seen)") or $self->mydie(...); } Does the problem actually exist at all? Michal |
From: Lionel B. <lio...@bo...> - 2009-08-24 07:55:18
|
Michal Ludvig a écrit, le 08/24/2009 07:54 AM : > Michal Ludvig wrote: > >> Lionel Bouton wrote: >> >> >>> I'm git all the way. >>> >> Cool. I converted the repo to GIT but don't have enough permission on SF >> to enable it. So it's your turn (or make me project admin and I'll do it). >> > > Hi Lionel, apparently you have enabled GIT for SQLgrey a couple of days > ago. Yes, I was surprised you didn't took advantage of my doing so :-) Lionel |
From: Michal L. <ml...@lo...> - 2009-08-24 05:53:14
|
Michal Ludvig wrote: > Lionel Bouton wrote: > >> I'm git all the way. > > Cool. I converted the repo to GIT but don't have enough permission on SF > to enable it. So it's your turn (or make me project admin and I'll do it). Hi Lionel, apparently you have enabled GIT for SQLgrey a couple of days ago. Since nothing else has happened since then I took the freedom to populate it. Hope you don't mind. I did some checkouts and commits and it seems to work fine. All the history is there as well, including 1.6-branch. Let me know if you spot any problems. Michal |
From: Lionel B. <lio...@bo...> - 2009-08-21 08:17:08
|
Phillip Smith a écrit, le 08/21/2009 04:07 AM : > Hi Lionel, > > I couldn't find any mailling lists or forums on the website, only your > e-mail address. There is user mailing list : https://lists.sourceforge.net/lists/listinfo/sqlgrey-users (cc'ed). > Let me know if I've missed something obvious and I'll ask this > question there :) > > I have 2 MX servers both using sqlgrey with a common/shared database - > ie, whitelists and optin/optout only has to be maintained in one place. > > Is the system supposed to ignore the 'reconnect_delay' setting in this > kind of setup. No, it should work correctly. > I have an example message that was grey-listed on my primary MX server: > > Aug 18 22:11:28 dingo postfix/smtpd[18060]: NOQUEUE: reject: RCPT from > relay04.mail-hub.dodo.com.au > <http://relay04.mail-hub.dodo.com.au>[123.2.6.239]: 450 4.7.1 > <te...@ry... <mailto:te...@ry...>>: Recipient address > rejected: Greylisted for 5 minutes; > from=<dan...@da... > <mailto:dan...@da...>> to=<te...@ry... > <mailto:te...@ry...>> proto=ESMTP > helo=<relay04.mail-hub.dodo.com.au <http://relay04.mail-hub.dodo.com.au>> > > Which made the remote MTA attempt delivery to the secondary MX only 2 > seconds later: > Aug 18 22:11:30 platypus postfix/qmgr[26506]: 4A96A8B67: > from=<dan...@da... > <mailto:dan...@da...>>, size=15367, nrcpt=1 (queue > active) Are you certain this is the same message and not another from the same user but to another one that was greylisted before? You should look at SQLgrey's log to find out what happened. > > My secondary MX then accepted the message because (I assume) of the > entry in the database from when the primary MX grey-listed it, however > there was only a 2 second gap so it didn't respect the reconnect_delay > setting. Both servers have '5' as the reconnect_delay. They are both > NTP sync'ed correctly, so the relative timestamps are accurate. NTP should always be used on mailservers but SQLgrey can work without it because it always uses the clock of the database. Unless something bad happens to the database's clock, SQLgrey should be fine. Lionel |
From: Michal L. <ml...@lo...> - 2009-08-21 03:31:49
|
Lionel Bouton wrote: > I'm git all the way. Cool. I converted the repo to GIT but don't have enough permission on SF to enable it. So it's your turn (or make me project admin and I'll do it). BTW it used to be so easy when CVS was the only choice. Now almost every project uses a different SCM and it doesn't help much one is a SVN or HG or Quilt guru when faced with GIT or Darcs or Arch or what not. But hey, concurrently developing in 5 different languages on 3 different platforms and actively using 4 different SCMs keeps me entertained ;-) Michal -- * http://smtp-cli.logix.cz - the ultimate command line smtp client |
From: Lionel B. <lio...@bo...> - 2009-08-20 23:21:28
|
Michal Ludvig a écrit, le 08/21/2009 12:54 AM : > Lionel Bouton wrote: > > >> I discouraged elaborate matchers for a reason : they are slow. >> If you use full IP or class-c, you are looking up a hash entry. If you >> do regex or prefix-based matching, this can become a full sequential >> search on a list and this part of the code is hit on each and every mail >> so it better be fast. >> > > It needn't be slow if done smartly ;-) > > What I have already (mostly) ready for checkin is roughly: > - instead of current $whitelist{IP}{ip} and $whitelist{C}{ip} > have $whitelist{prefixlen}{ip} (and similar for IPv6). > For example the entries would be: > $whitelist{24}{"1.2.3.0"} > $whitelist{24}{"10.20.30.0"} > $whitelist{16}{"123.123.0.0"} > - Then for each incoming IP: > foreach $prefix (sort(keys $whitelist)) { > $masked_ip = apply_prefix($ip, $prefix); > // Whitelist match? > return 1 if (defined $whitelist{$prefix}{$masked_ip}); > } > Hum, I must agree this looks good. The worst theoretical possible case would be 31 apply_prefix calls and hash lookups. > [...] > I vote for SVN ;-) Then you can use git-svn, can't you? I never used it > myself though. > There's no gain in using git-svn if we want to share our developments. A distributed source control system encourages parallel development and would make it far easier to exchange changesets between different branches. I'm git all the way. In fact I'm so used to it know that I was depressed by CVS (centralized systems have lost much of their appeal to me now, but CVS in particular should simply be buried and forgotten) the last times I played with SQLgrey's sources and gave up working with it. Lionel |
From: Michal L. <ml...@lo...> - 2009-08-20 22:54:21
|
Karl O. Pinc wrote: > On 08/20/2009 07:19:20 AM, Michal Ludvig wrote: >> BTW Would you mind if I migrated the CVS repo to SVN or HG? > Darcs? > http://darcs.net/ Er.. not another one please! I already deal with projects in GIT, HG, SVN and CVS. Don't make me add another VCS into the mix ;-) Michal -- * http://smtp-cli.logix.cz - the ultimate command line smtp client |
From: Michal L. <ml...@lo...> - 2009-08-20 22:53:47
|
Lionel Bouton wrote: > I discouraged elaborate matchers for a reason : they are slow. > If you use full IP or class-c, you are looking up a hash entry. If you > do regex or prefix-based matching, this can become a full sequential > search on a list and this part of the code is hit on each and every mail > so it better be fast. It needn't be slow if done smartly ;-) What I have already (mostly) ready for checkin is roughly: - instead of current $whitelist{IP}{ip} and $whitelist{C}{ip} have $whitelist{prefixlen}{ip} (and similar for IPv6). For example the entries would be: $whitelist{24}{"1.2.3.0"} $whitelist{24}{"10.20.30.0"} $whitelist{16}{"123.123.0.0"} - Then for each incoming IP: foreach $prefix (sort(keys $whitelist)) { $masked_ip = apply_prefix($ip, $prefix); // Whitelist match? return 1 if (defined $whitelist{$prefix}{$masked_ip}); } Since there's not likely to be more than a handful of different prefix-lenghts in $whitelist the overhead over the current implementation is marginal. Instead of current two hash lookups (in ->{IP} and ->{C}) it may do perhaps three or four lookups, depending on the number of different prefix-lengths. Definitely nowhere near sequential search, don't worry ;-) > Git (or Mercurial) would allow us to branch easily to prepare the fixes > for 1.8.0 and the evolutions for 1.8.x at the same time. But Mercurial > would be a speed bump for me :-( I vote for SVN ;-) Then you can use git-svn, can't you? I never used it myself though. Michal -- * http://smtp-cli.logix.cz - the ultimate command line smtp client |
From: Lionel B. <lio...@bo...> - 2009-08-20 15:35:14
|
Karl O. Pinc a écrit, le 08/20/2009 04:55 PM : > On 08/20/2009 08:48:30 AM, Lionel Bouton wrote: > > >> I discouraged elaborate matchers for a reason : they are slow. >> If you use full IP or class-c, you are looking up a hash entry. If >> you >> do regex or prefix-based matching, this can become a full sequential >> search on a list and this part of the code is hit on each and every >> mail >> so it better be fast. >> > > FWIW, PostgreSQL has network data types and functions > and operators that support/detect subnet mebership etc. > > Probably means a separate code path.... > I know, a very early SQLgrey version supported them and I dropped this to support MySQL and SQLite. Lionel |
From: Karl O. P. <ko...@me...> - 2009-08-20 14:55:38
|
On 08/20/2009 08:48:30 AM, Lionel Bouton wrote: > I discouraged elaborate matchers for a reason : they are slow. > If you use full IP or class-c, you are looking up a hash entry. If > you > do regex or prefix-based matching, this can become a full sequential > search on a list and this part of the code is hit on each and every > mail > so it better be fast. FWIW, PostgreSQL has network data types and functions and operators that support/detect subnet mebership etc. Probably means a separate code path.... Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Lionel B. <lio...@bo...> - 2009-08-20 13:48:45
|
Michal Ludvig a écrit, le 08/20/2009 02:19 PM : > Frankly this is quite a major change to be implemented in the last > minute before 1.8.0. I propose to postpone it after 1.8.0 - it's been > waiting for two years, it can wait a month more. > I agree. > There's one thing I'd like to have for 1.8.0 - support for netmasks in > client_ip_whitelist[.local]. At the moment it only accepts full IP > (1.2.3.4) or class-c notation (1.2.3), what I'd like to have is a > prefix-based notation (1.2.0.0/16 or 10.20.30.40/28). I discouraged elaborate matchers for a reason : they are slow. If you use full IP or class-c, you are looking up a hash entry. If you do regex or prefix-based matching, this can become a full sequential search on a list and this part of the code is hit on each and every mail so it better be fast. For 1.8.0 I'd like to have the fix for the erroneous database down/up emails. The rest would be a plus. > Still accepting > the current syntax as well of course. That's pretty easy to verify for > correctness and should be safe to include. > > BTW Would you mind if I migrated the CVS repo to SVN or HG? > We can ? That would be sweet. Git would be my choice, this is what I use now and I just saw this is available on sourceforge. Git (or Mercurial) would allow us to branch easily to prepare the fixes for 1.8.0 and the evolutions for 1.8.x at the same time. But Mercurial would be a speed bump for me :-( Lionel |
From: Karl O. P. <ko...@me...> - 2009-08-20 13:11:42
|
On 08/20/2009 07:19:20 AM, Michal Ludvig wrote: > BTW Would you mind if I migrated the CVS repo to SVN or HG? Darcs? http://darcs.net/ Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Michal L. <ml...@lo...> - 2009-08-20 12:51:21
|
Lionel Bouton wrote: > Just init the system with a very high cleanup > frequency : 30s between each cleanup for example. Then let it find the > sweet spot from that point. On most domains it should lower the cleanup > frequency gradually and only on very busy domains will it keep it around > the initial value. > Everyone would be able to use this without any configuration. > > To make that work we only have to store the db_cleandelay in the config > table in addition to the current last cleanup timestamp. > For each SQLgrey process, we wait for our own (adjusted) copy of the > delay to expire, reload the values store in DB to make sure another > process hasn't change things behind our back (then we know the cleanup > is already done) and update our own internal copy of the > next_cleanup_time based on the DB values. We set a minimum db_cleandelay > of 10s, add or remove a random number of seconds (between -5 and +5) and > we are set. Even if two or more servers happen to clean at the same > time, it's not a problem : they should all take roughly the same time > doing so (because one is blocking all the others on blocking backends or > they share the delete work on non-blocking backends) and will all put > roughly the same values in the database -> no unexpected cleandelay > shooting through the roof or the floor. Frankly this is quite a major change to be implemented in the last minute before 1.8.0. I propose to postpone it after 1.8.0 - it's been waiting for two years, it can wait a month more. There's one thing I'd like to have for 1.8.0 - support for netmasks in client_ip_whitelist[.local]. At the moment it only accepts full IP (1.2.3.4) or class-c notation (1.2.3), what I'd like to have is a prefix-based notation (1.2.0.0/16 or 10.20.30.40/28). Still accepting the current syntax as well of course. That's pretty easy to verify for correctness and should be safe to include. BTW Would you mind if I migrated the CVS repo to SVN or HG? Michal -- * http://smtp-cli.logix.cz - the ultimate command line smtp client |
From: Lionel B. <lio...@bo...> - 2009-08-18 23:43:03
|
Dan Faerch a écrit, le 08/19/2009 12:08 AM : > Dan Faerch wrote: > >> [...] >> Right. Completly killing of the "old" cleanup_delay and replace it with >> a fuzzy cleanup is a great idea. However "max_allowed_clean_time" needs >> to be calculated, I don't think so. In fact we can very well determine what the value should look like. If we assume SQLgrey is blocked while cleaning (sync mode), we have to prevent 2 problems : - Postfix aborting the connection because of a policy service timeout (100s), - Postfix refusing incoming connections because it reached the maximum number of smtpd processes. The first is easy, just set a value for target_clean_time (max_clean_time was a poor name choice) well below 100s and everything is OK (if you set it to 10s for example, only a 10x surge in connect traffic can make it a problem, 2s => 50x surge, ...). It seems we don't have the second problem (as we didn't get reports describing errors about max simultaneous processes). My guess is that smtpd processes are busy for several seconds (actual mail transfer, RBL and policy queries, ...) for each transfer which make admins of busy server tune them for more simultaneous connections and push the second problem behind the limit where we witness the first (policy service timeouts). So I think that if we solve the first problem we kill two birds with one stone. I propose that we set a target of 5s for the actual cleanup execution time and adjust the cleanup frequency to try to keep below this value. >> since it is relative how much time it takes to >> cleanup, depending on the system, the dbserver, the cpu & IO load, the >> filesize of db's ect.ect. Yes it is, but assuming the rate of row delete is more or less a constant, on a given domain with traffic patterns that don't change by more than 2 orders of magnitude, the cleanup time doesn't change much from run to run (not by more than 2 orders of magnitude at least). >> So you would still need to employ LIMIT to >> ensure a definitive max. >> I think we won't have. Just init the system with a very high cleanup frequency : 30s between each cleanup for example. Then let it find the sweet spot from that point. On most domains it should lower the cleanup frequency gradually and only on very busy domains will it keep it around the initial value. Everyone would be able to use this without any configuration. To make that work we only have to store the db_cleandelay in the config table in addition to the current last cleanup timestamp. For each SQLgrey process, we wait for our own (adjusted) copy of the delay to expire, reload the values store in DB to make sure another process hasn't change things behind our back (then we know the cleanup is already done) and update our own internal copy of the next_cleanup_time based on the DB values. We set a minimum db_cleandelay of 10s, add or remove a random number of seconds (between -5 and +5) and we are set. Even if two or more servers happen to clean at the same time, it's not a problem : they should all take roughly the same time doing so (because one is blocking all the others on blocking backends or they share the delete work on non-blocking backends) and will all put roughly the same values in the database -> no unexpected cleandelay shooting through the roof or the floor. >> [...] >> > Ohhh. I just realized that i was answering based on the assumption that > max_allowed_clean time would be the maximum number of seconds a cleaup > is allowed to take.. And that it might not be what you meant? > It's not the maximum (the name wasn't right, sorry). It should be the sweet spot where everything runs smoothly and the system should be able to cope with 10x this value when an occasional surge occurs. Seems like it's time for me to go back to the source :-) Lionel |
From: Lionel B. <lio...@bo...> - 2009-08-18 17:27:55
|
Karl O. Pinc a écrit, le 08/18/2009 06:53 PM : > [...] >> delay * (last_clean_time / max_allowed_clean_time). >> > > this does not make sense to me. My mistake I meant : delay * (max_allowed_clean_time / last_clean_time). it would make the cleanup time converge towards a value around max_allowed_clean_time (which should really be called target_cleanup_time). As said previously it's not as simple as that but I use this kind of algorithm to setup the refresh period of RSS feeds in a custom RSS reader and it gets a very good compromise between not refreshing them too often and not often enough even with more or less unpredictable content changes (which is very similar to our constraints : don't cleanup too often to avoid putting unnecessary load on the DB and enough to avoid blocking too long and also take the irregular traffic pattern into account). Lionel |
From: Karl O. P. <ko...@me...> - 2009-08-18 16:53:24
|
On 08/18/2009 08:58:26 AM, Lionel Bouton wrote: > Dan Faerch a écrit, le 08/18/2009 11:36 AM : > Yes, but this can be avoided almost entirely with a better cleandelay > algorithm (see my next mail). > > >> [...] > >> > > I currently use 60 seconds. My idea was to make a LIMIT high enough > to > > minimize the timeout scenario, and if the limit is hit, cut cleanup > time > > > > You mean the db_cleandelay value ? Then we probably will agree. > > > down to eg. 30 seconds until limit isnt hit any more. This will > give > > > sqlgrey breathing room to continue to handle users. I could keep > doing > > '"delay" * 0.5' for every time the limit is hit, thus going from 60 > > seconds, to 30, to 15, 7, 4, 2 then "if (delay < 2) Remove_LIMIT > clause". > > > > You can avoid that by not doing : > delay * 0.5 > but something like > delay * (last_clean_time / max_allowed_clean_time). this does not make sense to me. When there's more garbage to collect it seems this will collect garbage less often. You'd want to collect garbage more often. Something like: delay * max(1 - last_clean_time / max_allowed_clean_time, 0) If you wind up continuously collecting garbage, well, your system is too small to handle the load. > > And protect against unwanted fluctuations of the cleandelay by > bounding > the value of (last_clean_time / max_allowed_clean_time). > Ie : if you have periodic botnet attack interleaved with calm > periods, > you don't want the delay to increase too fast but you want it to > decrease fast. > You may have a temporary problem when a more agressive botnet comes > (ie > : the current cleandelay will be too high on the first run), but > SQLgrey > would be ready to handle the load on the next cleanup. Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Lionel B. <lio...@bo...> - 2009-08-18 14:07:48
|
Kenneth Marshall a écrit, le 08/18/2009 03:01 PM : > On Tue, Aug 18, 2009 at 04:48:59PM +1200, Michal Ludvig wrote: > >> Michal Ludvig wrote: >> >> >>> Eventually we could rework the whole cleaning process. Instead of doing >>> (potentially massive) cleanups in fixed time periods there could be a >>> minor cleanup on every, say, 1000th call. No need to count the calls, >>> simple 'if(int(rand(1000))==0) { .. cleanup .. }' should be sufficient. >>> >> Actually... attached is a patch that implements it. I'm running it with >> cleanup_chance=10 (ie 1:10 chance to perform cleanup) on one of my >> light-loaded servers and it seems to do the job. So far I've seen 13 >> cleanups in 150 greylist checks which is quite on track. >> >> >> BTW How about dropping the async cleanup path completely? Given the >> scary comment in sqlgrey.conf I don't believe anyone dares to use it anyway. >> >> Michal >> > > I would rather get the async cleanup path working. This will allow > non-locking backends to continue to process incoming mail connections > without blocking during the cleanup. A synchronous cleanup process is > pretty painful on a busy mail system, and will get more painful the > busier it gets. > My thoughts exactly. There are several facts that came up from the discussions on this list : - currently db_cleandelay is arbitrary and doesn't suit all needs, - on very busy domains (millions of mails/day) long cleanups are painful to the point of preventing legitimate mails from going in in a timely fashion when using blocking backends (SQLite and MyISAM), - on the same very busy domains, multiple SQLgrey processes launch the cleanup concurrently which is inefficient, - we don't use non blocking backends efficiently with the "sync" cleanup. I think db_cleandelay should go away. If SQLgrey is clever enough it can compute an appropriate value for it based on the DB behaviour. What we want to avoid are calls to cleanup methods that block SQLgrey processes enough time to make Postfix hit smtpd_policy_service_timeout (100s by default). A good target may be a maximum of 5s for cleanup execution (subject to testing). If db_cleandelay becomes dynamic (between a minimum higher than the target above and 1h) and is allowed to adapt to the cleanup execution time in this range (see earlier mail on the algorithm), we'll solve the problems with SQLite and MyISAM blocking. What's left is to prevent multiple SQLgrey processes to waste time on concurrent cleanups. That's not a big problem with MyISAM (it blocks concurrent cleanups which make all but the first to execute being noops), but it could be a performance problem for InnoDB and PostgreSQL. There are two ways : - force the cleanup in a unique external process, there are drawbacks I don't like : we lose redundancy, make configuration more complex (add a "don't clean flag") and add another process to attend to (check, launch, ...), - make sure the cleanups from several processes don't overlap : this is more complex for the code, but make things more robust and simple for the admins. I'd prefer to use the second approach. Fuzzy delays (introducing randomness in the actual delay used) should help with that but I've still to think about it. Then there's the async cleanup. After all I think we are able to solve the problem with async. We can test it again and if it still fails, there's another way to deal with this. Instead of forking the SQLgrey process (which async does and didn't work correctly last time we checked), we can simply fork+exec another process dedicated to the cleanup. It's a little slower, but we'll only lose maybe 10-20ms per cleanup and this will completely isolate the main process from the database problems occuring when we fork. Doing this would even provide the tools for people willing to isolate the cleanup process from SQLgrey execution. I'm not sure async is worth it if we solve the other problems though. Lionel |