From: Michal L. <ml...@lo...> - 2009-08-17 14:18:02
|
Hi guys, Two years after 1.7.6 it's a high time to put out a new release of SQLgrey. I wonder if there are any patches floating around that you'd like to have included in the upcoming release? So far I fixed some issues with IPv6 handling - namely fixed 'classc'/'smart' trimming of IPv6 addresses and added support for v6 addrs for clients_ip_whitelist[.local]. I bet there must be some other work as well that popped up in the two years since 1.7.6 release. Anyone? And one more thing - are there any major issues that should be fixed for the next release? Something that's holding us back from calling the 1.7.x codebase "stable"? I ask because I'm tempted to fork a new stable branch for 1.8.x versions. Speak up! Michal -- * http://smtp-cli.logix.cz - the ultimate command line smtp client |
From: Ralph S. <sql...@se...> - 2009-08-17 14:58:30
|
Michal Ludvig wrote: > Speak up! I don't have any specific requests, but I want to extend a brief "thank you" to you for still maintaining SQLgrey. -R |
From: Lionel B. <lio...@bo...> - 2009-08-17 17:06:31
|
Michal Ludvig a écrit, le 08/17/2009 03:47 PM : > Hi guys, > > Two years after 1.7.6 it's a high time to put out a new release of > SQLgrey. I wonder if there are any patches floating around that you'd > like to have included in the upcoming release? > > So far I fixed some issues with IPv6 handling - namely fixed > 'classc'/'smart' trimming of IPv6 addresses and added support for v6 > addrs for clients_ip_whitelist[.local]. > > I bet there must be some other work as well that popped up in the two > years since 1.7.6 release. Anyone? > We should fix the case where a spammer sends a message to all MX of a domain at the same time. The problem is that several SQLgrey instances sharing the same database do the following (simplifying a bit) : - check that the message doesn't match a whitelist (no), - check if it is already in the connect table (it isn't), - try to create an entry in the "connect" table. Only one instance can complete the last step, all others get an SQL error (and it's completely normal). SQLgrey doesn't know about this case and thinks that there's a connection problem with the database. It then : - sends an email to the admin, - reconnects to the database, - later, when another action is requested and the database reacts correctly, it sends an email saying all is OK with the db again. I've seen a patch around that tries to solve the problem but IIRC it's MySQL specific so it won't do. What I'd like to do is simply to ignore errors occurring when SQLgrey tries to write an entry in the connect table. If the database is really down or misbehaving, SQLgrey has plenty other opportunities to detect this and warn the admin (reading from the database on the next email being processed for example). > And one more thing - are there any major issues that should be fixed for > the next release? Something that's holding us back from calling the > 1.7.x codebase "stable"? I ask because I'm tempted to fork a new stable > branch for 1.8.x versions. > I agree with you on the 1.8.x naming. 1.7.6 is battle-tested for a long time now. Lionel |
From: Karl O. P. <ko...@me...> - 2009-08-17 22:03:45
|
On 08/17/2009 11:48:25 AM, Lionel Bouton wrote: > We should fix the case where a spammer sends a message to all MX of a > domain at the same time. The problem is that several SQLgrey > instances > sharing the same database do the following (simplifying a bit) : > - check that the message doesn't match a whitelist (no), > - check if it is already in the connect table (it isn't), > - try to create an entry in the "connect" table. > Only one instance can complete the last step, all others get an SQL > error (and it's completely normal). > What I'd like to do is simply to ignore errors occurring when SQLgrey > tries to write an entry in the connect table. The "right" thing to do in the case of PostgreSQL is: - check that the message doesn't match a whitelist (no), - begin a transaction (BEGIN;), - ensure the processes don't mess with each other (SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE;), - check if it is already in the connect table -- if so, abort the transaction (ROLLBACK;) and done - try to create an entry in the "connect" table. - commit the transaction (COMMIT;) This approach uses the database features designed for just such a situation. I couldn't say whether this is the best choice for sqlgrey. Regards, Karl <ko...@me...> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein |
From: Lionel B. <lio...@bo...> - 2009-08-17 23:45:03
|
Karl O. Pinc a écrit, le 08/18/2009 12:03 AM : > The "right" thing to do in the case of PostgreSQL is: > - check that the message doesn't match a whitelist (no), > - begin a transaction (BEGIN;), > - ensure the processes don't mess with each other > (SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ WRITE;), > - check if it is already in the connect table > -- if so, abort the transaction (ROLLBACK;) and done > - try to create an entry in the "connect" table. > - commit the transaction (COMMIT;) > > This could work with MySQL with minor modifications (I'm not sure about the transaction isolation level support, we might have to lock the whole table explicitly). But I don't think it's a good path to follow for SQLgrey. I'll argue that using transactions doesn't solve the problem in the most efficient way. Transactions have a higher cost than simply handling an SQL error (which happen quite rarely compared to all the SQL queries generated by SQLgrey). For maintenance reasons, I'd advise against having different code paths for different databases. There are already a few of them and I had problems with them at the time they were implemented (complex code is always more difficult to maintain). So the cost/benefit ratio isn't in favor of transactions. Although they are the cleanest way from a theoretical point of few, in practice they would bring a lot of pain and SQLgrey can work just as well without them. Lionel |
From: Michal L. <ml...@lo...> - 2009-08-24 13:09:40
|
Lionel Bouton wrote: > We should fix the case where a spammer sends a message to all MX of a > domain at the same time. The problem is that several SQLgrey instances > sharing the same database do the following (simplifying a bit) : > - check that the message doesn't match a whitelist (no), > - check if it is already in the connect table (it isn't), > - try to create an entry in the "connect" table. > Only one instance can complete the last step, all others get an SQL > error (and it's completely normal). I don't think this can happen with current SQLgrey as the connect table doesn't have a unique or primary index: sub create_connect_table { # Note: no primary key, Mysql can't handle 500+ byte primary keys # connect should not become big enough to make it a problem $self->do("CREATE TABLE $tablename " . '(sender_name varchar(64) NOT NULL, ' . 'sender_domain varchar(255) NOT NULL, ' . 'src varchar(39) NOT NULL, ' . 'rcpt varchar(255) NOT NULL, ' . 'first_seen timestamp NOT NULL)') or $self->mydie(...); } sub create_connect_indexes($) { my $self = shift; $self->do("CREATE INDEX $connect" . '_idx ' . "ON $connect (src, sender_domain, sender_name)") or $self->mydie(...); $self->do("CREATE INDEX $connect" . '_fseen ' . "ON $connect (first_seen)") or $self->mydie(...); } Does the problem actually exist at all? Michal |
From: Lionel B. <lio...@bo...> - 2009-08-24 14:32:02
|
Michal Ludvig a écrit, le 08/24/2009 03:10 PM : > Lionel Bouton wrote: > > >> We should fix the case where a spammer sends a message to all MX of a >> domain at the same time. The problem is that several SQLgrey instances >> sharing the same database do the following (simplifying a bit) : >> - check that the message doesn't match a whitelist (no), >> - check if it is already in the connect table (it isn't), >> - try to create an entry in the "connect" table. >> Only one instance can complete the last step, all others get an SQL >> error (and it's completely normal). >> > > I don't think this can happen with current SQLgrey as the connect table > doesn't have a unique or primary index: > I keep forgetting that ! Damn MySQL... > sub create_connect_table { > # Note: no primary key, Mysql can't handle 500+ byte primary keys > # connect should not become big enough to make it a problem > $self->do("CREATE TABLE $tablename " . > '(sender_name varchar(64) NOT NULL, ' . > 'sender_domain varchar(255) NOT NULL, ' . > 'src varchar(39) NOT NULL, ' . > 'rcpt varchar(255) NOT NULL, ' . > 'first_seen timestamp NOT NULL)') > or $self->mydie(...); > } > > sub create_connect_indexes($) { > my $self = shift; > $self->do("CREATE INDEX $connect" . '_idx ' . > "ON $connect (src, sender_domain, sender_name)") > or $self->mydie(...); > $self->do("CREATE INDEX $connect" . '_fseen ' . > "ON $connect (first_seen)") > or $self->mydie(...); > } > > Does the problem actually exist at all? > You are right, the problem exists in the from_awl and (less often) in the domain_awl tables where there are constraints but not the connect table. I see from_awl errors even on a moderately loaded domain (less than 10_000 mails/day). The same logic can solve it though : if we ignore errors writing to the awl tables, we can still detect DB errors when reading from them and accessing the other tables. Lionel |
From: Michal L. <ml...@lo...> - 2009-08-25 00:53:52
Attachments:
sqlgrey-ignorable-do.patch
|
Lionel Bouton wrote: > You are right, the problem exists in the from_awl and (less often) in > the domain_awl tables where there are constraints but not the connect > table. I see from_awl errors even on a moderately loaded domain (less > than 10_000 mails/day). The attached patch seems to fix the problem for artificially triggered conflicts. I don't seem to experience the problem on my server so I can't verify that it's the right fix. Could someone test it in real world and let me know if it helped? Michal |
From: Michal L. <ml...@lo...> - 2009-09-02 05:06:01
|
Michal Ludvig wrote: > Lionel Bouton wrote: > >> You are right, the problem exists in the from_awl and (less often) in >> the domain_awl tables where there are constraints but not the connect >> table. I see from_awl errors even on a moderately loaded domain (less >> than 10_000 mails/day). > > The attached patch seems to fix the problem for artificially triggered > conflicts. I don't seem to experience the problem on my server so I > can't verify that it's the right fix. Could someone test it in real > world and let me know if it helped? Hey guys ... is anyone testing the patch? Did it help? Michal |
From: Michal L. <ml...@lo...> - 2009-09-28 22:15:59
|
Michal Ludvig wrote: > Michal Ludvig wrote: >> Lionel Bouton wrote: >> >>> You are right, the problem exists in the from_awl and (less often) in >>> the domain_awl tables where there are constraints but not the connect >>> table. I see from_awl errors even on a moderately loaded domain (less >>> than 10_000 mails/day). >> The attached patch seems to fix the problem for artificially triggered >> conflicts. I don't seem to experience the problem on my server so I >> can't verify that it's the right fix. Could someone test it in real >> world and let me know if it helped? > > Hey guys ... is anyone testing the patch? Did it help? ... a month later ... Hey guys ... has anyone tested the patch? Did it help? If nobody bothered to test it I guess the 'problem' doesn't really exist or is not annoying enough and therefore a fix is not needed. Does that sound like the right conclusion? ;-) Michal |
From: Lionel B. <lio...@bo...> - 2009-08-18 06:48:02
|
Michal Ludvig a écrit, le 08/18/2009 03:40 AM : > Dan Faerch wrote: > > >> There is a bug ill be hunting this week. Its an undocumented feature for >> when running multiple servers against one database server. I needed that >> only 1 specific server would do the cleanup, so i added (i think in >> 1.7.5) the possibillity to add a file called "dont_db_clean" in the >> sqlgrey config directory. Ive recently noticed that somehow i run into a >> deadlock where nobody cleans the db, but the "last_clean" timestamp >> still gets updated. resultning in nothing getting cleaned. >> > > In smtpd_access_policy() it does: > > # Is it time for cleanups ? > if ($current_time > ($self->{sqlgrey}{last_dbclean} + > $self->{sqlgrey}{db_cleandelay})) { > # updateconfig() returns affected_rows > 1==> if ($self->updateconfig('last_dbclean', > $current_time,$self->{sqlgrey}{last_dbclean})) { > # If affected_rows > 0, its my job to clean the db > $self->{sqlgrey}{last_dbclean} = $current_time; > 2==> $self->start_cleanup(); > ... > > > I.e. first it updates the last_dbclean value (#1) and only then calls > start_cleanup() It's a design decision : it's meant to prevent several SQLgrey processes to launch the cleanup at the same time (it's still possible but the time window is very short). > (#2) which means last_dbclean is updated even if cleanup > failed for some reason. That's clearly a wrong order of actions. > Not if you don't change the original design which imply that the cleanup is distributed amongst SQLgrey processes. > Perhaps we could have a "cleanup_in_progress" config option that could > be used as a semaphore. Set it atomically with: > $rows_affected = $dbi->do(" > UPDATE config > SET cleanup_in_progress=NOW() > WHERE cleanup_in_progress IS NULL"); > Then if $rows_affected == 1 we succeeded and can go ahead with cleanup. > If $rows_affected == 0 someone else started cleanup in the meantime And if this SQLgrey process crash later on, cleanup is toasted : we can't afford that. > - no > problem, we'll go ahead with other processing. > > > Eventually we could rework the whole cleaning process. Instead of doing > (potentially massive) cleanups in fixed time periods there could be a > minor cleanup on every, say, 1000th call. No need to count the calls, > simple 'if(int(rand(1000))==0) { .. cleanup .. }' should be sufficient. > Perhaps have three independent rng calls [cleanup_connect(), > cleanup_from_awl(), cleanup_domain_awl()] to spread the load over time > even further. > > Hum, random behaviour is difficult to debug... What's wrong with setting a low db_cleandelay ? > The advantage here is that the cleanup will take place at intervals > relative to the load of the given server not in fixed time periods that > may be difficult to set right. Each of these cleanups should be pretty > fast, affecting only a few rows, No, because the amount of work doesn't depend on the current load but on the load present around max_connect_age ago (and awl_age ago in a smaller proportion). > therefore the problem with multiple > sqlgrey processes attempting to cleanup at the same time should go away > as well (firstly because they're not likely to trigger cleanup at the > very same moment This could be implemented with a small random offset relative to the db_cleandelay if it's a problem. But is it really ? > and secondly because the time spent with tables locked > should be very short. As said in another mail, if you have problems with tables locked, change your backend. Lionel |
From: Michal L. <ml...@lo...> - 2009-08-18 09:56:59
|
Lionel Bouton wrote: > Michal Ludvig a écrit, le 08/18/2009 03:40 AM : >> (#2) which means last_dbclean is updated even if cleanup >> failed for some reason. That's clearly a wrong order of actions. > > Not if you don't change the original design which imply that the cleanup > is distributed amongst SQLgrey processes. I understand that, however it shouldn't record "done" before it's really done. Because it may record 'done' and then keep crashing while doing the job and never actually complete it. And the record would still say 'done'. > Hum, random behaviour is difficult to debug... Not really in this case, is it? ;-) For debugging you can turn random off and call cleanup for each email if you like. Random behaviour is often evil but here it provides a simple way to achieve something that's not that critical. I think it's not inappropriate. > What's wrong with setting a low db_cleandelay ? How low? On very busy servers even 5 minutes can be too much while on many others 5 minutes interval may call cleanup more often than emails happen to arrive ;-) My "adaptive interval" seems to fit more sizes without much tweaking. >> The advantage here is that the cleanup will take place at intervals >> relative to the load of the given server not in fixed time periods that >> may be difficult to set right. Each of these cleanups should be pretty >> fast, affecting only a few rows, > > No, Yes, > because the amount of work doesn't depend on the current load but on > the load present around max_connect_age ago assuming the load doesn't vary too much day to day it does work. I don't operate any really big mail servers, the busiest one I have access to does between 7k and 12k emails per day. Which is still an acceptable variation for the method I propose. >> therefore the problem with multiple >> sqlgrey processes attempting to cleanup at the same time should go away >> as well (firstly because they're not likely to trigger cleanup at the >> very same moment > > This could be implemented with a small random offset relative to the > db_cleandelay if it's a problem. But is it really ? To be honest I don't know. I actually don't experience the problem I'm trying to solve here. Just wanted to bring up a new approach that may eventually help. Take it or leave it, I don't mind either way. Michal -- * http://smtp-cli.logix.cz - the ultimate command line smtp client |
From: Kenneth M. <kt...@ri...> - 2009-08-18 13:02:07
|
On Tue, Aug 18, 2009 at 04:48:59PM +1200, Michal Ludvig wrote: > Michal Ludvig wrote: > > > Eventually we could rework the whole cleaning process. Instead of doing > > (potentially massive) cleanups in fixed time periods there could be a > > minor cleanup on every, say, 1000th call. No need to count the calls, > > simple 'if(int(rand(1000))==0) { .. cleanup .. }' should be sufficient. > > Actually... attached is a patch that implements it. I'm running it with > cleanup_chance=10 (ie 1:10 chance to perform cleanup) on one of my > light-loaded servers and it seems to do the job. So far I've seen 13 > cleanups in 150 greylist checks which is quite on track. > > > BTW How about dropping the async cleanup path completely? Given the > scary comment in sqlgrey.conf I don't believe anyone dares to use it anyway. > > Michal I would rather get the async cleanup path working. This will allow non-locking backends to continue to process incoming mail connections without blocking during the cleanup. A synchronous cleanup process is pretty painful on a busy mail system, and will get more painful the busier it gets. Regards, Ken > Index: sqlgrey > =================================================================== > RCS file: /cvsroot/sqlgrey/sqlgrey/sqlgrey,v > retrieving revision 1.110 > diff -u -r1.110 sqlgrey > --- sqlgrey 17 Aug 2009 13:15:03 -0000 1.110 > +++ sqlgrey 18 Aug 2009 03:13:20 -0000 > @@ -86,6 +86,8 @@ > $dflt{discrimination} = 0; > $dflt{discrimination_add_rulenr} = 0; > > +$dflt{cleanup_chance} = 1000; # Cleanup the DB roughly in 1:1000 calls. > + > $dflt{log} = { # note values here are not used > 'grey' => 2, > 'whitelist' => 2, > @@ -1608,16 +1610,13 @@ > } > > ## Choose the actual cleanup method > -sub start_cleanup { > +sub try_cleanup { > my $self = shift; > > - if ($dflt{dont_db_clean}) { > - $self->mylog('conf', 2, "This host has db-cleaning disabled"); > - return; > - } > + return if (int(rand($dflt{cleanup_chance})) != 0); > > if ($self->{sqlgrey}{clean_method} eq 'sync') { > - $self->cleanup(); > + $self->cleanup(); > } else { > $self->fork_cleanup(); > } > @@ -2309,31 +2308,8 @@ > # we need the value of now() in the database > $self->update_dbnow(); > > - # If !defined last_dbclean, reload value from DB > - if (!defined $self->{sqlgrey}{last_dbclean}) { > - $self->{sqlgrey}{last_dbclean} = $self->getconfig('last_dbclean'); > - > - # if last_dbclean not found in db then write it. > - if (!defined $self->{sqlgrey}{last_dbclean}) { > - # 0 will force a cleanup (unless db_cleandelay is really huge) > - $self->setconfig('last_dbclean',0); > - $self->{sqlgrey}{last_dbclean} = 0; > - } > - } > - > - # Is it time for cleanups ? > - my $current_time = time(); > - if ($current_time > ($self->{sqlgrey}{last_dbclean} + $self->{sqlgrey}{db_cleandelay})) { > - # updateconfig() returns affected_rows > - if ($self->updateconfig('last_dbclean',$current_time,$self->{sqlgrey}{last_dbclean})) { > - # If affected_rows > 0, its my job to clean the db > - $self->{sqlgrey}{last_dbclean} = $current_time; > - $self->start_cleanup(); > - } else { > - #If affected_rows == 0, then someone already cleaned db > - $self->{sqlgrey}{last_dbclean} = undef; #make sqlgrey reload time from db on next pass > - } > - } > + # See if we get a chance to clean up the DB a little > + $self->try_cleanup(); > > # domain scale awl check > if ($self->is_in_domain_awl($sender_domain, $cltid)) { > @@ -2572,7 +2548,6 @@ > awl_age => $dflt{awl_age}, > # How many from match a domain/IP before a switch to domain AWL > domain_level => $dflt{group_domain_level}, > - last_dbclean => undef, # triggers reload from db > db_cleandelay => $dflt{db_cleandelay}, # between table cleanups (seconds) > > db_prepare_cache => $dflt{db_prepare_cache}, > Index: etc/sqlgrey.conf > =================================================================== > RCS file: /cvsroot/sqlgrey/sqlgrey/etc/sqlgrey.conf,v > retrieving revision 1.19 > diff -u -r1.19 sqlgrey.conf > --- etc/sqlgrey.conf 17 Aug 2009 12:43:11 -0000 1.19 > +++ etc/sqlgrey.conf 18 Aug 2009 03:13:21 -0000 > @@ -124,13 +124,14 @@ > # db_prepare_cache = 0 # use prepared statements cache > # BEWARE: memory leaks have been reported > # when it is active > -# db_cleandelay = 1800 # in seconds, how much time between database cleanups > # clean_method = sync # sync : cleanup is done in the main process, > # delaying other operations > # async: cleanup is done in a forked process, > # it won't delay mail processing > # BEWARE: lockups have been reported > # and are still investigated > +# cleanup_chance = 1000 # How often should DB cleanup be performed? > + # By default once in every 1000 calls. > > ## Database clustering (for advanced setups) > # > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users |
From: Lionel B. <lio...@bo...> - 2009-08-18 14:07:48
|
Kenneth Marshall a écrit, le 08/18/2009 03:01 PM : > On Tue, Aug 18, 2009 at 04:48:59PM +1200, Michal Ludvig wrote: > >> Michal Ludvig wrote: >> >> >>> Eventually we could rework the whole cleaning process. Instead of doing >>> (potentially massive) cleanups in fixed time periods there could be a >>> minor cleanup on every, say, 1000th call. No need to count the calls, >>> simple 'if(int(rand(1000))==0) { .. cleanup .. }' should be sufficient. >>> >> Actually... attached is a patch that implements it. I'm running it with >> cleanup_chance=10 (ie 1:10 chance to perform cleanup) on one of my >> light-loaded servers and it seems to do the job. So far I've seen 13 >> cleanups in 150 greylist checks which is quite on track. >> >> >> BTW How about dropping the async cleanup path completely? Given the >> scary comment in sqlgrey.conf I don't believe anyone dares to use it anyway. >> >> Michal >> > > I would rather get the async cleanup path working. This will allow > non-locking backends to continue to process incoming mail connections > without blocking during the cleanup. A synchronous cleanup process is > pretty painful on a busy mail system, and will get more painful the > busier it gets. > My thoughts exactly. There are several facts that came up from the discussions on this list : - currently db_cleandelay is arbitrary and doesn't suit all needs, - on very busy domains (millions of mails/day) long cleanups are painful to the point of preventing legitimate mails from going in in a timely fashion when using blocking backends (SQLite and MyISAM), - on the same very busy domains, multiple SQLgrey processes launch the cleanup concurrently which is inefficient, - we don't use non blocking backends efficiently with the "sync" cleanup. I think db_cleandelay should go away. If SQLgrey is clever enough it can compute an appropriate value for it based on the DB behaviour. What we want to avoid are calls to cleanup methods that block SQLgrey processes enough time to make Postfix hit smtpd_policy_service_timeout (100s by default). A good target may be a maximum of 5s for cleanup execution (subject to testing). If db_cleandelay becomes dynamic (between a minimum higher than the target above and 1h) and is allowed to adapt to the cleanup execution time in this range (see earlier mail on the algorithm), we'll solve the problems with SQLite and MyISAM blocking. What's left is to prevent multiple SQLgrey processes to waste time on concurrent cleanups. That's not a big problem with MyISAM (it blocks concurrent cleanups which make all but the first to execute being noops), but it could be a performance problem for InnoDB and PostgreSQL. There are two ways : - force the cleanup in a unique external process, there are drawbacks I don't like : we lose redundancy, make configuration more complex (add a "don't clean flag") and add another process to attend to (check, launch, ...), - make sure the cleanups from several processes don't overlap : this is more complex for the code, but make things more robust and simple for the admins. I'd prefer to use the second approach. Fuzzy delays (introducing randomness in the actual delay used) should help with that but I've still to think about it. Then there's the async cleanup. After all I think we are able to solve the problem with async. We can test it again and if it still fails, there's another way to deal with this. Instead of forking the SQLgrey process (which async does and didn't work correctly last time we checked), we can simply fork+exec another process dedicated to the cleanup. It's a little slower, but we'll only lose maybe 10-20ms per cleanup and this will completely isolate the main process from the database problems occuring when we fork. Doing this would even provide the tools for people willing to isolate the cleanup process from SQLgrey execution. I'm not sure async is worth it if we solve the other problems though. Lionel |