| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-19 09:13:38
       | 
| Hi there, (I'm using PostgreSQL, so maybe your mileage may vary if you use another SQL DB system). I tried to optimize DB cleanup time by creating indexes on the last_seen column (domain_awl and from_awl) and on the first_seen column (connect), so the cleanup process can use these indexes to find entries older or newer than a given value rather than having to explore sequentially the complete tables. This can make a big difference for large sites having tens of thousands entries in their tables... However, usingt the PostgreSQL "explain" statement, I've been surprised to notice that the indexes I had created where *NOT* used when performing the kind of queries that SQLgrey performs, with a WHERE clause built such as in: select * from from_awl where last_seen < now() - INTERVAL '30 DAY'; However, the indexes would be used if the WHERE clause used a constant, such as in: select * from from_awl where last_seen < '2005-01-19 08:43:09'; It seems that what prevents the index from being used, is having the "now() - INTERVAL '65 DAY'" calculation performed as an expression in the WHERE clause comparison. So I wonder if it wouldn't be better to have SQLgrey calculate in Perl the wanted date boundary, then build an SQL query using this calculated boundary rather than asking the SQL engine to calculate it itself on the fly... Furthemore, it could help with different SQL DB systems using different INTERVAL syntaxes, as SQLgrey already needs to have an interval() subroutine adaptating the query syntax to each kind of SQL DB system. Probably, having this routine calculate the date boundary by itself, and spit a result that would be used as a constant in the queries would make a more universal solution, AND it would allow PostgreSQL to use indexes for performing this query, rather to have to process the DB sequentially... (I've tried to modify the SQL query syntax so it makes the calculation and uses the index, but I couldn't find a solution. However, I'm really not an SQL guru, so there may be an obvious solution that I missed ?) Comments ? -- Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-19 10:43:36
       
        
          
            Attachments:
            sqlgrey-1.4.7.interval.patch
          
        
       | 
| Le Samedi 19 F=E9vrier 2005 10:13, Michel Bouissou a =E9crit : > > It seems that what prevents the index from being used, is having the "n= ow() > - INTERVAL '65 DAY'" calculation performed as an expression in the WHER= E > clause comparison. > > So I wonder if it wouldn't be better to have SQLgrey calculate in Perl = the > wanted date boundary, then build an SQL query using this calculated > boundary rather than asking the SQL engine to calculate it itself on th= e > fly... > > Furthemore, it could help with different SQL DB systems using different > INTERVAL syntaxes, as SQLgrey already needs to have an interval() > subroutine adaptating the query syntax to each kind of SQL DB system. > > Probably, having this routine calculate the date boundary by itself, an= d > spit a result that would be used as a constant in the queries would mak= e a > more universal solution, AND it would allow PostgreSQL to use indexes f= or > performing this query, rather to have to process the DB sequentially... I've written a patch (attached) that I'm currently testing. It seems to w= ork=20 OK in PostgreSQL so far. I don't know if it would be OK for other SQL DB systems as well, but I su= ppose=20 that probably yes... Comments welcome. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-19 21:32:04
       
        
          
            Attachments:
            sqlgrey-1.4.7.interval2.patch
          
        
       | 
| Le Samedi 19 F=E9vrier 2005 11:43, Michel Bouissou a =E9crit : > > > It seems that what prevents the index from being used, is having the > > "now() - INTERVAL '65 DAY'" calculation performed as an expression in= the > > WHERE clause comparison. [...] > > Probably, having this routine calculate the date boundary by itself, = and > > spit a result [...] > > I've written a patch (attached) that I'm currently testing. It seems to > work OK in PostgreSQL so far. The attached patch supersedes and replaces the one I sent in my previous=20 email. It does basically the same thing, but it does it better ;-) : It specifie= s the=20 date (for PostgreSQL) in a format that is locale and "DateStyle" independ= ent,=20 and thus shouldn't be disturbed by such local settings. It also specifies the date in a format that should be OK for MySQL (teste= rs=20 needed ;-) For SQlite, I'm not sure, and some input is needed. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Klaus A. S. <kse...@gm...> - 2005-02-19 22:09:38
       | 
| > It does basically the same thing, but it does it better ;-) : It specifie= s the > date (for PostgreSQL) in a format that is locale and "DateStyle" independ= ent, > and thus shouldn't be disturbed by such local settings. Instead of using "YYYYMMDD HHMMSS" and localtime(), how about using "YYYYMMDDTHHMMSSZ" and gmtime() for Postgres? Cheers, --=20 Klaus Alexander Seistrup SubZeroNet =B7 Copenhagen =B7 Denmark | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-19 23:17:10
       
        
          
            Attachments:
            sqlgrey-1.4.7.interval3.patch
          
        
       | 
| Le Samedi 19 F=E9vrier 2005 22:31, Michel Bouissou a =E9crit : > > The attached patch supersedes and replaces the one I sent in my previou= s > email. Oops ! A bug somehow made its way into the patch ;-) There misses a ";" at the end of its line #39 Well, sorry, attached is the fixed patch, that "supersedes and replace th= e=20 previous one". Just for a ";" ;-)) --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-19 23:04:22
       | 
| Le Samedi 19 F=E9vrier 2005 23:09, Klaus Alexander Seistrup a =E9crit : > > Instead of using "YYYYMMDD HHMMSS" and localtime(), how about using > "YYYYMMDDTHHMMSSZ" and gmtime() for Postgres? The last_seen / first_seen fields in the tables are currently stored in=20 localtime, as "timestamp without time zone". I didn't change anything in the way this data is stored, only the way=20 intervals are calculated. When we calculate the interval, we need to use the same local time base, = or=20 the calculated intervals will be wrong. There is no special interest in using GMT here : Timestamps used in SQLgr= ey=20 only usage is to calculate durations (basically: How much time elapsed si= nce=20 an entry was added in connect ? How long since wed last saw an entry in=20 *_awl ?). Storing it as localtime seems to make more sense than using GMT= ,=20 and it's what is already done... There's little chance, if any, that the same SQLgrey DB would be used by=20 several servers on different timezones, so why use GMT...? Cheers. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E Tout le malheur des hommes vient de ce qu'ils ne vivent pas dans _le_ mon= de, mais dans _leur_ monde. -- H=E9raclite. | 
| 
      
      
      From: Klaus A. S. <kse...@gm...> - 2005-02-19 23:30:38
       | 
| Michel Bouissou wrote: > There's little chance, if any, that the same SQLgrey DB would be used > by several servers on different timezones, so why use GMT...? I was thinking that backup MXes might use the same database as the primary MX, and they might be in different timezones. What happens to localtime() timestamps (and intervals) as we pass from standard time to daylight saving time? For internal use, including storage, I tend to prefer UTC. For human consumption I tend to prefer localtime. Cheers, --=20 Klaus Alexander Seistrup SubZeroNet =B7 Copenhagen =B7 Denmark | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-19 23:49:03
       | 
| Le Dimanche 20 F=E9vrier 2005 00:30, Klaus Alexander Seistrup a =E9crit : > > > There's little chance, if any, that the same SQLgrey DB would be used > > by several servers on different timezones, so why use GMT...? > > I was thinking that backup MXes might use the same database as the > primary MX, and they might be in different timezones. They would probably not use the database directly, but maybe indirectly t= hru=20 using the same SQLgrey server. The timestamps are then the "internals" of= the=20 SQLgrey server, and are not known to the MXes. > What happens to localtime() timestamps (and intervals) as we pass from > standard time to daylight saving time? Not much. Addresses in AWLs may be kept one hour more or less than planne= d=20 (let's say 1 month +/- 1 hour instead of 1 month), which doesn't really=20 matter, does it ? Addresses in connect will be kept waiting for a reconnection for 1 more/l= ess=20 hour, which shouldn't have a big incidence either. What may be more noticeable is that, at the time the DST changes (in the=20 middle of the night) the "minimum time before accepting reconnection" may= =20 become one hour longer, or not enforced at all, for one hour. The practical consequences for this will be IMHO insignificant. Cheers. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Klaus A. S. <kse...@gm...> - 2005-02-20 06:51:17
       | 
| Michel Bouissou <mi...@bo...> wrote: >> I was thinking that backup MXes might use the same database as >> the primary MX, and they might be in different timezones. >=20 > They would probably not use the database directly, but maybe indirectly > thru using the same SQLgrey server. The timestamps are then the > "internals" of the SQLgrey server, and are not known to the MXes. I was actually imagining several sqlgrey daemons connecting to the same central database, but of course the scenario you mention is also possible. >> What happens to localtime() timestamps (and intervals) as we pass from >> standard time to daylight saving time? >=20 > Not much. Addresses in AWLs may be kept one hour more or less than > planned (let's say 1 month +/- 1 hour instead of 1 month), which doesn't > really matter, does it ? >=20 > Addresses in connect will be kept waiting for a reconnection for 1 more/l= ess > hour, which shouldn't have a big incidence either. It doesn't matter much, but the difference is there. =20 > What may be more noticeable is that, at the time the DST changes (in the > middle of the night) the "minimum time before accepting reconnection" may > become one hour longer, or not enforced at all, for one hour. It will be in the middle of the night at the database machine, but mail might be fetched with e.g. IMAP/POP from everywhere on the globe, or forwarded from the MX to other mail accounts in other timezones. =20 > The practical consequences for this will be IMHO insignificant. Whether it will be insignificant will depend on the actual sitation and use of that particular MX. IMHO the only sane thing to do when dealing with timestamps for other than human consumption is to use UTC. Cheers, --=20 Klaus Alexander Seistrup SubZeroNet =B7 Copenhagen =B7 Denmark | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-20 08:31:48
       | 
| Le Dimanche 20 F=E9vrier 2005 07:51, Klaus Alexander Seistrup a =E9crit : > > I was actually imagining several sqlgrey daemons connecting to the > same central database, but of course the scenario you mention is also > possible. I seems more logical to have several MTAs use the same SQLgrey server, ra= ther=20 than having several SQLgrey servers share the same database. > >> What happens to localtime() timestamps (and intervals) as we pass fr= om > >> standard time to daylight saving time? > > > > Not much. Addresses in AWLs may be kept one hour more or less than > > planned (let's say 1 month +/- 1 hour instead of 1 month), which does= n't > > really matter, does it ? > > > > Addresses in connect will be kept waiting for a reconnection for 1 > > more/less hour, which shouldn't have a big incidence either. > > It doesn't matter much, but the difference is there. Yep. But this difference makes no difference on a practical standpoint. > > What may be more noticeable is that, at the time the DST changes (in = the > > middle of the night) the "minimum time before accepting reconnection"= may > > become one hour longer, or not enforced at all, for one hour. > > It will be in the middle of the night at the database machine, but > mail might be fetched with e.g. IMAP/POP from everywhere on the globe, > or forwarded from the MX to other mail accounts in other timezones. > > > The practical consequences for this will be IMHO insignificant. > > Whether it will be insignificant will depend on the actual sitation > and use of that particular MX.=20 First, having an MTA and its SQLgrey server separated by several timezone= s=20 doesn't look like a good idea on both a performance and reliability=20 standpoint. This would be especially true between a primary and backup MX= ,=20 who IMHO should definitely not share the same SQLgrey server or database,= as=20 it would create a single point of failure. Second, let's be more precise about what may happen when the DST time cha= nges: - First note that this will only affect connections that are in the "new"= =20 status (in the connect table). Connections already known to the AWLs won'= t be=20 affected. - When going from winter time to summer time (2AM suddenly becomes 3AM), = it=20 will only affect connections that first came between (2AM - reconnect_del= ay)=20 and 2 AM (which makes typically 5 minutes...). These connections will be allowed to come back a little SOONER than it wo= uld=20 normally happen : Let's say the first connection comes at 1:58 and=20 reconnect_delay is 5 minutes. Connection normally wouldn't be accepted ag= ain=20 before 2:03, but 2:00 suddenly becomes 3:00, so reconnection will be=20 accepted, which results in having reduced the reconnect_delay from 5 to 2= =20 minutes. This will happen only for connections that were first seen betwe= en=20 1:55 and 2:00... And this will have no practical incidence. - When going from summer time to winter time (3AM becomes 2AM again, and = 1=20 hour is repeated), the phenomenon will be a little more complex, but yet = will=20 affect only first connections in the "connect" table. Let's say a connection first comes at 2:45 and reconnect_delay is 5 minut= es.=20 Reconnection will be accepted after 2:50. Now if the reconnection doesn't= =20 happen between 2:50 and 3:00, at 3:00, times goes back to 2:00 and=20 reconnection won't be accepted until it is 2:50 again. This creates a maximum "one hour greyhole" for a reconnection in the midd= le of=20 the night, on hour being the maximum (connections that first came in at 2= :10=20 and would be accepted at 2:15, for example, will only suffer a 15' greyho= le=20 if they hadn't reconnected before 3:00, when it becomes 2:00 again) On a practical standpoint, having this happen once a year in the middle o= f the=20 (local ;-) night won't make any difference for the vast majority of users= and=20 messages -- Note that greylisting a message always causes a delay over wh= ich=20 we don't have any control, because we can't control the incoming MTA's re= try=20 rate. Sometimes refusing a mail for 3 minutes will actually have it defer= red=20 for several hours if the remote MTA doesn't retry before several hours... So yet, this DST change story has a _little_ influence, but this influenc= e is=20 so little that it can practically be ignored IMHO. > IMHO the only sane thing to do when dealing with timestamps for other t= han > human consumption is to use UTC. Theoretically, yes, but there are always exception cases. Here, keeping t= he=20 database in local time will be more understandable for the admin who take= s an=20 human look ;-) inside the database, and makes no real difference in the w= ay=20 the system works. It is also worth noting that SQLgrey logs its activity in the same system= mail=20 log as Postfix, and Postfix logs in local time... So to keep everything=20 coherent, it's better having SQLgrey log (and work) in local time as well= ... Cheers. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E La plus grave maladie du cerveau c'est de r=E9fl=E9chir. -- Ko=E2n Shadok | 
| 
      
      
      From: Lionel B. <lio...@bo...> - 2005-02-20 23:34:57
       | 
| Michel Bouissou wrote the following on 19.02.2005 10:13 : >Hi there, > >(I'm using PostgreSQL, so maybe your mileage may vary if you use another SQL >DB system). > >I tried to optimize DB cleanup time by creating indexes on the last_seen >column (domain_awl and from_awl) and on the first_seen column (connect), so >the cleanup process can use these indexes to find entries older or newer than >a given value rather than having to explore sequentially the complete tables. > >This can make a big difference for large sites having tens of thousands >entries in their tables... > >However, usingt the PostgreSQL "explain" statement, I've been surprised to >notice that the indexes I had created where *NOT* used when performing the >kind of queries that SQLgrey performs, with a WHERE clause built such as in: > >select * from from_awl where last_seen < now() - INTERVAL '30 DAY'; > >However, the indexes would be used if the WHERE clause used a constant, such >as in: > >select * from from_awl where last_seen < '2005-01-19 08:43:09'; > >It seems that what prevents the index from being used, is having the "now() - >INTERVAL '65 DAY'" calculation performed as an expression in the WHERE clause >comparison. > > You just made me understand why these queries are slow. This isn't the whole syntax that is to blame, but the 'now()' function call. The database as no way of knowing that calls to now() will return the same result during the query (and in fact it doesn't). Not only does the database use a sequential scan which is *BAD* but it also calls now() for each and every one of the awl entries :-( If you do an explain on this : select * from from_awl where last_seen < timestamp '2005-02-21 00:12:46.22726+01' - INTERVAL '30 DAY'; You'll find that PostgreSQL is doing the right thing. I considered doing the now() call outside the database earlier (in fact when SQLite support was added). I want to avoid this if I can because I'd like to allow for several SQLgrey instances accessing the same database. As the default greylisting window is only 1 minute, either there must be ntp in the loop and we pay extra attention to the localtime accross the whole system (we don't want GMT in the database for the poor DBA) or we rely on a single point : the database. One way to do this might be to replace the now() with the "timestamp '<value>'" and fetch the value with a simple "SELECT now()". Leaving "SELECT 1" aside, this should be one of the most light query we can do :-) It works with PostgreSQL, I need to test MySQL and SQLite I'll isolate this into a separate method called only once per connection after whitelists evaluations. If needed we'll have the possibility of avoiding the database query by doing the work in Perl later. Lionel. | 
| 
      
      
      From: Max D. <Max...@lr...> - 2005-02-21 11:56:50
       | 
| Lionel Bouton wrote: > Michel Bouissou wrote the following on 19.02.2005 10:13 : > >> Hi there, >> >> (I'm using PostgreSQL, so maybe your mileage may vary if you use >> another SQL DB system). >> >> I tried to optimize DB cleanup time by creating indexes on the >> last_seen column (domain_awl and from_awl) and on the first_seen >> column (connect), so the cleanup process can use these indexes to find >> entries older or newer than a given value rather than having to >> explore sequentially the complete tables. >> >> This can make a big difference for large sites having tens of >> thousands entries in their tables... >> >> However, usingt the PostgreSQL "explain" statement, I've been >> surprised to notice that the indexes I had created where *NOT* used >> when performing the kind of queries that SQLgrey performs, with a >> WHERE clause built such as in: >> >> select * from from_awl where last_seen < now() - INTERVAL '30 DAY'; >> >> However, the indexes would be used if the WHERE clause used a >> constant, such as in: >> >> select * from from_awl where last_seen < '2005-01-19 08:43:09'; >> >> It seems that what prevents the index from being used, is having the >> "now() - INTERVAL '65 DAY'" calculation performed as an expression in >> the WHERE clause comparison. >> using first_seen and ip_addr as (separate) indices in connect (using MySQL 3.23.55), I found, that MySQL uses index for explain select count(*) from connect where first_seen < '2005-02-21 08:43:09'; as well as for explain select count(*) from connect where first_seen < now() - interval 12 hour; but not for explain select * from connect where first_seen < now() - interval 12 hour; or explain select * from connect where first_seen < '2005-02-21 08:43:09'; (or f.e. explain select sender_name, sender_domain, ip_addr from connect where first_seen < '2005-02-19 08:43:09'; ) So in case of MySQL 3.23.55 it doesn't seem to be a matter of using now() or not. On the other hand, using a compound small index like alter table connect add index (sender_name(8), sender_domain(8), rcpt(8)); wouldn't help on this kind of query, right? -- Max Diehn ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:di...@lr...> Barer Str. 21 ! 80333 Muenchen, Germany ! Tel: +49 89 289-27823 | 
| 
      
      
      From: Lionel B. <lio...@bo...> - 2005-02-21 13:39:41
       | 
| Max Diehn wrote the following on 21.02.2005 12:56 : > using first_seen and ip_addr as (separate) indices in connect (using > MySQL 3.23.55), I found, that MySQL uses index for > > explain select count(*) from connect where first_seen < > '2005-02-21 08:43:09'; > > as well as for > > explain select count(*) from connect where first_seen < now() - > interval 12 hour; > > but not for > > explain select * from connect where first_seen < now() - > interval 12 hour; > > or > > explain select * from connect where first_seen < '2005-02-21 > 08:43:09'; > > (or f.e. > explain select sender_name, sender_domain, ip_addr from connect > where first_seen < '2005-02-19 08:43:09'; > ) > > So in case of MySQL 3.23.55 it doesn't seem to be a matter of using > now() or not. > On 4.0.22 the explain shows that I don't even need to modify the current syntax to make use of the index (in the from_awl case with 1000 entries) ! Does a "ANALYZE TABLE connect" change anything? It might be that as you want all the column contents, MySQL decides that a full scan is more efficient in this case. Please do the test with a "EXPLAIN DELETE FROM..." or the original SELECT but with an INTERVAL where very few entries will be given back. > On the other hand, using a compound small index like > > alter table connect add index (sender_name(8), > sender_domain(8), rcpt(8)); > > wouldn't help on this kind of query, right? > > No it wouldn't. Lionel. | 
| 
      
      
      From: Max D. <Max...@lr...> - 2005-02-21 15:01:13
       | 
| Lionel Bouton wrote: > Max Diehn wrote the following on 21.02.2005 12:56 : > >> using first_seen and ip_addr as (separate) indices in connect (using >> MySQL 3.23.55), I found, that MySQL uses index for >> >> explain select count(*) from connect where first_seen < >> '2005-02-21 08:43:09'; >> >> as well as for >> >> explain select count(*) from connect where first_seen < now() - >> interval 12 hour; >> >> but not for >> >> explain select * from connect where first_seen < now() - >> interval 12 hour; >> >> or >> >> explain select * from connect where first_seen < '2005-02-21 >> 08:43:09'; >> >> (or f.e. >> explain select sender_name, sender_domain, ip_addr from connect >> where first_seen < '2005-02-19 08:43:09'; >> ) >> >> So in case of MySQL 3.23.55 it doesn't seem to be a matter of using >> now() or not. >> > > On 4.0.22 the explain shows that I don't even need to modify the current > syntax to make use of the index (in the from_awl case with 1000 entries) ! In create_from_awl_table a combined primary key for all columns except last_seen is defined. MySQL will use this PK as the index in all queries where it thinks it makes sense. In connect there is no such key, so I added the indices mentioned (first_seen and ip_addr). > Does a "ANALYZE TABLE connect" change anything? > It might be that as you want all the column contents, MySQL decides that > a full scan is more efficient in this case. Please do the test with a > "EXPLAIN DELETE FROM..." or the original SELECT but with an INTERVAL > where very few entries will be given back. explain select sender_name, sender_domain, ip_addr from connect where first_seen > now() - interval 1 minute; didn't use an index. I'll tell You if I find out something interesting. Now wouldn't it be better to let an external process do the cleanup? In that case, the smtpd doesn't need to wait for the cleanup to finish. Hence, the duration of cleanup wouldn't be that important. > > >> On the other hand, using a compound small index like >> >> alter table connect add index (sender_name(8), >> sender_domain(8), rcpt(8)); >> >> wouldn't help on this kind of query, right? >> >> > > No it wouldn't. > > Lionel. > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users -- Max Diehn ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:di...@lr...> Barer Str. 21 ! 80333 Muenchen, Germany ! Tel: +49 89 289-27823 | 
| 
      
      
      From: Lionel B. <lio...@bo...> - 2005-02-21 20:17:10
       | 
| Max Diehn wrote the following on 21.02.2005 16:01 : > (...) > >>> >> >> On 4.0.22 the explain shows that I don't even need to modify the >> current syntax to make use of the index (in the from_awl case with >> 1000 entries) ! > > > In create_from_awl_table a combined primary key for all columns except > last_seen is defined. MySQL will use this PK as the index in all > queries where it thinks it makes sense. > In connect there is no such key, so I added the indices mentioned > (first_seen and ip_addr). With a WHERE clause involving only the timestamp, only the index on the timestamp can be used. So as the query I did on from_awl involved a timestamp and where using the same syntax for the WHERE clause I suppos MySQL query planner should use the same strategy. Maybe 4.0.22 is simply clever than your version. > >> Does a "ANALYZE TABLE connect" change anything? >> It might be that as you want all the column contents, MySQL decides >> that a full scan is more efficient in this case. Please do the test >> with a "EXPLAIN DELETE FROM..." or the original SELECT but with an >> INTERVAL where very few entries will be given back. > > > explain select sender_name, sender_domain, ip_addr from connect where > first_seen > now() - interval 1 minute; > > didn't use an index. I'll tell You if I find out something interesting. You did ANALYZE TABLE before the explain, didn't you? > > Now wouldn't it be better to let an external process do the cleanup? > In that case, the smtpd doesn't need to wait for the cleanup to finish. > Hence, the duration of cleanup wouldn't be that important. > This is planned: my current idea is to make SQLgrey fork a process which will connect to the database separately for the cleaning. I don't know yet if its better to make it a standalone process with its own scheduling or more simply fork a short lived process *just after* doing the queries involved to answer a request. Lionel. | 
| 
      
      
      From: Max D. <Max...@lr...> - 2005-02-21 20:32:09
       | 
| Lionel Bouton wrote: > Max Diehn wrote the following on 21.02.2005 16:01 : > >> (...) >> >>>> >>> >>> On 4.0.22 the explain shows that I don't even need to modify the >>> current syntax to make use of the index (in the from_awl case with >>> 1000 entries) ! >> >> >> >> In create_from_awl_table a combined primary key for all columns except >> last_seen is defined. MySQL will use this PK as the index in all >> queries where it thinks it makes sense. >> In connect there is no such key, so I added the indices mentioned >> (first_seen and ip_addr). > > > > With a WHERE clause involving only the timestamp, only the index on the > timestamp can be used. So as the query I did on from_awl involved a > timestamp and where using the same syntax for the WHERE clause I suppos > MySQL query planner should use the same strategy. > Maybe 4.0.22 is simply clever than your version. yes, maybe. I'll switch to 4.1 later anyway. > >> >>> Does a "ANALYZE TABLE connect" change anything? >>> It might be that as you want all the column contents, MySQL decides >>> that a full scan is more efficient in this case. Please do the test >>> with a "EXPLAIN DELETE FROM..." or the original SELECT but with an >>> INTERVAL where very few entries will be given back. >> >> >> >> explain select sender_name, sender_domain, ip_addr from connect where >> first_seen > now() - interval 1 minute; >> >> didn't use an index. I'll tell You if I find out something interesting. > > > > You did ANALYZE TABLE before the explain, didn't you? yes, i did > >> >> Now wouldn't it be better to let an external process do the cleanup? >> In that case, the smtpd doesn't need to wait for the cleanup to finish. >> Hence, the duration of cleanup wouldn't be that important. >> > > This is planned: my current idea is to make SQLgrey fork a process which > will connect to the database separately for the cleaning. I don't know > yet if its better to make it a standalone process with its own > scheduling or more simply fork a short lived process *just after* doing > the queries involved to answer a request. > > Lionel. > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users -- Max Diehn ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:di...@lr...> Barer Str. 21 ! 80333 Muenchen, Germany ! Tel: +49 89 289-27823 | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-21 10:15:20
       | 
| Le Lundi 21 F=E9vrier 2005 00:34, Lionel Bouton a =E9crit : > > If you do an explain on this : > select * from from_awl where last_seen < timestamp '2005-02-21 > 00:12:46.22726+01' - INTERVAL '30 DAY'; > > You'll find that PostgreSQL is doing the right thing. Yes. But if we calculate the now() outside of the SELECT clause, we can a= s=20 well perform the interval calculation at the same time, which will make t= he=20 SQL query simpler. > I considered doing the now() call outside the database earlier (in fact > when SQLite support was added). I want to avoid this if I can because > I'd like to allow for several SQLgrey instances accessing the same > database. What is the interest of having several SQLgrey servers access the same=20 database ? Wouldn't it be more logical to rather have several MTAs access= the=20 same SQLgrey server ? First the network dialog between MTA <=3D> SQLgrey server will be much li= ghter=20 than the network dialog SQLgrey server <=3D> SQL DB. Second, if we have several SQLgrey servers using the same DB, then we wil= l=20 have several SQLgrey servers that will in turn attempt to perform DB clea= nup=20 which is a waste of ressource... (If we can get good performance from the cleanup SQL resquests by allowin= g=20 those requests to properly uses indexes, then cleanup will be real fast, = and=20 it eliminates the need to put the cleanup process in a separate program,=20 which is currently mentioned in the TODO for 1.5.x...) > One way to do this might be to replace the now() with the "timestamp > '<value>'" and fetch the value with a simple "SELECT now()". Leaving > "SELECT 1" aside, this should be one of the most light query we can do So we could perform this "SELECT now();" in the "interval" sub, instead o= f=20 calling "time" in Perl. This sounds logicial as we then always use the DB= 's=20 time. Cheers. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Lionel B. <lio...@bo...> - 2005-02-21 12:33:15
       | 
| Michel Bouissou wrote the following on 21.02.2005 11:11 : >Le Lundi 21 F=E9vrier 2005 00:34, Lionel Bouton a =E9crit : > =20 > >>If you do an explain on this : >>select * from from_awl where last_seen < timestamp '2005-02-21 >>00:12:46.22726+01' - INTERVAL '30 DAY'; >> >>You'll find that PostgreSQL is doing the right thing. >> =20 >> > >Yes. But if we calculate the now() outside of the SELECT clause, we can = as=20 >well perform the interval calculation at the same time, which will make = the=20 >SQL query simpler. > > =20 > Wouldn't it be better to let the SQL database handle the timestamp=20 representation itself ? If we do it in SQLgrey we will have to support=20 every syntax we can get from the SELECT now() output. There are already=20 special cases because some databases aren't really standard compliant=20 (or the standard isn't clear enough...). I really don't want to make=20 things more complex. >>I considered doing the now() call outside the database earlier (in fact >>when SQLite support was added). I want to avoid this if I can because >>I'd like to allow for several SQLgrey instances accessing the same >>database. >> =20 >> > >What is the interest of having several SQLgrey servers access the same=20 >database ? Wouldn't it be more logical to rather have several MTAs acces= s the=20 >same SQLgrey server ? > > =20 > Fault-tolerance. I want to support people who maintain big architectures=20 with several MX with high availability needs. In this case, each MX=20 should have its own local SQLgrey process in order to protect themselves=20 from a single system crash. >First the network dialog between MTA <=3D> SQLgrey server will be much l= ighter=20 >than the network dialog SQLgrey server <=3D> SQL DB. > > =20 > What makes you believe this ? From my point of view they should be more=20 or less the same. There are several information pieces SQLgrey reveives=20 from the MTA which are simply throwed away (helo_name, protocol_name,=20 instance, protocol_state, queue_id, request, size). A few SQL queries=20 shouldn't be far more heavier... From a quick look, the best case=20 (whitelist) is 0 SQL query, the worst is 8 (accepted reconnection,=20 triggering a move from from_awl to domain_awl), the most typical cases=20 involve 2 queries (domain_awl match), 3 queries (from_awl match) and 5=20 queries (no AWL and reconnect match). There may be systems where the awl have such high successful hits that=20 SQLgrey <=3D> db transfers might be less verbose than MTA <=3D> SQLgrey. >Second, if we have several SQLgrey servers using the same DB, then we wi= ll=20 >have several SQLgrey servers that will in turn attempt to perform DB cle= anup=20 >which is a waste of ressource... > >(If we can get good performance from the cleanup SQL resquests by allowi= ng=20 >those requests to properly uses indexes, then cleanup will be real fast,= and=20 >it eliminates the need to put the cleanup process in a separate program,= =20 >which is currently mentioned in the TODO for 1.5.x...) > > =20 > This can be addressed. I've not put much thought into this yet, but we=20 can store in DB the last cleanup time and the delay itself so as to=20 synchronize SQLgrey servers in a way where only one will trigger the=20 cleanup at a time. Cheers, Lionel. | 
| 
      
      
      From: Michel B. <mi...@bo...> - 2005-02-21 13:03:03
       | 
| Le Lundi 21 F=E9vrier 2005 13:33, Lionel Bouton a =E9crit : > > > >Yes. But if we calculate the now() outside of the SELECT clause, we ca= n as > >well perform the interval calculation at the same time, which will mak= e > > the SQL query simpler. > > Wouldn't it be better to let the SQL database handle the timestamp > representation itself ? If we do it in SQLgrey we will have to support > every syntax we can get from the SELECT now() output. There are already > special cases because some databases aren't really standard compliant > (or the standard isn't clear enough...). I really don't want to make > things more complex. That makes sense. So you would issue a "select now();" once, put the resu= lt=20 into a (let's call it $perlnow) Perl variable, and then reinject it in a=20 "select * from from_awl where last_seen < '$perlnow' - INTERVAL '30 DAY'"= =20 without having to pay attention to the contents or syntax inside... Gotten from SQL, fed back into SQL, and let SQL understand what it gave a= nd=20 perform the calculation... That's smart. On the other hand, we don't have support for such a big number of SQL DBs= =20 (MySQL, PostgreSQL and SQLite), so we have only 3 possible different=20 syntaxes... That's why doing everything in Perl isn't that difficult (exc= ept=20 for the possible synchronization issue you rised, if several SQLgrey serv= ers=20 share the same DB. But ntp is an easy solution, and is probably already i= n=20 place at every big site using MTA pools). > >What is the interest of having several SQLgrey servers access the same > >database ? Wouldn't it be more logical to rather have several MTAs acc= ess > > the same SQLgrey server ? > > Fault-tolerance. I want to support people who maintain big architecture= s > with several MX with high availability needs. In this case, each MX > should have its own local SQLgrey process in order to protect themselve= s > from a single system crash. As long as you share a single SQL DB, you have a single point of failure.= =20 Running the SQLgrey server and SQL DB on the same machine, and sharing th= e=20 SQLgrey server between MTAs, or running a SQLgrey server on each MTA serv= er,=20 and sharing the same SQL DB, you're down anyway if the DB server crashes.= .. > >First the network dialog between MTA <=3D> SQLgrey server will be much > > lighter than the network dialog SQLgrey server <=3D> SQL DB. > > What makes you believe this ? From my point of view they should be more > or less the same. There are several information pieces SQLgrey reveives > from the MTA which are simply throwed away (helo_name, protocol_name, > instance, protocol_state, queue_id, request, size). Basically, the dialog between the MTA and SQLgrey server can be described= as=20 "One question - One yes/no answer". Even though the question does include= =20 some elements we don't need, it doesn't make it that big. OTOH, as you state, dialog between the SQLgrey server and SQL DB is a ser= ies=20 of queries - answers, in the best case one, but in most cases several (ev= en=20 if not always 8 ;-), not event talking about periodical DB cleanup. > >Second, if we have several SQLgrey servers using the same DB, then we = will > >have several SQLgrey servers that will in turn attempt to perform DB > > cleanup which is a waste of ressource... > > > >(If we can get good performance from the cleanup SQL resquests by allo= wing > >those requests to properly uses indexes, then cleanup will be real fas= t, > > and it eliminates the need to put the cleanup process in a separate > > program, which is currently mentioned in the TODO for 1.5.x...) > > This can be addressed. I've not put much thought into this yet, but we > can store in DB the last cleanup time and the delay itself so as to > synchronize SQLgrey servers in a way where only one will trigger the > cleanup at a time. Yes. But it would cause one more SQL request ;-) Another possibility would be to add a perform/don't perform DB cleanup as= a=20 config parameter, and if several servers share the same DB, then only one= of=20 them should be configured to perform its cleanup. Cheers. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E | 
| 
      
      
      From: Lionel B. <lio...@bo...> - 2005-02-21 13:25:35
       | 
| Michel Bouissou wrote the following on 21.02.2005 14:01 : >>>What is the interest of having several SQLgrey servers access the same >>>database ? Wouldn't it be more logical to rather have several MTAs access >>>the same SQLgrey server ? >>> >>> >>Fault-tolerance. I want to support people who maintain big architectures >>with several MX with high availability needs. In this case, each MX >>should have its own local SQLgrey process in order to protect themselves >>from a single system crash. >> >> > >As long as you share a single SQL DB, you have a single point of failure. >Running the SQLgrey server and SQL DB on the same machine, and sharing the >SQLgrey server between MTAs, or running a SQLgrey server on each MTA server, >and sharing the same SQL DB, you're down anyway if the DB server crashes... > > Not at all. If the database goes down, SQLgrey will switch off greylisting and warn the postmaster. >>>Second, if we have several SQLgrey servers using the same DB, then we will >>>have several SQLgrey servers that will in turn attempt to perform DB >>>cleanup which is a waste of ressource... >>> >>>(If we can get good performance from the cleanup SQL resquests by allowing >>>those requests to properly uses indexes, then cleanup will be real fast, >>>and it eliminates the need to put the cleanup process in a separate >>>program, which is currently mentioned in the TODO for 1.5.x...) >>> >>> >>This can be addressed. I've not put much thought into this yet, but we >>can store in DB the last cleanup time and the delay itself so as to >>synchronize SQLgrey servers in a way where only one will trigger the >>cleanup at a time. >> >> > >Yes. But it would cause one more SQL request ;-) > > > We'll have a cache of the next_scheduled_cleanup value, and only when it will be in the past will we chack the DB. This one we can afford again, althoug a bit more complex for the DB than the SELECT now() :-) >Another possibility would be to add a perform/don't perform DB cleanup as a >config parameter, and if several servers share the same DB, then only one of >them should be configured to perform its cleanup. > > > Makes configuration more tricky (and makes it easy losing the cleanup process without being aware of it). I'd like to allow the same configuration file being used on each SQLgrey instance and let them sort this out. Lionel. |