|
From: Michael S. <Mic...@lr...> - 2005-04-29 14:10:30
|
Since modular design was mentioned in some of the last emails, I will try
to describe my ideas about a modular design of sqlgrey.
First of all, I would like to separate all MTA specific parts from the
part of the software which deals with grey-, white- and maybe
blacklisting. This would have several benefits.
1) From the discussion on this list I have the feeling only a few
sites are using sqlgrey, although I think it is one of the best
implementations. If the code would be separated into several packages,
other people could implement daemons for different MTAs like sendmail with
milter, exim or qmail. All of these daemons would be able to use the
package for grey-, black- and whitelisting. Since we are not using
postfix, we had to struggle to code glueware which emulates the postfix
policy protocol.
2) A separation of the code would allow to split functions into different
daemons and/or scripts. E.g. prevalidation would be driven by outgoing
emails, which in our case (not postfix) uses totally different daemons.
Another example are our MX- and A-checks for filling the domain_awl. These
are scripts startet by cron every 5 minutes. For these scripts I had to
copy large amounts of code out from sqlgrey and to modify it to use it
without a reference to netserver-daemon. It would be much easier for such
scripts to just have a use-statement.
Second: For smaller sites it is definitely nice to have one daemon, which
makes all the work. Just install the software and let it run. In our case
however, I would like to be able to tune the system in such a way that it
fits our needs. E.g. I would like to separate the checking of the
databases from the different propagation algorithms, which transports data
from one table to another, into separate daemons or scripts. This is the
reason, why I requested the field first_seen in from_awl and domain_awl,
which allows me to process all new fields independant from sqlgrey. This
means I must be able to switch on and off all of the algorithms, which are
used by sqlgrey in the moment.
Third: If I am able to swith on and off all of the algorithms, checking,
propagation and maintenance, then I am also able to decide which of the
algorithms I want to use, when running sqlgrey. E.g. a smaller site would
not need the connect_awl and rcpt_awl and would propagate entries directly
to from_awl. We, however, would use all of these tables for checking and
use separate scripts to propagate entries from connect_awl to from_awl or
rcpt_awl.
Fourth: This leads to another modular design request: sequence of
checks and propagations to execute. Do I first try to aggregate entries
from connect_awl to rcpt_awl or to from_awl? Could it be that one site
prefers rcpt_awl first and another from_awl? There must be a
sequentialization of these actions and a site should be able to determine
it.
Fifth: Let us make a step back and look at the overall design.
Greylisting by itself has nothing to do with spam like e.g. SpamAssassin.
A lot of people do confuse this. The influence on spam and virus infected
emails is merely a side effect of greylisting (but at the end the reason
why we are using greylisting). And the algorithm of greylisting ends at
the moment we accept an email after a successful retry.
The next step, the propagation of the triple to connect_awl or the tuple
to from_awl/rcpt_awl has to do with whitelisting. I would like to turn our
attention more to the whitelisting part of the software and separate it
from the mere usage with greylisting.
The first thing would be a rename of the tables from _awl = autowhitelist
to just _wl. Why? Because several methodes exists to fill these tables
with information. These can be traffic analysis, like the aggregation
algorithms, these can be other propagation algorithms, like our MX- and
A-checks, which take entries from one table and propagate them to another
table based on some conditions. But there are also algorithms possible
like feeding back information from SpamAssassin into white- (or
blacklists). Besides the renaming, the consequence would be to include (at
least) to other fields in every entry:
* name of algorithm, which created this entry, e.g. we already use
different algorithms to populate from_awl as well as domain_awl, and we
would really be able to tell the source of an entry when we examine and
analyze the tables.
* since the entries are then not automatically included, but maybe also
manually entered, in addition to first_seen, last_seen, we would need
an expiration date, to distinguish entries which should be deleted
automatically from entries which should stay. And different algorithms
could also mean different expiration dates, maybe one algorithm requests
4 days till expiration and another 35 days. In addition this would allow
incremental extension or reduction of expiration, maybe based on a spam
count.
What kind of whitelist tables are possible? Well, we have 5 variables:
- IP: IP adress of sending email server
- ON: Originator Name
- OD: Originator Domain
- RN: Recipient Name
- RD: Recipient Domain
This leads to 32 different possibilities:
Name of whitelist IP ON OD RN RD
========================================================================
reconnect ok / connection_wl X X X X X
X X X X -
X X X - X
from_wl X X X - -
------------------------------------------------------------------------
X X - X X
X X - X -
X X - - X
X X - - -
------------------------------------------------------------------------
X - X X X
X - X X -
X - X - X
domain_wl X - X - -
------------------------------------------------------------------------
rcpt_wl (forward_wl) X - - X X
X - - X -
X - - - X
client_ip_whitelist / src_wl X - - -
-
------------------------------------------------------------------------
preval_wl (prevalidation - X X X X
- X X X -
- X X - X
- X X - -
------------------------------------------------------------------------
- X - X X
- X - X -
- X - - X
- X - - -
------------------------------------------------------------------------
- - X X X
- - X X -
- - X - X
- - X - -
------------------------------------------------------------------------
optout_rcpt - - - X X
- - - X -
optout_domain - - - - X
no check, all emails whitelisted - - - - -
------------------------------------------------------------------------
If you try to make a directed graph of the evolution/dependancies of the
tables based on number of variables, then you get 6 levels:
1. level: 1 node (5 X)
2. level: 5 nodes (4 X)
3. level: 10 nodes (3 X)
4. level: 10 nodes (2 X)
5. level: 5 nodes (1 X)
6. level: 1 node (0 X)
I visualized this as a 3 dimensional object with a tool called Zometool,
(see http://www.zometool.com/build.html). From this you can see, that
there is no sequential path through the whitelists, as I said above.
I'll stop here, because this is a lot of information to think about. But
hopefully I showed some ideas of where sqlgrey could evolve into.
Regards,
Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum ! <mailto:St...@lr...>
Barer Str. 21 ! Fax: +49 89 2809460
80333 Muenchen, Germany ! Tel: +49 89 289-28840
|
|
From: Lionel B. <lio...@bo...> - 2005-04-29 23:08:52
|
Michael Storz wrote the following on 29.04.2005 16:10 : >Since modular design was mentioned in some of the last emails, I will try >to describe my ideas about a modular design of sqlgrey. > >First of all, I would like to separate all MTA specific parts from the >part of the software which deals with grey-, white- and maybe >blacklisting. This would have several benefits. > >1) From the discussion on this list I have the feeling only a few >sites are using sqlgrey, > From the whitelist site logs, 39 different IPs have checked the whitelists freshness this month. 31 users are subscribed to this list. So I guess SQLgrey isn't on 1% of the worldwide mailservers yet :-) > although I think it is one of the best >implementations. > Thanks! > If the code would be separated into several packages, >other people could implement daemons for different MTAs like sendmail with >milter, exim or qmail. All of these daemons would be able to use the >package for grey-, black- and whitelisting. Since we are not using >postfix, we had to struggle to code glueware which emulates the postfix >policy protocol. > > Spliting the code will probably happen sooner or later. SQLgrey is starting to look a little too bloated to my taste... I would like to avoid this for 1.6.0 though, because it will probably take some time and heavy surgery :-) >2) A separation of the code would allow to split functions into different >daemons and/or scripts. E.g. prevalidation would be driven by outgoing >emails, which in our case (not postfix) uses totally different daemons. >Another example are our MX- and A-checks for filling the domain_awl. These >are scripts startet by cron every 5 minutes. For these scripts I had to >copy large amounts of code out from sqlgrey and to modify it to use it >without a reference to netserver-daemon. It would be much easier for such >scripts to just have a use-statement. > > Agreed. >Second: For smaller sites it is definitely nice to have one daemon, which >makes all the work. Just install the software and let it run. In our case >however, I would like to be able to tune the system in such a way that it >fits our needs. E.g. I would like to separate the checking of the >databases from the different propagation algorithms, which transports data >from one table to another, into separate daemons or scripts. > Hum... There are problems with separating the propagations from the greylisting. * It will create stale entries in the bottom awls which will be fed by the greylister itself due to race conditions between the greylister and the separate daemons/scripts (not bad, just annoying and reflect what can already happen when multiple SQLgrey instances access the same DB). * You'll have more overhead because the propagation algorithms will have to query the database for the entries they have to move, now SQLgrey only query the src it is working on, the external daemons will have to select these srcs by querying the database. * You'll have to schedule the propagation algorithms carefully : not to slow or you will lose awl perfs, not to fast or you will bring the DB down to its knees. Today the scheduling is not needed as the propagation algorithms are event-driven (and so are automagically at the ideal point). The event-driven aspect is quite important if you want to: - maintain control of what happens on the global scale, - avoid querying large amounts of data to extract which part should be processed. > This is the >reason, why I requested the field first_seen in from_awl and domain_awl, >which allows me to process all new fields independant from sqlgrey. This >means I must be able to switch on and off all of the algorithms, which are >used by sqlgrey in the moment. > > >Third: If I am able to swith on and off all of the algorithms, checking, >propagation and maintenance, then I am also able to decide which of the >algorithms I want to use, when running sqlgrey. E.g. a smaller site would >not need the connect_awl and rcpt_awl and would propagate entries directly >to from_awl. We, however, would use all of these tables for checking and >use separate scripts to propagate entries from connect_awl to from_awl or >rcpt_awl. > > > Switching these algorithms can be done in sqlgrey.conf. >Fourth: This leads to another modular design request: sequence of >checks and propagations to execute. Do I first try to aggregate entries >from connect_awl to rcpt_awl or to from_awl? > I was thinking about this too. I'm not sure if the order will have a huge influence, I think the aggregation level for each propagation will though. > Could it be that one site >prefers rcpt_awl first and another from_awl? There must be a >sequentialization of these actions and a site should be able to determine >it. > >Fifth: Let us make a step back and look at the overall design. >Greylisting by itself has nothing to do with spam like e.g. SpamAssassin. >A lot of people do confuse this. The influence on spam and virus infected >emails is merely a side effect of greylisting (but at the end the reason >why we are using greylisting). And the algorithm of greylisting ends at >the moment we accept an email after a successful retry. > >The next step, the propagation of the triple to connect_awl or the tuple >to from_awl/rcpt_awl has to do with whitelisting. I would like to turn our >attention more to the whitelisting part of the software and separate it >from the mere usage with greylisting. > >The first thing would be a rename of the tables from _awl = autowhitelist >to just _wl. Why? Because several methodes exists to fill these tables >with information. > But there are all more or less automatic :-) I tend to consider awls to expire automatically and wls to be more static. This is just a name though, not so important. > These can be traffic analysis, like the aggregation >algorithms, these can be other propagation algorithms, like our MX- and >A-checks, which take entries from one table and propagate them to another >table based on some conditions. But there are also algorithms possible >like feeding back information from SpamAssassin into white- (or >blacklists). Besides the renaming, the consequence would be to include (at >least) to other fields in every entry: > >* name of algorithm, which created this entry, e.g. we already use > different algorithms to populate from_awl as well as domain_awl, and we > would really be able to tell the source of an entry when we examine and > analyze the tables. > > > Good idea. >* since the entries are then not automatically included, but maybe also > manually entered, in addition to first_seen, last_seen, we would need > an expiration date, to distinguish entries which should be deleted > automatically from entries which should stay. And different algorithms > could also mean different expiration dates, maybe one algorithm requests > 4 days till expiration and another 35 days. In addition this would allow > incremental extension or reduction of expiration, maybe based on a spam > count. > > > Makes sense. >What kind of whitelist tables are possible? Well, we have 5 variables: > >- IP: IP adress of sending email server >- ON: Originator Name >- OD: Originator Domain >- RN: Recipient Name >- RD: Recipient Domain > >This leads to 32 different possibilities: > > This is a little more complex than that... You can add to these 5 variables : time (probably first/last), helo, hits and some other values you can get through the policy protocol (SASL auth, fqdn). But you can probably blow huge holes in the matrix by removing the combinations that don't make sense (ON without OD isn't really useful for example)... >I'll stop here, because this is a lot of information to think about. But >hopefully I showed some ideas of where sqlgrey could evolve into. > > And I thank you. Quite ambitious! This will take some time to get there... Lionel. |
|
From: Michel B. <mi...@bo...> - 2005-04-30 07:02:09
|
Le Samedi 30 Avril 2005 01:08, Lionel Bouton a =E9crit : > >>From the whitelist site logs, 39 different IPs have checked the > whitelists freshness this month. I'm probably directy or indirectly responsible for 3 to 6 of them... > 31 users are subscribed to this list. > So I guess SQLgrey isn't on 1% of the worldwide mailservers yet :-) Rather surprising. More and more sites are using some greylisting, and, a= mong=20 the different greylisting systems that I have checked out, SQLgrey is the= =20 best, from far. --=20 Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E Appel de 200 Informaticiens pour le NON au Trait=E9 Constitutionnel Europ=E9en: http://www.200informaticiens.ras.eu.org |
|
From: Michael S. <Mic...@lr...> - 2005-05-06 21:11:31
|
On Sat, 30 Apr 2005, Lionel Bouton wrote: > Michael Storz wrote the following on 29.04.2005 16:10 : > ... > >If the code would be separated into several packages, > >other people could implement daemons for different MTAs like sendmail with > >milter, exim or qmail. All of these daemons would be able to use the > >package for grey-, black- and whitelisting. Since we are not using > >postfix, we had to struggle to code glueware which emulates the postfix > >policy protocol. > > > > Spliting the code will probably happen sooner or later. SQLgrey is > starting to look a little too bloated to my taste... I would like to > avoid this for 1.6.0 though, because it will probably take some time and > heavy surgery :-) > I agree. I thought it would be nice to have it for a 2.X release. ... > > >Second: For smaller sites it is definitely nice to have one daemon, which > >makes all the work. Just install the software and let it run. In our case > >however, I would like to be able to tune the system in such a way that it > >fits our needs. E.g. I would like to separate the checking of the > >databases from the different propagation algorithms, which transports data > >from one table to another, into separate daemons or scripts. > > > > Hum... There are problems with separating the propagations from the > greylisting. > * It will create stale entries in the bottom awls which will be fed by > the greylister itself due to race conditions between the greylister and > the separate daemons/scripts (not bad, just annoying and reflect what > can already happen when multiple SQLgrey instances access the same DB). > * You'll have more overhead because the propagation algorithms will have > to query the database for the entries they have to move, now SQLgrey > only query the src it is working on, the external daemons will have to > select these srcs by querying the database. > * You'll have to schedule the propagation algorithms carefully : not to > slow or you will lose awl perfs, not to fast or you will bring the DB > down to its knees. Today the scheduling is not needed as the propagation > algorithms are event-driven (and so are automagically at the ideal point). > > The event-driven aspect is quite important if you want to: > - maintain control of what happens on the global scale, > - avoid querying large amounts of data to extract which part should be > processed. I'm not sure, if I understand this correctly. As the underlying database engine does not allow transactions, you will always have the possibility of interference of parallel running daemons, which access the same data. If the sequence of operations (insert, delete, update) are carefully planned with the parallel access in mind, no big problems should occur. We are running our external propagation algorithms every 5 minutes and it does not seem to bring mysql down to its knees. Since the scripts only request all the new data of the last 6 minutes, this is not much load for mysql. The processing of the data however does need some time, since heavy DNS queries are done, which in case of spammer domains may take a while to complete or get a timeout. With the momentary desing of sqlgrey - multiplexing - it is not possible to do this event-driven, response time would be terrible. To allow DNS based queries, sqlgrey has to go to prefork, where several threads run in parallel like the implementaton of amavisd-new. ... > > >What kind of whitelist tables are possible? Well, we have 5 variables: > > > >- IP: IP adress of sending email server > >- ON: Originator Name > >- OD: Originator Domain > >- RN: Recipient Name > >- RD: Recipient Domain > > > >This leads to 32 different possibilities: > > > > > > This is a little more complex than that... You can add to these 5 > variables : time (probably first/last), helo, hits and some other values > you can get through the policy protocol (SASL auth, fqdn). But you can > probably blow huge holes in the matrix by removing the combinations that > don't make sense (ON without OD isn't really useful for example)... > I agree, there are a lot more variables to consider. What I tried was to see all possibilities from the variables we use at the moment in from_awl and domain_awl for whitelisting. As you can see, 2 of your new tables fit nicely in this concept, whereas the other 2 bring a new dimension to this: - exception processing of whitelist tables. This is something which could be valuable for other whitelists too. E.g. I think every automatic whitelist should have an exception table, which is manually configured. At the moment I have no example, but I could imagine that at some point in the future I would wish to express that some ip addr and/or domain should not be propagated to the next awl. > >I'll stop here, because this is a lot of information to think about. But > >hopefully I showed some ideas of where sqlgrey could evolve into. > > > > > > And I thank you. Quite ambitious! This will take some time to get there... I would love to hack some perl code together to implement at least some of these features. Unfortunately, I'm not allowed to do it, because I have to manage some other projects for our messing system. Therefore, I hope you are keen to implement these features :-) Michael Storz ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:St...@lr...> Barer Str. 21 ! Fax: +49 89 2809460 80333 Muenchen, Germany ! Tel: +49 89 289-28840 |
|
From: Lionel B. <lio...@bo...> - 2005-05-06 22:44:45
|
Michael Storz wrote the following on 06.05.2005 23:11 : >>Hum... There are problems with separating the propagations from the >>greylisting. >>* It will create stale entries in the bottom awls which will be fed by >>the greylister itself due to race conditions between the greylister and >>the separate daemons/scripts (not bad, just annoying and reflect what >>can already happen when multiple SQLgrey instances access the same DB). >>* You'll have more overhead because the propagation algorithms will have >>to query the database for the entries they have to move, now SQLgrey >>only query the src it is working on, the external daemons will have to >>select these srcs by querying the database. >>* You'll have to schedule the propagation algorithms carefully : not to >>slow or you will lose awl perfs, not to fast or you will bring the DB >>down to its knees. Today the scheduling is not needed as the propagation >>algorithms are event-driven (and so are automagically at the ideal point). >> >>The event-driven aspect is quite important if you want to: >>- maintain control of what happens on the global scale, >>- avoid querying large amounts of data to extract which part should be >>processed. >> >> > >I'm not sure, if I understand this correctly. As the underlying database >engine does not allow transactions, you will always have the possibility >of interference of parallel running daemons, which access the same data. >If the sequence of operations (insert, delete, update) are carefully >planned with the parallel access in mind, no big problems should occur. > > > Indeed no big problems will occur. The annoying effects (that can already happen and are nothing to be afraid of) I'm speaking of are the awl entries that could be created in from_awl although an entry in domain_awl supercedes it. The main problem I see with separate independant daemons is that the propagation algorithms must select from the whole awl tables the entries they want to handle. I don't like this for two reasons, this : - is inefficient on a purely design standpoint (you have to query the database for an information you could get directly from the greylister), - causes load spikes. What I would prefer to see is some key points in the code where you could register hooks. Let's say for example that every time an entry is ready to be added to the from_awl, any registered hook will be able to short-circuit the default behaviour of adding the entry to from_awl and do whatever it wants with the entry. You could then add the propagation to higher-level awls at this point. >We are running our external propagation algorithms every 5 minutes and it >does not seem to bring mysql down to its knees. Since the scripts only >request all the new data of the last 6 minutes, this is not much load for >mysql. The processing of the data however does need some time, since heavy >DNS queries are done, which in case of spammer domains may take a while to >complete or get a timeout. With the momentary desing of sqlgrey - >multiplexing - it is not possible to do this event-driven, response time >would be terrible. To allow DNS based queries, sqlgrey has to go to >prefork, where several threads run in parallel like the implementaton of >amavisd-new. > > As long as SQLgrey can answer in a timely fashion (and frankly it should or we'll have serious problems) prefork can only bring marginal speedups (and probably slowdowns if not tuned properly). Nothing prevents a fork in SQLgrey's code (or a module's one for that matter) as is already done for the cleanups though. For example, if I understand correctly, the DNS query only comes after the greylisting, the answer to this query isn't needed to return an answer to Postfix. You could then fork, returning your answer to the main code while processing the data asynchronously (in fact I could already implement forking in the code to do some DB processing asynchronously, mainly the AWL propagations). In the example above where the entry is about to be added to from_awl, the hook could fork, tell SQLgrey to let the message pass (and decide if you want the from_awl entry to be created by SQLgrey or not) and meanwhile do whatever you want with the "src, sender_name, sender_domain, rcpt, first_time" array. You could do DNS queries at this point or if you want, you can avoid forking and push this information to another daemon through a socket or even log the entry for future batch-processing if you feel like it. >I would love to hack some perl code together to implement at least some of >these features. Unfortunately, I'm not allowed to do it, because I have to >manage some other projects for our messing system. Therefore, I hope you >are keen to implement these features :-) > > > Not everything, not tonight :-) But these paths are interesting and help generate other ideas. Thanks, Lionel. |