Re: [Sqlgrey-users] modular design

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Michael Storz wrote the following on 29.04.2005 16:10 :

>Since modular design was mentioned in some of the last emails, I will try
>to describe my ideas about a modular design of sqlgrey.
>
>First of all, I would like to separate all MTA specific parts from the
>part of the software which deals with grey-, white- and maybe
>blacklisting. This would have several benefits.
>
>1) From the discussion on this list I have the feeling only a few
>sites are using sqlgrey,
>

From the whitelist site logs, 39 different IPs have checked the
whitelists freshness this month. 31 users are subscribed to this list.
So I guess SQLgrey isn't on 1% of the worldwide mailservers yet :-)

> although I think it is one of the best
>implementations.
>

Thanks!

>  If the code would be separated into several packages,
>other people could implement daemons for different MTAs like sendmail with
>milter, exim or qmail. All of these daemons would be able to use the
>package for grey-, black- and whitelisting. Since we are not using
>postfix, we had to struggle to code glueware which emulates the postfix
>policy protocol.
>  
>

Spliting the code will probably happen sooner or later. SQLgrey is
starting to look a little too bloated to my taste... I would like to
avoid this for 1.6.0 though, because it will probably take some time and
heavy surgery :-)

>2) A separation of the code would allow to split functions into different
>daemons and/or scripts. E.g. prevalidation would be driven by outgoing
>emails, which in our case (not postfix) uses totally different daemons.
>Another example are our MX- and A-checks for filling the domain_awl. These
>are scripts startet by cron every 5 minutes. For these scripts I had to
>copy large amounts of code out from sqlgrey and to modify it to use it
>without a reference to netserver-daemon. It would be much easier for such
>scripts to just have a use-statement.
>  
>

Agreed.

>Second: For smaller sites it is definitely nice to have one daemon, which
>makes all the work. Just install the software and let it run. In our case
>however, I would like to be able to tune the system in such a way that it
>fits our needs. E.g. I would like to separate the checking of the
>databases from the different propagation algorithms, which transports data
>from one table to another, into separate daemons or scripts.
>

Hum... There are problems with separating the propagations from the
greylisting.
* It will create stale entries in the bottom awls which will be fed by
the greylister itself due to race conditions between the greylister and
the separate daemons/scripts (not bad, just annoying and reflect what
can already happen when multiple SQLgrey instances access the same DB).
* You'll have more overhead because the propagation algorithms will have
to query the database for the entries they have to move, now SQLgrey
only query the src it is working on, the external daemons will have to
select these srcs by querying the database.
* You'll have to schedule the propagation algorithms carefully : not to
slow or you will lose awl perfs, not to fast or you will bring the DB
down to its knees. Today the scheduling is not needed as the propagation
algorithms are event-driven (and so are automagically at the ideal point).

The event-driven aspect is quite important if you want to:
- maintain control of what happens on the global scale,
- avoid querying large amounts of data to extract which part should be
processed.

> This is the
>reason, why I requested the field first_seen in from_awl and domain_awl,
>which allows me to process all new fields independant from sqlgrey. This
>means I must be able to switch on and off all of the algorithms, which are
>used by sqlgrey in the moment.
>  
>

>Third: If I am able to swith on and off all of the algorithms, checking,
>propagation and maintenance, then I am also able to decide which of the
>algorithms I want to use, when running sqlgrey. E.g. a smaller site would
>not need the connect_awl and rcpt_awl and would propagate entries directly
>to from_awl. We, however, would use all of these tables for checking and
>use separate scripts to propagate entries from connect_awl to from_awl or
>rcpt_awl.
>
>  
>
Switching these algorithms can be done in sqlgrey.conf.

>Fourth: This leads to another modular design request: sequence of
>checks and propagations to execute. Do I first try to aggregate entries
>from connect_awl to rcpt_awl or to from_awl?
>

I was thinking about this too. I'm not sure if the order will have a
huge influence, I think the aggregation level for each propagation will
though.

> Could it be that one site
>prefers rcpt_awl first and another from_awl? There must be a
>sequentialization of these actions and a site should be able to determine
>it.
>
>Fifth: Let us make a step back and look at the overall design.
>Greylisting by itself has nothing to do with spam like e.g. SpamAssassin.
>A lot of people do confuse this. The influence on spam and virus infected
>emails is merely a side effect of greylisting (but at the end the reason
>why we are using greylisting). And the algorithm of greylisting ends at
>the moment we accept an email after a successful retry.
>
>The next step, the propagation of the triple to connect_awl or the tuple
>to from_awl/rcpt_awl has to do with whitelisting. I would like to turn our
>attention more to the whitelisting part of the software and separate it
>from the mere usage with greylisting.
>
>The first thing would be a rename of the tables from _awl = autowhitelist
>to just _wl. Why? Because several methodes exists to fill these tables
>with information.
>

But there are all more or less automatic :-) I tend to consider awls to
expire automatically and wls to be more static. This is just a name
though, not so important.

> These can be traffic analysis, like the aggregation
>algorithms, these can be other propagation algorithms, like our MX- and
>A-checks, which take entries from one table and propagate them to another
>table based on some conditions. But there are also algorithms possible
>like feeding back information from SpamAssassin into white- (or
>blacklists). Besides the renaming, the consequence would be to include (at
>least) to other fields in every entry:
>
>* name of algorithm, which created this entry, e.g. we already use
>  different algorithms to populate from_awl as well as domain_awl, and we
>  would really be able to tell the source of an entry when we examine and
>  analyze the tables.
>
>  
>

Good idea.

>* since the entries are then not automatically included, but maybe also
>  manually entered, in addition to first_seen, last_seen, we would need
>  an expiration date, to distinguish entries which should be deleted
>  automatically from entries which should stay. And different algorithms
>  could also mean different expiration dates, maybe one algorithm requests
>  4 days till expiration and another 35 days. In addition this would allow
>  incremental extension or reduction of expiration, maybe based on a spam
>  count.
>
>  
>

Makes sense.

>What kind of whitelist tables are possible? Well, we have 5 variables:
>
>- IP: IP adress of sending email server
>- ON: Originator Name
>- OD: Originator Domain
>- RN: Recipient Name
>- RD: Recipient Domain
>
>This leads to 32 different possibilities:
>  
>

This is a little more complex than that... You can add to these 5
variables : time (probably first/last), helo, hits and some other values
you can get through the policy protocol (SASL auth, fqdn). But you can
probably blow huge holes in the matrix by removing the combinations that
don't make sense (ON without OD isn't really useful for example)...

>I'll stop here, because this is a lot of information to think about. But
>hopefully I showed some ideas of where sqlgrey could evolve into.
>  
>

And I thank you. Quite ambitious! This will take some time to get there...

Lionel.