[Sqlgrey-users] modular design

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Since modular design was mentioned in some of the last emails, I will try
to describe my ideas about a modular design of sqlgrey.

First of all, I would like to separate all MTA specific parts from the
part of the software which deals with grey-, white- and maybe
blacklisting. This would have several benefits.

1) From the discussion on this list I have the feeling only a few
sites are using sqlgrey, although I think it is one of the best
implementations.  If the code would be separated into several packages,
other people could implement daemons for different MTAs like sendmail with
milter, exim or qmail. All of these daemons would be able to use the
package for grey-, black- and whitelisting. Since we are not using
postfix, we had to struggle to code glueware which emulates the postfix
policy protocol.

2) A separation of the code would allow to split functions into different
daemons and/or scripts. E.g. prevalidation would be driven by outgoing
emails, which in our case (not postfix) uses totally different daemons.
Another example are our MX- and A-checks for filling the domain_awl. These
are scripts startet by cron every 5 minutes. For these scripts I had to
copy large amounts of code out from sqlgrey and to modify it to use it
without a reference to netserver-daemon. It would be much easier for such
scripts to just have a use-statement.

Second: For smaller sites it is definitely nice to have one daemon, which
makes all the work. Just install the software and let it run. In our case
however, I would like to be able to tune the system in such a way that it
fits our needs. E.g. I would like to separate the checking of the
databases from the different propagation algorithms, which transports data
from one table to another, into separate daemons or scripts. This is the
reason, why I requested the field first_seen in from_awl and domain_awl,
which allows me to process all new fields independant from sqlgrey. This
means I must be able to switch on and off all of the algorithms, which are
used by sqlgrey in the moment.

Third: If I am able to swith on and off all of the algorithms, checking,
propagation and maintenance, then I am also able to decide which of the
algorithms I want to use, when running sqlgrey. E.g. a smaller site would
not need the connect_awl and rcpt_awl and would propagate entries directly
to from_awl. We, however, would use all of these tables for checking and
use separate scripts to propagate entries from connect_awl to from_awl or
rcpt_awl.

Fourth: This leads to another modular design request: sequence of
checks and propagations to execute. Do I first try to aggregate entries
from connect_awl to rcpt_awl or to from_awl? Could it be that one site
prefers rcpt_awl first and another from_awl? There must be a
sequentialization of these actions and a site should be able to determine
it.

Fifth: Let us make a step back and look at the overall design.
Greylisting by itself has nothing to do with spam like e.g. SpamAssassin.
A lot of people do confuse this. The influence on spam and virus infected
emails is merely a side effect of greylisting (but at the end the reason
why we are using greylisting). And the algorithm of greylisting ends at
the moment we accept an email after a successful retry.

The next step, the propagation of the triple to connect_awl or the tuple
to from_awl/rcpt_awl has to do with whitelisting. I would like to turn our
attention more to the whitelisting part of the software and separate it
from the mere usage with greylisting.

The first thing would be a rename of the tables from _awl = autowhitelist
to just _wl. Why? Because several methodes exists to fill these tables
with information. These can be traffic analysis, like the aggregation
algorithms, these can be other propagation algorithms, like our MX- and
A-checks, which take entries from one table and propagate them to another
table based on some conditions. But there are also algorithms possible
like feeding back information from SpamAssassin into white- (or
blacklists). Besides the renaming, the consequence would be to include (at
least) to other fields in every entry:

* name of algorithm, which created this entry, e.g. we already use
  different algorithms to populate from_awl as well as domain_awl, and we
  would really be able to tell the source of an entry when we examine and
  analyze the tables.

* since the entries are then not automatically included, but maybe also
  manually entered, in addition to first_seen, last_seen, we would need
  an expiration date, to distinguish entries which should be deleted
  automatically from entries which should stay. And different algorithms
  could also mean different expiration dates, maybe one algorithm requests
  4 days till expiration and another 35 days. In addition this would allow
  incremental extension or reduction of expiration, maybe based on a spam
  count.

What kind of whitelist tables are possible? Well, we have 5 variables:

- IP: IP adress of sending email server
- ON: Originator Name
- OD: Originator Domain
- RN: Recipient Name
- RD: Recipient Domain

This leads to 32 different possibilities:

Name of whitelist                               IP   ON   OD   RN   RD
========================================================================
reconnect ok / connection_wl                    X    X    X    X    X
                                                X    X    X    X    -
                                                X    X    X    -    X
from_wl                                         X    X    X    -    -
------------------------------------------------------------------------
                                                X    X    -    X    X
                                                X    X    -    X    -
                                                X    X    -    -    X
                                                X    X    -    -    -
------------------------------------------------------------------------
                                                X    -    X    X    X
                                                X    -    X    X    -
                                                X    -    X    -    X
domain_wl                                       X    -    X    -    -
------------------------------------------------------------------------
rcpt_wl (forward_wl)                            X    -    -    X    X
                                                X    -    -    X    -
                                                X    -    -    -    X
client_ip_whitelist / src_wl                    X    -    -    -
-
------------------------------------------------------------------------
preval_wl (prevalidation                        -    X    X    X    X
                                                -    X    X    X    -
                                                -    X    X    -    X
                                                -    X    X    -    -
------------------------------------------------------------------------
                                                -    X    -    X    X
                                                -    X    -    X    -
                                                -    X    -    -    X
                                                -    X    -    -    -
------------------------------------------------------------------------
                                                -    -    X    X    X
                                                -    -    X    X    -
                                                -    -    X    -    X
                                                -    -    X    -    -
------------------------------------------------------------------------
optout_rcpt                                     -    -    -    X    X
                                                -    -    -    X    -
optout_domain                                   -    -    -    -    X
no check, all emails whitelisted                -    -    -    -    -
------------------------------------------------------------------------

If you try to make a directed graph of the evolution/dependancies of the
tables based on number of variables, then you get 6 levels:

1. level:  1 node  (5 X)
2. level:  5 nodes (4 X)
3. level: 10 nodes (3 X)
4. level: 10 nodes (2 X)
5. level:  5 nodes (1 X)
6. level:  1 node  (0 X)

I visualized this as a 3 dimensional object with a tool called Zometool,
(see http://www.zometool.com/build.html). From this you can see, that
there is no sequential path through the whitelists, as I said above.

I'll stop here, because this is a lot of information to think about. But
hopefully I showed some ideas of where sqlgrey could evolve into.

Regards,
Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840