Thread: [Sqlgrey-users] RELEASE: 1.5.9

Brought to you by: gyver, ludvigm, rebum

sqlgrey-users

[Sqlgrey-users] RELEASE: 1.5.9

From: Lionel B. <lio...@bo...> - 2005-06-07 16:47:04

Hi,

SQLgrey 1.5.9 tarball is on sourceforge (RPMs should come shortly after,
I'm fighting with failing hardware on the RH8 host I use for building RPMs).

Changelog :
- MySQL timestamp bugfix,
- improved log parser.

Happy greylisting,

Lionel.

Re: [Sqlgrey-users] RELEASE: 1.5.9

From: Lionel B. <lio...@bo...> - 2005-06-07 17:49:42

Lionel Bouton wrote:

>Hi,
>
>SQLgrey 1.5.9 tarball is on sourceforge (RPMs should come shortly after,
>I'm fighting with failing hardware on the RH8 host I use for building RPMs).
>
>  
>

RPMS are on sourceforge.

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Michel B. <mi...@bo...> - 2005-06-08 07:34:08

Attachments: sqlgrey-1.5.9.MiB.connect_throttle_delete_patch

Le Mardi 07 Juin 2005 18:47, Lionel Bouton a =E9crit :
>
> SQLgrey 1.5.9 tarball is on sourceforge [...]

Hi there,

My "connect throttling" and "connect cleanup" patches have been tested he=
re=20
and seem to be working very fine. Please find attached the complete patch=
=20
against 1.5.9.

I've produced 1.5.9 RPMs including this patch, available from=20
http://www.bouissou.net/sqlgrey/

Some sample of working throttling, taken from my logs:

Jun  8 02:30:29 totor sqlgrey: grey: new: 24.208.114.197,=20
fzj...@bu... -> da...@bo...

Jun  8 02:30:30 totor sqlgrey: grey: new: 24.208.114.197,=20
fzj...@bu... -> cl...@bo...

Jun  8 02:30:31 totor sqlgrey: grey: new: 24.208.114.197,=20
fzj...@bu... -> fi...@bo...

Jun  8 02:30:31 totor sqlgrey: grey: new: 24.208.114.197,=20
fzj...@bu... -> ad...@bo...

Jun  8 02:30:31 totor sqlgrey: grey: new: 24.208.114.197,=20
fzj...@bu... -> ope...@bo...

Jun  8 02:30:34 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> ch...@bo...

Jun  8 02:30:39 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> al...@bo...

Jun  8 02:30:44 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> ad...@bo...

Jun  8 02:30:55 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> ad...@bo...

Jun  8 02:31:02 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> ic...@bo...

Jun  8 02:31:07 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> ca...@bo...

Jun  8 02:31:13 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> aa...@bo...

Jun  8 02:31:19 totor sqlgrey: grey: throttling: 24.208.114.197,=20
fzj...@bu... -> de...@bo...


...and I have several of the kind.

I think that throttling may not only save space in connect, but also help=
=20
prevent some zombies (that tries random addresses from a dictionary or=20
infected machine's address book) from being able to pass thru greylisting=
 in=20
the end : By limiting the number of waiting entries for a given source in=
=20
connect, we reduce the chances that a random new try from the same source=
=20
matches a previous attempt, thus effectively improving the system's=20
efficiency.

Lionel, would you consider integrating this into the mainstream SQLgrey ?=
 As=20
throttling is completely optional, "it doesn't hurt anyway", and somebody=
 who=20
doesn't want the feature can just ignore it.

Cheers.

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Lionel B. <lio...@bo...> - 2005-06-08 09:05:25

Michel Bouissou wrote:

>Lionel, would you consider integrating this into the mainstream SQLgrey ? As 
>throttling is completely optional, "it doesn't hurt anyway", and somebody who 
>doesn't want the feature can just ignore it.
>  
>

# Throttling

"It doesn't hurt anyway" isn't enough. It must solve real world
problems. I'm aware that theoriticaly this is good to have less entries
in the connect table but as I said earlier the practical benefits aren't
clear to me yet.

Michel, could you give us a ratio between the results of:
grep "sqlgrey: grey: throttling: " | wc -l " (on a log spanning the last
max_connect_age period)
and
select count(*) from connect

on your configuration ? This would help measure the benefits of tarpitting.

If other users could fetch Michel's build and test it in the same manner
too that would be great.

# connect cleanup

I'm worrying about the LIKE. There are 2 problems with it:
- may hurt performance (I've no experience with it, I'm currently
guessing performance is OK),
- I'll have to check SQLite to see if it supports this.

Once I've a better understanding of the two points above, I'll make a
decision on the connect cleanup (should be shortly).

Lionel.

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Michel B. <mi...@bo...> - 2005-06-08 11:06:31

Le Mercredi 08 Juin 2005 11:05, Lionel Bouton a =E9crit :
>
> "It doesn't hurt anyway" isn't enough. It must solve real world
> problems. I'm aware that theoriticaly this is good to have less entries
> in the connect table but as I said earlier the practical benefits aren'=
t
> clear to me yet.

I strongly believe there are benefits, otherwise I wouldn't have asked fo=
r it=20
in the first time then coded it in the end ;-)

Well, I know, I can be mistaken ;-)

> Michel, could you give us a ratio between the results of:
> grep "sqlgrey: grey: throttling: " | wc -l " (on a log spanning the las=
t
> max_connect_age period)
> and
> select count(*) from connect
>
> on your configuration ? This would help measure the benefits of tarpitt=
ing.

I'm not sure my server is a good real-life example, as its traffic is rea=
lly=20
moderate.

OTOH, I've already seen some tapitting in action since I installed it=20
yesterday afternoon, and I recall my "connect" table size had been mutipl=
ied=20
by a factor 10 when the latest M$ worm came out... Hence the idea I had a=
bout=20
tarpitting for fighting this kind of event.
Guess we need another new M$ worm to figure out the benefits it gives whe=
n=20
such an event occurs...

> If other users could fetch Michel's build and test it in the same manne=
r
> too that would be great.

Yep. I'd love to get some feedback.

> # connect cleanup
>
> I'm worrying about the LIKE. There are 2 problems with it:
> - may hurt performance (I've no experience with it, I'm currently
> guessing performance is OK),

It probably won't hurt, as the query still use the main index for IP and=20
sender_domain, leaving the LIKE select a very small subset of entries in=20
connect...

> - I'll have to check SQLite to see if it supports this.

LIKE is a very standard SQL statement... I would be surprised if a decent=
 SQL=20
system didn't implement it.

BTW, have you considered creating the tables with "default 0" for timesta=
mp=20
columns ? "default 0" should be OK with any SQL, isn't it ? And it would=20
prevent MySQL from performing auto-updates...

Cheers.

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Michel B. <mi...@bo...> - 2005-06-08 11:23:02

Le Mercredi 08 Juin 2005 13:06, Michel Bouissou a =E9crit :
>
> > Michel, could you give us a ratio between the results of:
> > grep "sqlgrey: grey: throttling: " | wc -l " (on a log spanning the l=
ast
> > max_connect_age period)
> > and
> > select count(*) from connect
> >
> > on your configuration ? This would help measure the benefits of
> > tarpitting.
>
> I'm not sure my server is a good real-life example, as its traffic is
> really moderate.
>
> OTOH, I've already seen some tapitting in action since I installed it
> yesterday afternoon, and I recall my "connect" table size had been
> mutiplied by a factor 10 when the latest M$ worm came out... Hence the =
idea
> I had about tarpitting for fighting this kind of event.
> Guess we need another new M$ worm to figure out the benefits it gives w=
hen
> such an event occurs...

Anyhow, for now I have:

mysql> select count(*) from connect;
+----------+
| count(*) |
+----------+
|      198 |
+----------+

[root@totor etc]# grep -c "totor sqlgrey: grey:=20
throttling:" /var/log/mail/info
29


--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Lionel B. <lio...@bo...> - 2005-06-08 11:45:50

Michel Bouissou wrote:

>Le Mercredi 08 Juin 2005 11:05, Lionel Bouton a =E9crit :
> =20
>
>>"It doesn't hurt anyway" isn't enough. It must solve real world
>>problems. I'm aware that theoriticaly this is good to have less entries
>>in the connect table but as I said earlier the practical benefits aren'=
t
>>clear to me yet.
>>   =20
>>
>
>I strongly believe there are benefits, otherwise I wouldn't have asked f=
or it=20
>in the first time then coded it in the end ;-)
>
>Well, I know, I can be mistaken ;-)
>
> =20
>
>>Michel, could you give us a ratio between the results of:
>>grep "sqlgrey: grey: throttling: " | wc -l " (on a log spanning the las=
t
>>max_connect_age period)
>>and
>>select count(*) from connect
>>
>>on your configuration ? This would help measure the benefits of tarpitt=
ing.
>>   =20
>>
>
>I'm not sure my server is a good real-life example, as its traffic is re=
ally=20
>moderate.
>
>OTOH, I've already seen some tapitting in action since I installed it=20
>yesterday afternoon, and I recall my "connect" table size had been mutip=
lied=20
>by a factor 10 when the latest M$ worm came out... Hence the idea I had =
about=20
>tarpitting for fighting this kind of event.
>Guess we need another new M$ worm to figure out the benefits it gives wh=
en=20
>such an event occurs...
>
> =20
>
>>If other users could fetch Michel's build and test it in the same manne=
r
>>too that would be great.
>>   =20
>>
>
>Yep. I'd love to get some feedback.
>
> =20
>
>># connect cleanup
>>
>>I'm worrying about the LIKE. There are 2 problems with it:
>>- may hurt performance (I've no experience with it, I'm currently
>>guessing performance is OK),
>>   =20
>>
>
>It probably won't hurt, as the query still use the main index for IP and=
=20
>sender_domain, leaving the LIKE select a very small subset of entries in=
=20
>connect...
>
> =20
>
>>- I'll have to check SQLite to see if it supports this.
>>   =20
>>
>
>LIKE is a very standard SQL statement... I would be surprised if a decen=
t SQL=20
>system didn't implement it.
> =20
>

SQLite supports it (at least SQLite2 and SQLite3 do, which is enough for
me).
I'll take the cleanup code then. But don't start trusting the spam log
entries: there are cases where it will have false positives.

>BTW, have you considered creating the tables with "default 0" for timest=
amp=20
>columns ? "default 0" should be OK with any SQL, isn't it ? And it would=
=20
>prevent MySQL from performing auto-updates...
> =20
>

MySQL already changed its behaviour between versions regarding
timestamps. The  fact that "desc from_awl" doesn't output the CREATE
statement used by SQLgrey but a mangled one shows that MySQL
deliberately allows itself to change the CREATE statement. I don't want
to rely on the assumptions that the future MySQL versions won't
introduce other on the fly modifications of CREATE statements...

I prefer to solve this by forcing the databases to set first_seen to the
right value on each update. At least no database seems to modify UPDATE
statements on the fly! This is how it is done in 1.5.9.

Lionel

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Michel B. <mi...@bo...> - 2005-06-08 12:22:28

Le Mercredi 08 Juin 2005 13:46, Lionel Bouton a =E9crit :
>
> MySQL already changed its behaviour between versions regarding
> timestamps. The =A0fact that "desc from_awl" doesn't output the CREATE
> statement used by SQLgrey but a mangled one shows that MySQL
> deliberately allows itself to change the CREATE statement.

Indeed. By adding its "default behaviour" (which is to consider that the =
first=20
mentioned timestamp column [for which "default 0" is not specified (*)] i=
s=20
auto-update...)

(*) Beginning with MySQL 4.1.2 and on...

Furthermore, "desc" output is incomplete and doesn't always mention _all_=
 the=20
characteristics of a given column ("on update..." isn't displayed anywher=
e by=20
desc).

=3D> If we don't want any auto-update or auto-initialization at all, we s=
hould=20
probably use the "datetime" data type, rather than "timestamp".

> I don't want to rely on the assumptions that the future MySQL versions =
won't
> introduce other on the fly modifications of CREATE statements...

Yep, but "default 0" would be harmless anyway ;-)) Yes, I know,=20
"harmless" ;-)))

...But it would prevent other manual queries that would be done on the ta=
bles=20
to cause the first_seen timestamp to be inadvertently auto-updated.

> I prefer to solve this by forcing the databases to set first_seen to th=
e
> right value on each update. At least no database seems to modify UPDATE
> statements on the fly! This is how it is done in 1.5.9.

Yes, I've checked the diff between 1.5.8 and 1.5.9.

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling

From: Michel B. <mi...@bo...> - 2005-06-08 11:33:45

Le Mercredi 08 Juin 2005 11:05, Lionel Bouton a =E9crit :
>
> Michel, could you give us a ratio [...]

> If other users could fetch Michel's build and test it in the same manne=
r
> too that would be great.

Everybody can easily figure out if it could save many entries in their co=
nnect=20
table by performing manually a simple sql query such as :

select src, count(*) as cpt from connect group by src having cpt >=3D 3 o=
rder by=20
cpt desc, src;

(replace >=3D 3 with any value you would consider for setting the tarpitt=
ing=20
threshold)

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Values for throttling (was Re: [Sqlgrey-users] RELEASE: 1.5.9 + with throttling)

From: Michael S. <Mic...@lr...> - 2005-06-09 16:35:32

On Wed, 8 Jun 2005, Michel Bouissou wrote:

> Le Mercredi 08 Juin 2005 11:05, Lionel Bouton a =E9crit :
> >
> > Michel, could you give us a ratio [...]
>
> > If other users could fetch Michel's build and test it in the same manne=
r
> > too that would be great.
>
> Everybody can easily figure out if it could save many entries in their co=
nnect
> table by performing manually a simple sql query such as :
>
> select src, count(*) as cpt from connect group by src having cpt >=3D 3 o=
rder by
> cpt desc, src;
>
> (replace >=3D 3 with any value you would consider for setting the tarpitt=
ing
> threshold)
>
>

Here are my values using above select statement:

number of entries in connect:                 1.072.022
number of different IP addresses in connect:    110.904
average number of entries per IP address:          9.67
max. number of entries per IP address:            2.470

thrott. | num of  | num of    | num of  | left    | % reduc
num.    | IP addr | entries   | thrott  | entries |
=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=
=3D=3D=3D=3D=3D
    3   |  55.366 | 1.001.789 | 891.057 | 180.965 | 83.12 %
    5   |  42.124 |   956.904 | 788.408 | 283.614 | 73.54 %
   10   |  29.092 |   870.638 | 608.810 | 463.212 | 56.79 %
   20   |  13.186 |   672.086 | 421.552 | 650.470 | 39.32 %
   30   |   7.908 |   549.151 | 319.819 | 752.203 | 29.83 %
   40   |   5.367 |   463.256 | 253.943 | 818.079 | 23.69 %
   50   |   3.696 |   389.663 | 208.559 | 863.463 | 19.45 %
   60   |   2.432 |   323.176 | 179.688 | 892.334 | 16.76 %
   70   |   1.928 |   290.926 | 157.894 | 914.128 | 14.73 %
   80   |   1.605 |   266.954 | 140.159 | 931.863 | 13.07 %
   90   |   1.367 |   246.908 | 125.245 | 946.777 | 11.68 %
  100   |   1.164 |   227.773 | 112.537 | 959.485 | 10.50 %

thrott. num.: number of entries where throttling begins
num of IP addr: number of unique IP addresses =3D number of lines of above
                select statement
num of entries: total number of entries from select statement
num of thrott: num of entries - (thrott. num. - 1) * num of IP addr
left entries: number of entries in connect - num of thrott;
% reduc: num of thrott * 100 / number of entries in connect

This means, throttling would really decrease the size of our connect
table and hopefully the chance from spam to get through.

My primary goal was to reduce the delay for the regular messages. But
after this I wanted to look at algorithms which would reduce the number of
spams. Throttling would have been my first try to reduce spams. But since
I had not though how an algorithm could work, its great that Michel
already did the work.

However, I would not incorporate this algorithm into 1.6.0 but in 1.7.0.
If we put the other tables into sqlgrey about which I talked already, the
algorithm for throttling must be adapted. But even if not, I am not sure
if the algorithm is flexible enough. For example, lets assume the value of
connect_src_throttle is 21 and the value of group_domain_level is 10.

- if there is one or two entries in domain_awl, a new triple would be
  accepted.
- if there are 20 entries in connect as well as in from_awl and 0 in
  domain_awl, a new triple would be throttled, but 20 entries in from_awl
  should be as good as 2 entries in domain_awl because of
  group_domain_level.

Therefore a possible change to the algorithm would be to incorporate the
relation between from_awl and domain_awl, something like:

# Throttling too many connections from same new host
if (defined $self->{sqlgrey}{connect_src_throttle} and
    $self->{sqlgrey}{connect_src_throttle} > 0 and
    $self->count_src_connect($cltid) >=3D $self->{sqlgrey}{connect_src_thro=
ttle}) {

    # without the following tests a good chance exists to loose emails for
    # a new server of a big ISP
    my $threshold =3D connect_src_throttle - $self->count_src_domain_awl($c=
ltid) * group_domain_level;
    if ($threshold > 0) {
=09$threshold -=3D $self->count_src_from_awl($cltid));
=09if ($threshold > 0) {
=09    $self->mylog('grey', 2, "throttling: $cltid, $sender_name\@$sender_d=
omain -> $recipient");
=09    return ($self->{sqlgrey}{reject_first} . ' Throttling too many conne=
ctions from new source - ' .
                ' Try again later. ');
=09}
    }
}

BTW, this code sniplet is not tested!

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840

[Sqlgrey-users] Re: Values for throttling

From: Michel B. <mi...@bo...> - 2005-06-09 17:12:13

Le Jeudi 09 Juin 2005 18:34, Michael Storz a =E9crit :
>
> Here are my values using above select statement:
>
> number of entries in connect:                 1.072.022
> number of different IP addresses in connect:    110.904
> average number of entries per IP address:          9.67
> max. number of entries per IP address:            2.470

Waow ! 2.470 entries for ONE IP ;-)

> thrott. | num of  | num of    | num of  | left    | % reduc
> num.    | IP addr | entries   | thrott  | entries |
> =3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=
=3D=3D=3D=3D=3D=3D=3D
[...]
>    10   |  29.092 |   870.638 | 608.810 | 463.212 | 56.79 %
>    20   |  13.186 |   672.086 | 421.552 | 650.470 | 39.32 %
>    30   |   7.908 |   549.151 | 319.819 | 752.203 | 29.83 %
>    40   |   5.367 |   463.256 | 253.943 | 818.079 | 23.69 %

So by setting a throttling threshold between 10 and 40, you would save be=
tween=20
25 and 55% of your (huge) connect table size...

> However, I would not incorporate this algorithm into 1.6.0 but in 1.7.0=
.
> If we put the other tables into sqlgrey about which I talked already, t=
he
> algorithm for throttling must be adapted.

With the figures you gave, I believe throttling alone could help a great =
deal=20
with your problem of zombi spam accidentally passing thru, without maybe=20
having to go for a heavier method of multiplying tables.

Maybe you'd like to give throttling alone a try, and check to what extent=
 it=20
helps you, and if you still need further improvements (with the cost of=20
complexity).

> But even if not, I am not sure=20
> if the algorithm is flexible enough. For example, lets assume the value=
 of
> connect_src_throttle is 21 and the value of group_domain_level is 10.
>
> - if there is one or two entries in domain_awl, a new triple would be
>   accepted.
> - if there are 20 entries in connect as well as in from_awl and 0 in
>   domain_awl, a new triple would be throttled, but 20 entries in from_a=
wl
>   should be as good as 2 entries in domain_awl because of
>   group_domain_level.

For sure we could try to "refine the refinements" a little further, but a=
fter=20
having thought about it for a while, I believed this not to be necessary =
--=20
maybe I 'm mistaken.

Considering that we stop throttling when "we can be reasonably sure that =
a=20
given source (IP) usually retries all or most (*) of its messages", then =
we=20
don't need to throttle it anymore as :
a/ Waiting messages will (most probably) come back
b/ Throttling further could have undesired results (as causing unnecessar=
y=20
long delays, or even causing the loss of legitimate messages after queue=20
lifetime expiration at sending server)

(*) It doesn't however manage the case of both a legitimate server and a=20
network of zombies NATted behind the same IP...

Taking this into consideration, I think it's not of high importance to ha=
ve=20
the "number of entries in from_awl threshold" match the threshold at whic=
h a=20
given IP goes into domain_awl.

We need to consider both, but we don't necessarily need match "weights" f=
or=20
both conditions.

The game is only to make sure that we throttle as long as it may be usefu=
l,=20
but quit throttling for servers which have already succeeded in retrying =
a=20
significant number of messages.

Isn't it ?

Cheers.

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

[Sqlgrey-users] Re: Values for throttling

From: Michel B. <mi...@bo...> - 2005-06-09 17:33:29

Le Jeudi 09 Juin 2005 18:34, Michael Storz a =E9crit :
>
> Therefore a possible change to the algorithm would be to incorporate th=
e
> relation between from_awl and domain_awl, something like:

To complete what I wrote in my previous message :

One of the reasons I had _not_ to combine them, but test domain_awl first=
, is=20
for performance : If we find a presence in domain_awl, then we don't need=
 to=20
perform the query against from_awl (the and condition in perl will not=20
evaluate the following condition if the previous doesn't match), and thus=
 we=20
save a query against the bigger from_awl table when there is an entry in=20
domain_awl -- which is likely to be the case for big servers sending us a=
 lot=20
of stuff, which are more likely than others to generate a high number of=20
"legitimate entries" in connect, if their IP change for example.

If we want to mix the count from domain_awl and the count from from_awl, =
then=20
we would need to query both tables everytime, which could result in a=20
performance loss, which would be annoying especially for big sites...

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Re: [Sqlgrey-users] Re: Values for throttling

From: Michael S. <Mic...@lr...> - 2005-06-10 15:22:42

On Thu, 9 Jun 2005, Michel Bouissou wrote:

> Le Jeudi 09 Juin 2005 18:34, Michael Storz a =E9crit :
> >
> > Therefore a possible change to the algorithm would be to incorporate th=
e
> > relation between from_awl and domain_awl, something like:
>
> To complete what I wrote in my previous message :
>
> One of the reasons I had _not_ to combine them, but test domain_awl first=
, is
> for performance : If we find a presence in domain_awl, then we don't need=
 to
> perform the query against from_awl (the and condition in perl will not
> evaluate the following condition if the previous doesn't match), and thus=
 we
> save a query against the bigger from_awl table when there is an entry in
> domain_awl -- which is likely to be the case for big servers sending us a=
 lot
> of stuff, which are more likely than others to generate a high number of
> "legitimate entries" in connect, if their IP change for example.
>
> If we want to mix the count from domain_awl and the count from from_awl, =
then
> we would need to query both tables everytime, which could result in a
> performance loss, which would be annoying especially for big sites...
>

If you look carefully at the algorithm, then you see that we do not have
to check both tables in every case:

    my $threshold =3D connect_src_throttle -
                    $self->count_src_domain_awl($cltid) * group_domain_leve=
l;

If connect_src_throttle =3D=3D group_domain_level then 1 entry in domain_aw=
l
is enough to circumvene throttling. Only if connect_src_throttle >
group_domain_level you have to check from_awl in addition.

BTW, we use the algorithm, which checks for the IP address in domain_awl
and from_awl, for the opposite direction and call it fast propagation.
That means, if an IP address is from a well behaved MTA, then we accept
the triple immediately. This eliminates the delay for forwarded emails,
because most of the time a wellbehaved MTA has an entry in domain_awl. But
this is done with the cost of polluting the from_awl, therefore we want
the additional table for forwarding.

Because the algorithm can be used for two different purposes, we should
give it an extra subroutine, e.g. is_wellbehaved_mta.

Using both features, throttling and fast propagation will result in a
minimum delay, because on the first try only connect_src_throttle entries
will be made in connect and with the first retry all emails not only the
ones with entries in connect will be accepted. Withour fast propagation,
the first retry will only allow the acceptance of the emails in connect
and the second retry will accept all other emails.

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840

Re: [Sqlgrey-users] Re: Values for throttling

From: Michel B. <mi...@bo...> - 2005-06-12 08:11:28

Le Vendredi 10 Juin 2005 17:22, Michael Storz a =E9crit :
> >
> > If we want to mix the count from domain_awl and the count from from_a=
wl,
> > then we would need to query both tables everytime, which could result=
 in
> > a performance loss, which would be annoying especially for big sites.=
..
>
> If you look carefully at the algorithm, then you see that we do not hav=
e
> to check both tables in every case:
>
>     my $threshold =3D connect_src_throttle -
>                     $self->count_src_domain_awl($cltid) *
>                     group_domain_level;
>
> If connect_src_throttle =3D=3D group_domain_level then 1 entry in domai=
n_awl
> is enough to circumvene throttling. Only if connect_src_throttle >
> group_domain_level you have to check from_awl in addition.

I have some objections to using this algorithm instead of the one that I =
had=20
proposed :

One entry in domain_awl IMHO "wheights more" that group_domain_level entr=
ies=20
in from_awl. For one entry in domain_awl is equivalent to "AT LEAST=20
group_domain_level entries (or more...) for the same host and same domain=
 in=20
from_awl".

For this reason, I had considered that one entry in domain_awl was enough=
 to=20
consider that a given host was well behaved and known enough to allow it =
to=20
bypass throttling.

If you use the algorithm you propose, let's say with a domain_group_level=
 of=20
10 and a throttling threshold of 20, and you have one MTA that sends mail=
 for=20
ONLY one domain, then this MTA will make it to domain_awl (and have only =
one=20
entry there even though this may correspond to thousands of different=20
senders), but with your algorithm this will never be enough and this MTA =
will=20
still remain "throttleable".

So I still think that we shouldn't mix a count of entries in from_awl and=
=20
domain_awl, as they don't have the same meaning, and should rather use my=
=20
algorithm : Stop throttling for an IP if it has at least 1 entry in=20
domain_awl, or >=3D throttling threshold in from_awl.


> BTW, we use the algorithm, which checks for the IP address in domain_aw=
l
> and from_awl, for the opposite direction and call it fast propagation.
> That means, if an IP address is from a well behaved MTA, then we accept
> the triple immediately. This eliminates the delay for forwarded emails,
> because most of the time a wellbehaved MTA has an entry in domain_awl. =
But
> this is done with the cost of polluting the from_awl, therefore we want
> the additional table for forwarding.

Hmmm... I'm not sure that I completely understand what you mean here...

--=20
Michel Bouissou <mi...@bo...> OpenPGP ID 0xDDE8AC6E

Re: [Sqlgrey-users] Re: Values for throttling

From: Michael S. <Mic...@lr...> - 2005-06-14 15:29:09

On Sun, 12 Jun 2005, Michel Bouissou wrote:

> If you use the algorithm you propose, let's say with a domain_group_level of
> 10 and a throttling threshold of 20, and you have one MTA that sends mail for
> ONLY one domain, then this MTA will make it to domain_awl (and have only one
> entry there even though this may correspond to thousands of different
> senders), but with your algorithm this will never be enough and this MTA will
> still remain "throttleable".

If the MTA sends ONLY emails with originators from ONE domain, then there
will be en entry in domain_awl and ALL emails will immediately accepted.
There is noch chance for an email to be listed in connect and threfore
throttling will never occur.

>
> So I still think that we shouldn't mix a count of entries in from_awl and
> domain_awl, as they don't have the same meaning, and should rather use my
> algorithm : Stop throttling for an IP if it has at least 1 entry in
> domain_awl, or >= throttling threshold in from_awl.

I want to be able to specify that more than one entry in domain_awl
should be used. To have a simple configuration I thought about linking
entries in domain_awl and from_awl togther. But if you say these entries
cannot be linked together, we have to switch to explicit values. This
means we need a vector of values, where each value corresponds to the
number of entries in an awl which would prove that we trust a MTA (I call
these MTAs wellbehaved):

connect_src_throttle = (1, 10) # (value for domain_awl, value for from_awl)

Since I want to use a table for triples too, I would need a vectr with 3
elements.

>
>
> > BTW, we use the algorithm, which checks for the IP address in domain_awl
> > and from_awl, for the opposite direction and call it fast propagation.
> > That means, if an IP address is from a well behaved MTA, then we accept
> > the triple immediately. This eliminates the delay for forwarded emails,
> > because most of the time a wellbehaved MTA has an entry in domain_awl. But
> > this is done with the cost of polluting the from_awl, therefore we want
> > the additional table for forwarding.
>
> Hmmm... I'm not sure that I completely understand what you mean here...
>
>

Ok, which part can I describe better:

- how fast propagation works
- or what the relationship is between forwarding and fast propagation

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840

Re: [Sqlgrey-users] RELEASE: 1.5.9

From: Who K. <qui...@me...> - 2005-06-24 02:38:31

It would be nice if you put the download link in these release 
announcements.

Thanks,
Jim

P.S. Yes I'm going to fetch 1.6, but this was the todo tagged message in 
box.

Lionel Bouton wrote:

>Hi,
>
>SQLgrey 1.5.9 tarball is on sourceforge (RPMs should come shortly after,
>  
>