|
From: Michael S. <Mic...@lr...> - 2005-04-28 16:42:10
|
On Thu, 28 Apr 2005, Lionel Bouton wrote: > > A quick question though. I'm wondering if a "src_awl" would be of any > use, could people with large sites check how many entries they have in > domain_awl for the same src ? > I'm interested in the results of > SELECT src, count(*) FROM domain_awl GROUP BY src ORDER BY -count(*) > LIMIT 10; > and > SELECT count(*) FROM domain_awl; > > This will show if we can reduce the DB load by merging some entries in > another table for quicker lookups. Yes, this was something I was also thinking about. At the moment I made this manually. I analyzed the tables and put ip addresses with lots of entries in the tables in client_ip_whitelist.local. Here ist the output of the different select-statements I use: select host_ip,count(*) as cnt from domain_awl group by host_ip order by cnt desc limit 10; +----------------+-----+ | host_ip | cnt | +----------------+-----+ | 141.40.103.103 | 90 | | 132.230.2.211 | 69 | | 130.60.68.105 | 60 | | 194.95.177.104 | 54 | | 194.95.177.121 | 53 | | 130.60.68.106 | 52 | | 153.96.1.62 | 52 | | 141.48.3.8 | 51 | | 195.200.32.20 | 50 | | 62.153.78.100 | 47 | +----------------+-----+ select substring_index(host_ip, '.', 3),count(*) as cnt from domain_awl group by substring_index(host_ip, '.', 3) order by cnt desc limit 10; +----------------------------------+-----+ | substring_index(host_ip, '.', 3) | cnt | +----------------------------------+-----+ | 80.237.130 | 196 | | 193.125.235 | 130 | | 217.115.142 | 128 | | 193.109.255 | 113 | | 130.60.68 | 112 | | 194.95.177 | 107 | | 81.209.184 | 102 | | 195.200.32 | 94 | | 81.209.148 | 91 | | 141.40.103 | 90 | +----------------------------------+-----+ select count(*) from domain_awl; +----------+ | count(*) | +----------+ | 51800 | +----------+ select host_ip,count(*) as cnt from from_awl group by host_ip order by cnt desc limit 10; +-----------------+------+ | host_ip | cnt | +-----------------+------+ | 141.40.103.103 | 8189 | | 62.216.178.196 | 2010 | | 80.237.203.120 | 1651 | | 217.115.139.21 | 1528 | | 146.82.138.7 | 1126 | | 80.80.20.42 | 1091 | | 132.229.231.52 | 1080 | | 217.172.173.165 | 1063 | | 192.108.115.12 | 1054 | | 194.208.88.1 | 984 | +-----------------+------+ select substring_index(host_ip, '.', 3),count(*) as cnt from from_awl group by substring_index(host_ip, '.', 3) order by cnt desc limit 10; +----------------------------------+------+ | substring_index(host_ip, '.', 3) | cnt | +----------------------------------+------+ | 141.40.103 | 8190 | | 72.5.1 | 4684 | | 64.125.87 | 2018 | | 62.216.178 | 2015 | | 208.184.55 | 1825 | | 80.237.203 | 1669 | | 206.190.36 | 1628 | | 217.115.139 | 1534 | | 216.155.197 | 1380 | | 140.98.193 | 1349 | +----------------------------------+------+ select count(*) from from_awl; +----------+ | count(*) | +----------+ | 353241 | +----------+ As you can see, I have already optimized my domain_awl pretty good. The only candidate to whitelist is 141.40.103.103, one MTA of a local mailcluster, the otherone I already whitelisted. Interesting is the line | 141.40.103.103 | 8189 | When I analyzed the from_awl, I found several such ip addresses with extreme high numbers of entries. Going through the logs, I finally found out, why this was the case. The reason is forwarding. In the above case there were just 2 people which forwarded their email to mailbox on our system. All these entries have originators thought up by spammers, which most of the time do not exist. This brings me to my next wish :-) I need a forward_awl. And therefore this is another reason to have the connect_awl, otherwise I have to populate the forward_awl manually (actually, I have already written a little script to extract these entries out of the log file). Again aggregation would be done to fill the table, but this time on originator, whereas for the from_awl aggreating on recipient would be used. The forward_awl will decrease the need for a src_awl. If you look at the other high numbers from the from_awl, these are networks of BlueStream Media, a well known spammer, which I have not blocked yet, s. 64.125.87.0/24 http://www.spamhaus.org/sbl/sbl.lasso?query=SBL18058 64.125.188.0/25 http://www.spamhaus.org/sbl/sbl.lasso?query=SBL14961 69.25.109.0/24 http://www.spamhaus.org/sbl/sbl.lasso?query=SBL20650 72.5.1.0/24 http://www.spamhaus.org/sbl/sbl.lasso?query=SBL22215 208.184.55.0/25 http://www.spamhaus.org/sbl/sbl.lasso?query=SBL13542 Regards, Michael Storz ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:St...@lr...> Barer Str. 21 ! Fax: +49 89 2809460 80333 Muenchen, Germany ! Tel: +49 89 289-28840 |
|
From: Lionel B. <lio...@bo...> - 2005-04-28 22:20:05
|
Michael Storz wrote the following on 28.04.2005 18:42 : >Yes, this was something I was also thinking about. At the moment I made >this manually. I analyzed the tables and put ip addresses with lots of >entries in the tables in client_ip_whitelist.local. > > > I see, this is probably the best way of speeding the whole process, Perl hashes can't be slower than a database query! >Here ist the output of the different select-statements I use: > >(...) > >select substring_index(host_ip, '.', 3),count(*) as cnt from from_awl >group by substring_index(host_ip, '.', 3) order by cnt desc limit 10; > > Given you manually query for the 3 most significant bytes, do you use 'full' for the greylisting algorithm? Maybe smart would be better suited (would slowly decrease the number of database entries by replacing IPs by class C nets). It's difficult to say how much less entries you would have without a script applying the algo after DNS lookups though... I made it the default because it is more friendly with mail pools, but the side effect is that it is also more friendly with your database :-) >As you can see, I have already optimized my domain_awl pretty good. The >only candidate to whitelist is 141.40.103.103, one MTA of a local >mailcluster, the otherone I already whitelisted. Interesting is the line > >| 141.40.103.103 | 8189 | > >When I analyzed the from_awl, I found several such ip addresses with >extreme high numbers of entries. Going through the logs, I finally found >out, why this was the case. The reason is forwarding. In the above case >there were just 2 people which forwarded their email to mailbox on our >system. All these entries have originators thought up by spammers, which >most of the time do not exist. > >This brings me to my next wish :-) I need a forward_awl. And therefore >this is another reason to have the connect_awl, otherwise I have to >populate the forward_awl manually (actually, I have already written a >little script to extract these entries out of the log file). Again >aggregation would be done to fill the table, but this time on originator, >whereas for the from_awl aggreating on recipient would be used. > > I understand what you want (it took me some time though :)). The forward_awl (in fact more of a rcpt_awl if we refer to the field being awl'ed) will prevent the from_awl to be filled with hundreds of entries. I realised some time ago that I don't need to add connect_awl (or forward/rcp_awl for that matter) before releasing 1.5.6 with IPv6 and optin/optout support. This can wait for 1.5.7 and I won't have to code any database upgrade as SQLgrey checks for missing tables and automatically recreate them during startup. So we stil have some time to discuss the details which is a good thing. >The forward_awl will decrease the need for a src_awl. > > Yep, I realize that. Really good idea. Do any other on the list saw similar behaviours (spammer exploiting the from_awl weakness and forwards generating lots of from_awl entries)? Lionel. |
|
From: Michael S. <Mic...@lr...> - 2005-04-29 09:28:31
|
On Fri, 29 Apr 2005, Lionel Bouton wrote:
> Given you manually query for the 3 most significant bytes, do you use
> 'full' for the greylisting algorithm? Maybe smart would be better suited
> (would slowly decrease the number of database entries by replacing IPs
> by class C nets). It's difficult to say how much less entries you would
> have without a script applying the algo after DNS lookups though... I
> made it the default because it is more friendly with mail pools, but the
> side effect is that it is also more friendly with your database :-)
I know, at some point we have to discuss our differing views about FULL
versus SMART :-)
Indeed, we are using FULL, because I think it is the right way to go. It
definitely depends on the amount of emails you are receiving. A site with
a small or moderate amount of email may benefit of SMART. Therefore the
default is ok with me.
Now, what are the issues using FULL or SMART?
- One of your arguments was smaller database. Well, that's actually no
problem for us:
du -h /var/lib/mysql/sqlgrey/
258M /var/lib/mysql/sqlgrey
With 4 GByte of memory the whole database including indexes should be in
memory all the time. Our graph about CPU usage shows 5 % is used by
sqlgrey and mysqld. Up to another 5 % by our log analyzing. Therefore I
have no problem with more complexity of sqlgrey from the standpoint of
performance.
- Loss of emails from mailsystems with lots of MTAs/ip addresses for
outgoing email, which
* retry always from the same ip address (separate queueing systems)
* which use a different ip address for every retry (common queueing
system, for example a databse driven queueing system)
* use a linear backoff algorithm for retries (every 15, 30 or 60 minutes)
* use an exponential backoff algorithm (e.g. 5, 10, 20, 40, 80, ...)
IP
MTA | same | diff |
----+------+------+
lin | a | b |
----+------+------+
exp | c | d |
----+------+------+
* Case a: Here, emails will be delayed till an entry for
every MTA is created. It will take longer for FULL than
for SMART, but normally no email will be lost. Most of
the MTA pools are of this type.
* Case b: From my experience, this setup is seldom. Here
a chance exists that FULL will not accept an email if there
are a lot of MTAs in the pool and the retry time is longer than
usual. E.g. if the retry time is 30 minutes, reconnect_delay less than
30 minutes and max_connect_age 24 hours, than the pool can have up to
46 MTAs and we will still accept the email, but it will be delayed for
nearly 24 hours. In reality, emails will not be delayed so long, if
the MTAs are choosen randomly.
How many sites will have such big pools of MTAs?
* Case c: Similar to a, but emails will be delayed longer than in case
a. Still, for a well-behaved MTA, no email will be lost.
* Case d: Sites using such a system are very rude to the Net, in my
eyes. If they use a common queuing system, then they can distribute
the load on a cluster of outgoing MTAs, but they MUST shield this
from the outside, e.g. using NAT. Such installations are not well
behaving MTAs and must be whitelisted.
- Higher delay FULL can have compared to SMART:
The right medizine against that are good whitelists! Good means the
number of whitelists and the content of whitelists. In addition to the
standard sqlgrey algorithms for filling the tables via traffic analysis,
we have implemented our own algorithms :-)
* fast propagation (fills from_awl): This algorithm is based on the
trust we have about a sending MTA. If we trust it, we accept the
email, even if there is no entry about this triple in the whitelists.
* MX-check (fills domain_awl): if outgoing and incoming MTAs are the
same, put an entry in domain_awl.
* A-check (fills domain_awl): if sending MTA sends emails for its
hostname only, put it in domain_awl.
These additional algorithms give us a lot of entries in our from_awl and
domain_awl and therefore reduce the delay significantly. And the last 2
algorithms only work with FULL, not with SMART, with the current design
of sqlgrey. Sorry for you, guys :-)
About additional whitelists, forward_awl/rcpt_awl is one of them. At the
moment fast propagation replaces this table, because most of the time we
accept immidiately all the spam mails from forwarding where the remote
MTA does not use greylisting, but at the cost of many unnecessary
entries in from_awl.
Another one would be prevalidation as implented in other greylisting
software. here you put the tuple originator/recipient without an ip
address in a table for every email you send out.
- Trust in the sending MTA:
SMART reduces trust in the sending MTA about the handling of temporary
errors in a well behaved way based on the relevant RFCs.
Well behaved means
* trying to retransmit a message several times until a timeout of 3 - 5
days occur.
* retransmitting emails in a timely manner (minutes) and not only once
in 24 hours
For me this is the reason to use FULL. I want to get trust in the
sending MTA, the more the better. And if I have trust in a MTA, I will
accept emails as fast as I can.
Actually, what I want is to strengthen the trust in the sending MTA,
e.g. to use the domain from HELO/ELHO or requiring several retries,
before I accept a connection. But that's another story and some work for
the future, but before spammer change their software. A detailed
analysis about the retransmit behavior of other MTAs is needed first.
Regards,
Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum ! <mailto:St...@lr...>
Barer Str. 21 ! Fax: +49 89 2809460
80333 Muenchen, Germany ! Tel: +49 89 289-28840
|