From: Michael S. <Mic...@lr...> - 2005-05-06 22:13:32
|
Analyzing our from_awl, I found the following: The table has 365.208 entries from 178.026 different ip addresses. From these ip addresses - 129.210 have exactly one entry and this is with sender_domain = "-undef-" - 38.904 have only entries without sender_domain = "-undef-" - only 9.912 have entries with both kind of sender_domains If we split the from_awl in 2 tables - from_awl: sender_domain <> "-undef-" - dsn_awl: sender_domain = "-undef-" we get a massive reduction of entries in the from_awl and also a massive reduction of table size, since we do not have to store sender_name and sender_domain in dsn_awl. In our case, dsn_awl would have 139.122 entries and from_awl 48.816 entries. Since we know which table to query based on sender_domain/sender_name no additional table lookups are needed. Another advantage would be that the from_awl does not change so much as before, because all the DSNs which result as backscatter of the spammers are excluded now. And we could decide to use a different awl_age for the tables. If we include connect_awl, I don't think we need a split of this table, because the backscatter DSNs will propagate fast into the dsn_awl, only normal DSNs will stay in connect_awl. Michael Storz ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:St...@lr...> Barer Str. 21 ! Fax: +49 89 2809460 80333 Muenchen, Germany ! Tel: +49 89 289-28840 |
From: Michael S. <Mic...@lr...> - 2005-05-06 22:43:03
|
Sorry, mixed the numbers of entries wih the one of ip addrs: On Sat, 7 May 2005, Michael Storz wrote: > Analyzing our from_awl, I found the following: > > The table has 365.208 entries from 178.026 different ip addresses. > From these ip addresses > > - 129.210 have exactly one entry and this is with sender_domain = > "-undef-" > - 38.904 have only entries without sender_domain = "-undef-" > - only 9.912 have entries with both kind of sender_domains > > If we split the from_awl in 2 tables > > - from_awl: sender_domain <> "-undef-" > - dsn_awl: sender_domain = "-undef-" > > we get a massive reduction of entries in the from_awl and also a massive > reduction of table size, since we do not have to store sender_name and > sender_domain in dsn_awl. > > In our case, dsn_awl would have 139.122 entries and from_awl 48.816 In our case, dsn_awl would have 139.122 entries/ip addresses and from_awl 226.086 entries from 48.816 ip addresses. > entries. Since we know which table to query based on > sender_domain/sender_name no additional table lookups are needed. > > Another advantage would be that the from_awl does not change so much as > before, because all the DSNs which result as backscatter of the spammers > are excluded now. And we could decide to use a different awl_age for the > tables. > > If we include connect_awl, I don't think we need a split of this table, > because the backscatter DSNs will propagate fast into the dsn_awl, only > normal DSNs will stay in connect_awl. > > Michael Storz > ------------------------------------------------- > Leibniz-Rechenzentrum ! <mailto:St...@lr...> > Barer Str. 21 ! Fax: +49 89 2809460 > 80333 Muenchen, Germany ! Tel: +49 89 289-28840 > Michael Storz ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:St...@lr...> Barer Str. 21 ! Fax: +49 89 2809460 80333 Muenchen, Germany ! Tel: +49 89 289-28840 |
From: Lionel B. <lio...@bo...> - 2005-05-06 23:38:35
|
Michael Storz wrote the following on 07.05.2005 00:13 : >Analyzing our from_awl, I found the following: > >The table has 365.208 entries from 178.026 different ip addresses. >>From these ip addresses > >- 129.210 have exactly one entry and this is with sender_domain = > "-undef-" >- 38.904 have only entries without sender_domain = "-undef-" >- only 9.912 have entries with both kind of sender_domains > >If we split the from_awl in 2 tables > >- from_awl: sender_domain <> "-undef-" >- dsn_awl: sender_domain = "-undef-" > >(...) > > Ok. I consider this a design bug. In the original design from_awl is the first step towards domain_awl. But DSNs can't go into domain_awl (obviously because of the lack of domain...). This won't make it in 1.6.0, but it is now in my TODO. >If we include connect_awl, I don't think we need a split of this table, >because the backscatter DSNs will propagate fast into the dsn_awl, only >normal DSNs will stay in connect_awl. > > This I don't understand. How do we make the difference between backscatter DSNs and normal DSNs ? Lionel. |
From: Michael S. <Mic...@lr...> - 2005-05-07 19:25:16
|
On Sat, 7 May 2005, Lionel Bouton wrote: > Michael Storz wrote the following on 07.05.2005 00:13 : > > >Analyzing our from_awl, I found the following: > > > >The table has 365.208 entries from 178.026 different ip addresses. > >>From these ip addresses > > > >- 129.210 have exactly one entry and this is with sender_domain = > > "-undef-" > >- 38.904 have only entries without sender_domain = "-undef-" > >- only 9.912 have entries with both kind of sender_domains > > > >If we split the from_awl in 2 tables > > > >- from_awl: sender_domain <> "-undef-" > >- dsn_awl: sender_domain = "-undef-" > > > >(...) > > > > > > Ok. I consider this a design bug. In the original design from_awl is the > first step towards domain_awl. But DSNs can't go into domain_awl > (obviously because of the lack of domain...). This won't make it in > 1.6.0, but it is now in my TODO. > Well, I wouldn't call it a design bug, I would call it an optimization. This sounds much more positiv :-) What I am trying is, to get all the backscatter away from from_awl. At the moment backscatter mainly results from DSNs and forwards as far as I can see. For both, I've suggested new tables. I am interested in, how stable do we get from_awl and domain_awl? Which leads us to the question how stable are the relationships of email communications. How many new communication partners are found, how many old relationships will end? What is the percentage we can expect? > >If we include connect_awl, I don't think we need a split of this table, > >because the backscatter DSNs will propagate fast into the dsn_awl, only > >normal DSNs will stay in connect_awl. > > > > > > This I don't understand. How do we make the difference between > backscatter DSNs and normal DSNs ? Sorry, my explanation was a little bit too short. What I meant was, most of the times a spammer uses one of our domains as the originator of his spams, the left side was generated. Therefore, DSNs coming back to us were directed to a lot of different recipients. Aggregation will move such DSNs very fast to dsn_awl. On the other side, if a local user makes an error with a recipient address, most of these emails will not leave our system. Only a few will be accepted by other systems and will then generate a DSN. Therefore such DSNs will stay in connect_awl, because not enough DSNs are available for aggregation. This is the reason why I said backscatter DSNs will go to dsn_awl whereas normal DSNs will stay in connect_awl. Looking at a single DSN, you are right, we can't decide if it is a backscatter DSN or a normal DSN. Michael Storz ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:St...@lr...> Barer Str. 21 ! Fax: +49 89 2809460 80333 Muenchen, Germany ! Tel: +49 89 289-28840 |