Thread: [Sqlgrey-users] Improved dynamic/one-shot email address regex

Brought to you by: gyver, ludvigm, rebum

sqlgrey-users

[Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Jeff R. <py...@fi...> - 2005-09-20 22:15:24

Attachments: sqlgrey-regex.patch

Hi,
Just thought I would share a small patch that deals with a number of
single-use email addresses that weren't being recognized by the existing
regex in sqlgrey.  These are the sort of bounce-return-12310123981, etc.
 This patch just tries to mask the parts that appear to be unique, so
the database doesn't get filled with addresses that won't be used again.

I somewhat arbitrarily decided that if an email name contained a
delimiter such as "-","_", or "." along with a string of 12 or more
alphanumeric characters, then those characters should be masked.  That
may or may not result in some emails being masked when they should not,
or some not being masked when they should.  I don't believe the result
will be tragic in either case, and this can be adjusted to your liking.

It might not work as well for other folks, but it seems to catch the
major ones I see.  I am sure there are other patterns that I didn't
catch simply because they don't come up frequently in my email mix.

Jeff

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Lionel B. <lio...@bo...> - 2005-09-21 13:22:29

Jeff Rice wrote the following on 21.09.2005 00:14 :

>Hi,
>Just thought I would share a small patch that deals with a number of
>single-use email addresses that weren't being recognized by the existing
>regex in sqlgrey.  These are the sort of bounce-return-12310123981, etc.
> This patch just tries to mask the parts that appear to be unique, so
>the database doesn't get filled with addresses that won't be used again.
>
>I somewhat arbitrarily decided that if an email name contained a
>delimiter such as "-","_", or "." along with a string of 12 or more
>alphanumeric characters, then those characters should be masked.  That
>may or may not result in some emails being masked when they should not,
>or some not being masked when they should.  I don't believe the result
>will be tragic in either case, and this can be adjusted to your liking.
>
>It might not work as well for other folks, but it seems to catch the
>major ones I see.  I am sure there are other patterns that I didn't
>catch simply because they don't come up frequently in my email mix.
>
>Jeff
>  
>

Thanks, added in the 1.7.x branch, will be in 1.7.2. Comments below in 
the patch.

>--- sqlgrey 2005-09-03 01:09:21.000296554 +0000
>+++ /usr/sbin/sqlgrey   2005-09-03 01:09:02.000989883 +0000
>@@ -986,14 +986,21 @@
>     $user =~ s/^srs1=[^=]+=([^=]+)(=+)[^=]+=[^=]+=([^=]+)=([^=]+)$/srs1=#=$1$2#=#=$3=$4/;
>     # strip extension, used sometimes for mailing-list VERP
>     $user =~ s/\+.*//;
>+
>+    # strip frequently used bounce/return masks
>+    $user =~ s/((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z-_.]+$/$1#/gi;   # Added by JR
>+
>  
>

Good, I believe this is useful. Note: the case insensitive match isn't 
needed. All addresses are lowercased before being processed. I removed 
it from all your substitution.

>     # strip hexadecimal sequences (doable in one regexp ?)
>     # don't strip a leading hex sequence though
>     my $tmp = '';
>     while ($tmp ne $user) {
>    $tmp = $user;
>    $user =~ s/([._-])[0-9a-f]+([._-])/$1#$2/g;
>-    }
>+   $user =~ s/([._-])[0-9a-z]{12,}([._-])/$1#$2/gi;                                # Added by JR
>  
>

12 is arbitrary but seems good to me. I'm not sure how this one will 
play out in the wild (this is why I prefer to put this code in the 1.7.x 
branch).

>+
>+   }
>     $user =~ s/([._-])[0-9a-f]+$/$1#/g;
>+    $user =~ s/([._-])[0-9a-z]{12,}$/$1#/gi;                                       # Added by JR
>  
>

OK

Lionel.

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Jeff R. <py...@fi...> - 2005-09-21 16:59:23

Lionel Bouton wrote:
 > Good, I believe this is useful. Note: the case insensitive match isn't
> needed. All addresses are lowercased before being processed. I removed
> it from all your substitution.

Good to know.  The other VERP I am experimenting with deals with a
number of emails I get that don't contain tell-tale signs (like -,_, or
.) but are otherwise one-shot emails.  At the moment, I am assuming that
if an email name contains more than 7 consecutive digits, the whole name
should be masked.  I have never seen a normal email account with that
many digits in a row.

# mask long numeric sequences
$user =~ s/.*[0-9]{7,}.*/#/g;

This may be less-useful in the wild, but seems to function well enough
for me.  The only drawback I can foresee is that this will probably mask
emails that are sent by cellphones/pagers etc. since those are often use
the phone number as the user name.

This and the other patch may very well need tweaking, so it would be
useful if people could look at their from_awl tables and see how things
are looking.

Jeff

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Lionel B. <lio...@bo...> - 2005-09-21 22:43:25

Jeff Rice wrote the following on 21.09.2005 18:58 :

>Lionel Bouton wrote:
> > Good, I believe this is useful. Note: the case insensitive match isn't
>  
>
>>needed. All addresses are lowercased before being processed. I removed
>>it from all your substitution.
>>    
>>
>
>Good to know.  The other VERP I am experimenting with deals with a
>number of emails I get that don't contain tell-tale signs (like -,_, or
>.) but are otherwise one-shot emails.  At the moment, I am assuming that
>if an email name contains more than 7 consecutive digits, the whole name
>should be masked.  I have never seen a normal email account with that
>many digits in a row.
>
># mask long numeric sequences
>$user =~ s/.*[0-9]{7,}.*/#/g;
>  
>

Doesn't that simply replace any string with at least seven successive  
numerical characters by a sharp?

I would have used:

$user =~ s/[0-9]{7,}/#/g;


Lionel.

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Lionel B. <lio...@bo...> - 2005-09-21 23:07:47

Lionel Bouton wrote the following on 22.09.2005 00:43 :

> Jeff Rice wrote the following on 21.09.2005 18:58 :
>
>> Lionel Bouton wrote:
>> > Good, I believe this is useful. Note: the case insensitive match isn't
>>  
>>
>>> needed. All addresses are lowercased before being processed. I removed
>>> it from all your substitution.
>>>   
>>
>>
>> Good to know.  The other VERP I am experimenting with deals with a
>> number of emails I get that don't contain tell-tale signs (like -,_, or
>> .) but are otherwise one-shot emails.  At the moment, I am assuming that
>> if an email name contains more than 7 consecutive digits, the whole name
>> should be masked.  I have never seen a normal email account with that
>> many digits in a row.
>>
>> # mask long numeric sequences
>> $user =~ s/.*[0-9]{7,}.*/#/g;
>>  
>>
>
> Doesn't that simply replace any string with at least seven successive  
> numerical characters by a sharp?
>
> I would have used:
>
> $user =~ s/[0-9]{7,}/#/g;


Sorry, just understood that your regexp reflected your goal.

I've seen GSM related e-mail adresses too. I'm not sure what to do about 
that one.

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Michael S. <Mic...@lr...> - 2005-09-23 12:29:01

On Wed, 21 Sep 2005, Jeff Rice wrote:

> Lionel Bouton wrote:
>  > Good, I believe this is useful. Note: the case insensitive match isn't
> > needed. All addresses are lowercased before being processed. I removed
> > it from all your substitution.
>
> Good to know.  The other VERP I am experimenting with deals with a
> number of emails I get that don't contain tell-tale signs (like -,_, or
> .) but are otherwise one-shot emails.  At the moment, I am assuming that
> if an email name contains more than 7 consecutive digits, the whole name
> should be masked.  I have never seen a normal email account with that
> many digits in a row.
>
> # mask long numeric sequences
> $user =~ s/.*[0-9]{7,}.*/#/g;
>
> This may be less-useful in the wild, but seems to function well enough
> for me.  The only drawback I can foresee is that this will probably mask
> emails that are sent by cellphones/pagers etc. since those are often use
> the phone number as the user name.

This matches definitely a lot of mobile phones which send SMS to
emailboxes, examples are in my from_table. I would not recommend to use
this regex.

>
> This and the other patch may very well need tweaking, so it would be
> useful if people could look at their from_awl tables and see how things
> are looking.
>
> Jeff
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. Download
> it for free - -and be entered to win a 42" plasma tv or your very own
> Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
> _______________________________________________
> Sqlgrey-users mailing list
> Sql...@li...
> https://lists.sourceforge.net/lists/listinfo/sqlgrey-users
>

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Michael S. <Mic...@lr...> - 2005-09-23 12:27:14

On Wed, 21 Sep 2005, Lionel Bouton wrote:

> Jeff Rice wrote the following on 21.09.2005 00:14 :
>
> >Hi,
> >Just thought I would share a small patch that deals with a number of
> >single-use email addresses that weren't being recognized by the existing
> >regex in sqlgrey.  These are the sort of bounce-return-12310123981, etc.
> > This patch just tries to mask the parts that appear to be unique, so
> >the database doesn't get filled with addresses that won't be used again.
> >
> >I somewhat arbitrarily decided that if an email name contained a
> >delimiter such as "-","_", or "." along with a string of 12 or more
> >alphanumeric characters, then those characters should be masked.  That
> >may or may not result in some emails being masked when they should not,
> >or some not being masked when they should.  I don't believe the result
> >will be tragic in either case, and this can be adjusted to your liking.
> >
> >It might not work as well for other folks, but it seems to catch the
> >major ones I see.  I am sure there are other patterns that I didn't
> >catch simply because they don't come up frequently in my email mix.
> >
> >Jeff
> >
> >
>
> Thanks, added in the 1.7.x branch, will be in 1.7.2. Comments below in
> the patch.
>
> >--- sqlgrey 2005-09-03 01:09:21.000296554 +0000
> >+++ /usr/sbin/sqlgrey   2005-09-03 01:09:02.000989883 +0000
> >@@ -986,14 +986,21 @@
> >     $user =~ s/^srs1=[^=]+=([^=]+)(=+)[^=]+=[^=]+=([^=]+)=([^=]+)$/srs1=#=$1$2#=#=$3=$4/;
> >     # strip extension, used sometimes for mailing-list VERP
> >     $user =~ s/\+.*//;
> >+
> >+    # strip frequently used bounce/return masks
> >+    $user =~ s/((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z-_.]+$/$1#/gi;   # Added by JR
> >+
> >
> >
>
> Good, I believe this is useful. Note: the case insensitive match isn't
> needed. All addresses are lowercased before being processed. I removed
> it from all your substitution.

A change in deverp_user should be conservative. It should be carefully
crafted to only match onetime senders but not regular ones. The regex
above is too broad. It destroys the structure of a lot of bounce addresses
but will give you no advantage.

From my the data in my from_awl (400 000 entries) the following regex
would indeed bring an advantage because emails form provider
cheetahmail.com (email domain b.ABCDEF.chtah.com and others) are not
matched at the moment:

$user =~ s/^(bo|bounce)-[0-9a-z]+$/$1-#/g

However it will also match a lot of spam mails. The rest of the regex will
bring nearly nothing in my case. I have no single entry of notice-reply in
any of the tables and only a handful of notice_return entries from
provider at network 65.160.234.


You can check your database with

select src,sender_domain,sender_name
from from_awl
where
sender_name rlike "((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z_.-]+$"
and not
(sender_name rlike "^(bo|bounce)-[0-9a-z]+$")
order by src,sender_domain,sender_name;

>
> >     # strip hexadecimal sequences (doable in one regexp ?)
> >     # don't strip a leading hex sequence though
> >     my $tmp = '';
> >     while ($tmp ne $user) {
> >    $tmp = $user;
> >    $user =~ s/([._-])[0-9a-f]+([._-])/$1#$2/g;
> >-    }
> >+   $user =~ s/([._-])[0-9a-z]{12,}([._-])/$1#$2/gi;                                # Added by JR
> >
> >
>
> 12 is arbitrary but seems good to me. I'm not sure how this one will
> play out in the wild (this is why I prefer to put this code in the 1.7.x
> branch).
>
> >+
> >+   }
> >     $user =~ s/([._-])[0-9a-f]+$/$1#/g;
> >+    $user =~ s/([._-])[0-9a-z]{12,}$/$1#/gi;                                       # Added by JR
> >
> >
>

And I do not like this either. It matches much more other addresses than
onetime senders. Checking for hashes is better and not so error prone.


> OK
>
> Lionel.
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. Download
> it for free - -and be entered to win a 42" plasma tv or your very own
> Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
> _______________________________________________
> Sqlgrey-users mailing list
> Sql...@li...
> https://lists.sourceforge.net/lists/listinfo/sqlgrey-users
>

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Lionel B. <lio...@bo...> - 2005-09-23 12:46:02

Michael Storz wrote the following on 23.09.2005 14:26 :

>
>>From my the data in my from_awl (400 000 entries) the following regex
>would indeed bring an advantage because emails form provider
>cheetahmail.com (email domain b.ABCDEF.chtah.com and others) are not
>matched at the moment:
>
>$user =~ s/^(bo|bounce)-[0-9a-z]+$/$1-#/g
>
>However it will also match a lot of spam mails. The rest of the regex will
>bring nearly nothing in my case. I have no single entry of notice-reply in
>any of the tables and only a handful of notice_return entries from
>provider at network 65.160.234.
>
>
>You can check your database with
>
>select src,sender_domain,sender_name
>from from_awl
>where
>sender_name rlike "((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z_.-]+$"
>and not
>(sender_name rlike "^(bo|bounce)-[0-9a-z]+$")
>order by src,sender_domain,sender_name;
>
>  
>
>>>    # strip hexadecimal sequences (doable in one regexp ?)
>>>    # don't strip a leading hex sequence though
>>>    my $tmp = '';
>>>    while ($tmp ne $user) {
>>>   $tmp = $user;
>>>   $user =~ s/([._-])[0-9a-f]+([._-])/$1#$2/g;
>>>-    }
>>>+   $user =~ s/([._-])[0-9a-z]{12,}([._-])/$1#$2/gi;                                # Added by JR
>>>
>>>
>>>      
>>>
>>12 is arbitrary but seems good to me. I'm not sure how this one will
>>play out in the wild (this is why I prefer to put this code in the 1.7.x
>>branch).
>>
>>    
>>
>>>+
>>>+   }
>>>    $user =~ s/([._-])[0-9a-f]+$/$1#/g;
>>>+    $user =~ s/([._-])[0-9a-z]{12,}$/$1#/gi;                                       # Added by JR
>>>
>>>
>>>      
>>>
>
>And I do not like this either. It matches much more other addresses than
>onetime senders. Checking for hashes is better and not so error prone.
>
>  
>

Thanks for the input. Having real-life data is the best. Jeff do you 
have any stats on your own trafic?

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: <py...@fi...> - 2005-09-24 06:02:46

Michael Storz writes: 

> A change in deverp_user should be conservative. It should be carefully
> crafted to only match onetime senders but not regular ones. The regex
> above is too broad. It destroys the structure of a lot of bounce addresses
> but will give you no advantage.

It gives an advantage in my mix.  Obviously, I cannot speak for yours.  If 
it was catching regular emails in my tables, I would not be using it. 

>>From my the data in my from_awl (400 000 entries) the following regex
> would indeed bring an advantage because emails form provider
> cheetahmail.com (email domain b.ABCDEF.chtah.com and others) are not
> matched at the moment: 
> 
> $user =~ s/^(bo|bounce)-[0-9a-z]+$/$1-#/g 
> 
> However it will also match a lot of spam mails. The rest of the regex will
> bring nearly nothing in my case. I have no single entry of notice-reply in
> any of the tables and only a handful of notice_return entries from
> provider at network 65.160.234.

I would not have added notice-reply or notice-return had they not appeared 
in my from_awl.  The point here is to try to match a variety of mixtures.  
They appear in my tables, otherwise I would not have suggested them.  If 
they does not appear in yours, then there is little penalty for including 
them. 

Your comment regarding this regex matching spam puzzles me.  If it is in 
your from_awl, then it already got through, hashes or not.  Maybe I 
misunderstood?  If it is in the from_awl, we know the server will retry and 
whether the regex matches or not seem irrelevent. This case seems to be a 
case for filtering that greylisting cannot address. (RBLs?) 

I can only test a regex email I see at my server.  This was the point in 
putting the regex up for discussion -- so that people could comment based on 
what they receive.  Regardless, individual tweaks are simple, so I don't 
particularly care if these end up in the main branch.  Had the changes not 
been beneficial for the emails I receive, I would not have put them forward 
or be using them on my server. 

I agree with your comment on the second email I sent -- it is far too broad 
and only seems to catch a small number of emails I see. 

I have rewriten the regexs a bit more, to do away with the loop.  If anyone 
is interested in testing this on their email mix, I would be happy to share 
it. 

Jeff

regex matching spam (was Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex)

From: Michael S. <Mic...@lr...> - 2005-09-27 11:05:38

On Fri, 23 Sep 2005 py...@fi... wrote:

>
> Your comment regarding this regex matching spam puzzles me.  If it is in
> your from_awl, then it already got through, hashes or not.  Maybe I
> misunderstood?  If it is in the from_awl, we know the server will retry and
> whether the regex matches or not seem irrelevent. This case seems to be a
> case for filtering that greylisting cannot address. (RBLs?)
>

Hi Jeff,

you are right. This means that spam was accepted by our relays. These spam
mails are coming from MTAs operated by spammers. Most of the ip addresses
can be identified by sbl.spamhaus.org. Unfortunately, in Germany you are
not allowed to generally block MTAs. This has to be the decision of the
end user.

My comment was, we do not have to make the regex more complicated if all
what we get is faster spam acceptance.

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Michael S. <Mic...@lr...> - 2005-09-27 11:13:46

On Fri, 23 Sep 2005 py...@fi... wrote:

> I have rewriten the regexs a bit more, to do away with the loop.  If anyone
> is interested in testing this on their email mix, I would be happy to share
> it.
>
> Jeff
>

Sure, I would like to see, what you changed. Actually I use my own version
of deverp_user, which includes a patch about metachars for the first
version I send Lionel and which he included in 1.7.1. It does not include
the bo|bounce matching yet.

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840
-------------------------------------------------------------------------------
sub deverp_user {
    my ($user, $rcpt) = @_;

    ### Try to match single-use addresses
    # SRS (first and subsequent levels of forwarding)
    $user =~ s/^srs0=[^=]+=[^=]+=([^=]+)=([^=]+)$/srs0=#=#=$1=$2/;
    $user =~ s/^srs1=[^=]+=([^=]+)(=+)[^=]+=[^=]+=([^=]+)=([^=]+)$/srs1=#=$1$2#=#=$3=$4/;

    ### strip extension, used sometimes for mailing-list VERP
    $user =~ s/\+.*//;

    ### eliminate recipient put in originator
    my $dot_sep_re = '[\.\*-]+';
    my $at_sep_re = '[=\?\*~\.]+';
    my ($rcpt_lhs, $rcpt_rhs) = split /\@/, $rcpt, 2;

    # quote all pattern metacharacters and replace '.' with match of possible separators
    $rcpt_lhs = join $dot_sep_re, map { "\Q$_\E"}  split /\./, $rcpt_lhs;
    $rcpt_rhs = join $dot_sep_re, map { "\Q$_\E"}  split /\./, $rcpt_rhs;

    # build pattern with the 3 alternatives to match recipient in originator
    # BATV implementations use third or first alternative (first by abuse.net)
    my $pat = qr/$rcpt_lhs$at_sep_re$rcpt_rhs|$rcpt_rhs$at_sep_re$rcpt_lhs|$rcpt_lhs/;

    # replace address with capital RCPT to be save with deletes
    # (MySQL matches case insensitive unfortunately)
    $user =~ s/(?<=[\*=\.-])$pat|$pat(?=[\*=\.-])/RCPT/;

    ### strip hexadecimal sequences
    # at the beginning only if user will contain at least 4 consecutive alpha chars
    $user =~ s/^[0-9a-f]{2,}(?=[._\/=-].*[a-zA-Z]{4,})|(?<=[._\/=-])[0-9a-f]+(?=[._\/=-]|$)/#/g;

    #### big german list provider fagms.de, Falk eSolution
    $user =~ s/-emid[0-9a-z]+$/-emid/;

    return $user;
} # deverp_user

Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

From: Michael S. <Mic...@lr...> - 2005-09-27 11:18:52

On Fri, 23 Sep 2005 py...@fi... wrote:

>
> I would not have added notice-reply or notice-return had they not appeared
> in my from_awl.  The point here is to try to match a variety of mixtures.
> They appear in my tables, otherwise I would not have suggested them.  If
> they does not appear in yours, then there is little penalty for including
> them.
>

The question for me was, how many of these entries do you see in your
from_awl. If there are only a handful, than it is not worth to include
them in a regex. To be able to match all onetime mailings I see in my
from_awl would make the regex huge. On the other side if the regex matches
hundreds of entries (without matching other addresses) than sure I would
include such a regex.

Michael Storz
-------------------------------------------------
Leibniz-Rechenzentrum   !   <mailto:St...@lr...>
Barer Str. 21           !   Fax: +49 89 2809460
80333 Muenchen, Germany !   Tel: +49 89 289-28840