|
From: Michael S. <Mic...@lr...> - 2005-09-23 12:27:14
|
On Wed, 21 Sep 2005, Lionel Bouton wrote: > Jeff Rice wrote the following on 21.09.2005 00:14 : > > >Hi, > >Just thought I would share a small patch that deals with a number of > >single-use email addresses that weren't being recognized by the existing > >regex in sqlgrey. These are the sort of bounce-return-12310123981, etc. > > This patch just tries to mask the parts that appear to be unique, so > >the database doesn't get filled with addresses that won't be used again. > > > >I somewhat arbitrarily decided that if an email name contained a > >delimiter such as "-","_", or "." along with a string of 12 or more > >alphanumeric characters, then those characters should be masked. That > >may or may not result in some emails being masked when they should not, > >or some not being masked when they should. I don't believe the result > >will be tragic in either case, and this can be adjusted to your liking. > > > >It might not work as well for other folks, but it seems to catch the > >major ones I see. I am sure there are other patterns that I didn't > >catch simply because they don't come up frequently in my email mix. > > > >Jeff > > > > > > Thanks, added in the 1.7.x branch, will be in 1.7.2. Comments below in > the patch. > > >--- sqlgrey 2005-09-03 01:09:21.000296554 +0000 > >+++ /usr/sbin/sqlgrey 2005-09-03 01:09:02.000989883 +0000 > >@@ -986,14 +986,21 @@ > > $user =~ s/^srs1=[^=]+=([^=]+)(=+)[^=]+=[^=]+=([^=]+)=([^=]+)$/srs1=#=$1$2#=#=$3=$4/; > > # strip extension, used sometimes for mailing-list VERP > > $user =~ s/\+.*//; > >+ > >+ # strip frequently used bounce/return masks > >+ $user =~ s/((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z-_.]+$/$1#/gi; # Added by JR > >+ > > > > > > Good, I believe this is useful. Note: the case insensitive match isn't > needed. All addresses are lowercased before being processed. I removed > it from all your substitution. A change in deverp_user should be conservative. It should be carefully crafted to only match onetime senders but not regular ones. The regex above is too broad. It destroys the structure of a lot of bounce addresses but will give you no advantage. From my the data in my from_awl (400 000 entries) the following regex would indeed bring an advantage because emails form provider cheetahmail.com (email domain b.ABCDEF.chtah.com and others) are not matched at the moment: $user =~ s/^(bo|bounce)-[0-9a-z]+$/$1-#/g However it will also match a lot of spam mails. The rest of the regex will bring nearly nothing in my case. I have no single entry of notice-reply in any of the tables and only a handful of notice_return entries from provider at network 65.160.234. You can check your database with select src,sender_domain,sender_name from from_awl where sender_name rlike "((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z_.-]+$" and not (sender_name rlike "^(bo|bounce)-[0-9a-z]+$") order by src,sender_domain,sender_name; > > > # strip hexadecimal sequences (doable in one regexp ?) > > # don't strip a leading hex sequence though > > my $tmp = ''; > > while ($tmp ne $user) { > > $tmp = $user; > > $user =~ s/([._-])[0-9a-f]+([._-])/$1#$2/g; > >- } > >+ $user =~ s/([._-])[0-9a-z]{12,}([._-])/$1#$2/gi; # Added by JR > > > > > > 12 is arbitrary but seems good to me. I'm not sure how this one will > play out in the wild (this is why I prefer to put this code in the 1.7.x > branch). > > >+ > >+ } > > $user =~ s/([._-])[0-9a-f]+$/$1#/g; > >+ $user =~ s/([._-])[0-9a-z]{12,}$/$1#/gi; # Added by JR > > > > > And I do not like this either. It matches much more other addresses than onetime senders. Checking for hashes is better and not so error prone. > OK > > Lionel. > > > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Sqlgrey-users mailing list > Sql...@li... > https://lists.sourceforge.net/lists/listinfo/sqlgrey-users > Michael Storz ------------------------------------------------- Leibniz-Rechenzentrum ! <mailto:St...@lr...> Barer Str. 21 ! Fax: +49 89 2809460 80333 Muenchen, Germany ! Tel: +49 89 289-28840 |