|
From: Lionel B. <lio...@bo...> - 2005-09-23 12:46:02
|
Michael Storz wrote the following on 23.09.2005 14:26 : > >>From my the data in my from_awl (400 000 entries) the following regex >would indeed bring an advantage because emails form provider >cheetahmail.com (email domain b.ABCDEF.chtah.com and others) are not >matched at the moment: > >$user =~ s/^(bo|bounce)-[0-9a-z]+$/$1-#/g > >However it will also match a lot of spam mails. The rest of the regex will >bring nearly nothing in my case. I have no single entry of notice-reply in >any of the tables and only a handful of notice_return entries from >provider at network 65.160.234. > > >You can check your database with > >select src,sender_domain,sender_name >from from_awl >where >sender_name rlike "((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z_.-]+$" >and not >(sender_name rlike "^(bo|bounce)-[0-9a-z]+$") >order by src,sender_domain,sender_name; > > > >>> # strip hexadecimal sequences (doable in one regexp ?) >>> # don't strip a leading hex sequence though >>> my $tmp = ''; >>> while ($tmp ne $user) { >>> $tmp = $user; >>> $user =~ s/([._-])[0-9a-f]+([._-])/$1#$2/g; >>>- } >>>+ $user =~ s/([._-])[0-9a-z]{12,}([._-])/$1#$2/gi; # Added by JR >>> >>> >>> >>> >>12 is arbitrary but seems good to me. I'm not sure how this one will >>play out in the wild (this is why I prefer to put this code in the 1.7.x >>branch). >> >> >> >>>+ >>>+ } >>> $user =~ s/([._-])[0-9a-f]+$/$1#/g; >>>+ $user =~ s/([._-])[0-9a-z]{12,}$/$1#/gi; # Added by JR >>> >>> >>> >>> > >And I do not like this either. It matches much more other addresses than >onetime senders. Checking for hashes is better and not so error prone. > > > Thanks for the input. Having real-life data is the best. Jeff do you have any stats on your own trafic? |