Re: [Sqlgrey-users] Improved dynamic/one-shot email address regex

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Michael Storz wrote the following on 23.09.2005 14:26 :

>
>>From my the data in my from_awl (400 000 entries) the following regex
>would indeed bring an advantage because emails form provider
>cheetahmail.com (email domain b.ABCDEF.chtah.com and others) are not
>matched at the moment:
>
>$user =~ s/^(bo|bounce)-[0-9a-z]+$/$1-#/g
>
>However it will also match a lot of spam mails. The rest of the regex will
>bring nearly nothing in my case. I have no single entry of notice-reply in
>any of the tables and only a handful of notice_return entries from
>provider at network 65.160.234.
>
>
>You can check your database with
>
>select src,sender_domain,sender_name
>from from_awl
>where
>sender_name rlike "((bo|bounce|notice-return|notice-reply)[._-])[0-9a-z_.-]+$"
>and not
>(sender_name rlike "^(bo|bounce)-[0-9a-z]+$")
>order by src,sender_domain,sender_name;
>
>  
>
>>>    # strip hexadecimal sequences (doable in one regexp ?)
>>>    # don't strip a leading hex sequence though
>>>    my $tmp = '';
>>>    while ($tmp ne $user) {
>>>   $tmp = $user;
>>>   $user =~ s/([._-])[0-9a-f]+([._-])/$1#$2/g;
>>>-    }
>>>+   $user =~ s/([._-])[0-9a-z]{12,}([._-])/$1#$2/gi;                                # Added by JR
>>>
>>>
>>>      
>>>
>>12 is arbitrary but seems good to me. I'm not sure how this one will
>>play out in the wild (this is why I prefer to put this code in the 1.7.x
>>branch).
>>
>>    
>>
>>>+
>>>+   }
>>>    $user =~ s/([._-])[0-9a-f]+$/$1#/g;
>>>+    $user =~ s/([._-])[0-9a-z]{12,}$/$1#/gi;                                       # Added by JR
>>>
>>>
>>>      
>>>
>
>And I do not like this either. It matches much more other addresses than
>onetime senders. Checking for hashes is better and not so error prone.
>
>  
>

Thanks for the input. Having real-life data is the best. Jeff do you 
have any stats on your own trafic?