Expression to remove text and symbol

Anonymous
2013-09-11
2013-09-19
  • Anonymous - 2013-09-11

    I'm struggling with how strip out alpha-numeric and an @ symbol in a string of text (email address).

    For example:
    I'll have email address: user456@domain.com

    I want to remove user456@ and just have the domain.com remaining.

    How would I go about doing this?

     
  • Evan Burkitt

    Evan Burkitt - 2013-09-11

    Use a regular expression replace. While the official RE for matching valid emails is pretty hairy, something like "[^ @]+@" (not including the quotes) would probably be sufficient to match the part you want to delete (replace with empty string). If you're replacing dozens use Find Next and visually verify the selection. If you're doing thousands, you'll probably want to save a copy of the original and use a diff tool to ensure nothing got damaged by an erroneous match.

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-09-12

    Hello Steves and Evan,

    You regular expression is quite correct, Evan, but don't forget that a negative class of characters as [^ @], with a space before @, means exactly what it's written :

    Find any single character different from a space or the @ character. So, for example any EOL character, as \r, \n or \r\n can also match !

    Then suppose a file, with the two successive lines :

    user456@domain.com
    user123@xxx.com

    With your actual regex match, it finds, first, the string user456@ and, secondly, the string domain.com\r\nuser123@, that is a wrong result !

    Moreover, we should expect the case of two e-mail addresses, separated by tabulations, instead of spaces

    So, a better search:replacement could be :

    SEARCH : [^@ \t\r\n]+@
    REPLACE : Nothing

    But, personally, I prefer the simple SEARCH regex \w+@, where \w represents any word character ( = [0-9A-Za-z_] or any accentuated letter )

    Best Regards,

    guy038

     
    Last edit: THEVENOT Guy 2013-09-12
    • Neomi

      Neomi - 2013-09-12

      You wouldn't match my address with word characters, it contains a dash. Dots are also quite common. Well, it seems there are many more possible characters:
      http://en.wikipedia.org/wiki/Email_address#Local_part

      Just out of curiosity: is there an elegant way (other than branching into two alternate sub expressions) to construct a regex that handles these quoted cases with their special characters too? I have basic knowledge about regular expressions and use them every now and then, but not often enough to know all the good tricks.

       
  • THEVENOT Guy

    THEVENOT Guy - 2013-09-14

    Hi Steves, Evan and Neomi,

    @ Neomi

    You are quite right : my last regex \w+@ was much too simple and matched only the basic form, given by steves. Even, my personal e-mail address wouldn't be matched ! ( guy.038@wanadoo.fr )

    I just realize that I didn't answer exactly to your last post ! I must go out but I'll be back in two hours. See you later !


    @ Steves, Evan, Neomi and All

    So, once I read the Wiki article on the local part of an e-mail address, I deduced some points :

    • Building regular expression to match every case of the legal local and/or domain part syntax would be an enormous and nasty matter !!

    • Moreover, some features like comments, IP addresses and quoted strings/characters are not commonly used in an e-mail address

    • Best is the enemy of Good ( French proverb )

    Thus, I decided to forget local parts containing quoted strings with special characters and/or comments and/or IP addresses between square brackets :)


    So, with the help of Wikipedia documentation, I was able to construct decent regular expressions which :

    • match most of the current local and domain parts

    • recognize Latin accentuated, Greek, Cyrillic, Hebrew and Arab letters, if your file is UNICODE encoded

    • find e-mail addresses embedded in specific characters, like parentheses or standard delimiters, like spaces

    IMPORTANT : The "Match case" box must be checked => Search is case sensitive


    SEARCH of the local part ONLY :

    (^|(?<=(\h|[,:<([])))([\w!#-'*+/=?^`{-~-]+)(\.(?3))*(?=@)
    


    SEARCH of the domain part ONLY :

    (?<=@)([\l\d-]+\.)+[a-z]{2,6}((?=(\h|[])>,.]))|$)
    


    BUT we still can improve the detection of the TLD ( Top-Level Domain ), at the very end of the e-mail address : it's a two letters country code OR a specific generic top-level domain.

    See the web site http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

    So, we obtain :

    SEARCH of the domain part ONLY :

    (?<=@)([\l\d-]+\.)+([a-z]{2}|com|net|org|aero|asia|biz|cat|coop|
    edu|gov|info|int|jobs|mil|mobi|museum|name|post|pro|tel|travel|
    xxx)((?=(\h|[])>,.]))|$)
    

    Notes :

    • The last regex is split in three lines because, in my old IE6 browser of my old XP home computer, it's impossible to see the totality of the regex. Of course, this regex is a one-line expression !

    • The first part (^|(?<=(\h|[,:<([]))), of the local part regex, is an assertion, which impose that the local part begins a line OR is preceded by an horizontal blank : space (SP), tabulation (TAB) or No-Break space (NBSP) OR by one of the five characters , : < ( [

    • The final part (?=@), of the local part regex, is a lookahead, which must be matched, but that is NOT part of the regex.

    In the same way :

    • The first part (?<=@), of the domain part regex, is a lookbehind, which must be matched, but that is NOT part of the regex.

    • The final part ((?=(\h|[])>,.]))|$), of the domain part regex, is an assertion, which impose that the domain part is followed by an horizontal blank : space (SP), tabulation (TAB) or No-Break space (NBSP) OR by one of the five characters ] ) > , . OR ends a line

    • Moreover :

    In the local part regex, you noticed the form (?3). It really represents the third group :

    [\w!#-'*+/=?^`{-~-]+
    


    ( a NON ZERO string of ANY allowed characters in local part )


    VERY IMPORTANT :

    Don't be mistaken, between the TWO regexes, let say, (\d+)_\1 and (\d+)_(?1)

    • In the first regex, the \1 back reference represents the exact value of group 1

    • In the second regex, the (?1) group reference represents the exact regex of group 1, so \d+

    You easily see the difference if you apply each of these two regexes on the subject string below :

    123_123 | 123456_789 | 123456_123456 | 456_123789


    Finally, to search a complete legal e-mail address ( = local part@domain part ), we just have to juxtapose the local and domain part, with the @ symbol between.

    SEARCH of an legal e-mail address :

    (^|(?<=(\h|[,:<([])))([\w!#-'*+/=?^`{-~-]+)(\.(?3))*@([\l\d-]+
    \.)+([a-z]{2}|com|net|org|aero|asia|biz|cat|coop|edu|gov|info|
    int|jobs|mil|mobi|museum|name|post|pro|tel|travel|xxx)
    ((?=(\h|[])>,.]))|$)
    


    Of course, you can get rid of the delimiter assertions, at the beginning and end of this regex, in order to get more e-mail addresses !

    ([\w!#-'*+/=?^`{-~-]+)(\.(?1))*@([\l\d-]+\.)+(com|net|org|aero|
    asia|biz|cat|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|
    post|pro|tel|travel|xxx|[a-z]{2})
    

    Examples :

    niceandsimple@example.com
    very.common@12345.com
    Vacances-d'été@club-med.fr
    a.little-lengthy.but.fine@dept.example.museum
    style.email.+symbols!#$%&'*+-/=?^_{}|~@a-small-example.fr

    With delimiters :

    This is a test@entity.com correct e-mail address
    (test@entity.com)
    [test@entity.net]
    <test@entity.org>
    ,test@entity.edu,
    :test@entity.xxx.

    Whaaaaaaou ! I'm a bit tired, now ! Hope that's useful, somehow.

    Cheers,

    guy038

    Article finished on 14/09/2013, at 21h21 ( French time zone )

     
    Last edit: THEVENOT Guy 2013-09-14
  • THEVENOT Guy

    THEVENOT Guy - 2013-09-15

    Hello Neomi and All,

    Let's going on, to find out a regex to match almost every form of a valid e-mail address. Aaaahhhh !

    Of course, we are supposed to use a Unicode version of Notepad++, since the 6.0


    @ Neomi

    Here are some general regex, relative to punctuation signs :

    • [[:punct:]] find ANY punctuation sign, between \x00 and \xff

    • [^[:^punct:]\x80-\xff] find ANY punctuation sign, between \x00 and \x7f. Notice the double negation that means : NOT ( NOT a punctuation sign AND a character ABOVE \X7f)

    • [!-/:-@\[-{-~]` is identical and find ANY punctuation sign, between \x21 included and \x7f included. This regex is the sum, between square brackets, of four intervals : from ! to /, from : to @, from [ to `` and from { to ~


    @ Neomi and All

    Go back to our e-mail address problem. So, let's define three sets of characters :

    Set1 = [\w!#-'*+/=?^`{-~-]+
    Set2 = ([\w !#-,/:-@\]-`{-~[-]|\\["\\])+
    Set3 = "([\w !#-,./:-@\]-`{-~[-]|\\["\\])+"
    


    Set1 is a string, composed of ALL allowed char., except DOT
    Set2 is a string, composed of ALL allowed + special char. OR \" OR \\ except DOT and UNIQUE " or \
    Set3 is a string, composed of ALL allowed + special char. OR \" OR \\ except UNIQUE " or \ , and surrounded by two double quotes "

    After some investigations ( and some hours ! ), I think that a general regex, relative to the local part, with the initial assertion and the final lookahead, could be :

    (^|(?<=(\h|[,:<([])))("Set2(\.Set2)*"|Set1((\.(Set1|Set3))*\.Set1)?)(?=@)

    With that syntax, management of quoted strings seems OK. So, e-mail addresses like abc."def".ghi@example.com or "abc.def.ghi"@example.com will be correct !

    So, if you replace the 3 sets by their value, we obtain the regex :

    (^|(?<=(\h|[,:<([])))("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\.
    ([\w !#-,/:-@\]-`{-~[-]|\\["\\])+)*"|[\w!#-'*+/=?^`{-~-]+((\.
    ([\w!#-'*+/=?^`{-~-]+|"([\w !#-,./:-@\]-`{-~[-]|\\["\\])+"))*
    \.[\w!#-'*+/=?^`{-~-]+)?)(?=@)
    


    As said in my previous post, this regex should be re-written in a one-line expression !

    But, if we use some group references (?n) that refers to a previous group n, this new regex, for search of the local part of an e-mail address, can be shortened as :

    (^|(?<=(\h|[,:<([])))("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\.
    (?4)+)*"|([\w!#-'*+/=?^`{-~-])+((\.((?6)+|"([\w !#-,./:-@\]-`
    {-~[-]|\\["\\])+"))*\.(?6)+)?)(?=@)
    


    Now, concerning the regex for the domain part, we just change the beginning ([\l\d-]+\.)+ into (([\l\d-]+\.)+)? to match
    the top-level domain ONLY, which is located further in the regex

    Finally, if we add the new regex of the domain part, the complete regex to search an e-mail address becomes :

    (^|(?<=(\h|[,:<([])))("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\.
    (?4)+)*"|([\w!#-'*+/=?^`{-~-])+((\.((?6)+|"([\w !#-,./:-@\]-`
    {-~[-]|\\["\\])+"))*\.(?6)+)?)@(([\l\d-]+\.)+)?([a-z]{2}|com|
    net|org|aero|asia|biz|cat|coop|edu|gov|info|int|jobs|mil|mobi|
    museum|name|post|pro|tel|travel|xxx)((?=(\h|[])>,.]))|$)
    

    You'll find, below, a list of some valid e-mail addresses, which are matched by this enormous regex !

    niceandsimple@example.com
    very.common@12345.com
    a.little-lengthy.but.fine@dept.example.museum
    disposable.style.email.with+symbol@example.com
    !#$%&'*+-/=?^_`{}|~@example.org
    Vacances-dté@club-med.fr
    style.email.+symbols!#$%&'*+-/=?^_{}|~@a-small-example.fr
    UPPERCASE-letters@example.com
    very.common@com
    very.common@fr
    
    !#$%&' *+-/=?^_`{}|~@example.org (A)
    !#$%&'\ *+-/=?^_`{}|~@example.org (A)
    
    just."not".right@example.com
    just."very.not".right@example.com
    just."very..not".right@example.com (B)
    
    "simple"@example.com
    "sim.ple"@example.com
    "much.more unusual"@example.com
    "very.unusual.@.unusual.com"@example.com
    "very.(),:;<>[]\".VERY.\"very@\\ \"\".unusual"@example.com
    "()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~.a"@example.org
    "UPPERCASE-letters"@example.com
    " "@example.org
    
    very.common@abc.com
    very.common@abc.fr
    very.common@abc.123.com
    very.common@abc.123.fr
    very.common@abc.123.fr.com
    very.common@abc.123.fr.fr
    very.common@abc.123.com
    very.common@abc.123.this.is.the-test.to-do.com
    
    This is a test@entity.com correct e-mail address (C)
    This is a   test@entity.com correct e-mail address (D)
    This is a test@entity.com correct e-mail address (E)
    (test@entity.com) (F)
    [test@entity.net] (G)
    <test@entity.org> (H)
    ,test@entity.edu, (I)
    :test@entity.xxx. (J)
    


    Notes :

    (A) Valid e-mail address, from the SPACE, excluded
    (B) This e-mail adress shouldn't be valid , because of the two SUCCESSIVE Dots
    (C) E-mail address, embedded with Spaces
    (D) E-mail address, embedded with Tabulations
    (E) E-mail address, embedded with No-Break Spaces
    (F) E-mail address, embedded with Parentheses
    (G) E-mail address, embedded with Square Braxkets
    (H) E-mail address, embedded with Angle Brackets
    (I) E-mail address, embedded with Commas
    (J) E-mail address, embedded with Colon and Dot


    And, to end up, a list of some invalid e-mail addresses, which are correctly NOT found by this regex, except the (Z) case

    !#$%&'"*+-/=?^_`{}|~@example.org (K)
    !#$%&'\*+-/=?^_`{}|~@example.org (L)
    !#$%&'\"*+-/=?^_`{}|~@example.org (M)
    !#$%&'\\*+-/=?^_`{}|~@example.org (N)
    
    style.email..+symbols!#$%&'*+-/=?^_{}|~@a-small-example.fr (O)
    "very.(),:;<>[]\".VERY..\"very@\\ \"\".unusual"@example.com (P)
    
    Abc.example.com (Q)
    A@b@c@example.com (R)
    a"b(c)d,e:f;g<h>i[j\k]l@example.com (S)
    just"not"right@example.com (T)
    
    just."not".@example.com (U)
    just."not"@.example.com (V)
    ."not".right@example.com (W)
    this is"not\allowed@example.com (X)
    this\ still\"not\\allowed@example.com (Y)
    
    user@[IPv6:2001:db8:1ff::a0b:dbd0] (Z)
    very.common@123 (1)
    very.common@.com (2)
    very.common@.k (3)
    very.common@ABC.def.com (4)
    very.common@abc.DEF.com (4)
    


    Notes :

    (K) A double quote " , NOT between double quotes
    (L) A backslash \ , NOT between double quotes
    (M) An escaped double quote \" , NOT between double quotes
    (N) An escaped double quote \ , NOT between double quotes

    (O) Two successive dots found
    (P) Two successive dots found

    (Q) The symbol @ is absent
    (R) The symbol @ is located, outside double quotes
    (S) The special symbols are NOT escaped
    (T) The quoted strings don't have a DOT delimiter

    (U) A DOT is located just BEFORE the symbol @
    (V) A DOT is located just AFTER the symbol @
    (W) A DOT found at the BEGINNING of e-mail address
    (X) Space, Double Quote and Backslash are NOT double-quoted nor escaped
    (Y) Space, Double Quote and Backslash are NOT double-quoted

    (Z) This NON-treated case is OK, as it's a domain with an IP address
    (1) Top-Level Domain UNKNOWN
    (2) Top-Level Domain preceded by a dot
    (3) ONE letter country or Top-Level Domain
    (4) UPPERCASE letter found
    (4) UPPERCASE letter found


    You may think : " this guy have ( plenty of ) time to waste !" You would be perfectly right. It's was just an exercice to boost my neurons and investigate nasty regex. Indeed, I will certainly never use this complicated regex !!!

    Cheers,

    guy038

    Article finished on 16/09/2013, at 01h50 ( French time zone )

     
    Last edit: THEVENOT Guy 2013-09-15
  • THEVENOT Guy

    THEVENOT Guy - 2013-09-16

    Hi all,

    To be complete on that topic, I forgot, in my last post, to give the regex for the search of an e-mail address, when you don't include the anchors at beginning and the end. Indeed, the different alternatives must be slightly modified : we need to put all the top-level domains BEFORE the search of country with the [a-z]{2}

    Moreover, as the anchor, at start of the regex, is absent, the number of the groups used, of the form (?x), must be reduce by 2 !

    So, the strict regex for e-mail address search, becomes :

    ("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\.(?2)+)*"|([\w!#-'*+/=?^`
    {-~-])+((\.((?4)+|"([\w!#-,./:-@\]-`{-~[-]|\\["\\])+"))*\.(?4)
    +)?)@(([\l\d-]+\.)+)?(com|net|org|aero|asia|biz|cat|coop|edu|
    gov|info|int|jobs|mil|mobi|museum|name|post|pro|tel|travel|xxx|
    [a-z]{2})
    


    Of course, as this regex impose less conditions than my last complete regex of my previous post, it may find some part of an invalid e-mail address, which is ,really, a valid e-mail address !

    End of my delirium !!!!!

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2013-09-16
  • GerdB

    GerdB - 2013-09-18

    Whoa, Guy, I hope you recovered from that one :-)

    A good place to look for any regexes is http://www.regular-expressions.info/examples.html You'll find lots of useful stuff there together with explanations and pros and cons.

    regards
    gerd

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-09-19

    Hi GerdB,

    Thank you for this link, but I already know this excellent Web site :-)

    When I was younger, I worked on Unix servers and I very often used regular expressions with the very old VI editor !

    Indeed, when David Brotherstone included the Perl Regular Common Expressions in the 6.0 version of N++ ( Many thanks to him, again ), I felt the need of good documentation about PCRE and modern features of regular expressions.

    So, after a while, I came across the site of Jan Goyvaerts, very complete, which allows everyone to find out all the power and the flexibility of regular expressions.

    If you are interested, I have the FULL English tutorial of this site, in a Word document ( 149 pages and about 2 Megabytes ). So, just send me an e-mail, at my e-mail address
    ( guy.038@wanadoo.fr ), with your own e-mail address.

    I will send you back this Word document, as an attached file.

    Of course, the version of this tutorial corresponds to the beginning of 2011 and it may have been updated !

    Best regards,

    guy038

     
    Last edit: THEVENOT Guy 2013-09-19

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks