Notepad++ / Discussion / [READ ONLY] Open Discussion: Expression to remove text and symbol

Anonymous - 2013-09-11

I'm struggling with how strip out alpha-numeric and an @ symbol in a string of text (email address).

For example:
I'll have email address: user456@domain.com

I want to remove user456@ and just have the domain.com remaining.

How would I go about doing this?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Evan Burkitt - 2013-09-11

Use a regular expression replace. While the official RE for matching valid emails is pretty hairy, something like "[^ @]+@" (not including the quotes) would probably be sufficient to match the part you want to delete (replace with empty string). If you're replacing dozens use Find Next and visually verify the selection. If you're doing thousands, you'll probably want to save a copy of the original and use a diff tool to ensure nothing got damaged by an erroneous match.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

THEVENOT Guy - 2013-09-12

Hello Steves and Evan,

You regular expression is quite correct, Evan, but don't forget that a negative class of characters as [^ @], with a space before @, means exactly what it's written :

Find any single character different from a space or the @ character. So, for example any EOL character, as \r, \n or \r\n can also match !

Then suppose a file, with the two successive lines :

user456@domain.com
user123@xxx.com

With your actual regex match, it finds, first, the string user456@ and, secondly, the string domain.com\r\nuser123@, that is a wrong result !

Moreover, we should expect the case of two e-mail addresses, separated by tabulations, instead of spaces

So, a better search:replacement could be :

SEARCH : [^@ \t\r\n]+@
REPLACE : Nothing

But, personally, I prefer the simple SEARCH regex \w+@, where \w represents any word character ( = [0-9A-Za-z_] or any accentuated letter )

Best Regards,

guy038

Last edit: THEVENOT Guy 2013-09-12

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neomi - 2013-09-12
  
  You wouldn't match my address with word characters, it contains a dash. Dots are also quite common. Well, it seems there are many more possible characters:
  http://en.wikipedia.org/wiki/Email_address#Local_part
  
  Just out of curiosity: is there an elegant way (other than branching into two alternate sub expressions) to construct a regex that handles these quoted cases with their special characters too? I have basic knowledge about regular expressions and use them every now and then, but not often enough to know all the good tricks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

THEVENOT Guy - 2013-09-14

Hi Steves, Evan and Neomi,

@ Neomi

You are quite right : my last regex \w+@ was much too simple and matched only the basic form, given by steves. Even, my personal e-mail address wouldn't be matched ! ( guy.038@wanadoo.fr )

I just realize that I didn't answer exactly to your last post ! I must go out but I'll be back in two hours. See you later !

@ Steves, Evan, Neomi and All

So, once I read the Wiki article on the local part of an e-mail address, I deduced some points :

Building regular expression to match every case of the legal local and/or domain part syntax would be an enormous and nasty matter !!

Moreover, some features like comments, IP addresses and quoted strings/characters are not commonly used in an e-mail address

Best is the enemy of Good ( French proverb )

Thus, I decided to forget local parts containing quoted strings with special characters and/or comments and/or IP addresses between square brackets :)

So, with the help of Wikipedia documentation, I was able to construct decent regular expressions which :

match most of the current local and domain parts

recognize Latin accentuated, Greek, Cyrillic, Hebrew and Arab letters, if your file is UNICODE encoded

find e-mail addresses embedded in specific characters, like parentheses or standard delimiters, like spaces

IMPORTANT : The "Match case" box must be checked => Search is case sensitive

SEARCH of the local part ONLY :

(^|(?<=(\h|[,:<([])))([\w!#-'*+/=?^`{-~-]+)(\.(?3))*(?=@)

SEARCH of the domain part ONLY :

(?<=@)([\l\d-]+\.)+[a-z]{2,6}((?=(\h|[])>,.]))|$)

BUT we still can improve the detection of the TLD ( Top-Level Domain ), at the very end of the e-mail address : it's a two letters country code OR a specific generic top-level domain.

See the web site http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

So, we obtain :

SEARCH of the domain part ONLY :

(?<=@)([\l\d-]+\.)+([a-z]{2}|com|net|org|aero|asia|biz|cat|coop| edu|gov|info|int|jobs|mil|mobi|museum|name|post|pro|tel|travel| xxx)((?=(\h|[])>,.]))|$)

Notes :

The last regex is split in three lines because, in my old IE6 browser of my old XP home computer, it's impossible to see the totality of the regex. Of course, this regex is a one-line expression !

The first part (^|(?<=(\h|[,:<([]))), of the local part regex, is an assertion, which impose that the local part begins a line OR is preceded by an horizontal blank : space (SP), tabulation (TAB) or No-Break space (NBSP) OR by one of the five characters , : < ( [

The final part (?=@), of the local part regex, is a lookahead, which must be matched, but that is NOT part of the regex.

In the same way :

The first part (?<=@), of the domain part regex, is a lookbehind, which must be matched, but that is NOT part of the regex.

The final part ((?=(\h|[])>,.]))|$), of the domain part regex, is an assertion, which impose that the domain part is followed by an horizontal blank : space (SP), tabulation (TAB) or No-Break space (NBSP) OR by one of the five characters ] ) > , . OR ends a line

Moreover :

In the local part regex, you noticed the form (?3). It really represents the third group :

[\w!#-'*+/=?^`{-~-]+

( a NON ZERO string of ANY allowed characters in local part )

VERY IMPORTANT :

Don't be mistaken, between the TWO regexes, let say, (\d+)_\1 and (\d+)_(?1)

In the first regex, the \1 back reference represents the exact value of group 1

In the second regex, the (?1) group reference represents the exact regex of group 1, so \d+

You easily see the difference if you apply each of these two regexes on the subject string below :

123_123 | 123456_789 | 123456_123456 | 456_123789

Finally, to search a complete legal e-mail address ( = local part@domain part ), we just have to juxtapose the local and domain part, with the @ symbol between.

SEARCH of an legal e-mail address :

(^|(?<=(\h|[,:<([])))([\w!#-'*+/=?^`{-~-]+)(\.(?3))*@([\l\d-]+ \.)+([a-z]{2}|com|net|org|aero|asia|biz|cat|coop|edu|gov|info| int|jobs|mil|mobi|museum|name|post|pro|tel|travel|xxx) ((?=(\h|[])>,.]))|$)

Of course, you can get rid of the delimiter assertions, at the beginning and end of this regex, in order to get more e-mail addresses !

([\w!#-'*+/=?^`{-~-]+)(\.(?1))*@([\l\d-]+\.)+(com|net|org|aero| asia|biz|cat|coop|edu|gov|info|int|jobs|mil|mobi|museum|name| post|pro|tel|travel|xxx|[a-z]{2})

Examples :

niceandsimple@example.com
very.common@12345.com
Vacances-d'été@club-med.fr
a.little-lengthy.but.fine@dept.example.museum
style.email.+symbols!#$%&'*+-/=?^_{}|~@a-small-example.fr

With delimiters :

This is a test@entity.com correct e-mail address
(test@entity.com)
[test@entity.net]
<test@entity.org>
,test@entity.edu,
:test@entity.xxx.

Whaaaaaaou ! I'm a bit tired, now ! Hope that's useful, somehow.

Cheers,

guy038

Article finished on 14/09/2013, at 21h21 ( French time zone )

Last edit: THEVENOT Guy 2013-09-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

THEVENOT Guy - 2013-09-15

Hello Neomi and All,

Let's going on, to find out a regex to match almost every form of a valid e-mail address. Aaaahhhh !

Of course, we are supposed to use a Unicode version of Notepad++, since the 6.0

@ Neomi

Here are some general regex, relative to punctuation signs :

[[:punct:]] find ANY punctuation sign, between \x00 and \xff

[^[:^punct:]\x80-\xff] find ANY punctuation sign, between \x00 and \x7f. Notice the double negation that means : NOT ( NOT a punctuation sign AND a character ABOVE \X7f)

[!-/:-@\[-{-~]` is identical and find ANY punctuation sign, between \x21 included and \x7f included. This regex is the sum, between square brackets, of four intervals : from ! to /, from : to @, from [ to `` and from { to ~

@ Neomi and All

Go back to our e-mail address problem. So, let's define three sets of characters :

Set1 = [\w!#-'*+/=?^`{-~-]+ Set2 = ([\w !#-,/:-@\]-`{-~[-]|\\["\\])+ Set3 = "([\w !#-,./:-@\]-`{-~[-]|\\["\\])+"

Set1 is a string, composed of ALL allowed char., except DOT
Set2 is a string, composed of ALL allowed + special char. OR \" OR \\ except DOT and UNIQUE " or \
Set3 is a string, composed of ALL allowed + special char. OR \" OR \\ except UNIQUE " or \ , and surrounded by two double quotes "

After some investigations ( and some hours ! ), I think that a general regex, relative to the local part, with the initial assertion and the final lookahead, could be :

(^|(?<=(\h|[,:<([])))("Set2(\.Set2)*"|Set1((\.(Set1|Set3))*\.Set1)?)(?=@)

With that syntax, management of quoted strings seems OK. So, e-mail addresses like abc."def".ghi@example.com or "abc.def.ghi"@example.com will be correct !

So, if you replace the 3 sets by their value, we obtain the regex :

(^|(?<=(\h|[,:<([])))("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\. ([\w !#-,/:-@\]-`{-~[-]|\\["\\])+)*"|[\w!#-'*+/=?^`{-~-]+((\. ([\w!#-'*+/=?^`{-~-]+|"([\w !#-,./:-@\]-`{-~[-]|\\["\\])+"))* \.[\w!#-'*+/=?^`{-~-]+)?)(?=@)

As said in my previous post, this regex should be re-written in a one-line expression !

But, if we use some group references (?n) that refers to a previous group n, this new regex, for search of the local part of an e-mail address, can be shortened as :

(^|(?<=(\h|[,:<([])))("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\. (?4)+)*"|([\w!#-'*+/=?^`{-~-])+((\.((?6)+|"([\w !#-,./:-@\]-` {-~[-]|\\["\\])+"))*\.(?6)+)?)(?=@)

Now, concerning the regex for the domain part, we just change the beginning ([\l\d-]+\.)+ into (([\l\d-]+\.)+)? to match
the top-level domain ONLY, which is located further in the regex

Finally, if we add the new regex of the domain part, the complete regex to search an e-mail address becomes :

(^|(?<=(\h|[,:<([])))("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\. (?4)+)*"|([\w!#-'*+/=?^`{-~-])+((\.((?6)+|"([\w !#-,./:-@\]-` {-~[-]|\\["\\])+"))*\.(?6)+)?)@(([\l\d-]+\.)+)?([a-z]{2}|com| net|org|aero|asia|biz|cat|coop|edu|gov|info|int|jobs|mil|mobi| museum|name|post|pro|tel|travel|xxx)((?=(\h|[])>,.]))|$)

You'll find, below, a list of some valid e-mail addresses, which are matched by this enormous regex !

niceandsimple@example.com very.common@12345.com a.little-lengthy.but.fine@dept.example.museum disposable.style.email.with+symbol@example.com !#$%&'*+-/=?^_`{}|~@example.org Vacances-d'été@club-med.fr style.email.+symbols!#$%&'*+-/=?^_{}|~@a-small-example.fr UPPERCASE-letters@example.com very.common@com very.common@fr !#$%&' *+-/=?^_`{}|~@example.org (A) !#$%&'\ *+-/=?^_`{}|~@example.org (A) just."not".right@example.com just."very.not".right@example.com just."very..not".right@example.com (B) "simple"@example.com "sim.ple"@example.com "much.more unusual"@example.com "very.unusual.@.unusual.com"@example.com "very.(),:;<>[]\".VERY.\"very@\\ \"\".unusual"@example.com "()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~.a"@example.org "UPPERCASE-letters"@example.com " "@example.org very.common@abc.com very.common@abc.fr very.common@abc.123.com very.common@abc.123.fr very.common@abc.123.fr.com very.common@abc.123.fr.fr very.common@abc.123.com very.common@abc.123.this.is.the-test.to-do.com This is a test@entity.com correct e-mail address (C) This is a test@entity.com correct e-mail address (D) This is a test@entity.com correct e-mail address (E) (test@entity.com) (F) [test@entity.net] (G) <test@entity.org> (H) ,test@entity.edu, (I) :test@entity.xxx. (J)

Notes :

(A) Valid e-mail address, from the SPACE, excluded
(B) This e-mail adress shouldn't be valid , because of the two SUCCESSIVE Dots
(C) E-mail address, embedded with Spaces
(D) E-mail address, embedded with Tabulations
(E) E-mail address, embedded with No-Break Spaces
(F) E-mail address, embedded with Parentheses
(G) E-mail address, embedded with Square Braxkets
(H) E-mail address, embedded with Angle Brackets
(I) E-mail address, embedded with Commas
(J) E-mail address, embedded with Colon and Dot

And, to end up, a list of some invalid e-mail addresses, which are correctly NOT found by this regex, except the (Z) case

!#$%&'"*+-/=?^_`{}|~@example.org (K) !#$%&'\*+-/=?^_`{}|~@example.org (L) !#$%&'\"*+-/=?^_`{}|~@example.org (M) !#$%&'\\*+-/=?^_`{}|~@example.org (N) style.email..+symbols!#$%&'*+-/=?^_{}|~@a-small-example.fr (O) "very.(),:;<>[]\".VERY..\"very@\\ \"\".unusual"@example.com (P) Abc.example.com (Q) A@b@c@example.com (R) a"b(c)d,e:f;g<h>i[j\k]l@example.com (S) just"not"right@example.com (T) just."not".@example.com (U) just."not"@.example.com (V) ."not".right@example.com (W) this is"not\allowed@example.com (X) this\ still\"not\\allowed@example.com (Y) user@[IPv6:2001:db8:1ff::a0b:dbd0] (Z) very.common@123 (1) very.common@.com (2) very.common@.k (3) very.common@ABC.def.com (4) very.common@abc.DEF.com (4)

Notes :

(K) A double quote " , NOT between double quotes
(L) A backslash \ , NOT between double quotes
(M) An escaped double quote \" , NOT between double quotes
(N) An escaped double quote \ , NOT between double quotes

(O) Two successive dots found
(P) Two successive dots found

(Q) The symbol @ is absent
(R) The symbol @ is located, outside double quotes
(S) The special symbols are NOT escaped
(T) The quoted strings don't have a DOT delimiter

(U) A DOT is located just BEFORE the symbol @
(V) A DOT is located just AFTER the symbol @
(W) A DOT found at the BEGINNING of e-mail address
(X) Space, Double Quote and Backslash are NOT double-quoted nor escaped
(Y) Space, Double Quote and Backslash are NOT double-quoted

(Z) This NON-treated case is OK, as it's a domain with an IP address
(1) Top-Level Domain UNKNOWN
(2) Top-Level Domain preceded by a dot
(3) ONE letter country or Top-Level Domain
(4) UPPERCASE letter found
(4) UPPERCASE letter found

You may think : " this guy have ( plenty of ) time to waste !" You would be perfectly right. It's was just an exercice to boost my neurons and investigate nasty regex. Indeed, I will certainly never use this complicated regex !!!

Cheers,

guy038

Article finished on 16/09/2013, at 01h50 ( French time zone )

Last edit: THEVENOT Guy 2013-09-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

THEVENOT Guy - 2013-09-16

Hi all,

To be complete on that topic, I forgot, in my last post, to give the regex for the search of an e-mail address, when you don't include the anchors at beginning and the end. Indeed, the different alternatives must be slightly modified : we need to put all the top-level domains BEFORE the search of country with the [a-z]{2}

Moreover, as the anchor, at start of the regex, is absent, the number of the groups used, of the form (?x), must be reduce by 2 !

So, the strict regex for e-mail address search, becomes :

("([\w !#-,/:-@\]-`{-~[-]|\\["\\])+(\.(?2)+)*"|([\w!#-'*+/=?^` {-~-])+((\.((?4)+|"([\w!#-,./:-@\]-`{-~[-]|\\["\\])+"))*\.(?4) +)?)@(([\l\d-]+\.)+)?(com|net|org|aero|asia|biz|cat|coop|edu| gov|info|int|jobs|mil|mobi|museum|name|post|pro|tel|travel|xxx| [a-z]{2})

Of course, as this regex impose less conditions than my last complete regex of my previous post, it may find some part of an invalid e-mail address, which is ,really, a valid e-mail address !

End of my delirium !!!!!

Cheers,

guy038

Last edit: THEVENOT Guy 2013-09-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

GerdB - 2013-09-18

Whoa, Guy, I hope you recovered from that one :-)

A good place to look for any regexes is http://www.regular-expressions.info/examples.html You'll find lots of useful stuff there together with explanations and pros and cons.

regards
gerd

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

THEVENOT Guy - 2013-09-19

Hi GerdB,

Thank you for this link, but I already know this excellent Web site :-)

When I was younger, I worked on Unix servers and I very often used regular expressions with the very old VI editor !

Indeed, when David Brotherstone included the Perl Regular Common Expressions in the 6.0 version of N++ ( Many thanks to him, again ), I felt the need of good documentation about PCRE and modern features of regular expressions.

So, after a while, I came across the site of Jan Goyvaerts, very complete, which allows everyone to find out all the power and the flexibility of regular expressions.

If you are interested, I have the FULL English tutorial of this site, in a Word document ( 149 pages and about 2 Megabytes ). So, just send me an e-mail, at my e-mail address
( guy.038@wanadoo.fr ), with your own e-mail address.

I will send you back this Word document, as an attached file.

Of course, the version of this tutorial corresponds to the beginning of 2011 and it may have been updated !

Best regards,

guy038

Last edit: THEVENOT Guy 2013-09-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Expression to remove text and symbol

Notepad++ project is moving to GitHub:

Forums

Help

Expression to remove text and symbol

Expression to remove text and symbol

Notepad++ project is moving to GitHub:

Forums

Help

Expression to remove text and symbol document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Expression to remove text and symbol