Notepad++ / Discussion / [READ ONLY] Open Discussion: Francois-R Boyer : New regex code quite OK !

Francois-R Boyer : New regex code quite OK !

Forum: [READ ONLY] Open Discussion

Creator: THEVENOT Guy

Created: 2013-06-16

Updated: 2013-07-04

THEVENOT Guy - 2013-06-16

Hello, François,

First of all, just have a look to my two last posts, after I downloaded your NEW Scilexer.dll, at the addresses :

https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d8e5

https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d4f1

This last post describe a small problem about highlighting* with the Find Mark style ( No important issue ! ).

So, I finished all the tests about Regex search/Replacement, described in my last post, and I'm glad to tell you that, globally, everything seems OK** :)

In addition to the small bug, described in my last post, I just noticed an other issue, concerning recursive patterns.

IMPORTANT : This issue occurs on both actual version of N++ ( and certainly before ! ) and on your new code

Let us consider the subject string below, in a new file :

---<<54<6>4>---<<123<>78>904>----<>----<12345>----

The search of <([^<>]|(?R))*> give the longest sequence <.....>, even multi-lines and/or EMPTY, containing, ONLY, WELL-imbricated other sequences <...>

Thus, the four strings <<54<6>4>, <<123<>78>904>, <> and <12345> are found, both with N++ and with the plug-in RegEx Helper

Now, consider the regex <([^<>]|(?R))+> ( The unique modification is the change of the star symbol * by the plus sign +, before the last symbol > )

Normally, this regex should search the longest NON EMPTY sequence <.....>, , even multi-lines, containing, ONLY, WELL-imbricated other NON EMPTY sequences <...>

Then, with N++, three strings are found : <<54<6>4>, <<123<>78>904> and <12345>
The second string <<123<>78>904> should not have been found ! It seems that it works only if it's out of the recursion phase !

But, with the plug-in RegEx Helper, two strings only are found : <<54<6>4> and <12345>
It's the correct behaviour !

What do you think of ?

Many thanks, again, for the corrections and improvements, in the Regex search/replacement engine !

I intend to create a new topic, concerning specific bugs and improvements about the Search/Replacement interface

Best Regards,

guy038

P.S.

By the way, I also tested your new character class [[:inval:]]. It works fine ! I just wrote an accentuated character, like é, in a dummy file. In UTF-8, it's normally coded with the two bytes \xc3 and \xa9.

So, with an other small Search/Replace editor, I replaced the first byte by, for example, the byte \xc1, which is always a forbidden value in an UTF-8 file.

Thus, in N++, this file was displayed with the symbols xC1 and xA9, and these bytes were correctly found with your [[inval:]] form.

François, the tiny Search/Replace editor, mtr.exe, that I'm speaking above, may interest you, especially for huge batch search/replacements, concerning hundred of files and/or hundred of simultaneous searches !

For this very powerful tool, called "Minitrue" v2.0.6, combines a text-viewer, a "grep" utility, a "less" pager utility and a fast search/replacement program, with the support of regular expressions !

Generally, this program is launched in a DOS session. But all actions can be memorized in batch files.

In addition, the list of files to scan and/or the list of strings to search and, eventually, the list of replacements to do, can all be stored in text files.

Although, its Regex syntax is a bit less powerfull than N++ PCRE syntax, it had some interesting other proprietary program options and Regex features !

You can download it at the address : http://adoxa.3eeweb.com/minitrue

But, it's better to place this program, directly, on a root drive, to avoid problems about the length of total path to access files ! Of course, named files containing spaces must be enclosed in double quotes.

The home page of the ( productive ! ) author , Jason Hoods, is at the address : http://adoxa.3eeweb.com

After downloading, just have a look to the fourteenth examples, at the end of his tutorial, with the -? help option, to be really convinced :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

THEVENOT Guy - 2013-06-16

Hello, François,

Oups, I should have had a problem when creating this new topic !!

First of all, just have a look to my two last posts, after I downloaded your NEW Scilexer.dll, at the addresses :

https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d8e5

https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d4f1

This last post describe a small problem about highlighting* with the Find Mark style ( No important issue ! ).

So, I finished all the tests about Regex search/Replacement, described in my last post, and I'm glad to tell you that, globally, everything seems OK** :)

In addition to the small bug, described in my last post, I just noticed an other issue, concerning recursive patterns.

IMPORTANT : This issue occurs on both actual version of N++ ( and certainly before ! ) and on your new code !

Let us consider the subject string below, in a new file :

---<54<6>4>---<<123<>78>904>----<>----<12345>----

The search of <([^<>]|(?R))*> give the longest sequence <.....>, even multi-lines and/or EMPTY, containing, ONLY, WELL-imbricated other sequences <...>, even multi-lines and/or EMPTY

Thus, the four strings <<54<6>4>, <<123<>78>904>, <> and <12345> are found, both with N++ and with the plug-in RegEx Helper

Now, consider the regex <([^<>]|(?R))+> ( The unique modification is the change of the star symbol * by the plus sign +, before the last symbol > )

Normally, this regex should search the longest NON EMPTY sequence <.....>, , even multi-lines, containing, ONLY, WELL-imbricated other NON EMPTY sequences <...>, even multi-lines

Then, with N++, three strings are found : <<54<6>4>, <<123<>78>904> and <12345>
The second string <<123<>78>904> should not have been found ! It seems that it works only if it's out of the recursion phase !

But, with the plug-in RegEx Helper, two strings only are found : <<54<6>4> and <12345>
It's the correct behaviour !

What do you think of ?

Many thanks, again, for the corrections and improvements, in the Regex search/replacement engine !

I created a new topic, concerning a specific bug and improvements, concerning the Search/Replacement interface , at the address :

https://sourceforge.net/p/notepad-plus/discussion/331753/thread/328af373/#e087

Best Regards,

guy038

P.S.

By the way, I also tested your new character class *[[:inval:]]. It works fine !

I just wrote an accentuated character, like the character é, in a dummy file. In UTF-8, it's normally coded with the two bytes \xc3 and \xa9.

So, with an other small Search/Replace editor, I replaced the first byte by, for example, the byte \xc1, which is always a forbidden value in an UTF-8 file.

Thus, in N++, this file was displayed with the symbols xC1 and xA9, and these bytes were correctly found with your [[inval:]] form.

François, the tiny Search/Replace editor, mtr.exe, that I'm speaking above, may interest you, especially for huge batch search/replacements, concerning hundred of files and/or hundred of simultaneous searches !

For this very powerful tool, called "Minitrue" v2.0.6, combines a text-viewer, a "grep" utility, a "less" pager utility and a fast search/replacement program, with the support of regular expressions !

Generally, this program is launched in a DOS session. But all actions can be memorized in batch files.

In addition, the list of files to scan and/or the list of strings to search and, eventually, the list of replacements to do, can all be stored in text files.

Although, its regex syntax is a bit less powerfull than N++ PCRE syntax, it had some interesting other proprietary program options and Regex features !

You can download it at the address :

http://adoxa.3eeweb.com/minitrue

But, it's better to place this program, directly, on a root drive, to avoid problems about the length of total path to access files ! Of course, named files containing spaces must be enclosed in double quotes.

The home page of the ( productive ! ) author , Jason Hoods, is at the address :

http://adoxa.3eeweb.com

After downloading, just have a look to the fourteenth examples, at the end of his tutorial, with the -? help option, to be really convinced :)

Also, open a file and leave the TAB key pressed => The two views of the file, normal and hexadecimal, seems to be really simultaneous !!! Very efficient code :)

Last edit: THEVENOT Guy 2013-06-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Francois-R Boyer : New regex code quite OK !

Notepad++ project is moving to GitHub:

Forums

Help

Francois-R Boyer : New regex code quite OK !

Francois-R Boyer : New regex code quite OK !

Notepad++ project is moving to GitHub:

Forums

Help

Francois-R Boyer : New regex code quite OK ! document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Francois-R Boyer : New regex code quite OK !