Menu

Francois-R Boyer : New regex code quite OK !

2013-06-16
2013-07-04
  • THEVENOT Guy

    THEVENOT Guy - 2013-06-16

    Hello, François,

    First of all, just have a look to my two last posts, after I downloaded your NEW Scilexer.dll, at the addresses :

    https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d8e5

    https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d4f1

    This last post describe a small problem about highlighting* with the Find Mark style ( No important issue ! ).


    So, I finished
    all the tests about Regex search/Replacement, described in my last post, and I'm glad to tell you that, globally, everything seems OK** :)

    In addition to the small bug, described in my last post, I just noticed an other issue, concerning recursive patterns.

    IMPORTANT : This issue occurs on both actual version of N++ ( and certainly before ! ) and on your new code

    Let us consider the subject string below, in a new file :

    ---<<54<6>4>---<<123<>78>904>----<>----<12345>----

    The search of  <([^<>]|(?R))*>   give the longest sequence <.....>, even multi-lines and/or EMPTY, containing, ONLY, WELL-imbricated other sequences <...>

    Thus, the four strings <<54<6>4>, <<123<>78>904>, <> and <12345> are found, both with N++ and with the plug-in RegEx Helper

    Now, consider the regex <([^<>]|(?R))+> ( The unique modification is the change of the star symbol * by the plus sign +, before the last symbol > )

    Normally, this regex should search the longest NON EMPTY sequence <.....>, , even multi-lines, containing, ONLY, WELL-imbricated other NON EMPTY sequences <...>

    Then, with N++, three strings are found : <<54<6>4>, <<123<>78>904> and <12345>
    The second string <<123<>78>904> should not have been found ! It seems that it works only if it's out of the recursion phase !

    But, with the plug-in RegEx Helper, two strings only are found : <<54<6>4> and <12345>
    It's the correct behaviour !

    What do you think of ?

    Many thanks, again, for the corrections and improvements, in the Regex search/replacement engine !

    I intend to create a new topic, concerning specific bugs and improvements about the Search/Replacement interface

    Best Regards,

    guy038

    P.S.

    By the way, I also tested your new character class [[:inval:]]. It works fine ! I just wrote an accentuated character, like é, in a dummy file. In UTF-8, it's normally coded with the two bytes \xc3 and \xa9.

    So, with an other small Search/Replace editor, I replaced the first byte by, for example, the byte \xc1, which is always a forbidden value in an UTF-8 file.

    Thus, in N++, this file was displayed with the symbols xC1 and xA9, and these bytes were correctly found with your [[inval:]] form.


    François, the tiny Search/Replace editor, mtr.exe, that I'm speaking above, may interest you, especially for huge batch search/replacements, concerning hundred of files and/or hundred of simultaneous searches !

    For this very powerful tool, called "Minitrue" v2.0.6, combines a text-viewer, a "grep" utility, a "less" pager utility and a fast search/replacement program, with the support of regular expressions !

    Generally, this program is launched in a DOS session. But all actions can be memorized in batch files.

    In addition, the list of files to scan and/or the list of strings to search and, eventually, the list of replacements to do, can all be stored in text files.

    Although, its Regex syntax is a bit less powerfull than N++ PCRE syntax, it had some interesting other proprietary program options and Regex features !

    You can download it at the address : http://adoxa.3eeweb.com/minitrue

    But, it's better to place this program, directly, on a root drive, to avoid problems about the length of total path to access files ! Of course, named files containing spaces must be enclosed in double quotes.

    The home page of the ( productive ! ) author , Jason Hoods, is at the address : http://adoxa.3eeweb.com

    After downloading, just have a look to the fourteenth examples, at the end of his tutorial, with the -? help option, to be really convinced :)

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-06-16

    Hello, François,

    Oups, I should have had a problem when creating this new topic !!

    First of all, just have a look to my two last posts, after I downloaded your NEW Scilexer.dll, at the addresses :

    https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d8e5

    https://sourceforge.net/p/notepad-plus/discussion/331753/thread/9f4742f6/#d4f1

    This last post describe a small problem about highlighting* with the Find Mark style ( No important issue ! ).


    So, I finished
    all the tests about Regex search/Replacement, described in my last post, and I'm glad to tell you that, globally, everything seems OK** :)

    In addition to the small bug, described in my last post, I just noticed an other issue, concerning recursive patterns.

    IMPORTANT : This issue occurs on both actual version of N++ ( and certainly before ! ) and on your new code !

    Let us consider the subject string below, in a new file :

    ---<54<6>4>---<<123<>78>904>----<>----<12345>----

    The search of  <([^<>]|(?R))*>   give the longest sequence <.....>, even multi-lines and/or EMPTY, containing, ONLY, WELL-imbricated other sequences <...>, even multi-lines and/or EMPTY

    Thus, the four strings <<54<6>4>, <<123<>78>904>, <> and <12345> are found, both with N++ and with the plug-in RegEx Helper

    Now, consider the regex <([^<>]|(?R))+> ( The unique modification is the change of the star symbol * by the plus sign +, before the last symbol > )

    Normally, this regex should search the longest NON EMPTY sequence <.....>, , even multi-lines, containing, ONLY, WELL-imbricated other NON EMPTY sequences <...>, even multi-lines

    Then, with N++, three strings are found : <<54<6>4>, <<123<>78>904> and <12345>
    The second string <<123<>78>904> should not have been found ! It seems that it works only if it's out of the recursion phase !

    But, with the plug-in RegEx Helper, two strings only are found : <<54<6>4> and <12345>
    It's the correct behaviour !

    What do you think of ?

    Many thanks, again, for the corrections and improvements, in the Regex search/replacement engine !

    I created a new topic, concerning a specific bug and improvements, concerning the Search/Replacement interface , at the address :

    https://sourceforge.net/p/notepad-plus/discussion/331753/thread/328af373/#e087

    Best Regards,

    guy038

    P.S.

    By the way, I also tested your new character class *[[:inval:]]. It works fine !

    I just wrote an accentuated character, like the character é, in a dummy file. In UTF-8, it's normally coded with the two bytes \xc3 and \xa9.

    So, with an other small Search/Replace editor, I replaced the first byte by, for example, the byte \xc1, which is always a forbidden value in an UTF-8 file.

    Thus, in N++, this file was displayed with the symbols xC1 and xA9, and these bytes were correctly found with your [[inval:]] form.


    François, the tiny Search/Replace editor, mtr.exe, that I'm speaking above, may interest you, especially for huge batch search/replacements, concerning hundred of files and/or hundred of simultaneous searches !

    For this very powerful tool, called "Minitrue" v2.0.6, combines a text-viewer, a "grep" utility, a "less" pager utility and a fast search/replacement program, with the support of regular expressions !

    Generally, this program is launched in a DOS session. But all actions can be memorized in batch files.

    In addition, the list of files to scan and/or the list of strings to search and, eventually, the list of replacements to do, can all be stored in text files.

    Although, its regex syntax is a bit less powerfull than N++ PCRE syntax, it had some interesting other proprietary program options and Regex features !

    You can download it at the address :

    http://adoxa.3eeweb.com/minitrue

    But, it's better to place this program, directly, on a root drive, to avoid problems about the length of total path to access files ! Of course, named files containing spaces must be enclosed in double quotes.

    The home page of the ( productive ! ) author , Jason Hoods, is at the address :

    http://adoxa.3eeweb.com

    After downloading, just have a look to the fourteenth examples, at the end of his tutorial, with the -? help option, to be really convinced :)

    Also, open a file and leave the TAB key pressed => The two views of the file, normal and hexadecimal, seems to be really simultaneous !!! Very efficient code :)

     

    Last edit: THEVENOT Guy 2013-06-18