Patch / Binary for Full Regular expressions

2012-01-31
2012-11-13
  • Dave Brotherstone

    I've just added a patch that adds full regular expression support (using boost::regex).  It's based on the work of Simon Steele for Programmer's Notepad 2, which someone has also done before, at least with instructions on how to do it.

    I've converted it to use boost::regex instead of boost::xpressive, and tidied it up a bit, and made some performance improvements.  I've also added a batch file so anyone can build it, just download and unpack boost (from www.boost.org), and run Buildboost.bat from the scintilla\boostregex directory where boost was extracted to.

    I've build a SciLexer.dll with it in, so you can download it and simply replace it if you want to add full regular expression support without building anything.  Obviously I can't say whether Don will accept the patch, but I've tried to make it as clean as possible, such that anyone can build it, and it doesn't change scintilla (apart from a single line in the makefile).

    The patch is here:
    https://sourceforge.net/tracker/?func=detail&atid=612384&aid=3482291&group_id=95717

    The binary is here (this should work with any recent version of Notepad++)
    http://www.brotherstone.co.uk/npp/SciLexer.zip

    It "appears" to work with a chinese character set, in UTF-8 and UCS-2, but without any knowledge of the language it's difficult to say.  If anyone can test that I'd be very grateful.

    An invalid regular expression will find characters on the first line - the patch includes a small change to Notepad++ to fix this, but if you just change the binary you may notice it.

    I actually think the performance can be dramatically increased, but depending on whether this patch gets accepted or not, depends on whether I'll put the changes into a plugin or into another patch.

    Quick guide to the new regex possibilities:

    {} operator - count of the previous element.  e.g. {3,5} will match between 3 and 5 a-z characters.
    | operator  - (cat|dog){2} will match catdog, catcat, dogcat, or dogdog
    Back references  - Refer back to a previous group in the expression:  (cat|dog)\1 will match catcat or dogdog, but not catdog or dogcat.
    Named groups - name your groups (?<letters>+)(?<thedigit>)  You can then use the names in your replacement string  - "The letters were $+{letters} and the digit was $+{thedigit}"
    I've set it so that "." does not match newline, but you can easily use \R to match any end-of-line.
    We could possibly extend N++ to allow you to alter this, allowing for easy multi-line matches.

    You can also have more than 9 groups, just ${n} where n is the match number.

    Those are the main ones, I think, -
    see http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html for regular expression syntax and http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html for replacement syntax.

    Bugs, feedback, comments are all welcome.

    Cheers,
    Dave.

     
  • Dave Brotherstone

    Just spotted an issue with extended characters.  I think I've got a fix, so will post back when I've fixed that.
    (For latin characters the existing version works fine).

     
  • Don HO

    Don HO - 2012-02-02

    Dave,

    just quick checked the binary, very interesting. (Bravo)
    I'll check the 2 patches this weekend and give you the feedback then.

    Don

     
  • Loreia2

    Loreia2 - 2012-02-02

    I ofter wondered why no one merged PN's support for regex into NPP.
    I even wanted to investigate the issue myself after I am done with UDL thing.
    Anyway, I just wanted to say "thank you one million times" for this feature. I can't think of a single thing I'd like to see more added to NPP.

    Thank you and best regards
    Loreia

     
  • Jan Schreiber

    Jan Schreiber - 2012-02-03

    For me the patched dll seems to have an unwelcome side-effect: My user-defined expressions for the function list plugin stopped working. This might be due to a badly designed regular expression on my part though. Other than that, I confirm that the patched dll works well and exactly as advertised so far.

     
  • DV

    DV - 2012-02-03

    Did you propose this patch to Neil Hodgson (the Scintilla developer)?

     
  • Dave Brotherstone

    @janschreiber - could you perhaps send me your regular expressions - maybe there's something that doesn't work properly that is fixable.  The patch still doesn't work for extended characters, I've got a fix but it's not yet implemented yet (I'm away at the moment). 

    @dv__ - no, but I'm 99% sure it wouldn't be accepted, as Neil is always against any dependencies for Scintilla.  When I've fixed the extended characters, I might do some work to put it in a "contrib" directory for scintilla, to allow other downstream users to simply use boost as an alternative engine (but as an "opt-in"), and then propose it.

    Cheers,
    Dave.

     
  • Julius

    Julius - 2012-02-05

    Dave, thank you really much for your wonderful work!

    Finally I can do multiline regexp!
    (As far as I know your lexer is the only way to do it)

    Npp is REALLY in need of a modern regexp engine.

    How can I help you to make this known to the devs?

     
  • Dave Brotherstone

    Patch has been updated to fix the UTF8 issue.  Updated binary from http://www.brotherstone.co.uk/npp/n++re.7z (includes a test version of N++ with the relevant changes to support multiline).

    Any feedback welcome.

     
  • Julius

    Julius - 2012-02-09

    On XP32 sp3 can't start the exe: "The procedure entry point ChangeWindowMessageFilter could not be located in the dll user32.dll"

     
  • Dave Brotherstone

    Yeah, I'd kinda expected that.  That's a new feature in the next version of N++, and I had to set it to Vista and up to get it to build.  I presume Don will sort this out before the next release.  I'll build a new one without this that will work with XP.

     
  • Josh Harris

    Josh Harris - 2012-02-12

    Nice…  I just wanted to say thanks for this.  I've been using my own custom PCRE compiled scintilla and Notepad++ mod for years but I never worked out all of the bugs; there were always a few problems with malformed regex syntax that I couldn't get rid of and which would cause Notepad++ to crash (and that's why I never released it).

    So far, this is working great and I love the ability to search multiple lines.  Your patches worked well, I combined them with my custom Notepad++ build and had everything compiled and running in just a few minutes.

     
  • Jan

    Jan - 2012-02-26

    Wow, thanks a lot. I've always beet too lazy to look into building np++ myself so i never checked if there was a way to patch in proper regular expressions instead of the severely limited stuff it contains.
    Luckily I stumbled across this topic seing how it's pretty annoying to find stuff in the tracker/forum.

    Thanks, thanks, thanks!

     
  • François-R Boyer

    Good to have more regex features.  But am I the only one for which it crashes when highlighting some XML tags?  I tested in a freshly decompressed version of 5.9.8.bin unicode (in case it would be a problem with another plugin), and XML tags are correctly highlighted when using the standard SciLexer, but is incorrect and crashes when using the modified SciLexer.

     
  • Dave Brotherstone

    That's been fixed in the latest patch/binary.  You need a fixed N++ too (xml tag highlighting is also considerably quicker on larger documents) - it's all in the link from comment 9. 

    Note that this has now been integrated in the official SVN source, so building from an SVN checkout will also get you the same thing.

    http://www.brotherstone.co.uk/npp/n++re.7z

    Dave.

     
  • Jan

    Jan - 2012-03-05

    I found that it shows some unexpected behaviour if you use Shift-F3, Find previous…

    forward searching for "^\d" in "1234 5678" properly finds the 1, but going backward there are two matches, both different: 4 and 2.