#142 switch from java.util.regex to ... another one ;-)

closed-works-for-me
nobody
None
5
2007-07-08
2007-05-18
Skeeve
No

java.util.regex has severe problems with regular expressions using alternatives (the "|"). Please refer to sun bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952

There is a solution provided switching from java.utiol.regex to http://javaregex.com/

Discussion

  • Marcelo Vanzin
    Marcelo Vanzin
    2007-05-19

    Logged In: YES
    user_id=75113
    Originator: NO

    The problem is: we switched to java.util.regex exactly so we would avoid depending on an external library. If we wanted to stick with another library, we probably would have kept gnu.regexp around.

    So I don't think switching to another library will be done. If Java's regex system was pluggable (like the XML parsers are), then it would be simple, but it's not.

     
  • Skeeve
    Skeeve
    2007-05-20

    Logged In: YES
    user_id=864970
    Originator: YES

    Problem with sticking to Sun's library are:
    1) It's not reliable
    2) Users might search but don't find anyything because of the stack overflow they don't see
    3) It might be complicated or impossible to find another regex that matches the same strings

    If there's really noone agreeing with me that an unreliable library shouldn't be used, there should be at least a hint that the regex the user used is too complex to be handled. I guess there is some try ... catch going on which suppresses the stack overflow exception. The Stack Overflow should be catched and signaled here so that the user has at least achance to change his regex or even his editor...

     
  • Skeeve
    Skeeve
    2007-05-24

    Logged In: YES
    user_id=864970
    Originator: YES

    Just a comment: I don't insist on Stevesoft but I request not to use Sun's library except if you can easily com up with a working alternative for the regex: "<!--([^-]|-[^-])*-->" It means: search XML Comments. Thy may contain anything but 2 adjacent "-"

     
  • Björn Kautler
    Björn Kautler
    2007-05-24

    Logged In: YES
    user_id=918212
    Originator: NO

    I don't think we will change the regex engine again. And certainly not back to gnu.regexp. gnu.regexp had its own problems and quirks and it was unmaintained and an external dependency. For me regexes with alternation work fine. Even your regex for searching XML comments works for me as expected.

     
  • Björn Kautler
    Björn Kautler
    2007-05-24

    Logged In: YES
    user_id=918212
    Originator: NO

    And btw. a replacement for your XML comments searching regexp should be:

    <!--?(-[^-]+)*-->

    This should achieve exactly the same results than your regexp I think. But if it has errors, tell me, I'm no Regex expert.

     
  • Skeeve
    Skeeve
    2007-05-24

    Logged In: YES
    user_id=864970
    Originator: YES

    Not exactly the same result as it also matches <!--->. I already have a working alternative. But it's not that it's just this one regex that shows the problem. And I'm also sure most of them can be translated to working alternatives. But I'm also sure that
    a) It's not always easy to find an alternativ
    and
    b) it's impossible to find one if you don't know that your regex made the regex util fail!

    case b is what worries me most! So alerting the user in such a case with "Your regex is to complex. Please use an alternative one" might be a workaround.

    And to give you the working alternative (tbh... Ididn't test it with "gigantic" comments ;-)) it's "<!--[^-]*(-[^-]+)*-->" But I had to ask the perlmonks for assistance.

     
  • Skeeve
    Skeeve
    2007-05-24

    • summary: switch from java.util.regex to the Stevesoft version --> switch from java.util.regex to ... another one ;-)
     
  • Alan Ezust
    Alan Ezust
    2007-05-24

    Logged In: YES
    user_id=935841
    Originator: NO

    If it is for search/replace, you can try using the xsearch plugin, which is still using the gnu.regexp library and behaves as jEdit used to.

     
  • Marcelo Vanzin
    Marcelo Vanzin
    2007-05-24

    Logged In: YES
    user_id=75113
    Originator: NO

    Skeeve, there's no way to detect whether a regex is too complex. Even in your example, the "complex" regex works in some places but not in others. How are we supposed to detect something out of that?

    What if we switched to another regex library and that library caused some other user to hit a similar problem with one of the regexps he's trying to use?

    But I think the main problem is that you're trying to apply regexps to something they're not suited for. In the specific case of the XML comment, why do you want the regexp to detect all the corner cases? Use a simple regexp for searching (what's wrong with "<!--.+?-->"? It will find invalid comments, but read on.). Install the XML plugin, which provides a proper XML parser, and it will warn you if your XML comments have errors. Simple.

     
  • Skeeve
    Skeeve
    2007-05-24

    Logged In: YES
    user_id=864970
    Originator: YES

    > Skeeve, there's no way to detect whether a regex is too complex.
    I beg to differ: It's too complex if you get a stack overflow. Currently this exception is simply ignored (when hypersearching) and you don't know that it was there. You simply think that your pattern didn't hit.

    > What if
    That's not a valid question. There are regex implementations that work for years now without that problem Sun introduced. We currently af not a "What if" but a "What now"! That's more important than any "What if".

    > But I think the main problem is that you're trying to apply regexps to
    > something they're not suited for.
    And I think that you are wrong here. See the examples on Sun's website. They are not XML related at all. I just used my example as a pretty simple one. I don't want to justify why I don't use any XML parser when editing an XML file. You see: I'm just an jEdit user who stumbled across this problem. Do you want to discuss with every user what's the best practice in his particular case for searching and finding?

    > In the specific case of the XML comment,
    > why do you want the regexp to detect all the corner cases?
    It's not the question WHY i want to do it, it's the question why you/jEdit developers insist on using a library with buggy regex support. Or better: Why, if you want to use iit, don't alert the user when he is using an "improper" regex. See above: Signal the stack overflow with a hint what the user can do about it.

    Regarding XML Parsers: The XML plugin is installed and it's more than fine when your XML file is complete. It's useless as long as you are still editing. But that's another discussion.

     
  • Marcelo Vanzin
    Marcelo Vanzin
    2007-05-24

    Logged In: YES
    user_id=75113
    Originator: NO

    > I beg to differ: It's too complex if you get a stack overflow. Currently
    > this exception is simply ignored (when hypersearching) and you don't know
    > that it was there. You simply think that your pattern didn't hit.

    My point is that you don't know if you're going to get a stack overflow before running the regexp.

     
  • Skeeve
    Skeeve
    2007-05-24

    Logged In: YES
    user_id=864970
    Originator: YES

    > My point is that you don't know if you're going to get a stack overflow before running the regexp.
    Where is the problem? If you don't get the overflow it's not too complex. If you get it, the user needs to find another way to achieve what he wants to. Currently he has no chance to do so because he simply does not know about hte stack overflow.

    To be specific: It's not a too complex regex but a regex vs. data issue.

    I have the slight feeling that my english is not good enough to explain to you what I want to see.
    a) Either use a library that doesn't have stack limitations
    -> You don't want to, okay, accepted. So this leads us to:
    b) If there are stack overflows tell the user so but not by just showing a stacktrace but by telling him to use another regex or other means to achieve his task.

     
  • Marcelo Vanzin
    Marcelo Vanzin
    2007-05-24

    Logged In: YES
    user_id=75113
    Originator: NO

    Skeeve, that would take care of the StackOverflow, but as you've seen by yourself, that's not the only problem with Sun's library; sometimes it goes into "infinite loops", and that's not something trivial to detect (probably doable in a very hackish and ugly way, but I don't want to go down that path). It would be trivial to catch the stack overflow, though.

    I like Alan's idea of keeping Xsearch as a search replacement. That way, the core can remain "lean" (meaning as few external dependencies as possible), and Xsearch can choose its own regex library and have a release schedule that is not tied to the jEdit core.

     
  • Skeeve
    Skeeve
    2007-05-24

    Logged In: YES
    user_id=864970
    Originator: YES

    > but as you've seen by yourself, [...] sometimes it goes into "infinite loops"
    *blush* No! I didn't...

    And yes: I know that's impossible to catch in a reliable fashion. One reason more to not use java.util.regex ;-) No! Don't let us start it all over again.

    I think I have to look at Xsearch.

     
  • Alan Ezust
    Alan Ezust
    2007-07-08

    • status: open --> closed-works-for-me
     
  • Alan Ezust
    Alan Ezust
    2007-07-08

    Logged In: YES
    user_id=935841
    Originator: NO

    closed, "works for me" with XSearch plugin.