Menu

#2 Regular expression token class

closed
nobody
5
2006-01-22
2005-10-26
No

I've written a class called "Regex", which uses the
internal Python "re" module for my own use, and I
believe that it will be useful as a core feature in
pyparsing.

It can offer large performance increases versus trying
to construct regular-expression analogs using the
current classes (like Word, And, ZeroOrMore etc.),
although it can quite happily live alongside the
existing classes.

This is currently a somewhat immature class, and I will
probably be fixing bugs and adding features as needed
(retaining source compatibility as possible).

Discussion

  • John Beisley

    John Beisley - 2005-10-26

    A patch to pyparsing.py to add the new class

     
  • John Beisley

    John Beisley - 2005-10-26

    Logged In: YES
    user_id=1368227

    And here is the code of the class if you prefer it in
    non-diff form:

    # You will need to "import re" for this to work

    class Regex(Token):
    """Token for matching strings that match a given regular
    expression.
    Defined with string specifying the regular expression
    in a form recognized by the inbuilt Python re module.
    """
    # Yes - this class is based on code from the Word class
    def __init__( self, pattern, flags=0):
    """The parameters pattern and flags are passed to
    the re.compile() function as-is. See the Python re module
    for an explanation of the acceptable patterns and flags."""
    super(Regex,self).__init__()
    self.pattern = pattern
    self.flags = flags

    self.re = re.compile(self.pattern, self.flags)

    self.name = _ustr(self)
    self.errmsg = "Expected " + self.name
    self.myException.msg = self.errmsg
    self.mayIndexError = False

    def parseImpl( self, instring, loc, doActions=True ):
    # Create a buffer object, starting at the currect
    location within the input string, for use in the regular
    expression pattern matcher
    buf = buffer(instring, loc)
    result = self.re.match(buf)
    if not result:
    exc = self.myException
    exc.loc = loc
    exc.pstr = instring
    raise exc

    loc += result.end()

    return loc, result.group()

    def __str__( self ):
    try:
    return super(Regex,self).__str__()
    except:
    pass

    if self.strRepr is None:
    self.strRepr = "Re:(%s)" % self.pattern

    return self.strRepr

     
  • John Beisley

    John Beisley - 2005-11-06

    Updated patch

     
  • Paul McGuire

    Paul McGuire - 2005-12-07

    Logged In: YES
    user_id=893320

    Very cool! I've considered this option in the past, but
    was stuck on how to have the regexp start not at the
    beginning of the string, but at the current parsing loc
    instead. Nice use of buffer instead of string slice for
    this purpose. You've really taken this along pretty well,
    I'll be happy to include this in the next version of
    pyparsing.

    Some questions:
    1. You currently return the match.group() from parseImpl.
    Have you had any problems with conflicts or surprises from
    how the corresponding ParseResults gets built? What
    happens if a regexp has named subfields - do they "play
    nice" with results names as defined for a Regex
    ParseElement?
    2. It's been over a month since you submitted this note
    (sorry for the delayed response - I thought I was
    subscribed to this section, I'll double check with SF).
    How has this code held up for you in that time?
    3. I am strictly an RE duffer. Could you send me some
    test cases that I can include in my regression tests?

    Lastly, I can definitely appreciate the potential for
    improved speed of these expressions. I've also considered
    how at some time I might compile a pyparsing grammar
    completely to RE's for evaluating an input string. But
    this is definitely going to be an "advanced" usage
    category feature of pyparsing - if you don't know what you
    are doing with a given regexp, you can easily parse much
    more than you had intended for a given expression.

    (I will attribute your contribution to "greatred" as I
    don't have your actual name - send me an e-mail if you'd
    like me to be able to give you a more personal credit -
    I'll just post your name, not the e-mail!)

    Nice work!
    -- Paul

     
  • John Beisley

    John Beisley - 2005-12-07

    Logged In: YES
    user_id=1368227

    1. I've not actually looked into result names as such, in
    that I don't yet understand what they are. If you mean named
    groups in the regex, that's certainly an extra feature that
    I can see being very useful, and something that I would like
    to see added to the class (although as I had no immediate
    need for it at the time it just didn't happen ;). I can't
    immediately see how I might serve the intentions of pulling
    named groups out from a Regex object as it fits into
    pyparsing (or making it "play nicely" in a sensible way) -
    my knowledge of the internals of pyparsing is somewhat
    limited. (The code for Regex being based off the Word class
    - which I felt was the nearest matching code already in
    pyparsing)

    2. I've not had any suprises from how the class works so
    far. It was a class I developed as I needed to make a regex
    matcher in order to match some BNF-like grammar more
    closely, and thus far it has worked nicely - at least in the
    fashion that I have utilised it (which is parsing a CSS
    stylesheet based loosely upon the BNF-like grammar from the
    CSS spec - which can have some quite complex regular
    expressions for matching strings, URLs and so on).

    3. I can certainly submit some Regex constructs that I'm
    using in my own code if that is useful. I'll have a look at
    this presently.

    My real name is John Beisley, I've updated my SF profile so
    it should show now.

     
  • John Beisley

    John Beisley - 2005-12-07

    Logged In: YES
    user_id=1368227

    Okay, here's a little bit of quickly knocked up test code.
    It doesn't test the Regex in the larger context of a full
    grammar, but rather tests it on a few small things. I've
    thrown in a test which doesn't quite work as one might
    desire for a named group, but then, nothing has been stated
    as to which group should be extracted (there is nothing in
    the code to specifically pull out the named groups, as
    observed).

     
  • John Beisley

    John Beisley - 2005-12-07

    Logged In: YES
    user_id=1368227

    Woops, I've actually attached it this time :)

     
  • John Beisley

    John Beisley - 2005-12-07

    Test code

     
  • Paul McGuire

    Paul McGuire - 2005-12-08

    Logged In: YES
    user_id=893320

    Thanks John!

    I've just checked your changes into my SVN repository,
    plus the unit tests.

    I'm going to give the named groups a little more thought,
    and see if I can get them to look more native to the
    pyparsing ParseResults.

    -- Paul

     
  • Paul McGuire

    Paul McGuire - 2005-12-22

    Logged In: YES
    user_id=893320

    John -

    Please check out the latest 1.4 beta1 release, for my
    inclusion of your Regex work.

    Thanks again!
    -- Paul

     
  • Nobody/Anonymous

    Logged In: NO

    Good stuff, I'll be taking a look at that when I get back to
    work after Christmas!

     
  • Paul McGuire

    Paul McGuire - 2006-01-22
    • status: open --> closed
     
  • Paul McGuire

    Paul McGuire - 2006-01-22

    Logged In: YES
    user_id=893320

    Released in version 1.4

     
  • Nobody/Anonymous

    I'm curious to find out what blog system you're utilizing? I'm having some small security problems with my latest blog and I'd like to find something more safe. Do you have any solutions?
    cheap north face coats http://jacketsnorthface.overblog.com/

     
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.