Re: [Pyparsing] C++ Comments and a Backslash at the End of the Line.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Paul,

> > Would you accept suggestions for equivalent regexps that execute
> > faster?  I'm thinking of parsing many C++ files frequently and could
> > use the speed.
> 
> Of course!  
> 
> Here is a suggestion for this regexp in particular: back before there
> was a Regex class in pyparsing, I built up these comment definitions
> from normal pyparsing constructs, and it looked like this:
> 
> cStyleComment = Combine( Literal("/*") +
>                          ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/")
> ) ) +
>                          Literal("*/") ).streamline().setName("cStyleComment
> enclosed in /* ... */")
> restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).leaveWhitespace()
> dblSlashComment = "//" + restOfLine
> cppStyleComment = ( dblSlashComment | cStyleComment )

OK.

> Note that both paths of cppStyleComment's alternation begin with a
> '/'.  I was able to speed this up quite a bit by adding an assertive
> lookahead with pyparsing's FollowedBy:
> 
> cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment )

Makes sense.

> How does one do such a lookahead in re syntax?  I bet that would speed
> up the re matching. (I just tested a version that refactors the
> leading '/' from both alternatives to the beginning of the re, with no
> appreciable speed improvement.

Yes, that's how you do it.

> My test suite is a large number of Verilog files, which is a fairly
> complex grammar that uses C++-style comments.  I suspect that the
> reason that adding FollowedBy("/") made such a difference before was
> because the other comment processing was so slow.)

I've found (below) it does make a difference.

> In other expressions, I have done some testing for performance.
> Here's what I've found:
> - an re matching for "[A-Z]" is not appreciably faster than
> "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]"

The RE undergoes optimisation in byte code form after parsing so these
become equivalent.  At least that's what I expect looking at
sre_compile.py:_optimize_charset() but perhaps not.

    >>> import re
    >>> r = re.compile('[A-Z]', flags = 128)
    in
      range (65, 90)
    >>> r = re.compile('[ABCDEFGHIJKLMNOPQRSTUVWXYZ]', flags = 128)
    in
      literal 65 literal 66 literal 67 literal 68 literal 69
      literal 70 literal 71 literal 72 literal 73 literal 74
      literal 75 literal 76 literal 77 literal 78 literal 79
      literal 80 literal 81 literal 82 literal 83 literal 84
      literal 85 literal 86 literal 87 literal 88 literal 89
      literal 90
    >>>

I haven't looked into this further.

> - there is little/no advantage in using re's that compile faster -
> runtime matching far outweighs setup/compile time performance

Oh, absolutely.

> But I will be the first to admit that I am no re expert, and I
> *WELCOME* any suggestions you might have on tuning up the re's in
> pyparsing!

_Mastering Regular Expressions_ by Friedl is a book you'd enjoy.  The
latest edition (I read the 1st Ed.) probably has up to date Python
coverage.  Anyway, based on a hazy recollection of that, I'd suggest

    r'''
        (?x)
        /
        (?:
            \*
            (?:[^*]*\*+)+?
            /
        |
            /[^\n]*
            (?:\n[^\n]*)*?
            (?:(?<!\\)|\Z)
        )
    '''

You could pass re.VERBOSE in as a flag instead of (?x), that was just
easier in my testing framework.

It's beneficial to move the common `/' prefix out of both alternatives
since there's overhead of going into a group and this saves the bother.

Parenthesis that capture have an overhead.  If you're only after
group(0) then don't capture, i.e. turn `a(b|c)*d' into `a(?:b|c)*d'.
The Python Regular Expression HOWTO says overwise, but in theory I think
it's wrong, and it seems so in practice (see test script below).

After `/*' we want to match as few characters as possible before the
`*/'.  Doing r'/\*[\s\S]*?\*/' could be better for a few reasons.

It's a non-greedy match, which is correct but means that matching
against `/*abc*/' does

    Match /?  Yes, consume /.
    Match *?  Yes, consume *.
    Match *?  No.
    Match space?  No.
    Match not-space?  Yes, consume a.
    Match *?  No.
    Match space?  No.
    Match not-space?  Yes, consume b.
    Match *?  No.
    Match space?  No.
    Match not-space?  Yes, consume c.
    Match *?  Yes, consume *.
    Match /?  Yes, consume /.

It `stutters' after every character, having to break out of one loop to
see if we've finished yet.

The `space' and `not-space' come from `[\s\S]' creating a character
class.

    >>> r = re.compile('[\s\S]', flags = 128)
    in
      category category_space
      category category_not_space

The common `/***...***/' comment line takes

    Match /?  Yes, consume /.
    Match *?  Yes, consume *.
    Match *?  Yes, consume *.
    Match /?  No, back-track one to *.
    Match space?  No.
    Match not-space?  Yes, consume *.
    Match *?  Yes, consume *.
    Match /?  No, back-track one to *.
    Match space?  No.
    Match not-space?  Yes, consume *.
    ...these four are repeated for each `middle' *...

So, I've gone for 

    /\*([^*]*\*+)+?/

After `/*' we gobble as many non-* as possible in a small tight loop
inside the engine.  We then know we're at a `*' so we gobble one or more
of those quickly too.  Since we're non-greedy we test to see if `/'
follows the string of *s.  If it does, we're done.  If not, we'll
consume the `/' as the first of the non-*s in the next loop around the
group.

Against `/*abc*/':

    Match /?  Yes, consume /.
    Match *?  Yes, consume *.
    Match not-*?  Yes, consume a.
    Match not-*?  Yes, consume b.
    Match not-*?  Yes, consume c.
    Match *?  Yes, consume *.
    Match *?  No.
    Match /?  Yes, consume /.

Against `/***...***/':

    Match /?  Yes, consume /.
    Match *?  Yes, consume *.
    Match not-*?  No.
    Match *?  Yes, consume *.
    ...this one line is repeated for each `middle' *...
    ...and the end one...
    Match *?  No.
    Match /?  Yes, consume /.

The C++ comment is a similar technique.

    1  //
    2  [^\n]*
    3  (\n[^\n]*)*?
    4  ((?<!\\)|\Z)

After `//', gobble up to the end of the line.  Because (3) is
zero-or-more non-greedy we initially skip that and test (4).  The
`(?<!\\)' is a `look-behind negative assertion' to check that the last
character before this newline we've arrived at wasn't a backslash.  If
that's the case then the unescaped-newline ends the comment, hurrah!  If
not, are we at the end of the string?  If so, `\Z' is true so we've
again reached the end of the comment.  I put this in for those pesky
Emacs users who don't terminate the last line of the text file with \n
but may have `//\\' as the last four characters of the file.

If we've reach a newline yet it isn't the end of the comment then we
attempt (3) for the first time.  We consume the \n and then gobble,
quickly, all non-\n before attempting (4) again.

Attached are my test script showing various regexps I tried, and a
anonymised C++ source file that was the impetus to all this in the first
place.  Usage is `./commentspeed anon.c++ 50'.  It tests all the regexps
against some built-in expected results, and then times all of them
matching against anon.c++ 50 times, checking they all agree on what
makes a comment.  You can comment some of the slow ones out in order to
run the quicker ones more times for more accurate results.

    $ ./commentspeed anon.c++ 240
    original 54.3169460297
    backslashless 55.1692581177
    slashfirst 27.1254348755
    noncapture 14.8059911728
    dropsS 14.2693331242
    tinyloop 2.20497894287
    nocharclass 2.40148186684
    $
    $ e 54.3169460297 / 2.20497894287
    24.63377085996162740307
    $

`original' and `backslashless' shouldn't differ in time.  It seems
whichever one runs first comes in slightly faster;  must be some
Python/OS overhead.

`slashfirst' shows factoring out the leading `/' helps.

`noncapture' shows using (?:) helps.

`tinyloop' is the one I'd recommend.  I expected `nocharclass' to be
quicker but it seems the engine compiles negative character classes of a
single character into a `not_literal' op-code rather than a character
class so the ones used in tinyloop aren't as slow as I thought.

Cheers,

Ralph.