Re: [Pyparsing] C++ Comments and a Backslash at the End of the Line.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> > Does this look better?
> > 
> > cppStyleComment =
> >     Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/(\\\n|.)*)").setName("C++ style 
> > comment")
> > 
> > It seems to test out okay.  I'll put it in the next update.
> 
> Yes, thanks.
> 
> Would you accept suggestions for equivalent regexps that 
> execute faster?
> I'm thinking of parsing many C++ files frequently and could 
> use the speed.
> 

Of course!  

Here is a suggestion for this regexp in particular: back before there was a
Regex class in pyparsing, I built up these comment definitions from normal
pyparsing constructs, and it looked like this:

cStyleComment = Combine( Literal("/*") +
                         ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/")
) ) +
                         Literal("*/") ).streamline().setName("cStyleComment
enclosed in /* ... */")
restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).leaveWhitespace()
dblSlashComment = "//" + restOfLine
cppStyleComment = ( dblSlashComment | cStyleComment )

Note that both paths of cppStyleComment's alternation begin with a '/'.  I
was able to speed this up quite a bit by adding an assertive lookahead with
pyparsing's FollowedBy:

cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment )

Essentially saying, "if the next character is not a '/', don't bother
testing the rest of the expression."

How does one do such a lookahead in re syntax?  I bet that would speed up
the re matching. (I just tested a version that refactors the leading '/'
from both alternatives to the beginning of the re, with no appreciable speed
improvement.  My test suite is a large number of Verilog files, which is a
fairly complex grammar that uses C++-style comments.  I suspect that the
reason that adding FollowedBy("/") made such a difference before was because
the other comment processing was so slow.)

In other expressions, I have done some testing for performance.  Here's what
I've found:
- an re matching for "[A-Z]" is not appreciably faster than
"[ABCDEFGHIJKLMNOPQRSTUVWXYZ]"
- there is little/no advantage in using re's that compile faster - runtime
matching far outweighs setup/compile time performance

But I will be the first to admit that I am no re expert, and I *WELCOME* any
suggestions you might have on tuning up the re's in pyparsing!

-- Paul