Re: [Pyparsing] C++ Comments and a Backslash at the End of the Line.
Brought to you by:
ptmcg
From: Paul M. <pa...@al...> - 2006-10-05 13:31:21
|
> > Does this look better? > > > > cppStyleComment = > > Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/(\\\n|.)*)").setName("C++ style > > comment") > > > > It seems to test out okay. I'll put it in the next update. > > Yes, thanks. > > Would you accept suggestions for equivalent regexps that > execute faster? > I'm thinking of parsing many C++ files frequently and could > use the speed. > Of course! Here is a suggestion for this regexp in particular: back before there was a Regex class in pyparsing, I built up these comment definitions from normal pyparsing constructs, and it looked like this: cStyleComment = Combine( Literal("/*") + ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/") ) ) + Literal("*/") ).streamline().setName("cStyleComment enclosed in /* ... */") restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).leaveWhitespace() dblSlashComment = "//" + restOfLine cppStyleComment = ( dblSlashComment | cStyleComment ) Note that both paths of cppStyleComment's alternation begin with a '/'. I was able to speed this up quite a bit by adding an assertive lookahead with pyparsing's FollowedBy: cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment ) Essentially saying, "if the next character is not a '/', don't bother testing the rest of the expression." How does one do such a lookahead in re syntax? I bet that would speed up the re matching. (I just tested a version that refactors the leading '/' from both alternatives to the beginning of the re, with no appreciable speed improvement. My test suite is a large number of Verilog files, which is a fairly complex grammar that uses C++-style comments. I suspect that the reason that adding FollowedBy("/") made such a difference before was because the other comment processing was so slow.) In other expressions, I have done some testing for performance. Here's what I've found: - an re matching for "[A-Z]" is not appreciably faster than "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]" - there is little/no advantage in using re's that compile faster - runtime matching far outweighs setup/compile time performance But I will be the first to admit that I am no re expert, and I *WELCOME* any suggestions you might have on tuning up the re's in pyparsing! -- Paul |