Re: [Pyparsing] C++ Comments and a Backslash at the End of the Line.
Brought to you by:
ptmcg
From: <ra...@in...> - 2006-10-07 16:26:16
|
Hi Paul, > > Would you accept suggestions for equivalent regexps that execute > > faster? I'm thinking of parsing many C++ files frequently and could > > use the speed. > > Of course! > > Here is a suggestion for this regexp in particular: back before there > was a Regex class in pyparsing, I built up these comment definitions > from normal pyparsing constructs, and it looked like this: > > cStyleComment = Combine( Literal("/*") + > ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/") > ) ) + > Literal("*/") ).streamline().setName("cStyleComment > enclosed in /* ... */") > restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).leaveWhitespace() > dblSlashComment = "//" + restOfLine > cppStyleComment = ( dblSlashComment | cStyleComment ) OK. > Note that both paths of cppStyleComment's alternation begin with a > '/'. I was able to speed this up quite a bit by adding an assertive > lookahead with pyparsing's FollowedBy: > > cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment ) Makes sense. > How does one do such a lookahead in re syntax? I bet that would speed > up the re matching. (I just tested a version that refactors the > leading '/' from both alternatives to the beginning of the re, with no > appreciable speed improvement. Yes, that's how you do it. > My test suite is a large number of Verilog files, which is a fairly > complex grammar that uses C++-style comments. I suspect that the > reason that adding FollowedBy("/") made such a difference before was > because the other comment processing was so slow.) I've found (below) it does make a difference. > In other expressions, I have done some testing for performance. > Here's what I've found: > - an re matching for "[A-Z]" is not appreciably faster than > "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]" The RE undergoes optimisation in byte code form after parsing so these become equivalent. At least that's what I expect looking at sre_compile.py:_optimize_charset() but perhaps not. >>> import re >>> r = re.compile('[A-Z]', flags = 128) in range (65, 90) >>> r = re.compile('[ABCDEFGHIJKLMNOPQRSTUVWXYZ]', flags = 128) in literal 65 literal 66 literal 67 literal 68 literal 69 literal 70 literal 71 literal 72 literal 73 literal 74 literal 75 literal 76 literal 77 literal 78 literal 79 literal 80 literal 81 literal 82 literal 83 literal 84 literal 85 literal 86 literal 87 literal 88 literal 89 literal 90 >>> I haven't looked into this further. > - there is little/no advantage in using re's that compile faster - > runtime matching far outweighs setup/compile time performance Oh, absolutely. > But I will be the first to admit that I am no re expert, and I > *WELCOME* any suggestions you might have on tuning up the re's in > pyparsing! _Mastering Regular Expressions_ by Friedl is a book you'd enjoy. The latest edition (I read the 1st Ed.) probably has up to date Python coverage. Anyway, based on a hazy recollection of that, I'd suggest r''' (?x) / (?: \* (?:[^*]*\*+)+? / | /[^\n]* (?:\n[^\n]*)*? (?:(?<!\\)|\Z) ) ''' You could pass re.VERBOSE in as a flag instead of (?x), that was just easier in my testing framework. It's beneficial to move the common `/' prefix out of both alternatives since there's overhead of going into a group and this saves the bother. Parenthesis that capture have an overhead. If you're only after group(0) then don't capture, i.e. turn `a(b|c)*d' into `a(?:b|c)*d'. The Python Regular Expression HOWTO says overwise, but in theory I think it's wrong, and it seems so in practice (see test script below). After `/*' we want to match as few characters as possible before the `*/'. Doing r'/\*[\s\S]*?\*/' could be better for a few reasons. It's a non-greedy match, which is correct but means that matching against `/*abc*/' does Match /? Yes, consume /. Match *? Yes, consume *. Match *? No. Match space? No. Match not-space? Yes, consume a. Match *? No. Match space? No. Match not-space? Yes, consume b. Match *? No. Match space? No. Match not-space? Yes, consume c. Match *? Yes, consume *. Match /? Yes, consume /. It `stutters' after every character, having to break out of one loop to see if we've finished yet. The `space' and `not-space' come from `[\s\S]' creating a character class. >>> r = re.compile('[\s\S]', flags = 128) in category category_space category category_not_space The common `/***...***/' comment line takes Match /? Yes, consume /. Match *? Yes, consume *. Match *? Yes, consume *. Match /? No, back-track one to *. Match space? No. Match not-space? Yes, consume *. Match *? Yes, consume *. Match /? No, back-track one to *. Match space? No. Match not-space? Yes, consume *. ...these four are repeated for each `middle' *... So, I've gone for /\*([^*]*\*+)+?/ After `/*' we gobble as many non-* as possible in a small tight loop inside the engine. We then know we're at a `*' so we gobble one or more of those quickly too. Since we're non-greedy we test to see if `/' follows the string of *s. If it does, we're done. If not, we'll consume the `/' as the first of the non-*s in the next loop around the group. Against `/*abc*/': Match /? Yes, consume /. Match *? Yes, consume *. Match not-*? Yes, consume a. Match not-*? Yes, consume b. Match not-*? Yes, consume c. Match *? Yes, consume *. Match *? No. Match /? Yes, consume /. Against `/***...***/': Match /? Yes, consume /. Match *? Yes, consume *. Match not-*? No. Match *? Yes, consume *. ...this one line is repeated for each `middle' *... ...and the end one... Match *? No. Match /? Yes, consume /. The C++ comment is a similar technique. 1 // 2 [^\n]* 3 (\n[^\n]*)*? 4 ((?<!\\)|\Z) After `//', gobble up to the end of the line. Because (3) is zero-or-more non-greedy we initially skip that and test (4). The `(?<!\\)' is a `look-behind negative assertion' to check that the last character before this newline we've arrived at wasn't a backslash. If that's the case then the unescaped-newline ends the comment, hurrah! If not, are we at the end of the string? If so, `\Z' is true so we've again reached the end of the comment. I put this in for those pesky Emacs users who don't terminate the last line of the text file with \n but may have `//\\' as the last four characters of the file. If we've reach a newline yet it isn't the end of the comment then we attempt (3) for the first time. We consume the \n and then gobble, quickly, all non-\n before attempting (4) again. Attached are my test script showing various regexps I tried, and a anonymised C++ source file that was the impetus to all this in the first place. Usage is `./commentspeed anon.c++ 50'. It tests all the regexps against some built-in expected results, and then times all of them matching against anon.c++ 50 times, checking they all agree on what makes a comment. You can comment some of the slow ones out in order to run the quicker ones more times for more accurate results. $ ./commentspeed anon.c++ 240 original 54.3169460297 backslashless 55.1692581177 slashfirst 27.1254348755 noncapture 14.8059911728 dropsS 14.2693331242 tinyloop 2.20497894287 nocharclass 2.40148186684 $ $ e 54.3169460297 / 2.20497894287 24.63377085996162740307 $ `original' and `backslashless' shouldn't differ in time. It seems whichever one runs first comes in slightly faster; must be some Python/OS overhead. `slashfirst' shows factoring out the leading `/' helps. `noncapture' shows using (?:) helps. `tinyloop' is the one I'd recommend. I expected `nocharclass' to be quicker but it seems the engine compiles negative character classes of a single character into a `not_literal' op-code rather than a character class so the ones used in tinyloop aren't as slow as I thought. Cheers, Ralph. |