pyparsing-users Mailing List for Python parsing module (Page 28)
Brought to you by:
ptmcg
You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
(2) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(2) |
Feb
|
Mar
(2) |
Apr
(12) |
May
(2) |
Jun
|
Jul
|
Aug
(12) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2006 |
Jan
(5) |
Feb
(1) |
Mar
(10) |
Apr
(3) |
May
(7) |
Jun
(2) |
Jul
(2) |
Aug
(7) |
Sep
(8) |
Oct
(17) |
Nov
|
Dec
(3) |
2007 |
Jan
(4) |
Feb
|
Mar
(10) |
Apr
|
May
(6) |
Jun
(11) |
Jul
(1) |
Aug
|
Sep
(19) |
Oct
(8) |
Nov
(32) |
Dec
(8) |
2008 |
Jan
(12) |
Feb
(6) |
Mar
(42) |
Apr
(47) |
May
(17) |
Jun
(15) |
Jul
(7) |
Aug
(2) |
Sep
(13) |
Oct
(6) |
Nov
(11) |
Dec
(3) |
2009 |
Jan
(2) |
Feb
(3) |
Mar
|
Apr
|
May
(11) |
Jun
(13) |
Jul
(19) |
Aug
(17) |
Sep
(8) |
Oct
(3) |
Nov
(7) |
Dec
(1) |
2010 |
Jan
(2) |
Feb
|
Mar
(19) |
Apr
(6) |
May
|
Jun
(2) |
Jul
|
Aug
(1) |
Sep
|
Oct
(4) |
Nov
(3) |
Dec
(2) |
2011 |
Jan
(4) |
Feb
|
Mar
(5) |
Apr
(1) |
May
(3) |
Jun
(8) |
Jul
(6) |
Aug
(8) |
Sep
(35) |
Oct
(1) |
Nov
(1) |
Dec
(2) |
2012 |
Jan
(2) |
Feb
|
Mar
(3) |
Apr
(4) |
May
|
Jun
(1) |
Jul
|
Aug
(6) |
Sep
(18) |
Oct
|
Nov
(1) |
Dec
|
2013 |
Jan
(7) |
Feb
(7) |
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
(1) |
Aug
(5) |
Sep
(3) |
Oct
(11) |
Nov
(3) |
Dec
|
2014 |
Jan
(3) |
Feb
(1) |
Mar
|
Apr
(6) |
May
(10) |
Jun
(4) |
Jul
|
Aug
(5) |
Sep
(2) |
Oct
(4) |
Nov
(1) |
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(13) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(9) |
Oct
(2) |
Nov
(11) |
Dec
(2) |
2016 |
Jan
|
Feb
(3) |
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
(3) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(4) |
2017 |
Jan
(2) |
Feb
(2) |
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
(4) |
Nov
(3) |
Dec
|
2018 |
Jan
(10) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
2019 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2020 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2023 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
(3) |
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: Paul M. <pa...@al...> - 2006-10-11 14:03:20
|
> -----Original Message----- > From: Ralph Corderoy [mailto:ra...@in...] > Sent: Wednesday, October 11, 2006 7:08 AM > To: Paul McGuire > Cc: pyp...@li... > Subject: Re: [Pyparsing] Operator Precedence and Associativity. > > > Hi Paul, > > > Well, why not?! Check this out! > > > > def operatorGrammar( baseExpr, opList ): > > ret = Forward() > > lastExpr = baseExpr | ( Suppress('(') + ret + > Suppress(')') ) ... > > That's impressive and exactly what I was after. You may want > to consider formalising it as part of pyparsing since > recursive descent parsers often use operator precedence > parsers for that part of things. > > Thanks! > > > Ralph. > Ralph - Note that this does not *quite* do the LR-style recursive descent you were looking for. That is to say, 9 + 2 + 3 parses as: [[9, '+', 2, '+', 3]] and not as: [[[9, '+', 2], '+', 3]] which is what I think you were looking for originally. On the whole, though, it seems quite interesting. It took me about 2 minutes to do the following boolean expression grammar: boolExpr = operatorGrammar( Word(alphas), [ ("not",1,"R"), ("or",2,"L"), ("and",2,"L"), ]) test = ["p and not q", "not not p", "not(p and q)", "q or not p and r", "q or not (p and r)" ] for t in test: print t,'\n',boolExpr.parseString(t),'\n' Giving: p and not q [['p', 'and', ['not', 'q']]] not not p [['not', ['not', 'p']]] not(p and q) [['not', ['p', 'and', 'q']]] q or not p and r [[['q', 'or', ['not', 'p']], 'and', 'r']] q or not (p and r) [['q', 'or', ['not', ['p', 'and', 'r']]]] I need to give this a little more thought as far as including it into "core" pyparsing (as opposed to say, just making it another example - the examples directory is sort of becoming its own Pyparsing Cookbook). I want to make sure I get the API right. For instance, I might want to support a 4'th entry in each input tuple, representing a parse action to be attached to the internally generated pyparsing expression. -- Paul |
From: Ralph C. <ra...@in...> - 2006-10-11 12:09:38
|
Hi Paul, > Well, why not?! Check this out! > > def operatorGrammar( baseExpr, opList ): > ret = Forward() > lastExpr = baseExpr | ( Suppress('(') + ret + Suppress(')') ) > ... That's impressive and exactly what I was after. You may want to consider formalising it as part of pyparsing since recursive descent parsers often use operator precedence parsers for that part of things. Thanks! Ralph. |
From: Paul M. <pa...@al...> - 2006-10-10 22:03:18
|
> -----Original Message----- > From: pyp...@li... > [mailto:pyp...@li...] On > Behalf Of Ralph Corderoy > Sent: Wednesday, September 13, 2006 7:23 AM > To: Paolo Losi > Cc: pyp...@li... > Subject: Re: [Pyparsing] Operator Precedence and Associativity. > > > Hi Paolo, > > > Ralph Corderoy wrote: <snip> > > I guess I'd like to write > > number = Words(digit) > factor = Forward() > factor << RightBinOpSeq(number, expop, factor) > unary = OptRightUnaryOpSeq(signop, factor) > term = unary ^ LeftBinOpSeq(unary, multop, unary) > expr = term ^ LeftBinOpSeq(term, plusop, term) > statement = expr + semicolon > > and have the /(Opt)?(Left|Right|Non)(Unary|Bin)OpSeq/ > routines do the Grouping implicitly so > > 1+2*-3^4*5+-+-6 > > yeilds the tree > > ( , +, ) > (1, +, ) (-, ) > ( , *, 5) (+, ) > (2, *, ) (-, 6) > (-, ) > (3, ^, 4) > > Although, something like > > expop = Literal('^') > signop = OneOf('+ -') > multop = Literal('*') > plusop = Literal('+') > > expr = Forward() > term = Forward() > unary = Forward() > factor = Forward() > > expr = OperatorGrammar( \ > expop, 2, 'R', > signop, 1, 'R', > multop, 2, 'L', > plusop, 2, 'L', > ''') > > also seems attractive, and simpler to specify. > Well, why not?! Check this out! -- Paul from pyparsing import * def operatorGrammar( baseExpr, opList ): ret = Forward() lastExpr = baseExpr | ( Suppress('(') + ret + Suppress(')') ) i = 0 for opExpr,arity,rightLeftAssoc in opList: thisExpr = Forward() thisExpr.setName("expr%d" % i)#.setDebug() if rightLeftAssoc == 'L': if arity == 1: thisExpr << ( Group( lastExpr + opExpr ) | lastExpr ) else: thisExpr << ( Group( lastExpr + OneOrMore( opExpr + lastExpr ) ) | lastExpr ) else: if arity == 1: # try to avoid LR with this extra test if isinstance(opExpr, Optional): thisExpr << (( FollowedBy(opExpr.expr + thisExpr) + Group( opExpr + thisExpr )) | lastExpr ) else: thisExpr << ( opExpr + thisExpr ) else: thisExpr << ( Group( lastExpr + OneOrMore( opExpr + thisExpr ) ) | lastExpr ) lastExpr = thisExpr thisExpr = Forward() i += 1 ret << lastExpr return ret number = Word(nums).setParseAction(lambda t:int(t[0])) variable = Word(alphas,exact=1) operand = number | variable expop = Literal('^') signop = Optional( oneOf('+ -') ) multop = oneOf('* /') plusop = oneOf('+ -') factop = Literal('!') expr = operatorGrammar( operand, [(factop, 1, 'L'), (expop, 2, 'R'), (signop, 1, 'R'), (multop, 2, 'L'), (plusop, 2, 'L'),] ) test = ["9 + 2 * 3", "(9 + 2) * 3", "(9 + -2) * 3", "(9 + -2) * 3^2^2", "(9! + -2) * 3^2^2", "M*X + B", "1+2*-3^4*5+-+-6",] for t in test: print t print expr.parseString(t) print Gives the following: 9 + 2 * 3 [[9, '+', [2, '*', 3]]] (9 + 2) * 3 [[[9, '+', 2], '*', 3]] (9 + -2) * 3 [[[9, '+', ['-', 2]], '*', 3]] (9 + -2) * 3^2^2 [[[9, '+', ['-', 2]], '*', [3, '^', [2, '^', 2]]]] (9! + -2) * 3^2^2 [[[[9, '!'], '+', ['-', 2]], '*', [3, '^', [2, '^', 2]]]] M*X + B [[['M', '*', 'X'], '+', 'B']] 1+2*-3^4*5+-+-6 [[1, '+', [2, '*', ['-', [3, '^', 4]], '*', 5], '+', ['-', ['+', ['-', 6]]]]] |
From: Ralph C. <ra...@in...> - 2006-10-09 13:09:34
|
Hi Paul, > That reminds me. Are these changes retro-compatible to Python 2.3? A > while ago I added some features that made pyparsing no longer > compatible with Python 2.2, and have kind of regretted it, so I'm > really trying to stay compatible with at least Py2.3.2. > > Do you know off-hand if these features (such as non-capturing) work on > Python 2.3.2? No, but I've had a look at http://www.python.org/doc/versions/ and http://www.python.org/doc/2.3/lib/re-syntax.html says that all the features I've used where in the re module in Python 2.3 AFAICS. Cheers, Ralph. |
From: Paul M. <pa...@al...> - 2006-10-08 21:25:36
|
> -----Original Message----- > From: Ralph Corderoy [mailto:ra...@in...] > Sent: Sunday, October 08, 2006 8:19 AM > To: Paul McGuire > Cc: pyp...@li... > Subject: Re: C++ Comments and a Backslash at the End of the Line. > > > Hi Paul, > > > Care to try your hand at the re's for quoted strings? > > OK. > > _escapedChar = Regex(r"\\.") > > I don't think this is used anywhere? > > dblQuotedString = Regex(r'"([^"\n\r\\]|("")|(\\.))*"') > > There's not much you can do about this. Make it > non-capturing. Other than that, attempting to match many > [^"\n\r\\] at a time without going outside the group will > result in exponential behaviour on an unterminated string > because of backtracking. See > http://www.regular-expressions.info/atomic.html > > Unfortunately, Python's re module doesn't support Atomic > Grouping or Possessive Quantifiers, unlike, e.g. Perl. > That reminds me. Are these changes retro-compatible to Python 2.3? A while ago I added some features that made pyparsing no longer compatible with Python 2.2, and have kind of regretted it, so I'm really trying to stay compatible with at least Py2.3.2. Do you know off-hand if these features (such as non-capturing) work on Python 2.3.2? -- Paul |
From: Ralph C. <ra...@in...> - 2006-10-08 13:21:29
|
Hi Paul, > Care to try your hand at the re's for quoted strings? OK. _escapedChar = Regex(r"\\.") I don't think this is used anywhere? dblQuotedString = Regex(r'"([^"\n\r\\]|("")|(\\.))*"') There's not much you can do about this. Make it non-capturing. Other than that, attempting to match many [^"\n\r\\] at a time without going outside the group will result in exponential behaviour on an unterminated string because of backtracking. See http://www.regular-expressions.info/atomic.html Unfortunately, Python's re module doesn't support Atomic Grouping or Possessive Quantifiers, unlike, e.g. Perl. Cheers, Ralph. |
From: Paul M. <pa...@al...> - 2006-10-07 20:51:35
|
Ralph - Thanks for your diligence in tuning up these comment re's. I'll include these changes in 1.4.4. Care to try your hand at the re's for quoted strings? -- Paul |
From: <ra...@in...> - 2006-10-07 16:26:16
|
Hi Paul, > > Would you accept suggestions for equivalent regexps that execute > > faster? I'm thinking of parsing many C++ files frequently and could > > use the speed. > > Of course! > > Here is a suggestion for this regexp in particular: back before there > was a Regex class in pyparsing, I built up these comment definitions > from normal pyparsing constructs, and it looked like this: > > cStyleComment = Combine( Literal("/*") + > ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/") > ) ) + > Literal("*/") ).streamline().setName("cStyleComment > enclosed in /* ... */") > restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).leaveWhitespace() > dblSlashComment = "//" + restOfLine > cppStyleComment = ( dblSlashComment | cStyleComment ) OK. > Note that both paths of cppStyleComment's alternation begin with a > '/'. I was able to speed this up quite a bit by adding an assertive > lookahead with pyparsing's FollowedBy: > > cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment ) Makes sense. > How does one do such a lookahead in re syntax? I bet that would speed > up the re matching. (I just tested a version that refactors the > leading '/' from both alternatives to the beginning of the re, with no > appreciable speed improvement. Yes, that's how you do it. > My test suite is a large number of Verilog files, which is a fairly > complex grammar that uses C++-style comments. I suspect that the > reason that adding FollowedBy("/") made such a difference before was > because the other comment processing was so slow.) I've found (below) it does make a difference. > In other expressions, I have done some testing for performance. > Here's what I've found: > - an re matching for "[A-Z]" is not appreciably faster than > "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]" The RE undergoes optimisation in byte code form after parsing so these become equivalent. At least that's what I expect looking at sre_compile.py:_optimize_charset() but perhaps not. >>> import re >>> r = re.compile('[A-Z]', flags = 128) in range (65, 90) >>> r = re.compile('[ABCDEFGHIJKLMNOPQRSTUVWXYZ]', flags = 128) in literal 65 literal 66 literal 67 literal 68 literal 69 literal 70 literal 71 literal 72 literal 73 literal 74 literal 75 literal 76 literal 77 literal 78 literal 79 literal 80 literal 81 literal 82 literal 83 literal 84 literal 85 literal 86 literal 87 literal 88 literal 89 literal 90 >>> I haven't looked into this further. > - there is little/no advantage in using re's that compile faster - > runtime matching far outweighs setup/compile time performance Oh, absolutely. > But I will be the first to admit that I am no re expert, and I > *WELCOME* any suggestions you might have on tuning up the re's in > pyparsing! _Mastering Regular Expressions_ by Friedl is a book you'd enjoy. The latest edition (I read the 1st Ed.) probably has up to date Python coverage. Anyway, based on a hazy recollection of that, I'd suggest r''' (?x) / (?: \* (?:[^*]*\*+)+? / | /[^\n]* (?:\n[^\n]*)*? (?:(?<!\\)|\Z) ) ''' You could pass re.VERBOSE in as a flag instead of (?x), that was just easier in my testing framework. It's beneficial to move the common `/' prefix out of both alternatives since there's overhead of going into a group and this saves the bother. Parenthesis that capture have an overhead. If you're only after group(0) then don't capture, i.e. turn `a(b|c)*d' into `a(?:b|c)*d'. The Python Regular Expression HOWTO says overwise, but in theory I think it's wrong, and it seems so in practice (see test script below). After `/*' we want to match as few characters as possible before the `*/'. Doing r'/\*[\s\S]*?\*/' could be better for a few reasons. It's a non-greedy match, which is correct but means that matching against `/*abc*/' does Match /? Yes, consume /. Match *? Yes, consume *. Match *? No. Match space? No. Match not-space? Yes, consume a. Match *? No. Match space? No. Match not-space? Yes, consume b. Match *? No. Match space? No. Match not-space? Yes, consume c. Match *? Yes, consume *. Match /? Yes, consume /. It `stutters' after every character, having to break out of one loop to see if we've finished yet. The `space' and `not-space' come from `[\s\S]' creating a character class. >>> r = re.compile('[\s\S]', flags = 128) in category category_space category category_not_space The common `/***...***/' comment line takes Match /? Yes, consume /. Match *? Yes, consume *. Match *? Yes, consume *. Match /? No, back-track one to *. Match space? No. Match not-space? Yes, consume *. Match *? Yes, consume *. Match /? No, back-track one to *. Match space? No. Match not-space? Yes, consume *. ...these four are repeated for each `middle' *... So, I've gone for /\*([^*]*\*+)+?/ After `/*' we gobble as many non-* as possible in a small tight loop inside the engine. We then know we're at a `*' so we gobble one or more of those quickly too. Since we're non-greedy we test to see if `/' follows the string of *s. If it does, we're done. If not, we'll consume the `/' as the first of the non-*s in the next loop around the group. Against `/*abc*/': Match /? Yes, consume /. Match *? Yes, consume *. Match not-*? Yes, consume a. Match not-*? Yes, consume b. Match not-*? Yes, consume c. Match *? Yes, consume *. Match *? No. Match /? Yes, consume /. Against `/***...***/': Match /? Yes, consume /. Match *? Yes, consume *. Match not-*? No. Match *? Yes, consume *. ...this one line is repeated for each `middle' *... ...and the end one... Match *? No. Match /? Yes, consume /. The C++ comment is a similar technique. 1 // 2 [^\n]* 3 (\n[^\n]*)*? 4 ((?<!\\)|\Z) After `//', gobble up to the end of the line. Because (3) is zero-or-more non-greedy we initially skip that and test (4). The `(?<!\\)' is a `look-behind negative assertion' to check that the last character before this newline we've arrived at wasn't a backslash. If that's the case then the unescaped-newline ends the comment, hurrah! If not, are we at the end of the string? If so, `\Z' is true so we've again reached the end of the comment. I put this in for those pesky Emacs users who don't terminate the last line of the text file with \n but may have `//\\' as the last four characters of the file. If we've reach a newline yet it isn't the end of the comment then we attempt (3) for the first time. We consume the \n and then gobble, quickly, all non-\n before attempting (4) again. Attached are my test script showing various regexps I tried, and a anonymised C++ source file that was the impetus to all this in the first place. Usage is `./commentspeed anon.c++ 50'. It tests all the regexps against some built-in expected results, and then times all of them matching against anon.c++ 50 times, checking they all agree on what makes a comment. You can comment some of the slow ones out in order to run the quicker ones more times for more accurate results. $ ./commentspeed anon.c++ 240 original 54.3169460297 backslashless 55.1692581177 slashfirst 27.1254348755 noncapture 14.8059911728 dropsS 14.2693331242 tinyloop 2.20497894287 nocharclass 2.40148186684 $ $ e 54.3169460297 / 2.20497894287 24.63377085996162740307 $ `original' and `backslashless' shouldn't differ in time. It seems whichever one runs first comes in slightly faster; must be some Python/OS overhead. `slashfirst' shows factoring out the leading `/' helps. `noncapture' shows using (?:) helps. `tinyloop' is the one I'd recommend. I expected `nocharclass' to be quicker but it seems the engine compiles negative character classes of a single character into a `not_literal' op-code rather than a character class so the ones used in tinyloop aren't as slow as I thought. Cheers, Ralph. |
From: Paul M. <pa...@al...> - 2006-10-05 13:31:21
|
> > Does this look better? > > > > cppStyleComment = > > Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/(\\\n|.)*)").setName("C++ style > > comment") > > > > It seems to test out okay. I'll put it in the next update. > > Yes, thanks. > > Would you accept suggestions for equivalent regexps that > execute faster? > I'm thinking of parsing many C++ files frequently and could > use the speed. > Of course! Here is a suggestion for this regexp in particular: back before there was a Regex class in pyparsing, I built up these comment definitions from normal pyparsing constructs, and it looked like this: cStyleComment = Combine( Literal("/*") + ZeroOrMore( CharsNotIn("*") | ( "*" + ~Literal("/") ) ) + Literal("*/") ).streamline().setName("cStyleComment enclosed in /* ... */") restOfLine = Optional( CharsNotIn( "\n\r" ), default="" ).leaveWhitespace() dblSlashComment = "//" + restOfLine cppStyleComment = ( dblSlashComment | cStyleComment ) Note that both paths of cppStyleComment's alternation begin with a '/'. I was able to speed this up quite a bit by adding an assertive lookahead with pyparsing's FollowedBy: cppStyleComment = FollowedBy("/") + ( dblSlashComment | cStyleComment ) Essentially saying, "if the next character is not a '/', don't bother testing the rest of the expression." How does one do such a lookahead in re syntax? I bet that would speed up the re matching. (I just tested a version that refactors the leading '/' from both alternatives to the beginning of the re, with no appreciable speed improvement. My test suite is a large number of Verilog files, which is a fairly complex grammar that uses C++-style comments. I suspect that the reason that adding FollowedBy("/") made such a difference before was because the other comment processing was so slow.) In other expressions, I have done some testing for performance. Here's what I've found: - an re matching for "[A-Z]" is not appreciably faster than "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]" - there is little/no advantage in using re's that compile faster - runtime matching far outweighs setup/compile time performance But I will be the first to admit that I am no re expert, and I *WELCOME* any suggestions you might have on tuning up the re's in pyparsing! -- Paul |
From: Ralph C. <ra...@in...> - 2006-10-05 12:39:06
|
Hi Paul, > > AIUI the // style of comment is terminated by > > the end of a line that is not preceeded with a backslash: > > Does this look better? > > cppStyleComment = > Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/(\\\n|.)*)").setName("C++ style comment") > > It seems to test out okay. I'll put it in the next update. Yes, thanks. Would you accept suggestions for equivalent regexps that execute faster? I'm thinking of parsing many C++ files frequently and could use the speed. Cheers, Ralph. |
From: Paul M. <pa...@al...> - 2006-10-04 03:57:36
|
Does this look better? cppStyleComment = Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/(\\\n|.)*)").setName("C++ style comment") It seems to test out okay. I'll put it in the next update. -- Paul > -----Original Message----- > From: pyp...@li... > [mailto:pyp...@li...] On > Behalf Of Ralph Corderoy > Sent: Tuesday, October 03, 2006 2:14 PM > To: pyp...@li... > Subject: [Pyparsing] C++ Comments and a Backslash at the End > of the Line. > > > Hi, > > Just looking at pyparsing-1.4.3's source and > > cppStyleComment = > Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/.*)").setName("C++ style comment") > > seems wrong. AIUI the // style of comment is terminated by > the end of a line that is not preceeded with a backslash: > > $ cat x.c++ > int a; \\ abc\ > int b; \\ ghi > $ g++ -E x.c++ > # 1 "x.c++" > # 1 "<built-in>" > # 1 "<command line>" > # 1 "x.c++" > int a; \\ abcint b; \\ ghi > $ > > I don't know about Java: > > javaStyleComment = cppStyleComment > > Cheers, > > > Ralph. > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT Join > SourceForge.net's Techsay panel and you'll get the chance to > share your opinions on IT & business topics through brief > surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge &CID=DEVDEV > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > > |
From: Ralph C. <ra...@in...> - 2006-10-03 19:20:59
|
Hi, Just looking at pyparsing-1.4.3's source and cppStyleComment = Regex(r"(\/\*[\s\S]*?\*\/)|(\/\/.*)").setName("C++ style comment") seems wrong. AIUI the // style of comment is terminated by the end of a line that is not preceeded with a backslash: $ cat x.c++ int a; \\ abc\ int b; \\ ghi $ g++ -E x.c++ # 1 "x.c++" # 1 "<built-in>" # 1 "<command line>" # 1 "x.c++" int a; \\ abcint b; \\ ghi $ I don't know about Java: javaStyleComment = cppStyleComment Cheers, Ralph. |
From: Paul M. <pa...@al...> - 2006-10-02 19:02:40
|
Unless you disable whitespace skipping, do not use delimitedList to read lists of items separated only by whitespace. This is where pyparsing is different from regular expressions. Pyparsing will treat whitespace as = a terminating boundary for words, but will implicitly skip over it for = parsing purposes. The delimitedList method returns an expression for things = like comma-delimited lists, or colon-delimited lists. Lists of items = separated only by whitespace can be parsed using a simple OneOrMore. This whitespace-skipping behavior makes it tricky to parse grammars with significant whitespace (like Python), but in the grammar you list = (assuming that commands have some kind of terminating character, like a ';'), whitespace truly is irrelevant. So: if (expression) { do_x; do_y; do_z; } is the same as:=20 if (expression) { do_x; do_y; do_z; } And you could implement this as: if_statement =3D Keyword("if") + \ "(" + conditional_expression + ")" + \ ( command | "{" + OneOrMore(command) + "}" ) assuming that ';' has been included in the definition of command. If you are parsing Pascal, the ';' is actually a statement separater, = not a terminator, so you would parse: if (expression) { do_x; do_y; do_z } using: if_statement =3D Keyword("if") + \ "(" + conditional_expression + ")" + \ ( command | "{" + delimitedList(command,';') + "}" ) and leave the ';' out of the definition of command. I would also recommend enclosing command within a Group so as to = maintain some idea of the code hierarchy. Lastly, assuming that an if_statement is a possible command, then I = assume you have previously defined command as a Forward(), and some subsequent = line would have something like: command << ( assignment_statement | if_statement | while_statement | switch_statement | function_call ) -- Paul > -----Original Message----- > From: pyp...@li...=20 > [mailto:pyp...@li...] On=20 > Behalf Of Roberto Alsina > Sent: Monday, October 02, 2006 12:54 PM > To: pyp...@li... > Subject: [Pyparsing] Multiline expressions >=20 > Hello, I am a newbie to pyparsing (and parsers in general)=20 > and things are going good, except... >=20 > How can I parse something like this >=20 > if (expression) { > command > command > command > } else { > command > command > } >=20 > My problem is that apparently all expressions match only in a=20 > single line. >=20 > For example: >=20 > >>> from pyparsing import * > >>> w=3DWord(alphanums) > >>> lines=3DdelimitedList(w,"\n") > >>> lines.parseString("""111111 > ... 222222 > ... 333333 > ... """ > ... ) > (['111111'], {}) > >>> >=20 > Rather than the three items I expected. >=20 > So, I can't match for a block like "{"+whatever+"}" >=20 > I must be missing something very obvious here :-) >=20 > -- > =A0("\''/").__..-''"`-. . =A0 =A0 =A0 =A0 Roberto Alsina > =A0`9_ 9 =A0) =A0 `-. ( =A0 =A0).`-._.`) =A0r...@kd... > =A0(_Y_.)' ._ =A0 ) `._`. =A0" -.-' =A0 KDE Developer (MFCH) > =A0 _..`-'_..-_/ /-'_.' > (l)-'' ((i).' ((!.' =A0 Buenos Aires - Argentina Feynman's=20 > problem solving algorithm: write down the problem-> think=20 > very hard -> write down the answer. >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT Join=20 > SourceForge.net's Techsay panel and you'll get the chance to=20 > share your opinions on IT & business topics through brief=20 > surveys -- and earn cash=20 > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge &CID=3DDEVDEV > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users >=20 >=20 |
From: Roberto A. <ra...@kd...> - 2006-10-02 17:54:32
|
Hello, I am a newbie to pyparsing (and parsers in general) and things are= =20 going good, except... How can I parse something like this if (expression) { command command command } else { command command } My problem is that apparently all expressions match only in a single line. =46or example: >>> from pyparsing import * >>> w=3DWord(alphanums) >>> lines=3DdelimitedList(w,"\n") >>> lines.parseString("""111111 =2E.. 222222 =2E.. 333333 =2E.. """ =2E.. ) (['111111'], {}) >>> Rather than the three items I expected. So, I can't match for a block like "{"+whatever+"}" I must be missing something very obvious here :-) =2D-=20 =A0("\''/").__..-''"`-. . =A0 =A0 =A0 =A0 Roberto Alsina =A0`9_ 9 =A0) =A0 `-. ( =A0 =A0).`-._.`) =A0r...@kd... =A0(_Y_.)' ._ =A0 ) `._`. =A0" -.-' =A0 KDE Developer (MFCH) =A0 _..`-'_..-_/ /-'_.' (l)-'' ((i).' ((!.' =A0 Buenos Aires - Argentina =46eynman's problem solving algorithm: write down the problem-> think very hard -> write down the answer. |
From: Ralph C. <ra...@in...> - 2006-09-15 11:00:36
|
Hi Paul, > This discussion from c.l.py > (http://www.velocityreviews.com/forums/t335296-basic-tokenizer.html - > about 2 years ago) might also shed some light on this topic, > especially the comments (and posted code) from Andrea Griffini. He > pointed out the '^' left-associativity bug in fourFn.py which I was > able to fix in a subsequent release. Thanks, I had already found that, and you're right, it helped concoct the right pyparsing grammar, as in fourFn.py. The issue is I don't want [1, '+', 2, '+', 3] returned, but [[1, '+', 2], '+', 3] which doesn't seem possible without post-parsing manipulation. That is, the `classic' parse tree of an infix algebraic expression isn't achievable. > I like the graphical rending of the parse tree. :) Yes, it just occurred to me when I was sitting there with the text all on one line. Not hard to do, either, with vim's Ctrl-Y in insert mode. Cheers, Ralph. |
From: Paul M. <pa...@al...> - 2006-09-13 13:39:23
|
This discussion from c.l.py (http://www.velocityreviews.com/forums/t335296-basic-tokenizer.html - about 2 years ago) might also shed some light on this topic, especially the comments (and posted code) from Andrea Griffini. He pointed out the '^' left-associativity bug in fourFn.py which I was able to fix in a subsequent release. I like the graphical rending of the parse tree. :) -- Paul > -----Original Message----- > From: pyp...@li... > [mailto:pyp...@li...] On > Behalf Of Ralph Corderoy > Sent: Wednesday, September 13, 2006 7:23 AM > To: Paolo Losi > Cc: pyp...@li... > Subject: Re: [Pyparsing] Operator Precedence and Associativity. > > > Hi Paolo, > > > Ralph Corderoy wrote: > > > Yes, but I don't think the R.D. parser below, written by > hand, not with > > > pyparsing, is left-recursive. It can just support different > > > associativity depending on how the parse tree is built. > > > > It is not left-recursive, of course. > > In your example you basically build the tree in the correct way > > instead of rearranging it. But please note that in my > example I could > > not add a parse action to ZeroOrMore but to expr. > > I agree. Given > > expr << term + ZeroOrMore(plusop + term) > > we can't use Group() because the left-most (term+term) isn't > accessible. > > > a new token as the ones that I suggested could maybe > provide a solution > > (similar to your parse_term or parse_expr in how the tree is built). > > ... > > If you think that could be a solution, I could try to provide an > > implementation... > > That's OK. The use of parseAction() that you showed me is enough to > give me the idea. > > I guess I'd like to write > > number = Words(digit) > factor = Forward() > factor << RightBinOpSeq(number, expop, factor) > unary = OptRightUnaryOpSeq(signop, factor) > term = unary ^ LeftBinOpSeq(unary, multop, unary) > expr = term ^ LeftBinOpSeq(term, plusop, term) > statement = expr + semicolon > > and have the /(Opt)?(Left|Right|Non)(Unary|Bin)OpSeq/ routines do the > Grouping implicitly so > > 1+2*-3^4*5+-+-6 > > yeilds the tree > > ( , +, ) > (1, +, ) (-, ) > ( , *, 5) (+, ) > (2, *, ) (-, 6) > (-, ) > (3, ^, 4) > > Although, something like > > expop = Literal('^') > signop = OneOf('+ -') > multop = Literal('*') > plusop = Literal('+') > > expr = Forward() > term = Forward() > unary = Forward() > factor = Forward() > > expr = OperatorGrammar( \ > expop, 2, 'R', > signop, 1, 'R', > multop, 2, 'L', > plusop, 2, 'L', > ''') > > also seems attractive, and simpler to specify. > > Cheers, > > > Ralph. > > > > -------------------------------------------------------------- > ----------- > Using Tomcat but need to do more? Need to support web > services, security? > Get stuff done quickly with pre-integrated technology to make > your job easier > Download IBM WebSphere Application Server v.1.0.1 based on > Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057& > dat=121642 > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > > |
From: Ralph C. <ra...@in...> - 2006-09-13 12:23:32
|
Hi Paolo, > Ralph Corderoy wrote: > > Yes, but I don't think the R.D. parser below, written by hand, not with > > pyparsing, is left-recursive. It can just support different > > associativity depending on how the parse tree is built. > > It is not left-recursive, of course. > In your example you basically build the tree in the correct way > instead of rearranging it. But please note that in my example I could > not add a parse action to ZeroOrMore but to expr. I agree. Given expr << term + ZeroOrMore(plusop + term) we can't use Group() because the left-most (term+term) isn't accessible. > a new token as the ones that I suggested could maybe provide a solution > (similar to your parse_term or parse_expr in how the tree is built). > ... > If you think that could be a solution, I could try to provide an > implementation... That's OK. The use of parseAction() that you showed me is enough to give me the idea. I guess I'd like to write number = Words(digit) factor = Forward() factor << RightBinOpSeq(number, expop, factor) unary = OptRightUnaryOpSeq(signop, factor) term = unary ^ LeftBinOpSeq(unary, multop, unary) expr = term ^ LeftBinOpSeq(term, plusop, term) statement = expr + semicolon and have the /(Opt)?(Left|Right|Non)(Unary|Bin)OpSeq/ routines do the Grouping implicitly so 1+2*-3^4*5+-+-6 yeilds the tree ( , +, ) (1, +, ) (-, ) ( , *, 5) (+, ) (2, *, ) (-, 6) (-, ) (3, ^, 4) Although, something like expop = Literal('^') signop = OneOf('+ -') multop = Literal('*') plusop = Literal('+') expr = Forward() term = Forward() unary = Forward() factor = Forward() expr = OperatorGrammar( \ expop, 2, 'R', signop, 1, 'R', multop, 2, 'L', plusop, 2, 'L', ''') also seems attractive, and simpler to specify. Cheers, Ralph. |
From: Ralph C. <ra...@in...> - 2006-09-13 10:04:22
|
Hi Paolo, > Ralph Corderoy wrote: > > whereas I'd like the left associativity explicit, i.e. > > > > + > > / \ > > 1 + 2 + 3 ==> + 3 > > / \ > > 1 2 > > > > I can achieve this with a right-associative operator, e.g. exponent, but > > how do I do this for left-associative ones whilst avoiding endless > > left-recursion? > > I think you have no other choice but to rearrange ParseResults by > setting a specific parse action for the expression: OK, thanks. I hadn't got that far, just been banging my head trying to get the desired result naturally. > > I wonder if PyParsing would benefit from being able to define operator > > tables, similar to K&R's, giving the tokens, precedence, and > > associativity, and being able to plumb that into the rest of the > > recursive descent parser at some point? > > It's not trivial at all. > > From http://en.wikipedia.org/wiki/Left_recursion > > "A formal grammar that contains left recursion cannot be parsed by a > recursive descent parser" > > PyParsing qualifies as recursive descent parser. Yes, but I don't think the R.D. parser below, written by hand, not with pyparsing, is left-recursive. It can just support different associativity depending on how the parse tree is built. Cheers, Ralph. #! /usr/bin/python def main(): test('1', "'1'") test('1+2', "('1', '+', '2')") test('1+2+3', "(('1', '+', '2'), '+', '3')") test('1*2+3', "(('1', '*', '2'), '+', '3')") test('1+2*3', "('1', '+', ('2', '*', '3'))") test('1^2^3', "('1', '^', ('2', '^', '3'))") test('-3^2', "('-', ('3', '^', '2'))") test('-+-3^2', "('-', ('+', ('-', ('3', '^', '2'))))") test('1a+2b*3c', "(('1', 'a'), '+', (('2', 'b'), '*', ('3', 'c')))") test('5a>b', "(('5', 'a'), 'b')") def test(s, e): r = repr(parse(s)) if r == e: print 'pass: %s => %s' % (s, r) else: print 'FAIL: %s: want %s. got %s.' % (s, e, r) def parse(s): l = list(s + ';') e = parse_statement(l) expect(l, ';') return e def parse_statement(s): 'statement := expr [">" unit]' e = parse_expr(s) if peek(s) == '>': consume(s) u = parse_unit(s) if not u: raise 'Expect unit after >.' return (e, u) return e def parse_expr(s): 'expr := term ["+" term]*' a = parse_term(s) while peek(s) == '+': op = consume(s) b = parse_term(s) a = (a, op, b) return a def parse_term(s): 'term := unary ["*" unary]*' a = parse_unary(s) while peek(s) == '*': op = consume(s) b = parse_unary(s) a = (a, op, b) return a def parse_unary(s): 'unary := ["+" | "-"] unary | factor' sign = '+' if peek(s) in '+-': sign = consume(s) a = parse_unary(s) return (sign, a) return parse_factor(s) def parse_factor(s): 'factor := number ["^" factor]' a = parse_number(s) while peek(s) == '^': op = consume(s) b = parse_factor(s) a = (a, op, b) return a def parse_number(s): 'number := "0" | "1" | ... unit' if not peek(s).isdigit(): raise i = consume(s) u = parse_unit(s) if u: return (i, u) return i def parse_unit(s): 'unit := "a" | "b" | "c" | empty' if peek(s) not in 'abc': return None return consume(s) def peek(s): return s[0] def expect(s, t): c = consume(s) if c != t: raise 'Expected %s but got %s' % (t, c) return def consume(s): return s.pop(0) main() |
From: Paolo L. <p....@hy...> - 2006-09-12 20:07:05
|
Ralph Corderoy wrote: >> "A formal grammar that contains left recursion cannot be parsed by a >> recursive descent parser" >> >> PyParsing qualifies as recursive descent parser. > > Yes, but I don't think the R.D. parser below, written by hand, not with > pyparsing, is left-recursive. It can just support different > associativity depending on how the parse tree is built. It is not left-recursive, of course. In your example you basically build the tree in the correct way instead of rearranging it. But please note that in my example I could not add a parse action to ZeroOrMore but to expr. Simply modifing or parametrizing the current tree building policy in one of the existing tokens is not enough. a new token as the ones that I suggested could maybe provide a solution (similar to your parse_term or parse_expr in how the tree is built). Translating a set of operator precedence and associativity rules in order to adapt parse behaviour could be a step further but, in terms of interface, I'd prefer using the one the I suggested... If you think that could be a solution, I could try to provide an implementation... Ciao Paolo > > Cheers, > > > Ralph. > > > #! /usr/bin/python > > def main(): > test('1', "'1'") > test('1+2', "('1', '+', '2')") > test('1+2+3', "(('1', '+', '2'), '+', '3')") > test('1*2+3', "(('1', '*', '2'), '+', '3')") > test('1+2*3', "('1', '+', ('2', '*', '3'))") > test('1^2^3', "('1', '^', ('2', '^', '3'))") > test('-3^2', "('-', ('3', '^', '2'))") > test('-+-3^2', "('-', ('+', ('-', ('3', '^', '2'))))") > test('1a+2b*3c', "(('1', 'a'), '+', (('2', 'b'), '*', ('3', 'c')))") > test('5a>b', "(('5', 'a'), 'b')") > > def test(s, e): > r = repr(parse(s)) > if r == e: > print 'pass: %s => %s' % (s, r) > else: > print 'FAIL: %s: want %s. got %s.' % (s, e, r) > > def parse(s): > l = list(s + ';') > e = parse_statement(l) > expect(l, ';') > return e > > def parse_statement(s): > 'statement := expr [">" unit]' > e = parse_expr(s) > if peek(s) == '>': > consume(s) > u = parse_unit(s) > if not u: > raise 'Expect unit after >.' > return (e, u) > return e > > def parse_expr(s): > 'expr := term ["+" term]*' > a = parse_term(s) > while peek(s) == '+': > op = consume(s) > b = parse_term(s) > a = (a, op, b) > return a > > def parse_term(s): > 'term := unary ["*" unary]*' > a = parse_unary(s) > while peek(s) == '*': > op = consume(s) > b = parse_unary(s) > a = (a, op, b) > return a > > def parse_unary(s): > 'unary := ["+" | "-"] unary | factor' > sign = '+' > if peek(s) in '+-': > sign = consume(s) > a = parse_unary(s) > return (sign, a) > return parse_factor(s) > > def parse_factor(s): > 'factor := number ["^" factor]' > a = parse_number(s) > while peek(s) == '^': > op = consume(s) > b = parse_factor(s) > a = (a, op, b) > return a > > def parse_number(s): > 'number := "0" | "1" | ... unit' > if not peek(s).isdigit(): > raise > i = consume(s) > u = parse_unit(s) > if u: > return (i, u) > return i > > def parse_unit(s): > 'unit := "a" | "b" | "c" | empty' > if peek(s) not in 'abc': > return None > return consume(s) > > def peek(s): > return s[0] > > def expect(s, t): > c = consume(s) > if c != t: > raise 'Expected %s but got %s' % (t, c) > return > > def consume(s): > return s.pop(0) > > main() > |
From: Paolo L. <p....@hy...> - 2006-09-12 15:57:01
|
Hi Ralph, Ralph Corderoy wrote: > whereas I'd like the left associativity explicit, i.e. > > + > / \ > 1 + 2 + 3 ==> + 3 > / \ > 1 2 > > I can achieve this with a right-associative operator, e.g. exponent, but > how do I do this for left-associative ones whilst avoiding endless > left-recursion? I think you have no other choice but to rearrange ParseResults by setting a specific parse action for the expression: from pyparsing import * plus = Literal('+') num = Word(nums) def left_associative_regroup(s,l,t): new_t = ParseResults( [ t[0], t[2] ] ) for item in t[3:]: if item != "+": new_t = ParseResults( [new_t, item] ) return new_t expr = num + ZeroOrMore( plus + num ) expr.setParseAction(left_associative_regroup) print expr.parseString("1 + 2 + 3 + 4 + 5") > I wonder if PyParsing would benefit from being able to define operator > tables, similar to K&R's, giving the tokens, precedence, and > associativity, and being able to plumb that into the rest of the > recursive descent parser at some point? It's not trivial at all. From http://en.wikipedia.org/wiki/Left_recursion "A formal grammar that contains left recursion cannot be parsed by a recursive descent parser" PyParsing qualifies as recursive descent parser. I think it would rather be possible to build Specilized Constructs such as LeftAssociativeOperator(term=Word(nums), operator=Literal('+') ) but those would only cope with immediate left recursion... Hope this helps Ciao Paolo |
From: Ralph C. <ra...@in...> - 2006-09-12 12:24:22
|
Hi, PyParsing seems great; many thanks. But I'm having trouble concocting the right grammar for the typical binary operator part. I'm aware of the latest fourFn.py but its use of ZeroOrMore() results in + 1 + 2 + 3 ==> /|\ 1 2 3 whereas I'd like the left associativity explicit, i.e. + / \ 1 + 2 + 3 ==> + 3 / \ 1 2 I can achieve this with a right-associative operator, e.g. exponent, but how do I do this for left-associative ones whilst avoiding endless left-recursion? I wonder if PyParsing would benefit from being able to define operator tables, similar to K&R's, giving the tokens, precedence, and associativity, and being able to plumb that into the rest of the recursive descent parser at some point? Cheers, Ralph. |
From: Chris L. <ch...@ka...> - 2006-09-01 18:40:23
|
Hi, Sorry for the delayed response. Vacation, work, life all have conspired to keep me away from this for a little while. Also, I think my previous message may have pulled you away from the direction I was trying to go in. My recursive descent example was meant as an example of how the problem is avoided in other approaches to parsing in the hopes that we could extract the part that makes error detection possible and use that in PyParsing. I hope that I have expressed myself better below. It will probably help that I have a more well formed idea of how to solve the problem now too :) On Wed, Aug 16, 2006 at 09:00:09AM -0500, Paul McGuire wrote: > 1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. I will not argue this any further. I think I see how it falls in the category of recursive descent even if it is not how recursive descent parsers are traditionally written. > > It is also this use of the stack for backtracking that makes real parse > error detection difficult. I don't think it is the backtracking in general, I think it is the overuse of backtracking. When we branch while parsing we need to know with absolute certainty that we are parsing the correct thing, or that we have just encountered an error. PyParsing does not make a distinction on exceptions between "this is the wrong option" and "there is no option". > > > 2. From your message: > > At each stage we know exactly what tokens we need in order to continue > > parsing, otherwise we have an error. I think that it should > > be possible to give pyparsing this behaviour, but my attempts at > > analysing the source code have left me a bit lost at this point. > > > Well, in one sense, pyparsing already has this behavior - each parse > expression contains its own idea of "what should come next", and they are in > turn composed into more complex expressions. > > Do you mean that pyparsing should explode its grammar down to the raw > character level, to become more of a character-based state machine rather > than the "recursive traverser of parsing expression objects" that it > currently is? Certainly, we could take the basic types such as I am saying we need to make a greater distinction between terminals(Token: Literal, Word, Regex, etc) and non terminals (ParseExpression: And,Or,OneOrMore, etc). More traditional parser/lexer libraries make this easy on the library programmer by separating the two. A lexer provides a stream of tokens, the parser combines these tokens into a syntax. This is more complicated for the library user than the interface provided by PyParsing, so not something we want to pursue too closely. The key difference is that the traditional parser knows what state it needs to move to based on what token is next in the token stream, consuming the token when it reaches a terminal rule. PyParsing is not aware of the current token in the input stream in non terminal rules (ParseExpression objects) and so it does not know which branch is the 'correct' one until it tries it and thus needs to take all branches in turn until it finds one that works. When there is no correct branch, there is no way to determine if that is because the input is wrong, or if we are in the wrong branch of a parent object. The assumption in the case that there is no correct branch is that we are currently in a wrong branch, and we backtrack to the previous level instead of knowing that we indeed have an error in the input. A simple solution to this problem is to iterate through our options and call a function which I will call 'match'. If match returns True, then we know that this object is the 'correct' branch. The key difference between match and parseImpl is in what is considered success. 'And.match' has a significantly different idea of success from 'And.parseImpl' in that we return True (success) if the first sub-object matches, where as parseImpl only registers success if all sub-objects due so as well. This is an important distinction to make because we are looking for the 'correct' branch, not a fully successful parse. 'Or.match' returns True if any of its sub-objects return True. OneOrMore returns True if its sub-object returns True. ZeroOrMore always returns True. Remember that match is used to gate entrance to a sub-object's parseImpl; match returning True means "I am the correct path." Terminals (Literal, Word, Regex, etc) do an actual test to see whether the input stream matches and returns an appropriate value. parseImpl would obviously then need to be modified to use this new information and ultimately would be able to flag an error in the input at the location it actually occurs. One problem with this approach is that we are at least doubling the number of comparisons. That may cancel out because of reducing the amount of backtracking that is done. If we do these modifications, we can look at optimizations after we have something that works. We may need to do some kind of caching mechanism. The other problem with this approach that I can think of is that it changes the class of grammar that can be parsed with PyParsing. It may be possible that some grammar as defined with the current PyParsing will work, but because of the definitive nature of this new mechanism may not. I can't think of a concrete example off the top of my head and am not even sure how to approach definitively prove or disprove it. I hope that was more coherent than my last message and gives you some ideas about how we can proceed, Chris |
From: Paolo L. <p....@hy...> - 2006-08-24 11:07:22
|
Paul, thank you for your reply Paul McGuire wrote: > Paolo - > > Thanks for the whitespace bugfix. Did you have a test case that raised this > problem in the first place? If so, I'd like to add it to my unit tests. Attached... > Packrat parsing has issues besides the ParseResults problem you describe. > Some parse actions have important side effects that get omitted when results > are retrieved from a cache instead of being explicitly reparsed. The > packrat concept showed such promise I couldn't pass it up, but I had to > leave it disabled by default, so as not to break code for existing > applications. This is an especially interesting phenomenon for me; as far > as pure theoretical parsing is concerned, packratting is a no-brainer. But > pyparsing's implementation, including parse actions, goes beyond the pure > parsing function, and in the process breaks the packrat concept. I crashed against the parseAction problem as well after posting the first email. I agree with you that packratting requires a deep review of pyparsing: I just tried to refactor _parseNoCache in order to handle doAction flag cleanly but I've given up :-) Why not adding doAction to the lookup key? I attach a proof of concept patch ("patch_fix_packrat.diff") that seems to fix all my cases... > In upgrading your app to be able to use packratting, is it possible that > your grammar could be restructured? For instance, in your example, if you > change: > e = A ^ ( A + B + C ) > to: > e2 = A + Optional(B + C) > then packratting does no harm. I'm sure this is a simplified example for > the sake of easy posting, but maybe similar mods could be made to your base > app grammar. Not easy at all... it would require review tons of code :-) > Performance enhancements is a sore subject for me. Pyparsing has gotten > some pretty nasty comments with respect to its parsing speed, so I admit I > committed a number of coding sins in trying to eke out a little more speed > at parsing time. I've also been reluctant to port parts of pyparsing to C, > as I think this complicates its inclusion in other projects, and also its > portability. IMO, the real value of pyparsing is not speed but predictability, elagance, flexibility, easy of use... speed should be a second priority goal. > Here are the main performance enhancement techniques, in rough descending > order of importance, as I recall (not counting packratting): > - cacheing of exception objects - pyparsing raises exceptions a *lot* as > various paths are tried and rejected. I optimized the creation of > exceptions as much as possible, and then realized that I could keep raising > the same exception object. This breaks threading as well... > - using startswith in Literal instead of comparing the literal string with a > string slice. Literals are the most common terminal, and so > Literal.parseImpl has to be fast. Ok > - ParseResults reuse with __new__ and __iadd__. There are non-performance > reasons for doing this also, mostly having to do with gathering results from > the exprs in an And. My proof of concept patch remove the functional reason. Do you agree? > - avoidance of try/except - one of the ugliest parts of the _parse method is > the repetition of code, depending on whether debugging methods are defined > or not. But setting up the try/except has a cost, and _parse is about as > high-traffic a method as there is. Ok > - using re's or dict keys for Word.parseImpl testing with initChars and > bodyChars Ok. Which is the overall performance improvement you got from exception caching, parseresults reuse and _parse() optimizations? > But pyparsing seems to be stabilizing - the last release was about 7 weeks > ago, and I've made only minor changes since then. Perhaps pyparsing could > move forward as a pair of releases, one pure Python and one C-enhanced > (similar to elementTree or StringIO). That exactly what I was thinking to. Have a clean, portable, reference pure python implementation and add C optimizations... > I've done *no* Python+C development, > so this is the other barrier for me to doing such a task. I *have* put in > way too much time on pyparsing, at the expense of some family and work time, > so I'm not eager to take on a major rewrite just now. I understand perfectly. I'm very busy as well and cannot make promises. If you have a "benchmark" grammar and test cases that you think are representative I'll be glad of work on them in my spare time. > If I had a multithreaded app that used pyparsing, I would probably serialize > access to the parser object. I don't want to make the resetCache invocation > explicit, since the need for it is kind of obscure, and so likely to be > overlooked. I agree. > Thanks for your persistence with pyparsing, Paolo, I know you have been > working with it for a while. (We never did get to try adding the GLR > mode...). I tried. I come to almost complete ugly working solution (parseImpl is generator of solutions) but it was difficult to build predictable grammar with it. I'm now using with great satisfaction the stock pyparsing. ;-) Ciao e Grazie Paolo |
From: Paul M. <pa...@al...> - 2006-08-24 07:18:33
|
Paolo - Thanks for the whitespace bugfix. Did you have a test case that raised this problem in the first place? If so, I'd like to add it to my unit tests. Packrat parsing has issues besides the ParseResults problem you describe. Some parse actions have important side effects that get omitted when results are retrieved from a cache instead of being explicitly reparsed. The packrat concept showed such promise I couldn't pass it up, but I had to leave it disabled by default, so as not to break code for existing applications. This is an especially interesting phenomenon for me; as far as pure theoretical parsing is concerned, packratting is a no-brainer. But pyparsing's implementation, including parse actions, goes beyond the pure parsing function, and in the process breaks the packrat concept. In upgrading your app to be able to use packratting, is it possible that your grammar could be restructured? For instance, in your example, if you change: e = A ^ ( A + B + C ) to: e2 = A + Optional(B + C) then packratting does no harm. I'm sure this is a simplified example for the sake of easy posting, but maybe similar mods could be made to your base app grammar. Performance enhancements is a sore subject for me. Pyparsing has gotten some pretty nasty comments with respect to its parsing speed, so I admit I committed a number of coding sins in trying to eke out a little more speed at parsing time. I've also been reluctant to port parts of pyparsing to C, as I think this complicates its inclusion in other projects, and also its portability. Here are the main performance enhancement techniques, in rough descending order of importance, as I recall (not counting packratting): - cacheing of exception objects - pyparsing raises exceptions a *lot* as various paths are tried and rejected. I optimized the creation of exceptions as much as possible, and then realized that I could keep raising the same exception object. - using startswith in Literal instead of comparing the literal string with a string slice. Literals are the most common terminal, and so Literal.parseImpl has to be fast. - ParseResults reuse with __new__ and __iadd__. There are non-performance reasons for doing this also, mostly having to do with gathering results from the exprs in an And. - avoidance of try/except - one of the ugliest parts of the _parse method is the repetition of code, depending on whether debugging methods are defined or not. But setting up the try/except has a cost, and _parse is about as high-traffic a method as there is. - using re's or dict keys for Word.parseImpl testing with initChars and bodyChars But pyparsing seems to be stabilizing - the last release was about 7 weeks ago, and I've made only minor changes since then. Perhaps pyparsing could move forward as a pair of releases, one pure Python and one C-enhanced (similar to elementTree or StringIO). I've done *no* Python+C development, so this is the other barrier for me to doing such a task. I *have* put in way too much time on pyparsing, at the expense of some family and work time, so I'm not eager to take on a major rewrite just now. If I had a multithreaded app that used pyparsing, I would probably serialize access to the parser object. I don't want to make the resetCache invocation explicit, since the need for it is kind of obscure, and so likely to be overlooked. Thanks for your persistence with pyparsing, Paolo, I know you have been working with it for a while. (We never did get to try adding the GLR mode...). -- Paul |
From: Paolo L. <p....@hy...> - 2006-08-22 23:09:55
|
Hello Paul and everyone, I'm trying to upgrade my application from 1.3.3 to 1.4.3, mainly for the PackRat feature since I expect huge performance gain from it in my app. I'm finding some problems/bugs though... ### PackRat bug reuse of ParseResults strategy conflicts with PackRat ParseResult caching. For an example: ----------------------- from pyparsing import * A = Literal('A') B = Literal('B') C = Literal('C') e = A ^ ( A + B + C ) print e.parseString("AB") ParserElement.enablePackrat() print e.parseString("AB") ------------------------- The solution could be dropping the reuse strategy (drop __iadd__ and __new__ from ParseResults). I think that this not only would solve the bug but would also make the code easier to understand/maintain with a little sacrifice in performance (has the performance advantage ever been measured?). I know this is a drastic solution and I want to hear from you Paul since I'd like to keep in synch with your official releases... ### skipWhitespace bug I attach a small patch related to whitespaces skipping... ### PackRat threading enhancement Current packrat implementation seems to be thread safe but not thread friendly: a new thread, via resetCache(), forces a global cache flush also when other threads are running... I suggest making resetCache invocation explicit or make specific per thread cache... what do you think? ### Performance and simple code I see that in some points of the code (i.e. ParserElement._parse) some performance optimization is achieved by making the code a little bit more complex... IMVHO performance optimization should be obtained implementing the critical code in C (many other python projects follow the same rule: in python reference clean code implementation, in C functionally equivalent fast code. I volunteer for C implementation if we agree on the approach... main targets could be ParseResults and ParserElement._parse... Thank you very much for the fantastic work Ciao Paolo |