Re: [Pyparsing] expression not greedy enough
Brought to you by:
ptmcg
From: Diez B. R. <de...@we...> - 2009-09-02 12:37:27
|
Hi Paul, thank you very much for your detailed and helpful answer - I updated my parser, and gained some 20-30% speedup. But I wish it was faster :) Here is the new version: http://pyparsing.pastebin.com/m2736e089 There are some issues I still have, see below: > 1. I see you solved your NUMBER issues, but I think you still have some > misconceptions about repetition, especially about Word. Here are your > NUMBER elements: > > numlit = Word(srange("[0-9]")) > DOT = Literal(".") > NUMBER = Combine(OneOrMore(numlit)) ^ Combine(ZeroOrMore(numlit) + DOT > + OneOrMore(numlit)) > > Here is the reference from the BNF: > > num [0-9]+|[0-9]*"."[0-9]+ > > Word is there to define "word groups" or contiguous characters in a > particular set. A better translation of num to pyparsing would be: > > numlit = Word(srange("[0-9]")) > DOT = Literal(".") > NUMBER = numlit | Combine(Optional(numlit) + "." + numlit) > > Word already takes care of the character repetition, there is no need for > the OneOrMore or ZeroOrMore. > > But in practice, I've found that numeric literal parsing is usually a > frequent step in overall parsing, and that a Regex term is worth the > trouble for measurably better parser performance: > > NUMBER = Regex(r"[0-9]*\.[0-9]+|[0-9]+") Done. > > > 2. Why this definition of FUNCTION and function? (Nevermind, I looked at > your BNF reference and found that this is mapping directly from the YACC > definitions.) > > FUNCTION = Combine(IDENT+ LPAREN) > ... > function = FUNCTION + ZeroOrMore(Optional(IDENT + EQUAL) + expr) + > RPAREN > > This makes it hard to see the matching of parens. I would suggest: > > function = IDENT + LPAREN + ZeroOrMore(Optional(IDENT + EQUAL) + expr) > + RPAREN > > Lastly, to give structure to your results: > > funcarg = Optional(IDENT + EQUAL) + expr > function = IDENT + LPAREN + Group(Optional(delimitedList(funcarg))) + > RPAREN Done that, too. > Now that the arguments are grouped, the parens are unnecessary in the > parsed output, you can suppress them. > > > > 3. expr follows a very common pattern, that of the delimited list. > > expr << (term + ZeroOrMore( Optional(operator) + term)) > > Here you could instead use: > > expr << delimitedList(term, delim=Optional(operator)) Here, I have the problem that the delimiters are operators - and these are swallowed. I could give the combine-parameter, but then I'm getting the whole string, having to parse that again - which makes no sense (especially as here we can have nested expressions, so "simple" approaches don't work. Any other suggestion, or shall I just keep my original rule? > 4. You may have gone a bit overboard in using '^' vs. '|'. For instance: > > LENGTH = Combine(NUMBER + (Literal("px") ^ Literal("cm") ^ > Literal("mm") ^ > Literal("in") ^ Literal("pt") ^ > Literal("pc"))) > > When you use '^', all matches are evaluated, even if there is a match early > in the list. Now in this case, if you parse the 'px' in '100px', there is > no point in checking for a match with 'cm', 'mm', 'in', etc. In this case > a MatchFirst is perfectly okay. Plus you can order the units in some > expected frequency of occurrence. For these cases, your suggestion to use | worked, but not for this one: term = Optional(unary_operator) + ((PERCENTAGE | LENGTH | EMS | EXS | ANGLE | TIME | FREQ | NUMBER)\ | STRING | URI | hexcolor | (function ^ IDENT)) I think the problem here is the IDENT vs. function - both start with an IDENT. But I'd say problem should be solvable with a lookahead of 1 (in normal LL(k)-terms), so maybe you have an idea here, too. Thanks again for your support! Diez |