Re: [Pyparsing] expression not greedy enough
Brought to you by:
ptmcg
From: Paul M. <pt...@au...> - 2009-08-29 03:58:00
|
> However, I solved the issue - see the NUMBER-nonterminal. But it might > help if > you guys take a look if that's really the way to go. > > Diez > Diez - 1. I see you solved your NUMBER issues, but I think you still have some misconceptions about repetition, especially about Word. Here are your NUMBER elements: numlit = Word(srange("[0-9]")) DOT = Literal(".") NUMBER = Combine(OneOrMore(numlit)) ^ Combine(ZeroOrMore(numlit) + DOT + OneOrMore(numlit)) Here is the reference from the BNF: num [0-9]+|[0-9]*"."[0-9]+ Word is there to define "word groups" or contiguous characters in a particular set. A better translation of num to pyparsing would be: numlit = Word(srange("[0-9]")) DOT = Literal(".") NUMBER = numlit | Combine(Optional(numlit) + "." + numlit) Word already takes care of the character repetition, there is no need for the OneOrMore or ZeroOrMore. But in practice, I've found that numeric literal parsing is usually a frequent step in overall parsing, and that a Regex term is worth the trouble for measurably better parser performance: NUMBER = Regex(r"[0-9]*\.[0-9]+|[0-9]+") 2. Why this definition of FUNCTION and function? (Nevermind, I looked at your BNF reference and found that this is mapping directly from the YACC definitions.) FUNCTION = Combine(IDENT+ LPAREN) ... function = FUNCTION + ZeroOrMore(Optional(IDENT + EQUAL) + expr) + RPAREN This makes it hard to see the matching of parens. I would suggest: function = IDENT + LPAREN + ZeroOrMore(Optional(IDENT + EQUAL) + expr) + RPAREN Lastly, to give structure to your results: funcarg = Optional(IDENT + EQUAL) + expr function = IDENT + LPAREN + Group(Optional(delimitedList(funcarg))) + RPAREN Now that the arguments are grouped, the parens are unnecessary in the parsed output, you can suppress them. 3. expr follows a very common pattern, that of the delimited list. expr << (term + ZeroOrMore( Optional(operator) + term)) Here you could instead use: expr << delimitedList(term, delim=Optional(operator)) 4. You may have gone a bit overboard in using '^' vs. '|'. For instance: LENGTH = Combine(NUMBER + (Literal("px") ^ Literal("cm") ^ Literal("mm") ^ Literal("in") ^ Literal("pt") ^ Literal("pc"))) When you use '^', all matches are evaluated, even if there is a match early in the list. Now in this case, if you parse the 'px' in '100px', there is no point in checking for a match with 'cm', 'mm', 'in', etc. In this case a MatchFirst is perfectly okay. Plus you can order the units in some expected frequency of occurrence. LENGTH = Combine(NUMBER + (Literal("px") | Literal("cm") | Literal("mm") | Literal("in") | Literal("pt") | Literal("pc"))) Now this could get you in trouble, if one of these terms was actually a leading subset of another, like "pts" and "pt". You would have to take care to test for the longer choice first. Pyparsing's helper method oneOf handles this (and internally generates a Regex for performance): LENGTH = Combine(NUMBER + oneOf("px cm mm in pt pc")) Thanks for giving pyparsing a shot! -- Paul |