Thread: [Pyparsing] expression not greedy enough
Brought to you by:
ptmcg
From: Diez B. R. <de...@we...> - 2009-08-25 17:32:21
|
Hi, I'm in the process of writing a CSS-parser - and encountered a behavior that I don't understand. The attached file shows it - I expect the expression 0 100px to be parsed as two tokens, a NUMBER and an LENGTH. But instead, it seems to be parsed as NUMBER, NUMBER, IDENT. The second test-case shows that if there is only one sub-expression, things work as expected. Any suggestions? Diez |
From: Paul M. <pt...@au...> - 2009-08-26 04:34:54
|
Still no joy in the attachments. Try just including the text in the body of your e-mail. Or use pyparsing.pastebin.com. -- Paul > -----Original Message----- > From: Diez B. Roggisch [mailto:de...@we...] > Sent: Tuesday, August 25, 2009 12:32 PM > To: Pyp...@li... > Subject: [Pyparsing] expression not greedy enough > > Hi, > > I'm in the process of writing a CSS-parser - and encountered a behavior > that I > don't understand. > > The attached file shows it - I expect the expression > > 0 100px > > to be parsed as two tokens, a NUMBER and an LENGTH. > > But instead, it seems to be parsed as NUMBER, NUMBER, IDENT. > > The second test-case shows that if there is only one sub-expression, > things > work as expected. > > Any suggestions? > > Diez |
From: Diez B. R. <de...@we...> - 2009-08-26 08:28:18
|
On Wednesday 26 August 2009 06:17:04 Paul McGuire wrote: > Still no joy in the attachments. Try just including the text in the body > of your e-mail. Or use pyparsing.pastebin.com. > Thanks, here it is: http://pyparsing.pastebin.com/m5df6ab17 However, I solved the issue - see the NUMBER-nonterminal. But it might help if you guys take a look if that's really the way to go. Diez |
From: Paul M. <pt...@au...> - 2009-08-29 03:58:00
|
> However, I solved the issue - see the NUMBER-nonterminal. But it might > help if > you guys take a look if that's really the way to go. > > Diez > Diez - 1. I see you solved your NUMBER issues, but I think you still have some misconceptions about repetition, especially about Word. Here are your NUMBER elements: numlit = Word(srange("[0-9]")) DOT = Literal(".") NUMBER = Combine(OneOrMore(numlit)) ^ Combine(ZeroOrMore(numlit) + DOT + OneOrMore(numlit)) Here is the reference from the BNF: num [0-9]+|[0-9]*"."[0-9]+ Word is there to define "word groups" or contiguous characters in a particular set. A better translation of num to pyparsing would be: numlit = Word(srange("[0-9]")) DOT = Literal(".") NUMBER = numlit | Combine(Optional(numlit) + "." + numlit) Word already takes care of the character repetition, there is no need for the OneOrMore or ZeroOrMore. But in practice, I've found that numeric literal parsing is usually a frequent step in overall parsing, and that a Regex term is worth the trouble for measurably better parser performance: NUMBER = Regex(r"[0-9]*\.[0-9]+|[0-9]+") 2. Why this definition of FUNCTION and function? (Nevermind, I looked at your BNF reference and found that this is mapping directly from the YACC definitions.) FUNCTION = Combine(IDENT+ LPAREN) ... function = FUNCTION + ZeroOrMore(Optional(IDENT + EQUAL) + expr) + RPAREN This makes it hard to see the matching of parens. I would suggest: function = IDENT + LPAREN + ZeroOrMore(Optional(IDENT + EQUAL) + expr) + RPAREN Lastly, to give structure to your results: funcarg = Optional(IDENT + EQUAL) + expr function = IDENT + LPAREN + Group(Optional(delimitedList(funcarg))) + RPAREN Now that the arguments are grouped, the parens are unnecessary in the parsed output, you can suppress them. 3. expr follows a very common pattern, that of the delimited list. expr << (term + ZeroOrMore( Optional(operator) + term)) Here you could instead use: expr << delimitedList(term, delim=Optional(operator)) 4. You may have gone a bit overboard in using '^' vs. '|'. For instance: LENGTH = Combine(NUMBER + (Literal("px") ^ Literal("cm") ^ Literal("mm") ^ Literal("in") ^ Literal("pt") ^ Literal("pc"))) When you use '^', all matches are evaluated, even if there is a match early in the list. Now in this case, if you parse the 'px' in '100px', there is no point in checking for a match with 'cm', 'mm', 'in', etc. In this case a MatchFirst is perfectly okay. Plus you can order the units in some expected frequency of occurrence. LENGTH = Combine(NUMBER + (Literal("px") | Literal("cm") | Literal("mm") | Literal("in") | Literal("pt") | Literal("pc"))) Now this could get you in trouble, if one of these terms was actually a leading subset of another, like "pts" and "pt". You would have to take care to test for the longer choice first. Pyparsing's helper method oneOf handles this (and internally generates a Regex for performance): LENGTH = Combine(NUMBER + oneOf("px cm mm in pt pc")) Thanks for giving pyparsing a shot! -- Paul |
From: Diez B. R. <de...@we...> - 2009-09-02 12:37:27
|
Hi Paul, thank you very much for your detailed and helpful answer - I updated my parser, and gained some 20-30% speedup. But I wish it was faster :) Here is the new version: http://pyparsing.pastebin.com/m2736e089 There are some issues I still have, see below: > 1. I see you solved your NUMBER issues, but I think you still have some > misconceptions about repetition, especially about Word. Here are your > NUMBER elements: > > numlit = Word(srange("[0-9]")) > DOT = Literal(".") > NUMBER = Combine(OneOrMore(numlit)) ^ Combine(ZeroOrMore(numlit) + DOT > + OneOrMore(numlit)) > > Here is the reference from the BNF: > > num [0-9]+|[0-9]*"."[0-9]+ > > Word is there to define "word groups" or contiguous characters in a > particular set. A better translation of num to pyparsing would be: > > numlit = Word(srange("[0-9]")) > DOT = Literal(".") > NUMBER = numlit | Combine(Optional(numlit) + "." + numlit) > > Word already takes care of the character repetition, there is no need for > the OneOrMore or ZeroOrMore. > > But in practice, I've found that numeric literal parsing is usually a > frequent step in overall parsing, and that a Regex term is worth the > trouble for measurably better parser performance: > > NUMBER = Regex(r"[0-9]*\.[0-9]+|[0-9]+") Done. > > > 2. Why this definition of FUNCTION and function? (Nevermind, I looked at > your BNF reference and found that this is mapping directly from the YACC > definitions.) > > FUNCTION = Combine(IDENT+ LPAREN) > ... > function = FUNCTION + ZeroOrMore(Optional(IDENT + EQUAL) + expr) + > RPAREN > > This makes it hard to see the matching of parens. I would suggest: > > function = IDENT + LPAREN + ZeroOrMore(Optional(IDENT + EQUAL) + expr) > + RPAREN > > Lastly, to give structure to your results: > > funcarg = Optional(IDENT + EQUAL) + expr > function = IDENT + LPAREN + Group(Optional(delimitedList(funcarg))) + > RPAREN Done that, too. > Now that the arguments are grouped, the parens are unnecessary in the > parsed output, you can suppress them. > > > > 3. expr follows a very common pattern, that of the delimited list. > > expr << (term + ZeroOrMore( Optional(operator) + term)) > > Here you could instead use: > > expr << delimitedList(term, delim=Optional(operator)) Here, I have the problem that the delimiters are operators - and these are swallowed. I could give the combine-parameter, but then I'm getting the whole string, having to parse that again - which makes no sense (especially as here we can have nested expressions, so "simple" approaches don't work. Any other suggestion, or shall I just keep my original rule? > 4. You may have gone a bit overboard in using '^' vs. '|'. For instance: > > LENGTH = Combine(NUMBER + (Literal("px") ^ Literal("cm") ^ > Literal("mm") ^ > Literal("in") ^ Literal("pt") ^ > Literal("pc"))) > > When you use '^', all matches are evaluated, even if there is a match early > in the list. Now in this case, if you parse the 'px' in '100px', there is > no point in checking for a match with 'cm', 'mm', 'in', etc. In this case > a MatchFirst is perfectly okay. Plus you can order the units in some > expected frequency of occurrence. For these cases, your suggestion to use | worked, but not for this one: term = Optional(unary_operator) + ((PERCENTAGE | LENGTH | EMS | EXS | ANGLE | TIME | FREQ | NUMBER)\ | STRING | URI | hexcolor | (function ^ IDENT)) I think the problem here is the IDENT vs. function - both start with an IDENT. But I'd say problem should be solvable with a lookahead of 1 (in normal LL(k)-terms), so maybe you have an idea here, too. Thanks again for your support! Diez |