Re: [Pyparsing] expression not greedy enough

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Paul,

thank you very much for your detailed and helpful answer - I updated my 
parser, and gained some 20-30% speedup. But I wish it was faster :)

Here is the new version:

http://pyparsing.pastebin.com/m2736e089

There are some issues I still have, see below:

> 1. I see you solved your NUMBER issues, but I think you still have some
> misconceptions about repetition, especially about Word.  Here are your
> NUMBER elements:
>
>     numlit = Word(srange("[0-9]"))
>     DOT = Literal(".")
>     NUMBER = Combine(OneOrMore(numlit)) ^ Combine(ZeroOrMore(numlit) + DOT
> + OneOrMore(numlit))
>
> Here is the reference from the BNF:
>
>     num		[0-9]+|[0-9]*"."[0-9]+
>
> Word is there to define "word groups" or contiguous characters in a
> particular set.  A better translation of num to pyparsing would be:
>
>     numlit = Word(srange("[0-9]"))
>     DOT = Literal(".")
>     NUMBER = numlit | Combine(Optional(numlit) + "." + numlit)
>
> Word already takes care of the character repetition, there is no need for
> the OneOrMore or ZeroOrMore.
>
> But in practice, I've found that numeric literal parsing is usually a
> frequent step in overall parsing, and that a Regex term is worth the
> trouble for measurably better parser performance:
>
>     NUMBER = Regex(r"[0-9]*\.[0-9]+|[0-9]+")

Done.

>
>
> 2. Why this definition of FUNCTION and function?  (Nevermind, I looked at
> your BNF reference and found that this is mapping directly from the YACC
> definitions.)
>
>     FUNCTION = Combine(IDENT+ LPAREN)
>     ...
>     function = FUNCTION + ZeroOrMore(Optional(IDENT + EQUAL) + expr) +
> RPAREN
>
> This makes it hard to see the matching of parens.  I would suggest:
>
>     function = IDENT + LPAREN + ZeroOrMore(Optional(IDENT + EQUAL) + expr)
> + RPAREN
>
> Lastly, to give structure to your results:
>
>     funcarg = Optional(IDENT + EQUAL) + expr
>     function = IDENT + LPAREN + Group(Optional(delimitedList(funcarg))) +
> RPAREN

Done that, too.

> Now that the arguments are grouped, the parens are unnecessary in the
> parsed output, you can suppress them.
>
>
>
> 3. expr follows a very common pattern, that of the delimited list.
>
>     expr << (term + ZeroOrMore( Optional(operator) +  term))
>
> Here you could instead use:
>
>     expr << delimitedList(term, delim=Optional(operator))

Here, I have the problem that the delimiters are operators - and these are 
swallowed. I could give the combine-parameter, but then I'm getting the whole 
string, having to parse that again - which makes no sense (especially as here 
we can have nested expressions, so "simple" approaches don't work.

Any other suggestion, or shall I just keep my original rule?

> 4. You may have gone a bit overboard in using '^' vs. '|'.  For instance:
>
>     LENGTH = Combine(NUMBER + (Literal("px") ^ Literal("cm") ^
> Literal("mm") ^
>                                Literal("in") ^ Literal("pt") ^
> Literal("pc")))
>
> When you use '^', all matches are evaluated, even if there is a match early
> in the list.  Now in this case, if you parse the 'px' in '100px', there is
> no point in checking for a match with 'cm', 'mm', 'in', etc.  In this case
> a MatchFirst is perfectly okay.  Plus you can order the units in some
> expected frequency of occurrence.

For these cases, your suggestion to use | worked, but not for this one:

        term = Optional(unary_operator) + ((PERCENTAGE | LENGTH | EMS | EXS |
                       ANGLE | TIME | FREQ | NUMBER)\
                       | STRING | URI | hexcolor | (function ^ IDENT))

I think the problem here is the IDENT vs. function - both start with an IDENT. 
But I'd say problem should be solvable with a lookahead of 1 (in normal 
LL(k)-terms), so maybe you have an idea here, too.

Thanks again for your support!

Diez