Re: [Pyparsing] expression not greedy enough

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> However, I solved the issue - see the NUMBER-nonterminal. But it might
> help if
> you guys take a look if that's really the way to go.
> 
> Diez
> 

Diez -

1. I see you solved your NUMBER issues, but I think you still have some
misconceptions about repetition, especially about Word.  Here are your
NUMBER elements:

    numlit = Word(srange("[0-9]"))
    DOT = Literal(".")
    NUMBER = Combine(OneOrMore(numlit)) ^ Combine(ZeroOrMore(numlit) + DOT +
OneOrMore(numlit))

Here is the reference from the BNF:

    num		[0-9]+|[0-9]*"."[0-9]+

Word is there to define "word groups" or contiguous characters in a
particular set.  A better translation of num to pyparsing would be:

    numlit = Word(srange("[0-9]"))
    DOT = Literal(".")
    NUMBER = numlit | Combine(Optional(numlit) + "." + numlit)

Word already takes care of the character repetition, there is no need for
the OneOrMore or ZeroOrMore.

But in practice, I've found that numeric literal parsing is usually a
frequent step in overall parsing, and that a Regex term is worth the trouble
for measurably better parser performance:

    NUMBER = Regex(r"[0-9]*\.[0-9]+|[0-9]+")

2. Why this definition of FUNCTION and function?  (Nevermind, I looked at
your BNF reference and found that this is mapping directly from the YACC
definitions.)

    FUNCTION = Combine(IDENT+ LPAREN)
    ...
    function = FUNCTION + ZeroOrMore(Optional(IDENT + EQUAL) + expr) +
RPAREN

This makes it hard to see the matching of parens.  I would suggest:

    function = IDENT + LPAREN + ZeroOrMore(Optional(IDENT + EQUAL) + expr) +
RPAREN

Lastly, to give structure to your results:

    funcarg = Optional(IDENT + EQUAL) + expr
    function = IDENT + LPAREN + Group(Optional(delimitedList(funcarg))) +
RPAREN

Now that the arguments are grouped, the parens are unnecessary in the parsed
output, you can suppress them.

3. expr follows a very common pattern, that of the delimited list.

    expr << (term + ZeroOrMore( Optional(operator) +  term))

Here you could instead use:

    expr << delimitedList(term, delim=Optional(operator))

4. You may have gone a bit overboard in using '^' vs. '|'.  For instance:

    LENGTH = Combine(NUMBER + (Literal("px") ^ Literal("cm") ^ Literal("mm")
^
                               Literal("in") ^ Literal("pt") ^
Literal("pc")))

When you use '^', all matches are evaluated, even if there is a match early
in the list.  Now in this case, if you parse the 'px' in '100px', there is
no point in checking for a match with 'cm', 'mm', 'in', etc.  In this case a
MatchFirst is perfectly okay.  Plus you can order the units in some expected
frequency of occurrence.

    LENGTH = Combine(NUMBER + (Literal("px") | Literal("cm") | Literal("mm")
|
                               Literal("in") | Literal("pt") |
Literal("pc")))

Now this could get you in trouble, if one of these terms was actually a
leading subset of another, like "pts" and "pt".  You would have to take care
to test for the longer choice first.  Pyparsing's helper method oneOf
handles this (and internally generates a Regex for performance):

    LENGTH = Combine(NUMBER + oneOf("px cm mm in pt pc"))

Thanks for giving pyparsing a shot!

-- Paul