Python parsing module / Bugs / #107 nestedExpr is splitting on QuotedString by default

nestedExpr is provided in pyparsing as a shortcut for more complex expressions that support nesting on opening and closing grouping strings. But as a shortcut, it does not really do much meaningful with the contents within the groups. So the question is, what should nestedExpr make of the strings that are inside the nested groups?

By default, nestedExpr will look for space-delimited words of printables, so that

(a b c (dd ee) ff)

will parse into

['a', 'b', 'c', ['dd', 'ee'], 'ff']

(if you call asList() on the ParseResults object that comes back from parseString()).

It then raises the question, "what if I use a quoted string to represent a nested item that contains a space?", as in:

(a "b c" (dd ee) ff)

Returning

['a', '"b', 'c"', ['dd', 'ee'], 'ff']

is pretty clearly a wrong guess, so nestedExpr also looks for quoted strings while parsing contents of the nested bits, giving:

['a', '"b c"', ['dd', 'ee'], 'ff']

This also protects us in case we get a tuple with an open or close paren in quotes:

(a "b )c" (dd ee) ff)

Nine times out of ten, we don't want that ')' to close the outer group, it is just another character in the nested character string.

But things start to look bad if our nested expression is much more like a Python tuple, with delimiting commas:

(a, b, c (dd, ee) ff)

Then the delimiting commas get mixed in with our parsed text:

[['a,', 'b,', 'c', ['dd,', 'ee'], 'ff']]

So nestedExpr supports an optional content arg, to permit definition of more complex contents in our groups. If we want to try parsing nested delimited lists of alphabetic words, we can write:

nested_alpha_list = nestedExpr('(', ')', content=delimitedList(Word(alphas)))

And now nestedExpr treats the nested contents as delimitedLists, which suppress the delimiting commas and just give back the list items:

[['a', 'b', 'c', ['dd', 'ee'], 'ff']]

Now what if we had something that really looked like a nested tuple, with commas separating every term, including nested lists. If we use nestedAlphaLists to parse this string:

(a, b, c, (dd, ee), ff)

We'll get this error:

FAIL: Expected ")" (at char 8), (line:1, col:9)

Our content definition only expects words separated by commas, no trailing or leading commas. We need to further expand our content argument to look like:

COMMA = Suppress(',')
smarter_nested_alpha_list = nestedExpr('(', ')', content=Optional(COMMA) + delimitedList(Word(alphas)) + Optional(COMMA))

And now we can parse our nested tuple successfully.

At this point, are we really parsing? This "smarter" nested list is not too smart, really. It will accept this string as well:

(a, b, c (dd, ee) ff)

since the leading and trailing commas on the nested content are optional.

I would argue that at this point, we have exceeded the bounds of the nestedExpr convenience method, and we need to buckle down and actually parse the expression using a nested parser. Something like this:

nested_item_list = Forward()
LPAR, RPAR = map(Suppress, "()")
nested_item = Word(alphas) | Group(LPAR + nested_item_list + RPAR)
nested_item_list <<= delimitedList(nested_item)

And if we revisit the earlier desire to accept quoted strings as items that might contain a space, or comma, or '(', then we just update nested_item to:

nested_item = Word(alphas) | quotedString | Group(LPAR + nested_item_list + RPAR)

And now our parser will handle this tuple-like string as well:

(a, "b c", (dd, ee), ff)

Giving:

[['a', '"b c"', ['dd', 'ee'], 'ff']]

Really, using nestedExpr for anything more complex than a space-delimited or comma-delimited list is something of a cheat, which is why I call it more of a shortcut than a real parsing element. It is very handy when parsing a language like C for function definitions, where you want to write a lazy parser to match a function method signature, but skip over all the other C syntax that might be found in the function body. Fortunately, unlike Python, C uses braces to delimit the code for a function, so you can define a C "parser" as:

type_decl = oneOf("int char float") + ZeroOrMore("*")
function_name = Word(alphas, alphanums+'_')
function_arg_list = Group(LPAR + Optional(delimitedList(arg_expr)) + RPAR)

function_signature = type_decl('type') + function_name('name') + function_arg_list('args')
function_body = nestedExpr('{', '}')('who_cares')

function_expr = function_signature + function_body

And this parser will find function definitions in C, but not really do much parsing of the C language itself.

So finally, to look at your question. You are parsing a string of the form:

{{{ a="blah" }}}

Which might sometimes be written as:

{{{ a= "blah" }}}

Or:

{{{ a = "blah" }}}

Or even:

{{{ a     =      "blah" }}}

And nestedExpr is following its default definition of looking for space-delimited printables and possible quoted strings.

If 'a = "blah"' has some meaning in your text, then you should probably parse it explicitly, or at least define an expression for it and pass that as the content arg to nestedExpr. Something like:

identifier = Word(alphas)
EQ = Suppress('=')
string_literal = quotedString
numeric_literal = Word(nums)
value_term = string_literal | numeric_literal | identifier
value = value_term + ZeroOrMore(oneOf("+ -") + value_term)
assignment = identifier('lhs') + EQ + value('rhs')

Now your nested expr can look like:

nestedExpr('{{{', '}}}', content=assignment)

You don't give an example of a nested expression, so I'm not sure how assignments should handle nesting. But I hope this discussion gives you more background on when nestedExpr is appropriate, and when you need to do more actual parsing.

Here is the test code I wrote to test all these strings and expressions:

from pyparsing import *

tests = """\
    (a b c (dd ee) ff)
    (a "b c" (dd ee) ff)
    (a, b, c (dd, ee) ff)
    (a, b, c, (dd, ee), ff)
    (a, "b c", (dd, ee), ff)
"""
nestedExpr().runTests(tests)

# basic nested list
nested_alpha_list = nestedExpr('(', ')', content=delimitedList(Word(alphas)))
nested_alpha_list.runTests(tests)

# nested list with comma delimiters
COMMA = Suppress(',')
smarter_nested_alpha_list = nestedExpr('(', ')', content=Optional(COMMA) + delimitedList(Word(alphas)) + Optional(COMMA))
smarter_nested_alpha_list.runTests(tests)

# actual parser for a nested list with comma delimiters
nested_item_list = Forward()
LPAR, RPAR = map(Suppress, "()")
nested_item = Word(alphas) | quotedString | Group(LPAR + nested_item_list + RPAR)
nested_item_list <<= delimitedList(nested_item)
nested_item_list.runTests(tests)

# parsing nested assignment statements
identifier = Word(alphas)
EQ = Suppress('=')
string_literal = quotedString
numeric_literal = Word(nums)
value_term = string_literal | numeric_literal | identifier
value = value_term + ZeroOrMore(oneOf("+ -") + value_term)
assignment_expr = identifier('lhs') + EQ + value('rhs')

tests = """\
    {{{ a="blah" }}}
    {{{ a= "blah" }}}
    {{{ a = "blah" }}}
    {{{ a = "blah" + foo }}}
"""
nestedExpr('{{{', '}}}', content=assignment_expr).runTests(tests)

nestedExpr is splitting on QuotedString by default

Group

Searches

Help

#107 nestedExpr is splitting on QuotedString by default

Discussion