Menu

#107 nestedExpr is splitting on QuotedString by default

v1.0 (example)
closed-works-for-me
None
5
2018-07-14
2018-03-10
clime
No

Hello,

nestedExpr is splitting on a QuotedString by default. I believe that shouldn't be the case.

When ignoreExpris set to None, splitting on quotes does not occur but then expression is split on whitepsaces inside quotes, which is actually not desired in my case.

from pyparsing import nestedExpr
expr = nestedExpr('{{{', '}}}')
expr.parseString('{{{ a="a b" }}}')
([(['a=', '"a b"'], {})], {})

To workaround my problem, I need go through the list of parsed tokens and reconnect two neighbouring elements if one ends with '=' and the second starts with a quote.

Discussion

  • clime

    clime - 2018-03-10

    I think this should be fixed but if there is other way around, please tell.

     
  • Paul McGuire

    Paul McGuire - 2018-03-31

    nestedExpr is provided in pyparsing as a shortcut for more complex expressions that support nesting on opening and closing grouping strings. But as a shortcut, it does not really do much meaningful with the contents within the groups. So the question is, what should nestedExpr make of the strings that are inside the nested groups?

    By default, nestedExpr will look for space-delimited words of printables, so that

    (a b c (dd ee) ff)
    

    will parse into

    ['a', 'b', 'c', ['dd', 'ee'], 'ff']
    

    (if you call asList() on the ParseResults object that comes back from parseString()).

    It then raises the question, "what if I use a quoted string to represent a nested item that contains a space?", as in:

    (a "b c" (dd ee) ff)
    

    Returning

    ['a', '"b', 'c"', ['dd', 'ee'], 'ff']
    

    is pretty clearly a wrong guess, so nestedExpr also looks for quoted strings while parsing contents of the nested bits, giving:

    ['a', '"b c"', ['dd', 'ee'], 'ff']
    

    This also protects us in case we get a tuple with an open or close paren in quotes:

    (a "b )c" (dd ee) ff)
    

    Nine times out of ten, we don't want that ')' to close the outer group, it is just another character in the nested character string.

    But things start to look bad if our nested expression is much more like a Python tuple, with delimiting commas:

    (a, b, c (dd, ee) ff)
    

    Then the delimiting commas get mixed in with our parsed text:

    [['a,', 'b,', 'c', ['dd,', 'ee'], 'ff']]
    

    So nestedExpr supports an optional content arg, to permit definition of more complex contents in our groups. If we want to try parsing nested delimited lists of alphabetic words, we can write:

    nested_alpha_list = nestedExpr('(', ')', content=delimitedList(Word(alphas)))
    

    And now nestedExpr treats the nested contents as delimitedLists, which suppress the delimiting commas and just give back the list items:

    [['a', 'b', 'c', ['dd', 'ee'], 'ff']]
    

    Now what if we had something that really looked like a nested tuple, with commas separating every term, including nested lists. If we use nestedAlphaLists to parse this string:

    (a, b, c, (dd, ee), ff)
    

    We'll get this error:

    FAIL: Expected ")" (at char 8), (line:1, col:9)
    

    Our content definition only expects words separated by commas, no trailing or leading commas. We need to further expand our content argument to look like:

    COMMA = Suppress(',')
    smarter_nested_alpha_list = nestedExpr('(', ')', content=Optional(COMMA) + delimitedList(Word(alphas)) + Optional(COMMA))
    

    And now we can parse our nested tuple successfully.

    At this point, are we really parsing? This "smarter" nested list is not too smart, really. It will accept this string as well:

    (a, b, c (dd, ee) ff)
    

    since the leading and trailing commas on the nested content are optional.

    I would argue that at this point, we have exceeded the bounds of the nestedExpr convenience method, and we need to buckle down and actually parse the expression using a nested parser. Something like this:

    nested_item_list = Forward()
    LPAR, RPAR = map(Suppress, "()")
    nested_item = Word(alphas) | Group(LPAR + nested_item_list + RPAR)
    nested_item_list <<= delimitedList(nested_item)
    

    And if we revisit the earlier desire to accept quoted strings as items that might contain a space, or comma, or '(', then we just update nested_item to:

    nested_item = Word(alphas) | quotedString | Group(LPAR + nested_item_list + RPAR)
    

    And now our parser will handle this tuple-like string as well:

    (a, "b c", (dd, ee), ff)
    

    Giving:

    [['a', '"b c"', ['dd', 'ee'], 'ff']]
    

    Really, using nestedExpr for anything more complex than a space-delimited or comma-delimited list is something of a cheat, which is why I call it more of a shortcut than a real parsing element. It is very handy when parsing a language like C for function definitions, where you want to write a lazy parser to match a function method signature, but skip over all the other C syntax that might be found in the function body. Fortunately, unlike Python, C uses braces to delimit the code for a function, so you can define a C "parser" as:

    type_decl = oneOf("int char float") + ZeroOrMore("*")
    function_name = Word(alphas, alphanums+'_')
    function_arg_list = Group(LPAR + Optional(delimitedList(arg_expr)) + RPAR)
    
    function_signature = type_decl('type') + function_name('name') + function_arg_list('args')
    function_body = nestedExpr('{', '}')('who_cares')
    
    function_expr = function_signature + function_body
    

    And this parser will find function definitions in C, but not really do much parsing of the C language itself.

    So finally, to look at your question. You are parsing a string of the form:

    {{{ a="blah" }}}
    

    Which might sometimes be written as:

    {{{ a= "blah" }}}
    

    Or:

    {{{ a = "blah" }}}
    

    Or even:

    {{{ a     =      "blah" }}}
    

    And nestedExpr is following its default definition of looking for space-delimited printables and possible quoted strings.

    If 'a = "blah"' has some meaning in your text, then you should probably parse it explicitly, or at least define an expression for it and pass that as the content arg to nestedExpr. Something like:

    identifier = Word(alphas)
    EQ = Suppress('=')
    string_literal = quotedString
    numeric_literal = Word(nums)
    value_term = string_literal | numeric_literal | identifier
    value = value_term + ZeroOrMore(oneOf("+ -") + value_term)
    assignment = identifier('lhs') + EQ + value('rhs')
    

    Now your nested expr can look like:

    nestedExpr('{{{', '}}}', content=assignment)
    

    You don't give an example of a nested expression, so I'm not sure how assignments should handle nesting. But I hope this discussion gives you more background on when nestedExpr is appropriate, and when you need to do more actual parsing.

    Here is the test code I wrote to test all these strings and expressions:

    from pyparsing import *
    
    tests = """\
        (a b c (dd ee) ff)
        (a "b c" (dd ee) ff)
        (a, b, c (dd, ee) ff)
        (a, b, c, (dd, ee), ff)
        (a, "b c", (dd, ee), ff)
    """
    nestedExpr().runTests(tests)
    
    # basic nested list
    nested_alpha_list = nestedExpr('(', ')', content=delimitedList(Word(alphas)))
    nested_alpha_list.runTests(tests)
    
    # nested list with comma delimiters
    COMMA = Suppress(',')
    smarter_nested_alpha_list = nestedExpr('(', ')', content=Optional(COMMA) + delimitedList(Word(alphas)) + Optional(COMMA))
    smarter_nested_alpha_list.runTests(tests)
    
    # actual parser for a nested list with comma delimiters
    nested_item_list = Forward()
    LPAR, RPAR = map(Suppress, "()")
    nested_item = Word(alphas) | quotedString | Group(LPAR + nested_item_list + RPAR)
    nested_item_list <<= delimitedList(nested_item)
    nested_item_list.runTests(tests)
    
    # parsing nested assignment statements
    identifier = Word(alphas)
    EQ = Suppress('=')
    string_literal = quotedString
    numeric_literal = Word(nums)
    value_term = string_literal | numeric_literal | identifier
    value = value_term + ZeroOrMore(oneOf("+ -") + value_term)
    assignment_expr = identifier('lhs') + EQ + value('rhs')
    
    tests = """\
        {{{ a="blah" }}}
        {{{ a= "blah" }}}
        {{{ a = "blah" }}}
        {{{ a = "blah" + foo }}}
    """
    nestedExpr('{{{', '}}}', content=assignment_expr).runTests(tests)
    
     
  • Paul McGuire

    Paul McGuire - 2018-07-14
    • status: open --> closed-works-for-me
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.