Thread: [Pyparsing] setResultsName on a recursive element of grammar
Brought to you by:
ptmcg
From: Elizabeth M. <eli...@in...> - 2016-02-20 14:25:58
|
Hello, So I am trying to create a recursive grammar that specifies RFC1459-style IRC frames for a pet project. This is what I have so far: ---SNIP--- > command = Word(alphanums).setResultsName("command") > middle = Word(alphanums) # Middle parameter > end = Literal(":").suppress() # Last optional parameter > param = Forward() > param << Optional((middle + param) | end) > param = param.setResultsName("param", listAllMatches=True) > > line = command + param + stringEnd() Unfortunately, the result I get out of it is this: > l = line.parseString("COMMAND param1 param2 :param3") > print(l.param) >> ['param1'] It "works" when I do: > param = Group(param).setResultsName("param", listAllMatches=True) But then it nests everything: >> [['param1', 'param2', 'param last']] I'm not entirely sure what to do here, or if this is a bug? (or a bug in my grammar?) -- Elizabeth |
From: Paul M. <pt...@au...> - 2016-02-20 16:36:29
|
Elizabeth - Googling for RFC1429, I found this BNF, which looks like what you are working from: https://tools.ietf.org/html/rfc1459#section-2.3.1 <message> ::= [':' <prefix> <SPACE> ] <command> <params> <crlf> <prefix> ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ] <command> ::= <letter> { <letter> } | <number> <number> <number> <SPACE> ::= ' ' { ' ' } <params> ::= <SPACE> [ ':' <trailing> | <middle> <params> ] <middle> ::= <Any *non-empty* sequence of octets not including SPACE or NUL or CR or LF, the first of which may not be ':'> <trailing> ::= <Any, possibly *empty*, sequence of octets not including NUL or CR or LF> <crlf> ::= CR LF >From this BNF, I came up with this translation to pyparsing, very similar to yours: COLON = Suppress(':') command = Word(alphas) | Word(nums, exact=3) middle = ~COLON + Word(printables) trailing = Word(printables) params = Forward() params <<= COLON + trailing | middle + params I usually leave the assignment of results names until the very end, just assigning them in the expressions where they get composed into groups or the top-most parse expression. line = command("command") + Group(params)("params") tests = """\ COMMAND param1 param2 : param3""" line.runTests(tests) And this gives: COMMAND param1 param2 : param3 ['COMMAND', ['param1', 'param2', 'param3']] - command: COMMAND - params: ['param1', 'param2', 'param3'] This is something of a problem, since we have lost the distinction of which part of the params are the middle and which are the trailing. The issue is that recursive definition of params, which you pointed out makes the results awkward to work with. The best I could do here was to define params using: params <<= (COLON + trailing("trailing") | middle("middle*") + params) (I'm using the abbreviated version of `setResultsName`, using the expressions as callables - the trailing '*' in "middle*" is equivalent to `middle.setResultsName("middle", listAllMatches=True)`. And as you probably already discovered, if `listAllMatches` is left out, then you will only get the last element of `middle`.) With this change, I get: COMMAND param1 param2 : param3 ['COMMAND', ['param1', 'param2', 'param3']] - command: COMMAND - params: ['param1', 'param2', 'param3'] - middle: [['param1'], ['param2']] [0]: ['param1'] [1]: ['param2'] - trailing: param3 Which is *okay* but not really pleasant to deal with that middle bit. But I'd like to look at this recursive construct in the original BNF: <params> ::= <SPACE> [ ':' <trailing> | <middle> <params> ] This is very typical in many BNFs, which will define a repetition of one or more items as: <list_of_items> ::= <item> [ <list_of_items> ] This *can* be implemented in pyparsing as: list_of_items = Forward() list_of_items <<= item + list_of_items But you'll find in pyparsing that things are usually clearer (and faster) when you define repetition using the OneOrMore or ZeroOrMore classes: list_of_items = OneOrMore(item) If we use a repetition expression instead of a recursive expression for params, it looks like this: params = (OneOrMore(middle)("middle") + COLON + ZeroOrMore(trailing)("trailing")) And the parsed test string gives: COMMAND param1 param2 : param3 ['COMMAND', ['param1', 'param2', 'param3']] - command: COMMAND - params: ['param1', 'param2', 'param3'] - middle: ['param1', 'param2'] - trailing: ['param3'] Here is the whole parser in one copy/pasteable chunk: command = Word(alphas) | Word(nums) COLON = Suppress(':') middle = ~COLON + Word(printables) trailing = Word(printables) params = (OneOrMore(middle)("middle") + COLON + ZeroOrMore(trailing)("trailing")) line = (command("command") + Group(params)("params")) tests = """\ COMMAND param1 param2 : param3""" line.runTests(tests) And no need to kludge in any `listAllMatches` behavior either. -- Paul --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
From: Elizabeth M. <eli...@in...> - 2016-02-23 11:40:35
|
On 20/02/16 10:36, Paul McGuire wrote: > Here is the whole parser in one copy/pasteable chunk: > > command = Word(alphas) | Word(nums) > COLON = Suppress(':') > middle = ~COLON + Word(printables) > trailing = Word(printables) > params = (OneOrMore(middle)("middle") + > COLON + > ZeroOrMore(trailing)("trailing")) > line = (command("command") + Group(params)("params")) > > tests = """\ > COMMAND param1 param2 : param3""" > line.runTests(tests) > > And no need to kludge in any `listAllMatches` behavior either. > > -- Paul > > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > There was one minor thing I forgot. Word(printables) is insufficient in your example, as any 8-bit string is acceptable in parameters (and are often used). The encoding is not specified for these portions (deliberately it seems), but I usually implicitly assume UTF-8, since that is the de-facto standard. As such, UTF-8 is what I decode all stuff from the wire to, using the 'replace' error handler. This is my proposed solution to match all characters but surely there is a better way than the below (perhaps using Regex is better?): # 1111998 is the total number of valid Unicode characters. utf8_chars = ''.join(chr(x) for x in range(1111998)) middle = (~COLON + ~White()) + Word(utf8_chars) -- Elizabeth |