Re: [Pyparsing] More useful parse errors
Brought to you by:
ptmcg
From: Paul M. <pa...@al...> - 2006-08-16 14:00:15
|
Chris - Thanks for your extensive and thoughtful post. Let me try to respond to some of your comments: 1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. I know you are not just referring to the recursive nature of some grammars, as distinguished by the presence of Forward() expressions containing references to themselves, but to the actual implementation of pyparsing. I receieved an e-mail shortly after my article got posted at OReilly ONLamp, suggesting that pyparsing was not "recursive descent" because there were no examples of recursive grammars. Here is an excerpt from my reply (in which we also discussed the list-parsing example), marked with '>>>>>'s - especially note the paragraph beginning with "But even if": >>>>> Pyparsing does support recursive grammars. At least one is included in the examples directory that comes with pyparsing. Here is a simple grammar from my upcoming presentation at PyCon. It seems that about once a month, someone on comp.lang.python asks how to convert from a string representation of a list back to the original list, without using eval. Here is a non-recursive version, which parses lists that are composed only of simple quoted strings, integers, and floats: -------------- <snip - description of development of recursive list-parsing grammar, culminating in...> listStr = Forward() listItem = real | integer | quotedString.setParseAction(removeQuotes) | Group(listStr) listStr << ( lbrack + Optional(delimitedList(listItem)) + rbrack ) The << operator indicates that we are not binding a new expression to listStr, but "injecting" an expression into the placeholder previously defined as a Forward. With this minor change, we can now parse nested lists, and get back a nested result: test = "['a', 100, [], 3.14, [ +2.718, 'xyzzy', -1.414] ]" print listStr.parseString(test) -------------- prints: ['a', 100, [], 3.1400000000000001, [2.718, 'xyzzy', -1.4139999999999999]] I think I need to submit another article to ONLamp, with some more advanced grammar examples. But even if the example grammars in the article are not recursive, does this necessarily mean pyparsing's parsers are not "recursive descent"? The grammars themselves are composed of ParserElement objects organized into a hierarchy representing the grammar expressions and their respective composition from other, more basic grammar expressions. In this case, the top level expression, listStr, is made up of 3 expressions, which must all match in their turn. The first and third expressions are just literals, and they call their respective parse methods to see if there is a left and right bracket at the start and end of the string. But the second expression is a composite, made up of an Optional element, containing a delimited list of listItems. delimitedList(expr) is a helper function, that returns expr + ZeroOrMore(delim + expr) where the default delim is a literal comma. So from the top level listStr, we call the Optional expression's parse method, which calls the parse method of the And object created by delimitedList - ___recursively descending___ through the grammar until a basic token (such as a Literal or Word) is found, or no match is found and an exception raised. It is because we have used recursion to traverse the grammar structure, that we can use exceptions and exception handlers to take care of backtracking through the grammar, with no separate maintenance of stacks or backtrack pointers. On the other hand, this is the feature that makes pyparsing vulnerable to left recursion, which I have only partially succeeded in addressing with the validate() method. >>>>> It is also this use of the stack for backtracking that makes real parse error detection difficult. 2. From your message: > At each stage we know exactly what tokens we need in order to continue > parsing, otherwise we have an error. I think that it should > be possible to give pyparsing this behaviour, but my attempts at > analysing the source code have left me a bit lost at this point. > Well, in one sense, pyparsing already has this behavior - each parse expression contains its own idea of "what should come next", and they are in turn composed into more complex expressions. Do you mean that pyparsing should explode its grammar down to the raw character level, to become more of a character-based state machine rather than the "recursive traverser of parsing expression objects" that it currently is? Certainly, we could take the basic types such as Literal("abc") and convert to a state machine that at some point contains logic such as: ch = readnext() if ch != 'a': return (errorFlag,currentLocation) ch = readnext() if ch != 'b': return (errorFlag,currentLocation) ch = readnext() if ch != 'c': return (errorFlag,currentLocation) return 'abc' Or an expression to detect a proper noun - Word(alphas.upper(), alphas.lower()) - could be converted to: ch = readnext() if ch not in alphas.upper(): return (errorFlag, currentLocation) wd = ch while 1: ch = readnext() if ch not in alphas.lower(): unreadlastchar() break wd += ch return wd This is not too different from what pyparsing already does. So then let's look at merging these into a more complex expression: phrase = Literal("abc") | Word(alphas.upper(),alphas.lower()) Disjunctions are not really difficult for state machines, but the Python code gets messy. To keep things clean, let's pretend Python supports 'goto', and we'll limit ourselves to only downward jumps: ch = readnext() if ch != 'a': goto <<readProperNoun>> ch = readnext() if ch != 'b': return (errorFlag,currentLocation) ch = readnext() if ch != 'c': return (errorFlag,currentLocation) return 'abc' ch = readnext() <<readProperNoun>> if ch not in alphas.upper(): return (errorFlag, currentLocation) wd = ch while 1: ch = readnext() if ch not in alphas.lower(): unreadlastchar() break wd += ch return wd Still not terrible. What gets messy is when we start to combine disjunction with repetition. Let's just expand phrase slightly now: phrase = OneOrMore( Literal("abc") | Word(alphas.upper(),alphas.lower()) ) and let's parse the text "Honest abe". We fail on the second word, but what should be the reason? Because "abe" wasn't capitalized, or because it didn't end with a 'c'? So in fact, our OneOrMore is satisfied - we *did* find one matching word, "Honest" - so phrase succeeds, and we continue matching at the 'a' in "abe". In fact, if the overall grammar contains: phrase + Literal("abe") We're doing okay. But if the grammar is: phrase + stringEnd() Then we get an "expected end of string" error. So now we add a third candidate to our list of possible failures at the 'a' in "abe". Here's a fourth: what if the input text was really supposed to read "Honest Gabe"? It's possible that more knowledge of the parsing process could be gained by rearchitecting pyparsing to use richer return values and codes (such as the tuple you propose), rather than the current ParseResults for successful parsing and exceptions for failed parsing. Then a higher-level routine could look for the furthest successful parse before failing, and assume that that was where the actual error occurred. I've taken a stab at such logic, but it has fallen short due to the limited scope for the individual parse expression objects. Here's another idea on how to tackle this. Rewrite pyparsing to pass around a text source rather than just the original string. For one thing, this would allow the text source to be a stream instead of a string, which is something I've wanted to add to pyparsing for a while. The text source could also be an object that kept track of "furthest parse location", and possibly also "exception raised at furthest parse location". When parseString ultimately fails, consult the text source for the furthest progress, and report that as the error location. Anyway, I need to get to work, so I've got to sign off just now. If this gives you more ideas, keep 'em coming. -- Paul |