Thread: [Pyparsing] More useful parse errors
Brought to you by:
ptmcg
From: Chris L. <ch...@ka...> - 2006-08-11 21:58:30
|
Hi, I am using pyparsing to parse CIM mof files. It is a rather lengthy grammar with lots of instances of OneOrMore. The top level structure is a like this: bnf = OneOrMore(rest_of_grammar) + StringEnd() This raises an exception when there is a parse error, but the error is not flagged at the point where the error occurs. Since everyone here should be familiar with Python grammar, I will provide an example of what would happen if we had a pyparsing grammar for python and attempted to parse something with a syntax error: class blah(object): pass class error_says_here(object): def __init__(self): self.a = 1 self.b = 2 def error_really_here(self: pass In this case the error would flag as having occurred at error_says_here rather than error_really_here. Is there a way to work around this behavior? I have not yet come across an example of a grammar that looks like it does a really good job of error handling. Does the architecture of pyparsing make it easy to write a parser for data that you know is syntacticly correct, but difficult or impossible to write a parser that can give useful error reports to the user in the face of syntax errors? Traditional parsers would know all possible tokens that could be next and if it encountered something that wasn't one of the above tokens, it would flag an error at the location. pyparsing in contrast says, will this parse, no, ok will this parse, no, ok, will this parse? The result is that it can throw away large portions of valid syntax to eventually end immediately after parsing the last full rest_of_grammar from our OneOrMore(rest_of_grammar) + StringEnd() idiom. It may be possible for pyparsing to precompute what the possible next tokens are, which may have the side effect of making pyparsing more efficient since it won't have so much trial and error. -Chris |
From: Paul M. <pa...@al...> - 2006-08-14 22:56:00
|
Chris - I have struggled with this behavior since version 0.1 of pyparsing. I have made a faint stab at addressing this using the ParseFatalException: when a parse action detects some semantic incorrectness in input it can raise ParseFatalException which will stop all parsing and report the exact location. But since pyparsing uses exceptions as signals to proceed to the next grammar expression, it's hard to know in advance which exceptions are really, well, exceptional. If anyone can dig up some research on how other recursive-descent parsers address this error-handling issue, I may be able to fold it into a future release. But I can't delve into this kind of research on my own at the moment (pyparsing doesn't pay like my day job does, and isn't likely to anytime soon). -- Paul > -----Original Message----- > From: pyp...@li... > [mailto:pyp...@li...] On > Behalf Of Chris Lambacher > Sent: Friday, August 11, 2006 4:59 PM > To: pyp...@li... > Subject: [Pyparsing] More useful parse errors > > Hi, > > I am using pyparsing to parse CIM mof files. It is a rather > lengthy grammar > with lots of instances of OneOrMore. The top level structure > is a like this: > > bnf = OneOrMore(rest_of_grammar) + StringEnd() > > This raises an exception when there is a parse error, but the > error is not > flagged at the point where the error occurs. Since everyone > here should be > familiar with Python grammar, I will provide an example of > what would happen > if we had a pyparsing grammar for python and attempted to > parse something with > a syntax error: > > > class blah(object): > pass > > class error_says_here(object): > def __init__(self): > self.a = 1 > self.b = 2 > > def error_really_here(self: > pass > > In this case the error would flag as having occurred at > error_says_here rather > than error_really_here. > > Is there a way to work around this behavior? I have not yet > come across an > example of a grammar that looks like it does a really good > job of error > handling. Does the architecture of pyparsing make it easy to > write a parser > for data that you know is syntacticly correct, but difficult > or impossible to > write a parser that can give useful error reports to the user > in the face of > syntax errors? > > Traditional parsers would know all possible tokens that could > be next and if > it encountered something that wasn't one of the above tokens, > it would flag an > error at the location. pyparsing in contrast says, will this > parse, no, ok > will this parse, no, ok, will this parse? The result is that > it can throw > away large portions of valid syntax to eventually end > immediately after parsing > the last full rest_of_grammar from our OneOrMore(rest_of_grammar) + > StringEnd() idiom. > > It may be possible for pyparsing to precompute what the > possible next tokens > are, which may have the side effect of making pyparsing more > efficient since > it won't have so much trial and error. > > -Chris > > > -------------------------------------------------------------- > ----------- > Using Tomcat but need to do more? Need to support web > services, security? > Get stuff done quickly with pre-integrated technology to make > your job easier > Download IBM WebSphere Application Server v.1.0.1 based on > Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057& > dat=121642 > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > > |
From: Chris L. <ch...@ka...> - 2006-08-15 22:07:35
|
Hi, I am not sure how I would classify a pyparsing parser, but I am not sure that I would call it a recursive decent parser. Lets define a recursive decent parser for parsing nested lists of the form: "[1,2,3,4,[1,2,3]]" A traditionally there is a separation between the lexer(tokenizer) and parser. For our purposes lets assume that we have a lex object available that has a next_token method that returns the next token and a pop method that returns and consumes the next token. It will return a (type, value, line) tuple. class ListParser(object): def __init__(self, lexer): self.lex = lexer def match(self, token_type): tokt,tokv,tokl = lex.pop() if tokt != token_type: raise ParseError("Parse error at line %d: Expected a %s, got a %s(%s)" % (tokl, token_type, tokt, str(tokv)) ) def parse(self): """ this will start the parsing process""" return list() def list(self): """ list ::= '[' <list_item> {',' <list_item>} ']' """ self.match('lbrace') val = [] while True: li = self.list_item() val.append(li) tokt, tokv, tokl = lex.next_token() if tokt == 'rbrace': break; self.match('comma') self.match('rbrace') return val def list_item(self): """ list_item ::= int | string | float | list """ tokt, tokv, tokl = lex.next_token() if tokt in ('int', 'float', 'string'): lex.pop() return tokv if tokt == 'lbrace': return self.list() raise ParseError("Parse error at line %d: Unexpected %s(%s)." % (tokl, tokt, tokv) At each stage we know exactly what tokens we need in order to continue parsing, otherwise we have an error. I think that it should be possible to give pyparsing this behaviour, but my attempts at analysing the source code have left me a bit lost at this point. Once the ParserElement tree has been generated by the user it should be possible to precompute the possible recursions that lead to a particular tuple. For instance for the above we might write the following with pyparsing: integer = Regex(r'[+-]?0|([1-9][0-9]*)') float = Regex(r'[+-]?[0-9]+\.[0-9]+([eE][+-]?[0-9]+)?') string = QuotedString('"', '\\', multiline=True) lbrace = Literal('[') rbrace = Literal('[') mylist = Forward() mylist << lbrace + delimitedList(integer | float | string | mylist ) + rbrace mysyntax = mylist.compile() There are 3 main non-terminal types represented above. First we have the anonymous Or, second, delimitedList which is really a ZeroOrMore, third we have the mylist which is an And object. Lets look at what happens for compile in each of these instances, starting with the anonymous Or In compile it would iterate through each of its sub elements asking for terminals and create a dictionary from a particular terminal to the next ParserElement in the chain: self.path_cache = dict((x.get_terminal(),x) for x in self.exprs) In a parse instead of trying a parse on each element, try a parse on each token: for terminal in self.path_cache: if terminal.matches(): self.path_cache[terminal].parse() break # we take the first mylist which is the And object: Compile procedes as above, except you might want to make the dictionary key on the non-terminal and point to the terminal type for that object. The other notable difference in this case is that delimitedList gives a bunch of terminals that could match(float, string, integer). parsing procedes as normal, but we know whether we fail or not based on our cahed terminal token matcher. In the ZeroOrMany case, we know whether we should continue parsing based on our cached token matcher. Ok, so this is a long post and I suspect it gets less coherent towards the end. If you are open for discussion on making this technique work let me know, I am happy to help and would give it a go myself if I fully understood what was happening in the parse functions as they are. -Chris On Mon, Aug 14, 2006 at 05:55:55PM -0500, Paul McGuire wrote: > Chris - > > I have struggled with this behavior since version 0.1 of pyparsing. I have > made a faint stab at addressing this using the ParseFatalException: when a > parse action detects some semantic incorrectness in input it can raise > ParseFatalException which will stop all parsing and report the exact > location. But since pyparsing uses exceptions as signals to proceed to the > next grammar expression, it's hard to know in advance which exceptions are > really, well, exceptional. > > If anyone can dig up some research on how other recursive-descent parsers > address this error-handling issue, I may be able to fold it into a future > release. But I can't delve into this kind of research on my own at the > moment (pyparsing doesn't pay like my day job does, and isn't likely to > anytime soon). > > -- Paul > > > > -----Original Message----- > > From: pyp...@li... > > [mailto:pyp...@li...] On > > Behalf Of Chris Lambacher > > Sent: Friday, August 11, 2006 4:59 PM > > To: pyp...@li... > > Subject: [Pyparsing] More useful parse errors > > > > Hi, > > > > I am using pyparsing to parse CIM mof files. It is a rather > > lengthy grammar > > with lots of instances of OneOrMore. The top level structure > > is a like this: > > > > bnf = OneOrMore(rest_of_grammar) + StringEnd() > > > > This raises an exception when there is a parse error, but the > > error is not > > flagged at the point where the error occurs. Since everyone > > here should be > > familiar with Python grammar, I will provide an example of > > what would happen > > if we had a pyparsing grammar for python and attempted to > > parse something with > > a syntax error: > > > > > > class blah(object): > > pass > > > > class error_says_here(object): > > def __init__(self): > > self.a = 1 > > self.b = 2 > > > > def error_really_here(self: > > pass > > > > In this case the error would flag as having occurred at > > error_says_here rather > > than error_really_here. > > > > Is there a way to work around this behavior? I have not yet > > come across an > > example of a grammar that looks like it does a really good > > job of error > > handling. Does the architecture of pyparsing make it easy to > > write a parser > > for data that you know is syntacticly correct, but difficult > > or impossible to > > write a parser that can give useful error reports to the user > > in the face of > > syntax errors? > > > > Traditional parsers would know all possible tokens that could > > be next and if > > it encountered something that wasn't one of the above tokens, > > it would flag an > > error at the location. pyparsing in contrast says, will this > > parse, no, ok > > will this parse, no, ok, will this parse? The result is that > > it can throw > > away large portions of valid syntax to eventually end > > immediately after parsing > > the last full rest_of_grammar from our OneOrMore(rest_of_grammar) + > > StringEnd() idiom. > > > > It may be possible for pyparsing to precompute what the > > possible next tokens > > are, which may have the side effect of making pyparsing more > > efficient since > > it won't have so much trial and error. > > > > -Chris > > > > > > -------------------------------------------------------------- > > ----------- > > Using Tomcat but need to do more? Need to support web > > services, security? > > Get stuff done quickly with pre-integrated technology to make > > your job easier > > Download IBM WebSphere Application Server v.1.0.1 based on > > Apache Geronimo > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057& > > dat=121642 > > _______________________________________________ > > Pyparsing-users mailing list > > Pyp...@li... > > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > > > > |
From: Paul M. <pa...@al...> - 2006-08-16 14:00:15
|
Chris - Thanks for your extensive and thoughtful post. Let me try to respond to some of your comments: 1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. I know you are not just referring to the recursive nature of some grammars, as distinguished by the presence of Forward() expressions containing references to themselves, but to the actual implementation of pyparsing. I receieved an e-mail shortly after my article got posted at OReilly ONLamp, suggesting that pyparsing was not "recursive descent" because there were no examples of recursive grammars. Here is an excerpt from my reply (in which we also discussed the list-parsing example), marked with '>>>>>'s - especially note the paragraph beginning with "But even if": >>>>> Pyparsing does support recursive grammars. At least one is included in the examples directory that comes with pyparsing. Here is a simple grammar from my upcoming presentation at PyCon. It seems that about once a month, someone on comp.lang.python asks how to convert from a string representation of a list back to the original list, without using eval. Here is a non-recursive version, which parses lists that are composed only of simple quoted strings, integers, and floats: -------------- <snip - description of development of recursive list-parsing grammar, culminating in...> listStr = Forward() listItem = real | integer | quotedString.setParseAction(removeQuotes) | Group(listStr) listStr << ( lbrack + Optional(delimitedList(listItem)) + rbrack ) The << operator indicates that we are not binding a new expression to listStr, but "injecting" an expression into the placeholder previously defined as a Forward. With this minor change, we can now parse nested lists, and get back a nested result: test = "['a', 100, [], 3.14, [ +2.718, 'xyzzy', -1.414] ]" print listStr.parseString(test) -------------- prints: ['a', 100, [], 3.1400000000000001, [2.718, 'xyzzy', -1.4139999999999999]] I think I need to submit another article to ONLamp, with some more advanced grammar examples. But even if the example grammars in the article are not recursive, does this necessarily mean pyparsing's parsers are not "recursive descent"? The grammars themselves are composed of ParserElement objects organized into a hierarchy representing the grammar expressions and their respective composition from other, more basic grammar expressions. In this case, the top level expression, listStr, is made up of 3 expressions, which must all match in their turn. The first and third expressions are just literals, and they call their respective parse methods to see if there is a left and right bracket at the start and end of the string. But the second expression is a composite, made up of an Optional element, containing a delimited list of listItems. delimitedList(expr) is a helper function, that returns expr + ZeroOrMore(delim + expr) where the default delim is a literal comma. So from the top level listStr, we call the Optional expression's parse method, which calls the parse method of the And object created by delimitedList - ___recursively descending___ through the grammar until a basic token (such as a Literal or Word) is found, or no match is found and an exception raised. It is because we have used recursion to traverse the grammar structure, that we can use exceptions and exception handlers to take care of backtracking through the grammar, with no separate maintenance of stacks or backtrack pointers. On the other hand, this is the feature that makes pyparsing vulnerable to left recursion, which I have only partially succeeded in addressing with the validate() method. >>>>> It is also this use of the stack for backtracking that makes real parse error detection difficult. 2. From your message: > At each stage we know exactly what tokens we need in order to continue > parsing, otherwise we have an error. I think that it should > be possible to give pyparsing this behaviour, but my attempts at > analysing the source code have left me a bit lost at this point. > Well, in one sense, pyparsing already has this behavior - each parse expression contains its own idea of "what should come next", and they are in turn composed into more complex expressions. Do you mean that pyparsing should explode its grammar down to the raw character level, to become more of a character-based state machine rather than the "recursive traverser of parsing expression objects" that it currently is? Certainly, we could take the basic types such as Literal("abc") and convert to a state machine that at some point contains logic such as: ch = readnext() if ch != 'a': return (errorFlag,currentLocation) ch = readnext() if ch != 'b': return (errorFlag,currentLocation) ch = readnext() if ch != 'c': return (errorFlag,currentLocation) return 'abc' Or an expression to detect a proper noun - Word(alphas.upper(), alphas.lower()) - could be converted to: ch = readnext() if ch not in alphas.upper(): return (errorFlag, currentLocation) wd = ch while 1: ch = readnext() if ch not in alphas.lower(): unreadlastchar() break wd += ch return wd This is not too different from what pyparsing already does. So then let's look at merging these into a more complex expression: phrase = Literal("abc") | Word(alphas.upper(),alphas.lower()) Disjunctions are not really difficult for state machines, but the Python code gets messy. To keep things clean, let's pretend Python supports 'goto', and we'll limit ourselves to only downward jumps: ch = readnext() if ch != 'a': goto <<readProperNoun>> ch = readnext() if ch != 'b': return (errorFlag,currentLocation) ch = readnext() if ch != 'c': return (errorFlag,currentLocation) return 'abc' ch = readnext() <<readProperNoun>> if ch not in alphas.upper(): return (errorFlag, currentLocation) wd = ch while 1: ch = readnext() if ch not in alphas.lower(): unreadlastchar() break wd += ch return wd Still not terrible. What gets messy is when we start to combine disjunction with repetition. Let's just expand phrase slightly now: phrase = OneOrMore( Literal("abc") | Word(alphas.upper(),alphas.lower()) ) and let's parse the text "Honest abe". We fail on the second word, but what should be the reason? Because "abe" wasn't capitalized, or because it didn't end with a 'c'? So in fact, our OneOrMore is satisfied - we *did* find one matching word, "Honest" - so phrase succeeds, and we continue matching at the 'a' in "abe". In fact, if the overall grammar contains: phrase + Literal("abe") We're doing okay. But if the grammar is: phrase + stringEnd() Then we get an "expected end of string" error. So now we add a third candidate to our list of possible failures at the 'a' in "abe". Here's a fourth: what if the input text was really supposed to read "Honest Gabe"? It's possible that more knowledge of the parsing process could be gained by rearchitecting pyparsing to use richer return values and codes (such as the tuple you propose), rather than the current ParseResults for successful parsing and exceptions for failed parsing. Then a higher-level routine could look for the furthest successful parse before failing, and assume that that was where the actual error occurred. I've taken a stab at such logic, but it has fallen short due to the limited scope for the individual parse expression objects. Here's another idea on how to tackle this. Rewrite pyparsing to pass around a text source rather than just the original string. For one thing, this would allow the text source to be a stream instead of a string, which is something I've wanted to add to pyparsing for a while. The text source could also be an object that kept track of "furthest parse location", and possibly also "exception raised at furthest parse location". When parseString ultimately fails, consult the text source for the furthest progress, and report that as the error location. Anyway, I need to get to work, so I've got to sign off just now. If this gives you more ideas, keep 'em coming. -- Paul |
From: Chris L. <ch...@ka...> - 2006-09-01 18:40:23
|
Hi, Sorry for the delayed response. Vacation, work, life all have conspired to keep me away from this for a little while. Also, I think my previous message may have pulled you away from the direction I was trying to go in. My recursive descent example was meant as an example of how the problem is avoided in other approaches to parsing in the hopes that we could extract the part that makes error detection possible and use that in PyParsing. I hope that I have expressed myself better below. It will probably help that I have a more well formed idea of how to solve the problem now too :) On Wed, Aug 16, 2006 at 09:00:09AM -0500, Paul McGuire wrote: > 1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. I will not argue this any further. I think I see how it falls in the category of recursive descent even if it is not how recursive descent parsers are traditionally written. > > It is also this use of the stack for backtracking that makes real parse > error detection difficult. I don't think it is the backtracking in general, I think it is the overuse of backtracking. When we branch while parsing we need to know with absolute certainty that we are parsing the correct thing, or that we have just encountered an error. PyParsing does not make a distinction on exceptions between "this is the wrong option" and "there is no option". > > > 2. From your message: > > At each stage we know exactly what tokens we need in order to continue > > parsing, otherwise we have an error. I think that it should > > be possible to give pyparsing this behaviour, but my attempts at > > analysing the source code have left me a bit lost at this point. > > > Well, in one sense, pyparsing already has this behavior - each parse > expression contains its own idea of "what should come next", and they are in > turn composed into more complex expressions. > > Do you mean that pyparsing should explode its grammar down to the raw > character level, to become more of a character-based state machine rather > than the "recursive traverser of parsing expression objects" that it > currently is? Certainly, we could take the basic types such as I am saying we need to make a greater distinction between terminals(Token: Literal, Word, Regex, etc) and non terminals (ParseExpression: And,Or,OneOrMore, etc). More traditional parser/lexer libraries make this easy on the library programmer by separating the two. A lexer provides a stream of tokens, the parser combines these tokens into a syntax. This is more complicated for the library user than the interface provided by PyParsing, so not something we want to pursue too closely. The key difference is that the traditional parser knows what state it needs to move to based on what token is next in the token stream, consuming the token when it reaches a terminal rule. PyParsing is not aware of the current token in the input stream in non terminal rules (ParseExpression objects) and so it does not know which branch is the 'correct' one until it tries it and thus needs to take all branches in turn until it finds one that works. When there is no correct branch, there is no way to determine if that is because the input is wrong, or if we are in the wrong branch of a parent object. The assumption in the case that there is no correct branch is that we are currently in a wrong branch, and we backtrack to the previous level instead of knowing that we indeed have an error in the input. A simple solution to this problem is to iterate through our options and call a function which I will call 'match'. If match returns True, then we know that this object is the 'correct' branch. The key difference between match and parseImpl is in what is considered success. 'And.match' has a significantly different idea of success from 'And.parseImpl' in that we return True (success) if the first sub-object matches, where as parseImpl only registers success if all sub-objects due so as well. This is an important distinction to make because we are looking for the 'correct' branch, not a fully successful parse. 'Or.match' returns True if any of its sub-objects return True. OneOrMore returns True if its sub-object returns True. ZeroOrMore always returns True. Remember that match is used to gate entrance to a sub-object's parseImpl; match returning True means "I am the correct path." Terminals (Literal, Word, Regex, etc) do an actual test to see whether the input stream matches and returns an appropriate value. parseImpl would obviously then need to be modified to use this new information and ultimately would be able to flag an error in the input at the location it actually occurs. One problem with this approach is that we are at least doubling the number of comparisons. That may cancel out because of reducing the amount of backtracking that is done. If we do these modifications, we can look at optimizations after we have something that works. We may need to do some kind of caching mechanism. The other problem with this approach that I can think of is that it changes the class of grammar that can be parsed with PyParsing. It may be possible that some grammar as defined with the current PyParsing will work, but because of the definitive nature of this new mechanism may not. I can't think of a concrete example off the top of my head and am not even sure how to approach definitively prove or disprove it. I hope that was more coherent than my last message and gives you some ideas about how we can proceed, Chris |