Re: [Pyparsing] More useful parse errors
Brought to you by:
ptmcg
From: Chris L. <ch...@ka...> - 2006-08-15 22:07:35
|
Hi, I am not sure how I would classify a pyparsing parser, but I am not sure that I would call it a recursive decent parser. Lets define a recursive decent parser for parsing nested lists of the form: "[1,2,3,4,[1,2,3]]" A traditionally there is a separation between the lexer(tokenizer) and parser. For our purposes lets assume that we have a lex object available that has a next_token method that returns the next token and a pop method that returns and consumes the next token. It will return a (type, value, line) tuple. class ListParser(object): def __init__(self, lexer): self.lex = lexer def match(self, token_type): tokt,tokv,tokl = lex.pop() if tokt != token_type: raise ParseError("Parse error at line %d: Expected a %s, got a %s(%s)" % (tokl, token_type, tokt, str(tokv)) ) def parse(self): """ this will start the parsing process""" return list() def list(self): """ list ::= '[' <list_item> {',' <list_item>} ']' """ self.match('lbrace') val = [] while True: li = self.list_item() val.append(li) tokt, tokv, tokl = lex.next_token() if tokt == 'rbrace': break; self.match('comma') self.match('rbrace') return val def list_item(self): """ list_item ::= int | string | float | list """ tokt, tokv, tokl = lex.next_token() if tokt in ('int', 'float', 'string'): lex.pop() return tokv if tokt == 'lbrace': return self.list() raise ParseError("Parse error at line %d: Unexpected %s(%s)." % (tokl, tokt, tokv) At each stage we know exactly what tokens we need in order to continue parsing, otherwise we have an error. I think that it should be possible to give pyparsing this behaviour, but my attempts at analysing the source code have left me a bit lost at this point. Once the ParserElement tree has been generated by the user it should be possible to precompute the possible recursions that lead to a particular tuple. For instance for the above we might write the following with pyparsing: integer = Regex(r'[+-]?0|([1-9][0-9]*)') float = Regex(r'[+-]?[0-9]+\.[0-9]+([eE][+-]?[0-9]+)?') string = QuotedString('"', '\\', multiline=True) lbrace = Literal('[') rbrace = Literal('[') mylist = Forward() mylist << lbrace + delimitedList(integer | float | string | mylist ) + rbrace mysyntax = mylist.compile() There are 3 main non-terminal types represented above. First we have the anonymous Or, second, delimitedList which is really a ZeroOrMore, third we have the mylist which is an And object. Lets look at what happens for compile in each of these instances, starting with the anonymous Or In compile it would iterate through each of its sub elements asking for terminals and create a dictionary from a particular terminal to the next ParserElement in the chain: self.path_cache = dict((x.get_terminal(),x) for x in self.exprs) In a parse instead of trying a parse on each element, try a parse on each token: for terminal in self.path_cache: if terminal.matches(): self.path_cache[terminal].parse() break # we take the first mylist which is the And object: Compile procedes as above, except you might want to make the dictionary key on the non-terminal and point to the terminal type for that object. The other notable difference in this case is that delimitedList gives a bunch of terminals that could match(float, string, integer). parsing procedes as normal, but we know whether we fail or not based on our cahed terminal token matcher. In the ZeroOrMany case, we know whether we should continue parsing based on our cached token matcher. Ok, so this is a long post and I suspect it gets less coherent towards the end. If you are open for discussion on making this technique work let me know, I am happy to help and would give it a go myself if I fully understood what was happening in the parse functions as they are. -Chris On Mon, Aug 14, 2006 at 05:55:55PM -0500, Paul McGuire wrote: > Chris - > > I have struggled with this behavior since version 0.1 of pyparsing. I have > made a faint stab at addressing this using the ParseFatalException: when a > parse action detects some semantic incorrectness in input it can raise > ParseFatalException which will stop all parsing and report the exact > location. But since pyparsing uses exceptions as signals to proceed to the > next grammar expression, it's hard to know in advance which exceptions are > really, well, exceptional. > > If anyone can dig up some research on how other recursive-descent parsers > address this error-handling issue, I may be able to fold it into a future > release. But I can't delve into this kind of research on my own at the > moment (pyparsing doesn't pay like my day job does, and isn't likely to > anytime soon). > > -- Paul > > > > -----Original Message----- > > From: pyp...@li... > > [mailto:pyp...@li...] On > > Behalf Of Chris Lambacher > > Sent: Friday, August 11, 2006 4:59 PM > > To: pyp...@li... > > Subject: [Pyparsing] More useful parse errors > > > > Hi, > > > > I am using pyparsing to parse CIM mof files. It is a rather > > lengthy grammar > > with lots of instances of OneOrMore. The top level structure > > is a like this: > > > > bnf = OneOrMore(rest_of_grammar) + StringEnd() > > > > This raises an exception when there is a parse error, but the > > error is not > > flagged at the point where the error occurs. Since everyone > > here should be > > familiar with Python grammar, I will provide an example of > > what would happen > > if we had a pyparsing grammar for python and attempted to > > parse something with > > a syntax error: > > > > > > class blah(object): > > pass > > > > class error_says_here(object): > > def __init__(self): > > self.a = 1 > > self.b = 2 > > > > def error_really_here(self: > > pass > > > > In this case the error would flag as having occurred at > > error_says_here rather > > than error_really_here. > > > > Is there a way to work around this behavior? I have not yet > > come across an > > example of a grammar that looks like it does a really good > > job of error > > handling. Does the architecture of pyparsing make it easy to > > write a parser > > for data that you know is syntacticly correct, but difficult > > or impossible to > > write a parser that can give useful error reports to the user > > in the face of > > syntax errors? > > > > Traditional parsers would know all possible tokens that could > > be next and if > > it encountered something that wasn't one of the above tokens, > > it would flag an > > error at the location. pyparsing in contrast says, will this > > parse, no, ok > > will this parse, no, ok, will this parse? The result is that > > it can throw > > away large portions of valid syntax to eventually end > > immediately after parsing > > the last full rest_of_grammar from our OneOrMore(rest_of_grammar) + > > StringEnd() idiom. > > > > It may be possible for pyparsing to precompute what the > > possible next tokens > > are, which may have the side effect of making pyparsing more > > efficient since > > it won't have so much trial and error. > > > > -Chris > > > > > > -------------------------------------------------------------- > > ----------- > > Using Tomcat but need to do more? Need to support web > > services, security? > > Get stuff done quickly with pre-integrated technology to make > > your job easier > > Download IBM WebSphere Application Server v.1.0.1 based on > > Apache Geronimo > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057& > > dat=121642 > > _______________________________________________ > > Pyparsing-users mailing list > > Pyp...@li... > > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > > > > |