Re: [Pyparsing] More useful parse errors

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I am not sure how I would classify a pyparsing parser, but I am not sure that
I would call it a recursive decent parser.  

Lets define a recursive decent parser for parsing nested lists of the form:
"[1,2,3,4,[1,2,3]]"

A traditionally there is a separation between the lexer(tokenizer) and parser.
For our purposes lets assume that we have a lex object available that has a
next_token method that returns the next token and a pop method that returns
and consumes the next token.  It will return a (type, value, line) tuple.

class ListParser(object):
    def __init__(self, lexer):
        self.lex = lexer

    def match(self, token_type):
        tokt,tokv,tokl = lex.pop()
        if tokt != token_type:
            raise ParseError("Parse error at line %d: Expected a %s, got a
%s(%s)" % (tokl, token_type, tokt, str(tokv)) )

    def parse(self):
        """ this will start the parsing process"""

        return list()

    def list(self):
        """ list ::= '[' <list_item> {',' <list_item>} ']' """
        self.match('lbrace')
        val = []

        while True:
            li = self.list_item()
            val.append(li)

            tokt, tokv, tokl = lex.next_token()
            if tokt == 'rbrace':
                break;

            self.match('comma')

        self.match('rbrace')

        return val

    def list_item(self):
        """ list_item ::= int |
                          string |
                          float |
                          list 
        """
        tokt, tokv, tokl = lex.next_token()

        if tokt in ('int', 'float', 'string'):
            lex.pop()
            return tokv
        if tokt == 'lbrace':
            return self.list()

        raise ParseError("Parse error at line %d: Unexpected %s(%s)." % (tokl, tokt,
tokv)

At each stage we know exactly what tokens we need in order to continue
parsing, otherwise we have an error.  I think that it should be possible to
give pyparsing this behaviour, but my attempts at analysing the source code
have left me a bit lost at this point.

Once the ParserElement tree has been generated by the user it should be
possible to precompute the possible recursions that lead to a particular
tuple.

For instance for the above we might write the following with pyparsing:

integer = Regex(r'[+-]?0|([1-9][0-9]*)')
float = Regex(r'[+-]?[0-9]+\.[0-9]+([eE][+-]?[0-9]+)?')
string = QuotedString('"', '\\', multiline=True)

lbrace = Literal('[')
rbrace = Literal('[')

mylist = Forward()
mylist << lbrace + delimitedList(integer | float | string | mylist ) + rbrace

mysyntax = mylist.compile()

There are 3 main non-terminal types represented above.  First we have the
anonymous Or, second, delimitedList which is really a ZeroOrMore, third we
have the mylist which is an And object.

Lets look at what happens for compile in each of these instances, starting
with the anonymous Or

In compile it would iterate through each of its sub elements asking for
terminals and create a dictionary from a particular terminal to the next
ParserElement in the chain:
    self.path_cache = dict((x.get_terminal(),x) for x in self.exprs)

In a parse instead of trying a parse on each element, try a parse on each token:
for terminal in self.path_cache:
    if terminal.matches():
        self.path_cache[terminal].parse()
        break # we take the first

mylist which is the And object:

Compile procedes as above, except you might want to make the dictionary key on
the non-terminal and point to the terminal type for that object.  The other
notable difference in this case is that delimitedList gives a bunch of
terminals that could match(float, string, integer).

parsing procedes as normal, but we know whether we fail or not based on our
cahed terminal token matcher.

In the ZeroOrMany case, we know whether we should continue parsing based on
our cached token matcher.

Ok, so this is a long post and I suspect it gets less coherent towards the
end.  If you are open for discussion on making this technique work let me
know, I am happy to help and would give it a go myself if I fully understood
what was happening in the parse functions as they are.

-Chris

On Mon, Aug 14, 2006 at 05:55:55PM -0500, Paul McGuire wrote:
> Chris -
> 
> I have struggled with this behavior since version 0.1 of pyparsing.  I have
> made a faint stab at addressing this using the ParseFatalException: when a
> parse action detects some semantic incorrectness in input it can raise
> ParseFatalException which will stop all parsing and report the exact
> location.  But since pyparsing uses exceptions as signals to proceed to the
> next grammar expression, it's hard to know in advance which exceptions are
> really, well, exceptional.
> 
> If anyone can dig up some research on how other recursive-descent parsers
> address this error-handling issue, I may be able to fold it into a future
> release.  But I can't delve into this kind of research on my own at the
> moment (pyparsing doesn't pay like my day job does, and isn't likely to
> anytime soon).
> 
> -- Paul
> 
> 
> > -----Original Message-----
> > From: pyp...@li... 
> > [mailto:pyp...@li...] On 
> > Behalf Of Chris Lambacher
> > Sent: Friday, August 11, 2006 4:59 PM
> > To: pyp...@li...
> > Subject: [Pyparsing] More useful parse errors
> > 
> > Hi,
> > 
> > I am using pyparsing to parse CIM mof files.  It is a rather 
> > lengthy grammar
> > with lots of instances of OneOrMore.  The top level structure 
> > is a like this:
> > 
> > bnf = OneOrMore(rest_of_grammar) + StringEnd()
> > 
> > This raises an exception when there is a parse error, but the 
> > error is not
> > flagged at the point where the error occurs.  Since everyone 
> > here should be
> > familiar with Python grammar, I will provide an example of 
> > what would happen
> > if we had a pyparsing grammar for python and attempted to 
> > parse something with
> > a syntax error:
> > 
> > 
> > class blah(object):
> >     pass
> > 
> > class error_says_here(object):
> >     def __init__(self):
> >         self.a = 1
> >         self.b = 2
> > 
> >     def error_really_here(self:
> >         pass
> > 
> > In this case the error would flag as having occurred at 
> > error_says_here rather
> > than error_really_here.
> > 
> > Is there a way to work around this behavior?  I have not yet 
> > come across an
> > example of a grammar that looks like it does a really good 
> > job of error
> > handling.  Does the architecture of pyparsing make it easy to 
> > write a parser
> > for data that you know is syntacticly correct, but difficult 
> > or impossible to
> > write a parser that can give useful error reports to the user 
> > in the face of
> > syntax errors?
> > 
> > Traditional parsers would know all possible tokens that could 
> > be next and if
> > it encountered something that wasn't one of the above tokens, 
> > it would flag an
> > error at the location.  pyparsing in contrast says, will this 
> > parse, no, ok
> > will this parse, no, ok, will this parse?  The result is that 
> > it can throw
> > away large portions of valid syntax to eventually end 
> > immediately after parsing
> > the last full rest_of_grammar from our OneOrMore(rest_of_grammar) +
> > StringEnd() idiom.
> > 
> > It may be possible for pyparsing to precompute what the 
> > possible next tokens
> > are, which may have the side effect of making pyparsing more 
> > efficient since
> > it won't have so much trial and error.
> > 
> > -Chris
> > 
> > 
> > --------------------------------------------------------------
> > -----------
> > Using Tomcat but need to do more? Need to support web 
> > services, security?
> > Get stuff done quickly with pre-integrated technology to make 
> > your job easier
> > Download IBM WebSphere Application Server v.1.0.1 based on 
> > Apache Geronimo
> > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&
> > dat=121642
> > _______________________________________________
> > Pyparsing-users mailing list
> > Pyp...@li...
> > https://lists.sourceforge.net/lists/listinfo/pyparsing-users
> > 
> >