Re: [Pyparsing] More useful parse errors
Brought to you by:
ptmcg
From: Chris L. <ch...@ka...> - 2006-09-01 18:40:23
|
Hi, Sorry for the delayed response. Vacation, work, life all have conspired to keep me away from this for a little while. Also, I think my previous message may have pulled you away from the direction I was trying to go in. My recursive descent example was meant as an example of how the problem is avoided in other approaches to parsing in the hopes that we could extract the part that makes error detection possible and use that in PyParsing. I hope that I have expressed myself better below. It will probably help that I have a more well formed idea of how to solve the problem now too :) On Wed, Aug 16, 2006 at 09:00:09AM -0500, Paul McGuire wrote: > 1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. I will not argue this any further. I think I see how it falls in the category of recursive descent even if it is not how recursive descent parsers are traditionally written. > > It is also this use of the stack for backtracking that makes real parse > error detection difficult. I don't think it is the backtracking in general, I think it is the overuse of backtracking. When we branch while parsing we need to know with absolute certainty that we are parsing the correct thing, or that we have just encountered an error. PyParsing does not make a distinction on exceptions between "this is the wrong option" and "there is no option". > > > 2. From your message: > > At each stage we know exactly what tokens we need in order to continue > > parsing, otherwise we have an error. I think that it should > > be possible to give pyparsing this behaviour, but my attempts at > > analysing the source code have left me a bit lost at this point. > > > Well, in one sense, pyparsing already has this behavior - each parse > expression contains its own idea of "what should come next", and they are in > turn composed into more complex expressions. > > Do you mean that pyparsing should explode its grammar down to the raw > character level, to become more of a character-based state machine rather > than the "recursive traverser of parsing expression objects" that it > currently is? Certainly, we could take the basic types such as I am saying we need to make a greater distinction between terminals(Token: Literal, Word, Regex, etc) and non terminals (ParseExpression: And,Or,OneOrMore, etc). More traditional parser/lexer libraries make this easy on the library programmer by separating the two. A lexer provides a stream of tokens, the parser combines these tokens into a syntax. This is more complicated for the library user than the interface provided by PyParsing, so not something we want to pursue too closely. The key difference is that the traditional parser knows what state it needs to move to based on what token is next in the token stream, consuming the token when it reaches a terminal rule. PyParsing is not aware of the current token in the input stream in non terminal rules (ParseExpression objects) and so it does not know which branch is the 'correct' one until it tries it and thus needs to take all branches in turn until it finds one that works. When there is no correct branch, there is no way to determine if that is because the input is wrong, or if we are in the wrong branch of a parent object. The assumption in the case that there is no correct branch is that we are currently in a wrong branch, and we backtrack to the previous level instead of knowing that we indeed have an error in the input. A simple solution to this problem is to iterate through our options and call a function which I will call 'match'. If match returns True, then we know that this object is the 'correct' branch. The key difference between match and parseImpl is in what is considered success. 'And.match' has a significantly different idea of success from 'And.parseImpl' in that we return True (success) if the first sub-object matches, where as parseImpl only registers success if all sub-objects due so as well. This is an important distinction to make because we are looking for the 'correct' branch, not a fully successful parse. 'Or.match' returns True if any of its sub-objects return True. OneOrMore returns True if its sub-object returns True. ZeroOrMore always returns True. Remember that match is used to gate entrance to a sub-object's parseImpl; match returning True means "I am the correct path." Terminals (Literal, Word, Regex, etc) do an actual test to see whether the input stream matches and returns an appropriate value. parseImpl would obviously then need to be modified to use this new information and ultimately would be able to flag an error in the input at the location it actually occurs. One problem with this approach is that we are at least doubling the number of comparisons. That may cancel out because of reducing the amount of backtracking that is done. If we do these modifications, we can look at optimizations after we have something that works. We may need to do some kind of caching mechanism. The other problem with this approach that I can think of is that it changes the class of grammar that can be parsed with PyParsing. It may be possible that some grammar as defined with the current PyParsing will work, but because of the definitive nature of this new mechanism may not. I can't think of a concrete example off the top of my head and am not even sure how to approach definitively prove or disprove it. I hope that was more coherent than my last message and gives you some ideas about how we can proceed, Chris |