Re: [Pyparsing] More useful parse errors

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Sorry for the delayed response.  Vacation, work, life all have conspired to
keep me away from this for a little while.

Also, I think my previous message may have pulled you away from the direction
I was trying to go in.  My recursive descent example was meant as an example
of how the problem is avoided in other approaches to parsing in the hopes that
we could extract the part that makes error detection possible and use that in
PyParsing.  I hope that I have expressed myself better below.  It will
probably help that I have a more well formed idea of how to solve the problem
now too :)

On Wed, Aug 16, 2006 at 09:00:09AM -0500, Paul McGuire wrote:
> 1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. 
I will not argue this any further.  I think I see how it falls in the category
of recursive descent even if it is not how recursive descent parsers are
traditionally written.
> 
> It is also this use of the stack for backtracking that makes real parse
> error detection difficult.
I don't think it is the backtracking in general, I think it is the overuse of
backtracking.  When we branch while parsing we need to know with absolute
certainty that we are parsing the correct thing, or that we have just
encountered an error.  PyParsing does not make a distinction on exceptions
between "this is the wrong option" and "there is no option".
> 
> 
> 2. From your message:
> > At each stage we know exactly what tokens we need in order to continue
> > parsing, otherwise we have an error.  I think that it should 
> > be possible to give pyparsing this behaviour, but my attempts at 
> > analysing the source code have left me a bit lost at this point.
> >
> Well, in one sense, pyparsing already has this behavior - each parse
> expression contains its own idea of "what should come next", and they are in
> turn composed into more complex expressions.
> 
> Do you mean that pyparsing should explode its grammar down to the raw
> character level, to become more of a character-based state machine rather
> than the "recursive traverser of parsing expression objects" that it
> currently is?  Certainly, we could take the basic types such as
I am saying we need to make a greater distinction between terminals(Token:
Literal, Word, Regex, etc) and non terminals (ParseExpression:
And,Or,OneOrMore, etc).

More traditional parser/lexer libraries make this easy on the library
programmer by separating the two.  A lexer provides a stream of tokens, the
parser combines these tokens into a syntax.  This is more complicated for the
library user than the interface provided by PyParsing, so not something we
want to pursue too closely.

The key difference is that the traditional parser knows what state it needs to
move to based on what token is next in the token stream, consuming the token
when it reaches a terminal rule.  PyParsing is not aware of the current token
in the input stream in non terminal rules (ParseExpression objects) and so it
does not know which branch is the 'correct' one until it tries it and thus
needs to take all branches in turn until it finds one that works.  When there
is no correct branch, there is no way to determine if that is because the
input is wrong, or if we are in the wrong branch of a parent object.  The
assumption in the case that there is no correct branch is that we are
currently in a wrong branch, and we backtrack to the previous level instead of
knowing that we indeed have an error in the input.

A simple solution to this problem is to iterate through our options and call a
function which I will call 'match'.  If match returns True, then we know that
this object is the 'correct' branch.  The key difference between match and
parseImpl is in  what is considered success.  'And.match' has a significantly
different idea of success from 'And.parseImpl' in that we return True
(success) if the first sub-object matches, where as parseImpl only registers
success if all sub-objects due so as well.  This is an important distinction
to make because we are looking for the 'correct' branch, not a fully
successful parse.  'Or.match' returns True if any of its sub-objects return
True.  OneOrMore returns True if its sub-object returns True.  ZeroOrMore
always returns True.  Remember that match is used to gate entrance to a
sub-object's parseImpl; match returning True means "I am the correct path."

Terminals (Literal, Word, Regex, etc) do an actual test to see whether the
input stream matches and returns an appropriate value.

parseImpl would obviously then need to be modified to use this new information
and ultimately would be able to flag an error in the input at the location it
actually occurs.  

One problem with this approach is that we are at least doubling the number of
comparisons.  That may cancel out because of reducing the amount of
backtracking that is done.  If we do these modifications, we can look at
optimizations after we have something that works.  We may need to do some kind
of caching mechanism.

The other problem with this approach that I can think of is that it changes
the class of grammar that can be parsed with PyParsing.  It may be possible
that some grammar as defined with the current PyParsing will work, but because
of the definitive nature of this new mechanism may not.  I can't think of a
concrete example off the top of my head and am not even sure how to approach
definitively prove or disprove it.

I hope that was more coherent than my last message and gives you some ideas
about how we can proceed,
Chris