Re: [Pyparsing] More useful parse errors
Brought to you by:
ptmcg
|
From: Paul M. <pa...@al...> - 2006-08-16 14:00:15
|
Chris -
Thanks for your extensive and thoughtful post. Let me try to respond to
some of your comments:
1. I'm pretty sure pyparsing *does* qualify for "recursive descent"-ness. I
know you are not just referring to the recursive nature of some grammars, as
distinguished by the presence of Forward() expressions containing references
to themselves, but to the actual implementation of pyparsing. I receieved
an e-mail shortly after my article got posted at OReilly ONLamp, suggesting
that pyparsing was not "recursive descent" because there were no examples of
recursive grammars. Here is an excerpt from my reply (in which we also
discussed the list-parsing example), marked with '>>>>>'s - especially note
the paragraph beginning with "But even if":
>>>>>
Pyparsing does support recursive grammars. At least one is included in the
examples directory that comes with pyparsing. Here is a simple grammar from
my upcoming presentation at PyCon. It seems that about once a month,
someone on comp.lang.python asks how to convert from a string representation
of a list back to the original list, without using eval. Here is a
non-recursive version, which parses lists that are composed only of simple
quoted strings, integers, and floats:
--------------
<snip - description of development of recursive list-parsing grammar,
culminating in...>
listStr = Forward()
listItem = real | integer | quotedString.setParseAction(removeQuotes) |
Group(listStr)
listStr << ( lbrack + Optional(delimitedList(listItem)) + rbrack )
The << operator indicates that we are not binding a new expression to
listStr, but "injecting" an expression into the placeholder previously
defined as a Forward.
With this minor change, we can now parse nested lists, and get back a nested
result:
test = "['a', 100, [], 3.14, [ +2.718, 'xyzzy', -1.414] ]"
print listStr.parseString(test)
--------------
prints:
['a', 100, [], 3.1400000000000001, [2.718, 'xyzzy', -1.4139999999999999]]
I think I need to submit another article to ONLamp, with some more advanced
grammar examples.
But even if the example grammars in the article are not recursive, does this
necessarily mean pyparsing's parsers are not "recursive descent"? The
grammars themselves are composed of ParserElement objects organized into a
hierarchy representing the grammar expressions and their respective
composition from other, more basic grammar expressions. In this case, the
top level expression, listStr, is made up of 3 expressions, which must all
match in their turn. The first and third expressions are just literals, and
they call their respective parse methods to see if there is a left and right
bracket at the start and end of the string. But the second expression is a
composite, made up of an Optional element, containing a delimited list of
listItems. delimitedList(expr) is a helper function, that returns
expr + ZeroOrMore(delim + expr)
where the default delim is a literal comma. So from the top level listStr,
we call the Optional expression's parse method, which calls the parse method
of the And object created by delimitedList - ___recursively descending___
through the grammar until a basic token (such as a Literal or Word) is
found, or no match is found and an exception raised. It is because we have
used recursion to traverse the grammar structure, that we can use exceptions
and exception handlers to take care of backtracking through the grammar,
with no separate maintenance of stacks or backtrack pointers. On the other
hand, this is the feature that makes pyparsing vulnerable to left recursion,
which I have only partially succeeded in addressing with the validate()
method.
>>>>>
It is also this use of the stack for backtracking that makes real parse
error detection difficult.
2. From your message:
> At each stage we know exactly what tokens we need in order to continue
> parsing, otherwise we have an error. I think that it should
> be possible to give pyparsing this behaviour, but my attempts at
> analysing the source code have left me a bit lost at this point.
>
Well, in one sense, pyparsing already has this behavior - each parse
expression contains its own idea of "what should come next", and they are in
turn composed into more complex expressions.
Do you mean that pyparsing should explode its grammar down to the raw
character level, to become more of a character-based state machine rather
than the "recursive traverser of parsing expression objects" that it
currently is? Certainly, we could take the basic types such as
Literal("abc") and convert to a state machine that at some point contains
logic such as:
ch = readnext()
if ch != 'a': return (errorFlag,currentLocation)
ch = readnext()
if ch != 'b': return (errorFlag,currentLocation)
ch = readnext()
if ch != 'c': return (errorFlag,currentLocation)
return 'abc'
Or an expression to detect a proper noun - Word(alphas.upper(),
alphas.lower()) - could be converted to:
ch = readnext()
if ch not in alphas.upper(): return (errorFlag, currentLocation)
wd = ch
while 1:
ch = readnext()
if ch not in alphas.lower():
unreadlastchar()
break
wd += ch
return wd
This is not too different from what pyparsing already does. So then let's
look at merging these into a more complex expression:
phrase = Literal("abc") | Word(alphas.upper(),alphas.lower())
Disjunctions are not really difficult for state machines, but the Python
code gets messy. To keep things clean, let's pretend Python supports
'goto', and we'll limit ourselves to only downward jumps:
ch = readnext()
if ch != 'a': goto <<readProperNoun>>
ch = readnext()
if ch != 'b': return (errorFlag,currentLocation)
ch = readnext()
if ch != 'c': return (errorFlag,currentLocation)
return 'abc'
ch = readnext()
<<readProperNoun>>
if ch not in alphas.upper(): return (errorFlag, currentLocation)
wd = ch
while 1:
ch = readnext()
if ch not in alphas.lower():
unreadlastchar()
break
wd += ch
return wd
Still not terrible.
What gets messy is when we start to combine disjunction with repetition.
Let's just expand phrase slightly now:
phrase = OneOrMore( Literal("abc") | Word(alphas.upper(),alphas.lower())
)
and let's parse the text "Honest abe".
We fail on the second word, but what should be the reason? Because "abe"
wasn't capitalized, or because it didn't end with a 'c'? So in fact, our
OneOrMore is satisfied - we *did* find one matching word, "Honest" - so
phrase succeeds, and we continue matching at the 'a' in "abe". In fact, if
the overall grammar contains:
phrase + Literal("abe")
We're doing okay. But if the grammar is:
phrase + stringEnd()
Then we get an "expected end of string" error.
So now we add a third candidate to our list of possible failures at the 'a'
in "abe". Here's a fourth: what if the input text was really supposed to
read "Honest Gabe"?
It's possible that more knowledge of the parsing process could be gained by
rearchitecting pyparsing to use richer return values and codes (such as the
tuple you propose), rather than the current ParseResults for successful
parsing and exceptions for failed parsing. Then a higher-level routine
could look for the furthest successful parse before failing, and assume that
that was where the actual error occurred. I've taken a stab at such logic,
but it has fallen short due to the limited scope for the individual parse
expression objects.
Here's another idea on how to tackle this. Rewrite pyparsing to pass around
a text source rather than just the original string. For one thing, this
would allow the text source to be a stream instead of a string, which is
something I've wanted to add to pyparsing for a while. The text source
could also be an object that kept track of "furthest parse location", and
possibly also "exception raised at furthest parse location". When
parseString ultimately fails, consult the text source for the furthest
progress, and report that as the error location.
Anyway, I need to get to work, so I've got to sign off just now. If this
gives you more ideas, keep 'em coming.
-- Paul
|