Re: [Pyparsing] packrat bug + various

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Paolo -

Thanks for the whitespace bugfix.  Did you have a test case that raised this
problem in the first place?  If so, I'd like to add it to my unit tests.

Packrat parsing has issues besides the ParseResults problem you describe.
Some parse actions have important side effects that get omitted when results
are retrieved from a cache instead of being explicitly reparsed.  The
packrat concept showed such promise I couldn't pass it up, but I had to
leave it disabled by default, so as not to break code for existing
applications.  This is an especially interesting phenomenon for me; as far
as pure theoretical parsing is concerned, packratting is a no-brainer.  But
pyparsing's implementation, including parse actions, goes beyond the pure
parsing function, and in the process breaks the packrat concept.

In upgrading your app to be able to use packratting, is it possible that
your grammar could be restructured?  For instance, in your example, if you
change:
    e = A ^ ( A + B + C )
to:
    e2 = A + Optional(B + C)
then packratting does no harm.  I'm sure this is a simplified example for
the sake of easy posting, but maybe similar mods could be made to your base
app grammar.

Performance enhancements is a sore subject for me.  Pyparsing has gotten
some pretty nasty comments with respect to its parsing speed, so I admit I
committed a number of coding sins in trying to eke out a little more speed
at parsing time.  I've also been reluctant to port parts of pyparsing to C,
as I think this complicates its inclusion in other projects, and also its
portability.  

Here are the main performance enhancement techniques, in rough descending
order of importance, as I recall (not counting packratting):
- cacheing of exception objects - pyparsing raises exceptions a *lot* as
various paths are tried and rejected.  I optimized the creation of
exceptions as much as possible, and then realized that I could keep raising
the same exception object.
- using startswith in Literal instead of comparing the literal string with a
string slice. Literals are the most common terminal, and so
Literal.parseImpl has to be fast.
- ParseResults reuse with __new__ and __iadd__.  There are non-performance
reasons for doing this also, mostly having to do with gathering results from
the exprs in an And.
- avoidance of try/except - one of the ugliest parts of the _parse method is
the repetition of code, depending on whether debugging methods are defined
or not.  But setting up the try/except has a cost, and _parse is about as
high-traffic a method as there is.
- using re's or dict keys for Word.parseImpl testing with initChars and
bodyChars

But pyparsing seems to be stabilizing - the last release was about 7 weeks
ago, and I've made only minor changes since then.  Perhaps pyparsing could
move forward as a pair of releases, one pure Python and one C-enhanced
(similar to elementTree or StringIO).  I've done *no* Python+C development,
so this is the other barrier for me to doing such a task.  I *have* put in
way too much time on pyparsing, at the expense of some family and work time,
so I'm not eager to take on a major rewrite just now.

If I had a multithreaded app that used pyparsing, I would probably serialize
access to the parser object.  I don't want to make the resetCache invocation
explicit, since the need for it is kind of obscure, and so likely to be
overlooked.

Thanks for your persistence with pyparsing, Paolo, I know you have been
working with it for a while.  (We never did get to try adding the GLR
mode...).

-- Paul