My current grammar is about as complex as K&R C. Parsing files does now take up to 10 seconds. I'd like some tips about how I can optimize my grammar in order to optimize parsing speed. I can imagine that you'd like rules to match asap to cut down iterations and stuff.
This is my first big project in Python so I don't have any optimization experience with this language. Would it be possible to optimize pyparsing by doing stuff like pre-allocation and such or is pyparsing as fast as it can be?
Not having seen your specific grammar, I can only make some general suggestions:
1. Use '^' sparingly, try to use '|' and the oneOf helper so that alternations will short-circuit ('^' has to evaluate all paths before choosing the longest - but sometimes this is unavoidable).
2. Use repetition instead of recursion for lists. Many BNFs define "one or more" using something like:
. listOfAs = A | A + listOfAs
To implement this directly in pyparsing requires Forward to set up a recursive grammar:
. listOfAs = Forward()
. listOfAs << Literal("A") ^ Literal("A") + listOfAs
This is *much* more efficiently implemented as:
. listOfAs = OneOrMore( Literal("A") )
3. Use Optional instead of "or empty". Another BNF idiom is this:
. optionalA = "A" | empty
I first saw this in a grammar where it was not so obvious, where an optional argument was shown as
. arg = argDefn | empty
The pyparsing form for this is:
. arg = Optional( argDefn )
4. Use psyco. This will shave about 30-40% off your parsing time, especially in a complex grammar. If you are writing your grammar for a platform where you aren't sure if psyco is available, use this form:
. import psyco
Back around version 1.0.4 I did a lot of performance tuning to pyparsing with the hotshot profiler to try to wring out what I could (I've also wanted to keep pyparsing implemented in pure Python, although I think a pyrex version could be an option.) Probably the biggest performance improvement was the pre-allocation of exception objects when ParserElements are constructed. It turns out that each element has a particular exception that it raises whenever it fails to match, varying only in the location value, and I was creating and destroying them by the *thousands*. The second biggest perf improvement was using startswith to avoid creating a string slice in the Literal class, which is *very* heavily used (suggestion by Dan Griffith). Since then I've looked at some other obvious performance opportunities, but they haven't panned out. pyparsing is a parser type called a "combinator" and raw performance is just one of those things that combinators are less strong.
If you want to e-mail me your grammar, I'll try to see if there are any other obvious things you could improve.
Thanks for the reply! I'll update my grammar with those tips in mind. My grammar is still being finalized, but once it is done I can mail it to you.
Log in to post a comment.