My current grammar is about as complex as K&R C. Parsing files does now take up to 10 seconds. I'd like some tips about how I can optimize my grammar in order to optimize parsing speed. I can imagine that you'd like rules to match asap to cut down iterations and stuff.
This is my first big project in Python so I don't have any optimization experience with this language. Would it be possible to optimize pyparsing by doing stuff like pre-allocation and such or is pyparsing as fast as it can be?
-Klaas
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Not having seen your specific grammar, I can only make some general suggestions:
1. Use '^' sparingly, try to use '|' and the oneOf helper so that alternations will short-circuit ('^' has to evaluate all paths before choosing the longest - but sometimes this is unavoidable).
2. Use repetition instead of recursion for lists. Many BNFs define "one or more" using something like:
. listOfAs = A | A + listOfAs
To implement this directly in pyparsing requires Forward to set up a recursive grammar:
. listOfAs = Forward()
. listOfAs << Literal("A") ^ Literal("A") + listOfAs
This is *much* more efficiently implemented as:
. listOfAs = OneOrMore( Literal("A") )
3. Use Optional instead of "or empty". Another BNF idiom is this:
. optionalA = "A" | empty
I first saw this in a grammar where it was not so obvious, where an optional argument was shown as
. arg = argDefn | empty
The pyparsing form for this is:
. arg = Optional( argDefn )
4. Use psyco. This will shave about 30-40% off your parsing time, especially in a complex grammar. If you are writing your grammar for a platform where you aren't sure if psyco is available, use this form:
.try:
. import psyco
. psyco.full()
.except:
. pass
Back around version 1.0.4 I did a lot of performance tuning to pyparsing with the hotshot profiler to try to wring out what I could (I've also wanted to keep pyparsing implemented in pure Python, although I think a pyrex version could be an option.) Probably the biggest performance improvement was the pre-allocation of exception objects when ParserElements are constructed. It turns out that each element has a particular exception that it raises whenever it fails to match, varying only in the location value, and I was creating and destroying them by the *thousands*. The second biggest perf improvement was using startswith to avoid creating a string slice in the Literal class, which is *very* heavily used (suggestion by Dan Griffith). Since then I've looked at some other obvious performance opportunities, but they haven't panned out. pyparsing is a parser type called a "combinator" and raw performance is just one of those things that combinators are less strong.
If you want to e-mail me your grammar, I'll try to see if there are any other obvious things you could improve.
-- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
My current grammar is about as complex as K&R C. Parsing files does now take up to 10 seconds. I'd like some tips about how I can optimize my grammar in order to optimize parsing speed. I can imagine that you'd like rules to match asap to cut down iterations and stuff.
This is my first big project in Python so I don't have any optimization experience with this language. Would it be possible to optimize pyparsing by doing stuff like pre-allocation and such or is pyparsing as fast as it can be?
-Klaas
Not having seen your specific grammar, I can only make some general suggestions:
1. Use '^' sparingly, try to use '|' and the oneOf helper so that alternations will short-circuit ('^' has to evaluate all paths before choosing the longest - but sometimes this is unavoidable).
2. Use repetition instead of recursion for lists. Many BNFs define "one or more" using something like:
. listOfAs = A | A + listOfAs
To implement this directly in pyparsing requires Forward to set up a recursive grammar:
. listOfAs = Forward()
. listOfAs << Literal("A") ^ Literal("A") + listOfAs
This is *much* more efficiently implemented as:
. listOfAs = OneOrMore( Literal("A") )
3. Use Optional instead of "or empty". Another BNF idiom is this:
. optionalA = "A" | empty
I first saw this in a grammar where it was not so obvious, where an optional argument was shown as
. arg = argDefn | empty
The pyparsing form for this is:
. arg = Optional( argDefn )
4. Use psyco. This will shave about 30-40% off your parsing time, especially in a complex grammar. If you are writing your grammar for a platform where you aren't sure if psyco is available, use this form:
.try:
. import psyco
. psyco.full()
.except:
. pass
Back around version 1.0.4 I did a lot of performance tuning to pyparsing with the hotshot profiler to try to wring out what I could (I've also wanted to keep pyparsing implemented in pure Python, although I think a pyrex version could be an option.) Probably the biggest performance improvement was the pre-allocation of exception objects when ParserElements are constructed. It turns out that each element has a particular exception that it raises whenever it fails to match, varying only in the location value, and I was creating and destroying them by the *thousands*. The second biggest perf improvement was using startswith to avoid creating a string slice in the Literal class, which is *very* heavily used (suggestion by Dan Griffith). Since then I've looked at some other obvious performance opportunities, but they haven't panned out. pyparsing is a parser type called a "combinator" and raw performance is just one of those things that combinators are less strong.
If you want to e-mail me your grammar, I'll try to see if there are any other obvious things you could improve.
-- Paul
Paul,
Thanks for the reply! I'll update my grammar with those tips in mind. My grammar is still being finalized, but once it is done I can mail it to you.
Cheers,
Klaas