request for parsing speed optimization tips

  • Klaas Hofstra

    Klaas Hofstra - 2005-03-24


    My current grammar is about as complex as K&R C. Parsing files does now take up to 10 seconds. I'd like some tips about how I can optimize my grammar in order to optimize parsing speed. I can imagine that you'd like rules to match asap to cut down iterations and stuff.

    This is my first big project in Python so I don't have any optimization experience with this language. Would it be possible to optimize pyparsing by doing stuff like pre-allocation and such or is pyparsing as fast as it can be?


    • Paul McGuire

      Paul McGuire - 2005-03-25

      Not having seen your specific grammar, I can only make some general suggestions:
      1. Use '^' sparingly, try to use '|' and the oneOf helper so that alternations will short-circuit ('^' has to evaluate all paths before choosing the longest - but sometimes this is unavoidable).
      2. Use repetition instead of recursion for lists.  Many BNFs define "one or more" using something like:
      .    listOfAs = A | A + listOfAs
      To implement this directly in pyparsing requires Forward to set up a recursive grammar:
      .    listOfAs = Forward()
      .    listOfAs << Literal("A") ^ Literal("A") + listOfAs
      This is *much* more efficiently implemented as:
      .    listOfAs = OneOrMore( Literal("A") )
      3. Use Optional instead of "or empty".  Another BNF idiom is this:
      .    optionalA = "A" | empty
      I first saw this in a grammar where it was not so obvious, where an optional argument was shown as
      .    arg = argDefn | empty
      The pyparsing form for this is:
      .    arg = Optional( argDefn )
      4. Use psyco.  This will shave about 30-40% off your parsing time, especially in a complex grammar.  If you are writing your grammar for a platform where you aren't sure if psyco is available, use this form:
      .    import psyco
      .    psyco.full()
      .    pass

      Back around version 1.0.4 I did a lot of performance tuning to pyparsing with the hotshot profiler to try to wring out what I could (I've also wanted to keep pyparsing implemented in pure Python, although I think a pyrex version could be an option.)  Probably the biggest performance improvement was the pre-allocation of exception objects when ParserElements are constructed.  It turns out that each element has a particular exception that it raises whenever it fails to match, varying only in the location value, and I was creating and destroying them by the *thousands*.  The second biggest perf improvement was using startswith to avoid creating a string slice in the Literal class, which is *very* heavily used (suggestion by Dan Griffith). Since then I've looked at some other obvious performance opportunities, but they haven't panned out.  pyparsing is a parser type called a "combinator" and raw performance is just one of those things that combinators are less strong.

      If you want to e-mail me your grammar, I'll try to see if there are any other obvious things you could improve.

      -- Paul

      • Klaas Hofstra

        Klaas Hofstra - 2005-03-25


        Thanks for the reply! I'll update my grammar with those tips in mind. My grammar is still being finalized, but once it is done I can mail it to you.




Log in to post a comment.