Speed optimization tips (again)

2005-06-04
2013-05-14
  • Hello!
    First I want to say that pyparsing is very intuitive and easy to use. This is the first time I written a parser/searcher and I manage to do a prototype in just 10 minutes!

    But the performance was not equally good. Parsing a large (14000 lines) c header file takes 6 minutes, so I want some optimization tips for this implementation:

    def get_atomic_typedefs(filename):
      """
      Scan and get typedefs that are atomic (not composite types).
      signed and unsigned integers of different sizes and enums are extracted.
      Returns a dictionary where key is the name and data is the representation.
      """
      fh = file(filename, 'r')

      typedef_name = Word(alphanums + '_')
      integer_typedef = Keyword('typedef') + Optional(Keyword('unsigned')) + \                    oneOf("char short int long") + typedef_name
      enum_typedef = Keyword('typedef') + Keyword('enum') + typedef_name
      atomic_typedef = integer_typedef | enum_typedef

      typedef_dic = {}
      text = fh.read()
      atomic_typedef.ignore(cStyleComment)
      for typedef,start,end in atomic_typedef.scanString(text):
        if typedef[1] == 'enum':
          typedef_dic[typedef[2]] = 'int' # ANSI-C represents enum with int
        else:
          if len(typedef) == 4:
            typedef_dic[typedef[3]] = typedef[1] + ' ' + typedef[2]
          elif len(typedef) == 3:
            typedef_dic[typedef[2]] = typedef[1]

      fh.close()

      return typedef_dic

     
    • Paul McGuire
      Paul McGuire
      2005-06-04

      Henrik -

      Here is a tip that will speed up your parser, but it definitely falls into the "tricks" category.

      Note that your grammar is an alternative of two different 'typedef' declarations.  The scanString() method indexes through the input, trying to match the grammar at successive places in the input string.  Your grammar checks for 'typedef' <integer type definition> and then for 'typedef' <enum definition>.  It would be nice to be able to short-circuit this test if the current string position doesn't even start with 'typedef'.

      You can do this by modifying your definition of atomic_typedef to:

        atomic_typedef = FollowedBy('typedef') + ( integer_typedef | enum_typedef )

      This way, you avoid the more expensive integer_typedef and enum_typedef tests if the current string position doesn't start with 'typedef'.

      This wasn't a whirlwind improvement, it cut my test from ~2 minutes to ~1 minute, so about 50%.  Give it a try with your test data and see what you find.

      (Meanwhile, I'm also looking at ways to improve scanString's performance based on this test, but my initial attempts are "close but not quite right."  I'll let you know if I have any scanString breakthroughs.)

      -- Paul

       
      • Paul McGuire
        Paul McGuire
        2005-06-06

        Okay, I made a slight enhancement to scanString()'s logic, that my tests show about another 20% performance improvement.  It's not 10X or anything, but 20% is better than nothing.

        I'll release it in 1.3.1, due out "real soon now."

        -- Paul

         
    • Paul McGuire
      Paul McGuire
      2005-06-04

      Also, I think you have a bug in your grammar - aren't enums of the form

      typedef enum { enum values list } name;

      Here is the modified pyparsing source to handle this.  Note that I am suppressing the enum list itself, as it seems to be secondary to your goal of finding the typedef's.  Also by suppressing the returned tokens, your existing code will continue to work without changes.

        enumDef = '{' + SkipTo('}') + '}'
        enum_typedef = Keyword('typedef') + Keyword('enum') + enumDef.suppress() + typedef_name

      -- Paul

       
    • Paul-
      Your grammar suggestion and version 1.3.1 resulted in the following speed improvement:

      with 1.3 and old gramar:
      404.81u 0.10s 6:46.03 99.7%

      with 1.3 and new gramar:
      259.48u 0.06s 4:20.82 99.5%

      with 1.3.1 and new gramar:
      211.49u 0.17s 3:33.76 99.0%

      Thats nearly half of the original time! Thanks Pa(u)l! =)

       
      • Paul McGuire
        Paul McGuire
        2005-06-14

        Wunderbar!

        And to think that about this time last year, I thought I had nothing else to optimize in pyparsing!

        I hope you noticed that I mentioned in the change log that it was your work and questions that helped me find this enhancement.

        Best of luck to you!

        -- Paul