Performance on parsing BibTex

2009-05-17
2013-05-14
  • Joseph Reagle
    Joseph Reagle
    2009-05-17

    I recently noted that a script I use was getting kind of laggy. It was taking almost 2s to parse my very large bibtex file using bibstuff (which uses SimpleParse). I thought, why not write something in pyparsing and see how that fares? It took about a 60s! (Though I'm sure I've done some clumsy stuff, including the need to flatten.) So I thought, what if I just wrote a simple regexp bibtex parse? 0.24s!

    I then read that psyco and packrat can speed things up. So I report on the various values below. All of these functions return dict[ident][field] = value. My bibtex is machine generated, so I can make a lot of simplifying solutions in the regexp parse.

    regexGet = 0.24 seconds
    bibstuffGet = 1.75 seconds
    pypGet = 59.83 seconds
    pypGet+psyco = 53.20 seconds
    pyGet+packrat = 102.27 seconds
    pypGet+psyco+packrat = 96.97 seconds

    The test code is here:
    http://reagle.org/joseph/2009/05/parsing-tests.py

     
    • Paul McGuire
      Paul McGuire
      2009-05-17

      I'm realizing that I should offer a non-grouping form of nestedExpr, that just returns the entire nested expression as a single string.

      Try my modified version of your test program: http://pyparsing.pastebin.com/m5dc1ef86

      -- Paul

       
    • Joseph Reagle
      Joseph Reagle
      2009-05-18

      For some reason, that version chokes on the first bibentry:

      test = '''@online{1722005u1,
         author = {172, User},
         shorttitle = {{User:172} (Version 10702240)},
         title = {{User:172}},
         day = {2},
         year = {2005},
         urldate = {2005-03-24},
         url = {http://en.wikipedia.org/w/index.php?title=User:172&oldid=10702240},
         month = {3},
         custom1 = {20050324},
         organization = {Wikimedia},
      }
      @article{zenspider2007bln1,
         author = {zenspider},
         title = {{Burnout} and the Late Night Rant},
         day = {28},
         year = {2007},
         urldate = {2007-03-01 14:26Z},
         url = {http://blog.zenspider.com/archives/2007/02/burnout_and_the_late_night_rant.html},
         journal = {Polishing Ruby},
         month = {2},
         custom1 = {20070301 14:26 UTC},
         keyword = {frustration},
      }
      '''
      Traceback (most recent call last):
        File "./parsing-tests-paul.py", line 177, in <module>
          print "pypGet = %.2f seconds\n" % float(tp.timeit(number=rep)/rep)
        File "/usr/lib/python2.5/timeit.py", line 161, in timeit
          timing = self.inner(it, self.timer)
        File "<timeit-src>", line 6, in inner
        File "./parsing-tests-paul.py", line 125, in pypGet
          bibtex = pypParse(text)
        File "./parsing-tests-paul.py", line 118, in pypParse
          bibtex_file.parseString(text)
        File "/home/reagle/bin/lib/python2.5/site-packages/pyparsing-1.5.2-py2.5.egg/pyparsing.py", line 1076, in parseString
          raise exc
      pyparsing.ParseException: Expected "}" (at char 48), (line:3, col:4)

      btw: I see you moved a packrat test before psyco, but that didn't make a difference in timing, except to hurt the subsequent psyco test.

       
      • Paul McGuire
        Paul McGuire
        2009-05-18

        Well, that would be because of my bug in nestedExprWithoutGrouping!  Now things run much better (http://pyparsing.pastebin.com/f4f01dba9).

        Some other notes:
        - ignoreExpr should not be set to None.  I assume that if you had an article with the title "How to use a }", that should not be interpreted as a closing brace.  Now that I am using originalTextFor, the ignore command wont suppress the returned text
        - Your setKey/setValue parse actions are a valiant stab at converting a list of pairs into a dict-like set of entries.  If only you had some pyparsing documentation, the Dict class might have suggested an alternative.  The modified version uses results names and a Dict to dynamically give named results in the returned ParseResults structure.  By calling dump() we get:

        [u'article', '{', u'Ball2007mws',...
        - data: [[u'author', u'Ball, Philip'], [u'shorttitle',...
          - author: Ball, Philip
          - custom1: 20070301
          - journal: news@nature.com
          - month: 2
          - shorttitle: {The} More, the Wikier
          - url: http://www.nature.com/news/2007/070226/full/news070226-6.html
          - year: 2007
        - ident: Ball2007mws

        And you can reference these fields like dict entries (results["ident"], or results["data"]["author"]) or like object attributes (results.ident or results.data.author).

        -- Paul

         
    • Joseph Reagle
      Joseph Reagle
      2009-05-18

      Thanks so much for your response Paul.

      1. I see that you moved the nestedExpr from the outer structure, to just the values, makes sense.
      2. I was using ingoreExpr because even though the open/close chars I was used was "{}" I believe the function still uses default quoteStrings, and hence causing me problems with a field like:
        title = {'The {Spinster} and the {Prophet'} by {A.B.} {McKillop}},
      (I shouldn't be creating bibtex like that, but for the meaningtime...) So I turned it back on.
      3. Is there any difference in settings a ResultsName at the assignment versus within an expression? e.g.:
        pairs = Dict(OneOrMore(pair))("data")
      verus
        pairs = Dict(OneOrMore(pair)) ... + pairs("data") + '}')
      4. Trust me, I've read HowToUsePyparsing, PyCon2006, module docs, and OnLamp many times -- not that I understand it :) -- and did want to use Dict(), but still can't figure how to implement what I want directly. (BTW: the HowToUsePyparsing on wikispaces is out of date.) Part of this goes back to my continuing confusion of the structure of the ParseResults given a partiuclar grammar. In any case, I thought constructing from within the using ParseActions would be faster than creating a dictionary later and it made it easy for me to debug -- which is what I do now:

              entries = {}
              for result in results:
                  entries[result.ident] = {}
                  for field, value in result.data:
                      entries[result.ident][field] = value
              return entries
      5. Results with current code http://reagle.org/joseph/2009/05/parsing-tests.py:
      regexGet = 0.22 seconds
      bibstuffGet = 1.63 seconds
      pypGet = 48.83 seconds
      pypGet+psyco = 41.12 seconds
      pypGet+packrat = 81.06 seconds