Python parsing module / Discussion / Help/Open Discussion: Performance on parsing BibTex

Joseph Reagle - 2009-05-17

I recently noted that a script I use was getting kind of laggy. It was taking almost 2s to parse my very large bibtex file using bibstuff (which uses SimpleParse). I thought, why not write something in pyparsing and see how that fares? It took about a 60s! (Though I'm sure I've done some clumsy stuff, including the need to flatten.) So I thought, what if I just wrote a simple regexp bibtex parse? 0.24s!

I then read that psyco and packrat can speed things up. So I report on the various values below. All of these functions return dict[ident][field] = value. My bibtex is machine generated, so I can make a lot of simplifying solutions in the regexp parse.

regexGet = 0.24 seconds
bibstuffGet = 1.75 seconds
pypGet = 59.83 seconds
pypGet+psyco = 53.20 seconds
pyGet+packrat = 102.27 seconds
pypGet+psyco+packrat = 96.97 seconds

The test code is here:
http://reagle.org/joseph/2009/05/parsing-tests.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul McGuire - 2009-05-17
  
  I'm realizing that I should offer a non-grouping form of nestedExpr, that just returns the entire nested expression as a single string.
  
  Try my modified version of your test program: http://pyparsing.pastebin.com/m5dc1ef86
  
  -- Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Joseph Reagle - 2009-05-18
  
  For some reason, that version chokes on the first bibentry:
  
  test = '''@online{1722005u1,
     author = {172, User},
     shorttitle = {{User:172} (Version 10702240)},
     title = {{User:172}},
     day = {2},
     year = {2005},
     urldate = {2005-03-24},
     url = {http://en.wikipedia.org/w/index.php?title=User:172&oldid=10702240},
     month = {3},
     custom1 = {20050324},
     organization = {Wikimedia},
  }
  @article{zenspider2007bln1,
     author = {zenspider},
     title = {{Burnout} and the Late Night Rant},
     day = {28},
     year = {2007},
     urldate = {2007-03-01 14:26Z},
     url = {http://blog.zenspider.com/archives/2007/02/burnout_and_the_late_night_rant.html},
     journal = {Polishing Ruby},
     month = {2},
     custom1 = {20070301 14:26 UTC},
     keyword = {frustration},
  }
  '''
  Traceback (most recent call last):
  File "./parsing-tests-paul.py", line 177, in <module>
      print "pypGet = %.2f seconds\n" % float(tp.timeit(number=rep)/rep)
  File "/usr/lib/python2.5/timeit.py", line 161, in timeit
      timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "./parsing-tests-paul.py", line 125, in pypGet
      bibtex = pypParse(text)
  File "./parsing-tests-paul.py", line 118, in pypParse
      bibtex_file.parseString(text)
  File "/home/reagle/bin/lib/python2.5/site-packages/pyparsing-1.5.2-py2.5.egg/pyparsing.py", line 1076, in parseString
      raise exc
  pyparsing.ParseException: Expected "}" (at char 48), (line:3, col:4)
  
  btw: I see you moved a packrat test before psyco, but that didn't make a difference in timing, except to hurt the subsequent psyco test.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Paul McGuire - 2009-05-18
    
    Well, that would be because of my bug in nestedExprWithoutGrouping! Now things run much better (http://pyparsing.pastebin.com/f4f01dba9).
    
    Some other notes:
    - ignoreExpr should not be set to None. I assume that if you had an article with the title "How to use a }", that should not be interpreted as a closing brace. Now that I am using originalTextFor, the ignore command wont suppress the returned text
    - Your setKey/setValue parse actions are a valiant stab at converting a list of pairs into a dict-like set of entries. If only you had some pyparsing documentation, the Dict class might have suggested an alternative. The modified version uses results names and a Dict to dynamically give named results in the returned ParseResults structure. By calling dump() we get:
    
    [u'article', '{', u'Ball2007mws',...
    - data: [[u'author', u'Ball, Philip'], [u'shorttitle',...
    - author: Ball, Philip
    - custom1: 20070301
    - journal: news@nature.com
    - month: 2
    - shorttitle: {The} More, the Wikier
    - url: http://www.nature.com/news/2007/070226/full/news070226-6.html
    - year: 2007
    - ident: Ball2007mws
    
    And you can reference these fields like dict entries (results["ident"], or results["data"]["author"]) or like object attributes (results.ident or results.data.author).
    
    -- Paul
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Joseph Reagle - 2009-05-18
  
  Thanks so much for your response Paul.
  
  1. I see that you moved the nestedExpr from the outer structure, to just the values, makes sense.
  2. I was using ingoreExpr because even though the open/close chars I was used was "{}" I believe the function still uses default quoteStrings, and hence causing me problems with a field like:
  title = {'The {Spinster} and the {Prophet'} by {A.B.} {McKillop}},
  (I shouldn't be creating bibtex like that, but for the meaningtime...) So I turned it back on.
  3. Is there any difference in settings a ResultsName at the assignment versus within an expression? e.g.:
  pairs = Dict(OneOrMore(pair))("data")
  verus
  pairs = Dict(OneOrMore(pair)) ... + pairs("data") + '}')
  4. Trust me, I've read HowToUsePyparsing, PyCon2006, module docs, and OnLamp many times -- not that I understand it :) -- and did want to use Dict(), but still can't figure how to implement what I want directly. (BTW: the HowToUsePyparsing on wikispaces is out of date.) Part of this goes back to my continuing confusion of the structure of the ParseResults given a partiuclar grammar. In any case, I thought constructing from within the using ParseActions would be faster than creating a dictionary later and it made it easy for me to debug -- which is what I do now:
  
          entries = {}
          for result in results:
              entries[result.ident] = {}
              for field, value in result.data:
                  entries[result.ident][field] = value
          return entries
  5. Results with current code http://reagle.org/joseph/2009/05/parsing-tests.py:
  regexGet = 0.22 seconds
  bibstuffGet = 1.63 seconds
  pypGet = 48.83 seconds
  pypGet+psyco = 41.12 seconds
  pypGet+packrat = 81.06 seconds
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Performance on parsing BibTex

Forums

Help

Performance on parsing BibTex document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Performance on parsing BibTex