I recently noted that a script I use was getting kind of laggy. It was taking almost 2s to parse my very large bibtex file using bibstuff (which uses SimpleParse). I thought, why not write something in pyparsing and see how that fares? It took about a 60s! (Though I'm sure I've done some clumsy stuff, including the need to flatten.) So I thought, what if I just wrote a simple regexp bibtex parse? 0.24s!
I then read that psyco and packrat can speed things up. So I report on the various values below. All of these functions return dict[ident][field] = value. My bibtex is machine generated, so I can make a lot of simplifying solutions in the regexp parse.
Well, that would be because of my bug in nestedExprWithoutGrouping! Now things run much better (http://pyparsing.pastebin.com/f4f01dba9).
Some other notes:
- ignoreExpr should not be set to None. I assume that if you had an article with the title "How to use a }", that should not be interpreted as a closing brace. Now that I am using originalTextFor, the ignore command wont suppress the returned text
- Your setKey/setValue parse actions are a valiant stab at converting a list of pairs into a dict-like set of entries. If only you had some pyparsing documentation, the Dict class might have suggested an alternative. The modified version uses results names and a Dict to dynamically give named results in the returned ParseResults structure. By calling dump() we get:
And you can reference these fields like dict entries (results["ident"], or results["data"]["author"]) or like object attributes (results.ident or results.data.author).
-- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1. I see that you moved the nestedExpr from the outer structure, to just the values, makes sense.
2. I was using ingoreExpr because even though the open/close chars I was used was "{}" I believe the function still uses default quoteStrings, and hence causing me problems with a field like:
title = {'The {Spinster} and the {Prophet'} by {A.B.} {McKillop}},
(I shouldn't be creating bibtex like that, but for the meaningtime...) So I turned it back on.
3. Is there any difference in settings a ResultsName at the assignment versus within an expression? e.g.:
pairs = Dict(OneOrMore(pair))("data")
verus
pairs = Dict(OneOrMore(pair)) ... + pairs("data") + '}')
4. Trust me, I've read HowToUsePyparsing, PyCon2006, module docs, and OnLamp many times -- not that I understand it :) -- and did want to use Dict(), but still can't figure how to implement what I want directly. (BTW: the HowToUsePyparsing on wikispaces is out of date.) Part of this goes back to my continuing confusion of the structure of the ParseResults given a partiuclar grammar. In any case, I thought constructing from within the using ParseActions would be faster than creating a dictionary later and it made it easy for me to debug -- which is what I do now:
entries = {}
for result in results:
entries[result.ident] = {}
for field, value in result.data:
entries[result.ident][field] = value
return entries
5. Results with current code http://reagle.org/joseph/2009/05/parsing-tests.py:
regexGet = 0.22 seconds
bibstuffGet = 1.63 seconds
pypGet = 48.83 seconds
pypGet+psyco = 41.12 seconds
pypGet+packrat = 81.06 seconds
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I recently noted that a script I use was getting kind of laggy. It was taking almost 2s to parse my very large bibtex file using bibstuff (which uses SimpleParse). I thought, why not write something in pyparsing and see how that fares? It took about a 60s! (Though I'm sure I've done some clumsy stuff, including the need to flatten.) So I thought, what if I just wrote a simple regexp bibtex parse? 0.24s!
I then read that psyco and packrat can speed things up. So I report on the various values below. All of these functions return dict[ident][field] = value. My bibtex is machine generated, so I can make a lot of simplifying solutions in the regexp parse.
regexGet = 0.24 seconds
bibstuffGet = 1.75 seconds
pypGet = 59.83 seconds
pypGet+psyco = 53.20 seconds
pyGet+packrat = 102.27 seconds
pypGet+psyco+packrat = 96.97 seconds
The test code is here:
http://reagle.org/joseph/2009/05/parsing-tests.py
I'm realizing that I should offer a non-grouping form of nestedExpr, that just returns the entire nested expression as a single string.
Try my modified version of your test program: http://pyparsing.pastebin.com/m5dc1ef86
-- Paul
For some reason, that version chokes on the first bibentry:
test = '''@online{1722005u1,
author = {172, User},
shorttitle = {{User:172} (Version 10702240)},
title = {{User:172}},
day = {2},
year = {2005},
urldate = {2005-03-24},
url = {http://en.wikipedia.org/w/index.php?title=User:172&oldid=10702240},
month = {3},
custom1 = {20050324},
organization = {Wikimedia},
}
@article{zenspider2007bln1,
author = {zenspider},
title = {{Burnout} and the Late Night Rant},
day = {28},
year = {2007},
urldate = {2007-03-01 14:26Z},
url = {http://blog.zenspider.com/archives/2007/02/burnout_and_the_late_night_rant.html},
journal = {Polishing Ruby},
month = {2},
custom1 = {20070301 14:26 UTC},
keyword = {frustration},
}
'''
Traceback (most recent call last):
File "./parsing-tests-paul.py", line 177, in <module>
print "pypGet = %.2f seconds\n" % float(tp.timeit(number=rep)/rep)
File "/usr/lib/python2.5/timeit.py", line 161, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
File "./parsing-tests-paul.py", line 125, in pypGet
bibtex = pypParse(text)
File "./parsing-tests-paul.py", line 118, in pypParse
bibtex_file.parseString(text)
File "/home/reagle/bin/lib/python2.5/site-packages/pyparsing-1.5.2-py2.5.egg/pyparsing.py", line 1076, in parseString
raise exc
pyparsing.ParseException: Expected "}" (at char 48), (line:3, col:4)
btw: I see you moved a packrat test before psyco, but that didn't make a difference in timing, except to hurt the subsequent psyco test.
Well, that would be because of my bug in nestedExprWithoutGrouping! Now things run much better (http://pyparsing.pastebin.com/f4f01dba9).
Some other notes:
- ignoreExpr should not be set to None. I assume that if you had an article with the title "How to use a }", that should not be interpreted as a closing brace. Now that I am using originalTextFor, the ignore command wont suppress the returned text
- Your setKey/setValue parse actions are a valiant stab at converting a list of pairs into a dict-like set of entries. If only you had some pyparsing documentation, the Dict class might have suggested an alternative. The modified version uses results names and a Dict to dynamically give named results in the returned ParseResults structure. By calling dump() we get:
[u'article', '{', u'Ball2007mws',...
- data: [[u'author', u'Ball, Philip'], [u'shorttitle',...
- author: Ball, Philip
- custom1: 20070301
- journal: news@nature.com
- month: 2
- shorttitle: {The} More, the Wikier
- url: http://www.nature.com/news/2007/070226/full/news070226-6.html
- year: 2007
- ident: Ball2007mws
And you can reference these fields like dict entries (results["ident"], or results["data"]["author"]) or like object attributes (results.ident or results.data.author).
-- Paul
Thanks so much for your response Paul.
1. I see that you moved the nestedExpr from the outer structure, to just the values, makes sense.
2. I was using ingoreExpr because even though the open/close chars I was used was "{}" I believe the function still uses default quoteStrings, and hence causing me problems with a field like:
title = {'The {Spinster} and the {Prophet'} by {A.B.} {McKillop}},
(I shouldn't be creating bibtex like that, but for the meaningtime...) So I turned it back on.
3. Is there any difference in settings a ResultsName at the assignment versus within an expression? e.g.:
pairs = Dict(OneOrMore(pair))("data")
verus
pairs = Dict(OneOrMore(pair)) ... + pairs("data") + '}')
4. Trust me, I've read HowToUsePyparsing, PyCon2006, module docs, and OnLamp many times -- not that I understand it :) -- and did want to use Dict(), but still can't figure how to implement what I want directly. (BTW: the HowToUsePyparsing on wikispaces is out of date.) Part of this goes back to my continuing confusion of the structure of the ParseResults given a partiuclar grammar. In any case, I thought constructing from within the using ParseActions would be faster than creating a dictionary later and it made it easy for me to debug -- which is what I do now:
entries = {}
for result in results:
entries[result.ident] = {}
for field, value in result.data:
entries[result.ident][field] = value
return entries
5. Results with current code http://reagle.org/joseph/2009/05/parsing-tests.py:
regexGet = 0.22 seconds
bibstuffGet = 1.63 seconds
pypGet = 48.83 seconds
pypGet+psyco = 41.12 seconds
pypGet+packrat = 81.06 seconds