Re: [Pyparsing] Painfully slow parsing.
Brought to you by:
ptmcg
From: John K. <jkr...@lt...> - 2009-08-31 21:09:08
|
On Sat, 2009-08-29 at 00:00 -0500, Paul McGuire wrote: > > -----Original Message----- > > From: John Krukoff [mailto:jkr...@lt...] > > Sent: Friday, August 28, 2009 6:09 PM > > To: pyp...@li... > > Subject: [Pyparsing] Painfully slow parsing. > > > > Hello, > > > > I have a serious speed problem with a parser written using pyparsing, > > where it's taking ~13 minutes to parse a 30 line file. I'm totally lost > > on what might be causing it, as small variations seem to be causing > > large differences in parsing time. I was hoping I could get some tips on > > general optimization strategies to follow. For instance, I'm suspicious > > that I should be trying harder to use the '-' operator, and wonder if > > that would help... > > > > John - > > I've not seen people use '-' as a way to speed up parsing, but I imagine it > could help. '?load', '?attribute', and '?element' look like good places > where '-' would be a fit (right after the keyword literal). > > But I am struggling as to where to even begin. You have posted 500+ lines > of parser code, without much guidance as to what BNF you are working from, > or what you are trying to get from the parser. But you didn't post the 30 > line test file, so I have nothing to run your parser with. Hello, Thank you for giving my poorly asked question more time than it deserves. As an excuse, I truly had no idea where in that block of parser code I was having a speed problem, and was at a complete loss as to how to narrow it down. There's really no formal description of the grammar, the best I've got for documentation is here: http://code.google.com/p/compactxml/source/browse/compact.rst Which is really more of a set of examples than anything. The grammar developed fairly organically to scratch a very specific itch. The specific test case that was slow is here, as part of the (nose based) module test suite: http://code.google.com/p/compactxml/source/browse/compactxml/tests/speed_test.py I do say was, though, as I stumbled across this comment on the discussion board: http://pyparsing.wikispaces.com/message/view/home/13557997 Which was an instance of someone else having a performance problem with pyparsing.indentedBlock. His solution worked for me, and didn't break any of my test suite. My parser still isn't speedy, but at least it's not ridiculous anymore (from 13 minutes to parse that test case down to about a tenth of a second). I patched indentedBlock as so: http://code.google.com/p/compactxml/source/browse/compactxml/pyparsingaddons.py I really appreciate the tips you've given me, and they've given me a good starting place to try and shave down the parsing time further. > > You already mention that packratting isn't an option, how about psyco? The memory footprint hasn't worked out well for me in the past, but I went and took a closer look at it. Unfortunately it looks like even if I get over that, it's incompatible with the 64-bit production environment I have to run on. > > Why do you write this: > restartIndentation = pyparsing.Literal( '<' ).setParseAction( lambda s, l, > t: push_indent( ) ).suppress( ) > resumeIndentation = pyparsing.Literal( '>' ).setParseAction( lambda s, l, t: > pop_indent( ) ).suppress( ) > > Instead of: > > restartIndentation = pyparsing.Literal( '<' ).setParseAction( push_indent > ).suppress( ) > resumeIndentation = pyparsing.Literal( '>' ).setParseAction( pop_indent > ).suppress( ) Wow, that's a handy shortcut to know. I had no idea from the pyparsing documentation that parse actions could take a variable number of arguments like that. Now that you've pointed it out, I see the rules stated in the setParseAction docstring, which I'll use to shorten some of the grammar up a bit. > > > Here's an idea: follow your definition of endStatement with this: > > endStatement.setName("endStatement").setDebug() > > Re-run your test, and see how much retracing of your steps is going on. You > might find that you parse to the end, and then spend most of the time > figuring out that you're actually AT the end. Perfect! I didn't understand what these methods were for, and this really gives me the tools to decipher what the parser is doing. > > > This code also looks like a likely performance problem: > > def create_block( simple, compound ): > block = pyparsing.Forward( ) > simpleStatement = simple + endStatement > compoundStatement = compound + endStatement + > pyparsing.Optional( block ) > statement = compoundStatement | simpleStatement > block << addons.indentedBlock( statement, aIndentations ) > block.setParseAction( lambda s, l, t: t[ 0 ] ) > return compoundStatement > > You can try adding some more setName/setDebug calls, to get more insight to > how pyparsing is working its way through your grammar. > > As for getting response to your questions, the mailing list and wiki > Discussion tab are about the same, although I think other people besides me > are more likely to chime in on the list. This actually bodes well for > getting a faster response, as I just started a new job, and am pretty busy > trying to get off on a good start. You could also try posting on > stackoverflow.com - you might get a response from Alex Martelli himself! > > Good luck! > -- Paul > > -- John Krukoff <jkr...@lt...> Land Title Guarantee Company |