Re: [Pyparsing] Painfully slow parsing.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Sat, 2009-08-29 at 00:00 -0500, Paul McGuire wrote:
> > -----Original Message-----
> > From: John Krukoff [mailto:jkr...@lt...]
> > Sent: Friday, August 28, 2009 6:09 PM
> > To: pyp...@li...
> > Subject: [Pyparsing] Painfully slow parsing.
> > 
> > Hello,
> > 
> > I have a serious speed problem with a parser written using pyparsing,
> > where it's taking ~13 minutes to parse a 30 line file. I'm totally lost
> > on what might be causing it, as small variations seem to be causing
> > large differences in parsing time. I was hoping I could get some tips on
> > general optimization strategies to follow. For instance, I'm suspicious
> > that I should be trying harder to use the '-' operator, and wonder if
> > that would help...
> > 
> 
> John -
> 
> I've not seen people use '-' as a way to speed up parsing, but I imagine it
> could help.  '?load', '?attribute', and '?element' look like good places
> where '-' would be a fit (right after the keyword literal).
> 
> But I am struggling as to where to even begin.  You have posted 500+ lines
> of parser code, without much guidance as to what BNF you are working from,
> or what you are trying to get from the parser.  But you didn't post the 30
> line test file, so I have nothing to run your parser with.

Hello,

Thank you for giving my poorly asked question more time than it
deserves. As an excuse, I truly had no idea where in that block of
parser code I was having a speed problem, and was at a complete loss as
to how to narrow it down.

There's really no formal description of the grammar, the best I've got
for documentation is here:

http://code.google.com/p/compactxml/source/browse/compact.rst

Which is really more of a set of examples than anything. The grammar
developed fairly organically to scratch a very specific itch.

The specific test case that was slow is here, as part of the (nose
based) module test suite:

http://code.google.com/p/compactxml/source/browse/compactxml/tests/speed_test.py

I do say was, though, as I stumbled across this comment on the
discussion board:

http://pyparsing.wikispaces.com/message/view/home/13557997

Which was an instance of someone else having a performance problem with
pyparsing.indentedBlock. His solution worked for me, and didn't break
any of my test suite. My parser still isn't speedy, but at least it's
not ridiculous anymore (from 13 minutes to parse that test case down to
about a tenth of a second). I patched indentedBlock as so:

http://code.google.com/p/compactxml/source/browse/compactxml/pyparsingaddons.py

I really appreciate the tips you've given me, and they've given me a
good starting place to try and shave down the parsing time further.

> 
> You already mention that packratting isn't an option, how about psyco?

The memory footprint hasn't worked out well for me in the past, but I
went and took a closer look at it. Unfortunately it looks like even if I
get over that, it's incompatible with the 64-bit production environment
I have to run on.

> 
> Why do you write this:
> restartIndentation = pyparsing.Literal( '<' ).setParseAction( lambda s, l,
> t: push_indent( ) ).suppress( )
> resumeIndentation = pyparsing.Literal( '>' ).setParseAction( lambda s, l, t:
> pop_indent( ) ).suppress( )
> 
> Instead of:
> 
> restartIndentation = pyparsing.Literal( '<' ).setParseAction( push_indent
> ).suppress( )
> resumeIndentation = pyparsing.Literal( '>' ).setParseAction( pop_indent
> ).suppress( )

Wow, that's a handy shortcut to know. I had no idea from the pyparsing
documentation that parse actions could take a variable number of
arguments like that. Now that you've pointed it out, I see the rules
stated in the setParseAction docstring, which I'll use to shorten some
of the grammar up a bit.

> 
> 
> Here's an idea: follow your definition of endStatement with this:
> 
> endStatement.setName("endStatement").setDebug()
> 
> Re-run your test, and see how much retracing of your steps is going on.  You
> might find that you parse to the end, and then spend most of the time
> figuring out that you're actually AT the end.

Perfect! I didn't understand what these methods were for, and this
really gives me the tools to decipher what the parser is doing.

> 
> 
> This code also looks like a likely performance problem:
> 
> 	def create_block( simple, compound ):
> 		block = pyparsing.Forward( )
> 		simpleStatement = simple + endStatement
> 		compoundStatement = compound + endStatement +
> pyparsing.Optional( block )
> 		statement = compoundStatement | simpleStatement
> 		block << addons.indentedBlock( statement, aIndentations )
> 		block.setParseAction( lambda s, l, t: t[ 0 ] )
> 		return compoundStatement
> 
> You can try adding some more setName/setDebug calls, to get more insight to
> how pyparsing is working its way through your grammar.
> 
> As for getting response to your questions, the mailing list and wiki
> Discussion tab are about the same, although I think other people besides me
> are more likely to chime in on the list.  This actually bodes well for
> getting a faster response, as I just started a new job, and am pretty busy
> trying to get off on a good start.  You could also try posting on
> stackoverflow.com - you might get a response from Alex Martelli himself!
> 
> Good luck!
> -- Paul
> 
> 
-- 
John Krukoff <jkr...@lt...>
Land Title Guarantee Company