Thread: [Pyparsing] Painfully slow parsing.
Brought to you by:
ptmcg
From: John K. <jkr...@lt...> - 2009-08-28 23:43:57
|
Hello, I have a serious speed problem with a parser written using pyparsing, where it's taking ~13 minutes to parse a 30 line file. I'm totally lost on what might be causing it, as small variations seem to be causing large differences in parsing time. I was hoping I could get some tips on general optimization strategies to follow. For instance, I'm suspicious that I should be trying harder to use the '-' operator, and wonder if that would help... I've posted my project on google code for easy access, the relevant bit that defines the parsing grammar if I can interest someone in taking a look, is at: http://code.google.com/p/compactxml/source/browse/compactxml/expand.py I'm making heavy use of significant whitespace and pyparsing.indentedBlock, so it does feel like I'm fighting against pyparsing a bit. Unfortunately, it looks like indentedBlock is incompatible with packrat parsing, so the most obvious performance improving tip looks to be unusable. I'm also unsure if the mailing list is the best place to ask for help, as it looks like there's more traffic on the pyparsing home page discussion tab? -- John Krukoff <jkr...@lt...> Land Title Guarantee Company |
From: Gre7g L. <haf...@ya...> - 2009-08-29 00:44:15
|
Try: import pyparsing as PP PP.ParserElement.enablePackrat() It's on the website somewhere, but not nearly prominent enough IMHO. Gre7g ________________________________ From: John Krukoff <jkr...@lt...> To: pyp...@li... Sent: Friday, August 28, 2009 5:08:39 PM Subject: [Pyparsing] Painfully slow parsing. Hello, I have a serious speed problem with a parser written using pyparsing, where it's taking ~13 minutes to parse a 30 line file. I'm totally lost on what might be causing it, as small variations seem to be causing large differences in parsing time. I was hoping I could get some tips on general optimization strategies to follow. For instance, I'm suspicious that I should be trying harder to use the '-' operator, and wonder if that would help... I've posted my project on google code for easy access, the relevant bit that defines the parsing grammar if I can interest someone in taking a look, is at: http://code.google.com/p/compactxml/source/browse/compactxml/expand.py I'm making heavy use of significant whitespace and pyparsing.indentedBlock, so it does feel like I'm fighting against pyparsing a bit. Unfortunately, it looks like indentedBlock is incompatible with packrat parsing, so the most obvious performance improving tip looks to be unusable. I'm also unsure if the mailing list is the best place to ask for help, as it looks like there's more traffic on the pyparsing home page discussion tab? -- John Krukoff <jkr...@lt...> Land Title Guarantee Company ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: Paul M. <pt...@au...> - 2009-08-29 05:00:59
|
> -----Original Message----- > From: John Krukoff [mailto:jkr...@lt...] > Sent: Friday, August 28, 2009 6:09 PM > To: pyp...@li... > Subject: [Pyparsing] Painfully slow parsing. > > Hello, > > I have a serious speed problem with a parser written using pyparsing, > where it's taking ~13 minutes to parse a 30 line file. I'm totally lost > on what might be causing it, as small variations seem to be causing > large differences in parsing time. I was hoping I could get some tips on > general optimization strategies to follow. For instance, I'm suspicious > that I should be trying harder to use the '-' operator, and wonder if > that would help... > John - I've not seen people use '-' as a way to speed up parsing, but I imagine it could help. '?load', '?attribute', and '?element' look like good places where '-' would be a fit (right after the keyword literal). But I am struggling as to where to even begin. You have posted 500+ lines of parser code, without much guidance as to what BNF you are working from, or what you are trying to get from the parser. But you didn't post the 30 line test file, so I have nothing to run your parser with. You already mention that packratting isn't an option, how about psyco? Why do you write this: restartIndentation = pyparsing.Literal( '<' ).setParseAction( lambda s, l, t: push_indent( ) ).suppress( ) resumeIndentation = pyparsing.Literal( '>' ).setParseAction( lambda s, l, t: pop_indent( ) ).suppress( ) Instead of: restartIndentation = pyparsing.Literal( '<' ).setParseAction( push_indent ).suppress( ) resumeIndentation = pyparsing.Literal( '>' ).setParseAction( pop_indent ).suppress( ) Here's an idea: follow your definition of endStatement with this: endStatement.setName("endStatement").setDebug() Re-run your test, and see how much retracing of your steps is going on. You might find that you parse to the end, and then spend most of the time figuring out that you're actually AT the end. This code also looks like a likely performance problem: def create_block( simple, compound ): block = pyparsing.Forward( ) simpleStatement = simple + endStatement compoundStatement = compound + endStatement + pyparsing.Optional( block ) statement = compoundStatement | simpleStatement block << addons.indentedBlock( statement, aIndentations ) block.setParseAction( lambda s, l, t: t[ 0 ] ) return compoundStatement You can try adding some more setName/setDebug calls, to get more insight to how pyparsing is working its way through your grammar. As for getting response to your questions, the mailing list and wiki Discussion tab are about the same, although I think other people besides me are more likely to chime in on the list. This actually bodes well for getting a faster response, as I just started a new job, and am pretty busy trying to get off on a good start. You could also try posting on stackoverflow.com - you might get a response from Alex Martelli himself! Good luck! -- Paul |
From: John K. <jkr...@lt...> - 2009-08-31 21:09:08
|
On Sat, 2009-08-29 at 00:00 -0500, Paul McGuire wrote: > > -----Original Message----- > > From: John Krukoff [mailto:jkr...@lt...] > > Sent: Friday, August 28, 2009 6:09 PM > > To: pyp...@li... > > Subject: [Pyparsing] Painfully slow parsing. > > > > Hello, > > > > I have a serious speed problem with a parser written using pyparsing, > > where it's taking ~13 minutes to parse a 30 line file. I'm totally lost > > on what might be causing it, as small variations seem to be causing > > large differences in parsing time. I was hoping I could get some tips on > > general optimization strategies to follow. For instance, I'm suspicious > > that I should be trying harder to use the '-' operator, and wonder if > > that would help... > > > > John - > > I've not seen people use '-' as a way to speed up parsing, but I imagine it > could help. '?load', '?attribute', and '?element' look like good places > where '-' would be a fit (right after the keyword literal). > > But I am struggling as to where to even begin. You have posted 500+ lines > of parser code, without much guidance as to what BNF you are working from, > or what you are trying to get from the parser. But you didn't post the 30 > line test file, so I have nothing to run your parser with. Hello, Thank you for giving my poorly asked question more time than it deserves. As an excuse, I truly had no idea where in that block of parser code I was having a speed problem, and was at a complete loss as to how to narrow it down. There's really no formal description of the grammar, the best I've got for documentation is here: http://code.google.com/p/compactxml/source/browse/compact.rst Which is really more of a set of examples than anything. The grammar developed fairly organically to scratch a very specific itch. The specific test case that was slow is here, as part of the (nose based) module test suite: http://code.google.com/p/compactxml/source/browse/compactxml/tests/speed_test.py I do say was, though, as I stumbled across this comment on the discussion board: http://pyparsing.wikispaces.com/message/view/home/13557997 Which was an instance of someone else having a performance problem with pyparsing.indentedBlock. His solution worked for me, and didn't break any of my test suite. My parser still isn't speedy, but at least it's not ridiculous anymore (from 13 minutes to parse that test case down to about a tenth of a second). I patched indentedBlock as so: http://code.google.com/p/compactxml/source/browse/compactxml/pyparsingaddons.py I really appreciate the tips you've given me, and they've given me a good starting place to try and shave down the parsing time further. > > You already mention that packratting isn't an option, how about psyco? The memory footprint hasn't worked out well for me in the past, but I went and took a closer look at it. Unfortunately it looks like even if I get over that, it's incompatible with the 64-bit production environment I have to run on. > > Why do you write this: > restartIndentation = pyparsing.Literal( '<' ).setParseAction( lambda s, l, > t: push_indent( ) ).suppress( ) > resumeIndentation = pyparsing.Literal( '>' ).setParseAction( lambda s, l, t: > pop_indent( ) ).suppress( ) > > Instead of: > > restartIndentation = pyparsing.Literal( '<' ).setParseAction( push_indent > ).suppress( ) > resumeIndentation = pyparsing.Literal( '>' ).setParseAction( pop_indent > ).suppress( ) Wow, that's a handy shortcut to know. I had no idea from the pyparsing documentation that parse actions could take a variable number of arguments like that. Now that you've pointed it out, I see the rules stated in the setParseAction docstring, which I'll use to shorten some of the grammar up a bit. > > > Here's an idea: follow your definition of endStatement with this: > > endStatement.setName("endStatement").setDebug() > > Re-run your test, and see how much retracing of your steps is going on. You > might find that you parse to the end, and then spend most of the time > figuring out that you're actually AT the end. Perfect! I didn't understand what these methods were for, and this really gives me the tools to decipher what the parser is doing. > > > This code also looks like a likely performance problem: > > def create_block( simple, compound ): > block = pyparsing.Forward( ) > simpleStatement = simple + endStatement > compoundStatement = compound + endStatement + > pyparsing.Optional( block ) > statement = compoundStatement | simpleStatement > block << addons.indentedBlock( statement, aIndentations ) > block.setParseAction( lambda s, l, t: t[ 0 ] ) > return compoundStatement > > You can try adding some more setName/setDebug calls, to get more insight to > how pyparsing is working its way through your grammar. > > As for getting response to your questions, the mailing list and wiki > Discussion tab are about the same, although I think other people besides me > are more likely to chime in on the list. This actually bodes well for > getting a faster response, as I just started a new job, and am pretty busy > trying to get off on a good start. You could also try posting on > stackoverflow.com - you might get a response from Alex Martelli himself! > > Good luck! > -- Paul > > -- John Krukoff <jkr...@lt...> Land Title Guarantee Company |
From: Eike W. <eik...@gm...> - 2009-09-03 15:09:05
|
On Monday 31 August 2009, John Krukoff wrote: > I do say was, though, as I stumbled across this comment on the > discussion board: > > http://pyparsing.wikispaces.com/message/view/home/13557997 > > Which was an instance of someone else having a performance problem > with pyparsing.indentedBlock. His solution worked for me, and > didn't break any of my test suite. My parser still isn't speedy, > but at least it's not ridiculous anymore (from 13 minutes to parse > that test case down to about a tenth of a second). I patched > indentedBlock as so: That's an interesting find. I put the modification into my patched version of Pyparsing and it didn't break any tests in my project too. The modification also solves one problem where Pyparsing can't parse a complex but syntactically correct file. Without the modification the parser crashes with "RuntimeError: maximum recursion depth exceeded". So this modification seems to be quite usefull. John, you should put a patch into Pyparsing's tracker, so that it doesn't get lost. http://sourceforge.net/tracker/?atid=617313&group_id=97203&func=browse Kind regards, Eike. |