Re: [Pyparsing] packrat bug + various
Brought to you by:
ptmcg
From: Paul M. <pa...@al...> - 2006-08-24 07:18:33
|
Paolo - Thanks for the whitespace bugfix. Did you have a test case that raised this problem in the first place? If so, I'd like to add it to my unit tests. Packrat parsing has issues besides the ParseResults problem you describe. Some parse actions have important side effects that get omitted when results are retrieved from a cache instead of being explicitly reparsed. The packrat concept showed such promise I couldn't pass it up, but I had to leave it disabled by default, so as not to break code for existing applications. This is an especially interesting phenomenon for me; as far as pure theoretical parsing is concerned, packratting is a no-brainer. But pyparsing's implementation, including parse actions, goes beyond the pure parsing function, and in the process breaks the packrat concept. In upgrading your app to be able to use packratting, is it possible that your grammar could be restructured? For instance, in your example, if you change: e = A ^ ( A + B + C ) to: e2 = A + Optional(B + C) then packratting does no harm. I'm sure this is a simplified example for the sake of easy posting, but maybe similar mods could be made to your base app grammar. Performance enhancements is a sore subject for me. Pyparsing has gotten some pretty nasty comments with respect to its parsing speed, so I admit I committed a number of coding sins in trying to eke out a little more speed at parsing time. I've also been reluctant to port parts of pyparsing to C, as I think this complicates its inclusion in other projects, and also its portability. Here are the main performance enhancement techniques, in rough descending order of importance, as I recall (not counting packratting): - cacheing of exception objects - pyparsing raises exceptions a *lot* as various paths are tried and rejected. I optimized the creation of exceptions as much as possible, and then realized that I could keep raising the same exception object. - using startswith in Literal instead of comparing the literal string with a string slice. Literals are the most common terminal, and so Literal.parseImpl has to be fast. - ParseResults reuse with __new__ and __iadd__. There are non-performance reasons for doing this also, mostly having to do with gathering results from the exprs in an And. - avoidance of try/except - one of the ugliest parts of the _parse method is the repetition of code, depending on whether debugging methods are defined or not. But setting up the try/except has a cost, and _parse is about as high-traffic a method as there is. - using re's or dict keys for Word.parseImpl testing with initChars and bodyChars But pyparsing seems to be stabilizing - the last release was about 7 weeks ago, and I've made only minor changes since then. Perhaps pyparsing could move forward as a pair of releases, one pure Python and one C-enhanced (similar to elementTree or StringIO). I've done *no* Python+C development, so this is the other barrier for me to doing such a task. I *have* put in way too much time on pyparsing, at the expense of some family and work time, so I'm not eager to take on a major rewrite just now. If I had a multithreaded app that used pyparsing, I would probably serialize access to the parser object. I don't want to make the resetCache invocation explicit, since the need for it is kind of obscure, and so likely to be overlooked. Thanks for your persistence with pyparsing, Paolo, I know you have been working with it for a while. (We never did get to try adding the GLR mode...). -- Paul |