Thread: [Pyparsing] packrat bug + various
Brought to you by:
ptmcg
From: Paolo L. <p....@hy...> - 2006-08-22 23:09:55
Attachments:
patch_whitespaces.diff
|
Hello Paul and everyone, I'm trying to upgrade my application from 1.3.3 to 1.4.3, mainly for the PackRat feature since I expect huge performance gain from it in my app. I'm finding some problems/bugs though... ### PackRat bug reuse of ParseResults strategy conflicts with PackRat ParseResult caching. For an example: ----------------------- from pyparsing import * A = Literal('A') B = Literal('B') C = Literal('C') e = A ^ ( A + B + C ) print e.parseString("AB") ParserElement.enablePackrat() print e.parseString("AB") ------------------------- The solution could be dropping the reuse strategy (drop __iadd__ and __new__ from ParseResults). I think that this not only would solve the bug but would also make the code easier to understand/maintain with a little sacrifice in performance (has the performance advantage ever been measured?). I know this is a drastic solution and I want to hear from you Paul since I'd like to keep in synch with your official releases... ### skipWhitespace bug I attach a small patch related to whitespaces skipping... ### PackRat threading enhancement Current packrat implementation seems to be thread safe but not thread friendly: a new thread, via resetCache(), forces a global cache flush also when other threads are running... I suggest making resetCache invocation explicit or make specific per thread cache... what do you think? ### Performance and simple code I see that in some points of the code (i.e. ParserElement._parse) some performance optimization is achieved by making the code a little bit more complex... IMVHO performance optimization should be obtained implementing the critical code in C (many other python projects follow the same rule: in python reference clean code implementation, in C functionally equivalent fast code. I volunteer for C implementation if we agree on the approach... main targets could be ParseResults and ParserElement._parse... Thank you very much for the fantastic work Ciao Paolo |
From: Paul M. <pa...@al...> - 2006-08-24 07:18:33
|
Paolo - Thanks for the whitespace bugfix. Did you have a test case that raised this problem in the first place? If so, I'd like to add it to my unit tests. Packrat parsing has issues besides the ParseResults problem you describe. Some parse actions have important side effects that get omitted when results are retrieved from a cache instead of being explicitly reparsed. The packrat concept showed such promise I couldn't pass it up, but I had to leave it disabled by default, so as not to break code for existing applications. This is an especially interesting phenomenon for me; as far as pure theoretical parsing is concerned, packratting is a no-brainer. But pyparsing's implementation, including parse actions, goes beyond the pure parsing function, and in the process breaks the packrat concept. In upgrading your app to be able to use packratting, is it possible that your grammar could be restructured? For instance, in your example, if you change: e = A ^ ( A + B + C ) to: e2 = A + Optional(B + C) then packratting does no harm. I'm sure this is a simplified example for the sake of easy posting, but maybe similar mods could be made to your base app grammar. Performance enhancements is a sore subject for me. Pyparsing has gotten some pretty nasty comments with respect to its parsing speed, so I admit I committed a number of coding sins in trying to eke out a little more speed at parsing time. I've also been reluctant to port parts of pyparsing to C, as I think this complicates its inclusion in other projects, and also its portability. Here are the main performance enhancement techniques, in rough descending order of importance, as I recall (not counting packratting): - cacheing of exception objects - pyparsing raises exceptions a *lot* as various paths are tried and rejected. I optimized the creation of exceptions as much as possible, and then realized that I could keep raising the same exception object. - using startswith in Literal instead of comparing the literal string with a string slice. Literals are the most common terminal, and so Literal.parseImpl has to be fast. - ParseResults reuse with __new__ and __iadd__. There are non-performance reasons for doing this also, mostly having to do with gathering results from the exprs in an And. - avoidance of try/except - one of the ugliest parts of the _parse method is the repetition of code, depending on whether debugging methods are defined or not. But setting up the try/except has a cost, and _parse is about as high-traffic a method as there is. - using re's or dict keys for Word.parseImpl testing with initChars and bodyChars But pyparsing seems to be stabilizing - the last release was about 7 weeks ago, and I've made only minor changes since then. Perhaps pyparsing could move forward as a pair of releases, one pure Python and one C-enhanced (similar to elementTree or StringIO). I've done *no* Python+C development, so this is the other barrier for me to doing such a task. I *have* put in way too much time on pyparsing, at the expense of some family and work time, so I'm not eager to take on a major rewrite just now. If I had a multithreaded app that used pyparsing, I would probably serialize access to the parser object. I don't want to make the resetCache invocation explicit, since the need for it is kind of obscure, and so likely to be overlooked. Thanks for your persistence with pyparsing, Paolo, I know you have been working with it for a while. (We never did get to try adding the GLR mode...). -- Paul |
From: Paolo L. <p....@hy...> - 2006-08-24 11:07:22
Attachments:
test_skipws.py
patch_fix_packrat.diff
|
Paul, thank you for your reply Paul McGuire wrote: > Paolo - > > Thanks for the whitespace bugfix. Did you have a test case that raised this > problem in the first place? If so, I'd like to add it to my unit tests. Attached... > Packrat parsing has issues besides the ParseResults problem you describe. > Some parse actions have important side effects that get omitted when results > are retrieved from a cache instead of being explicitly reparsed. The > packrat concept showed such promise I couldn't pass it up, but I had to > leave it disabled by default, so as not to break code for existing > applications. This is an especially interesting phenomenon for me; as far > as pure theoretical parsing is concerned, packratting is a no-brainer. But > pyparsing's implementation, including parse actions, goes beyond the pure > parsing function, and in the process breaks the packrat concept. I crashed against the parseAction problem as well after posting the first email. I agree with you that packratting requires a deep review of pyparsing: I just tried to refactor _parseNoCache in order to handle doAction flag cleanly but I've given up :-) Why not adding doAction to the lookup key? I attach a proof of concept patch ("patch_fix_packrat.diff") that seems to fix all my cases... > In upgrading your app to be able to use packratting, is it possible that > your grammar could be restructured? For instance, in your example, if you > change: > e = A ^ ( A + B + C ) > to: > e2 = A + Optional(B + C) > then packratting does no harm. I'm sure this is a simplified example for > the sake of easy posting, but maybe similar mods could be made to your base > app grammar. Not easy at all... it would require review tons of code :-) > Performance enhancements is a sore subject for me. Pyparsing has gotten > some pretty nasty comments with respect to its parsing speed, so I admit I > committed a number of coding sins in trying to eke out a little more speed > at parsing time. I've also been reluctant to port parts of pyparsing to C, > as I think this complicates its inclusion in other projects, and also its > portability. IMO, the real value of pyparsing is not speed but predictability, elagance, flexibility, easy of use... speed should be a second priority goal. > Here are the main performance enhancement techniques, in rough descending > order of importance, as I recall (not counting packratting): > - cacheing of exception objects - pyparsing raises exceptions a *lot* as > various paths are tried and rejected. I optimized the creation of > exceptions as much as possible, and then realized that I could keep raising > the same exception object. This breaks threading as well... > - using startswith in Literal instead of comparing the literal string with a > string slice. Literals are the most common terminal, and so > Literal.parseImpl has to be fast. Ok > - ParseResults reuse with __new__ and __iadd__. There are non-performance > reasons for doing this also, mostly having to do with gathering results from > the exprs in an And. My proof of concept patch remove the functional reason. Do you agree? > - avoidance of try/except - one of the ugliest parts of the _parse method is > the repetition of code, depending on whether debugging methods are defined > or not. But setting up the try/except has a cost, and _parse is about as > high-traffic a method as there is. Ok > - using re's or dict keys for Word.parseImpl testing with initChars and > bodyChars Ok. Which is the overall performance improvement you got from exception caching, parseresults reuse and _parse() optimizations? > But pyparsing seems to be stabilizing - the last release was about 7 weeks > ago, and I've made only minor changes since then. Perhaps pyparsing could > move forward as a pair of releases, one pure Python and one C-enhanced > (similar to elementTree or StringIO). That exactly what I was thinking to. Have a clean, portable, reference pure python implementation and add C optimizations... > I've done *no* Python+C development, > so this is the other barrier for me to doing such a task. I *have* put in > way too much time on pyparsing, at the expense of some family and work time, > so I'm not eager to take on a major rewrite just now. I understand perfectly. I'm very busy as well and cannot make promises. If you have a "benchmark" grammar and test cases that you think are representative I'll be glad of work on them in my spare time. > If I had a multithreaded app that used pyparsing, I would probably serialize > access to the parser object. I don't want to make the resetCache invocation > explicit, since the need for it is kind of obscure, and so likely to be > overlooked. I agree. > Thanks for your persistence with pyparsing, Paolo, I know you have been > working with it for a while. (We never did get to try adding the GLR > mode...). I tried. I come to almost complete ugly working solution (parseImpl is generator of solutions) but it was difficult to build predictable grammar with it. I'm now using with great satisfaction the stock pyparsing. ;-) Ciao e Grazie Paolo |