Menu

#86 packrat caching can not be disabled

v1.0 (example)
closed-fixed
nobody
None
5
2018-09-19
2016-01-08
No

In our project we implement several parsers with pyparsing. For some of the more complex parsers setting:

pyparsing.ParserElement.enablePackrat()

does result in speed up but for some simple parsers it results in significant slow downs and worse yet, significant memory usage. For example this simple parser:

https://codereview.appspot.com/277630043/diff/40001/tools/layout_expert/layout_expert/parser/trimming_parser.py

is used on a large file with C code (approximately 1.7mb - 50kloc). Without packrat it completes in about 20 seconds but with packrat it takes 1 min 15 seconds (almost 4 times longer). It also uses several GB of memory. Looking at the code for packrat it implements a simple memoization scheme:

    # this method gets repeatedly called during backtracking with the same arguments -                                                                                                        
    # we can cache these arguments and save ourselves the trouble of re-parsing the contained expression                                                                                      
    def _parseCache( self, instring, loc, doActions=True, callPreParse=True ):
        lookup = (self,instring,loc,callPreParse,doActions)
        if lookup in ParserElement._exprArgCache:
            value = ParserElement._exprArgCache[ lookup ]
            if isinstance(value, Exception):
            raise value
            return (value[0],value[1].copy())
        else:
            try:
                value = self._parseNoCache( instring, loc, doActions, callPreParse )
                ParserElement._exprArgCache[ lookup ] = (value[0],value[1].copy())
                return value
            except ParseBaseException as pe:
                pe.__traceback__ = None
                ParserElement._exprArgCache[ lookup ] = pe
                raise

There are several issues with this code:
1. The cache is global and shared between all parsers for the life of the program.
2. There is no limit on the size of the cache - the cache grows until the program exist or resetCache() is called (which resets the cache for all users at once). The user needs to be aware that they should call resetCache() frequently to ensure reasonable memory usage.
3. The cache key contains instring - yes this is the entire parsed string for each string being parsed. If the string is large this uses a lot of memory. Although technically python will just keep a reference to the same string once (unless it is being modified) it will still need to calculate a hash of the entire string every time it looks up the cache. A very large input string is a performance killer here!
4. This function turns out to be extremely hot - If the additional cost of looking up the cache is not offset by the benefits of the cache, then the cache kills performance. Since any potential benefits of the cache depend on the specific parser the utility of the cache is also parser dependant. This issue should also be clarified more in the documentation (which currently suggests this is always beneficial).

There is a way to enable the cache but there is no official way to disable it. I propose adding a new method:

  def disablePackrat():
    ParserElement._packratEnabled = False
    ParserElement._parse = ParserElement._parseNoCache

but this is a stop gap because the real problem is that the cache is global. Actually thinking about it the cache is only useful towards the top of the parse tree in the case where there are repeated identical tokens shared by many subbranches. As one gets to the leafs of the parse tree (which backtrack a lot more) the cache becomes more expensive (since there will rarely be a cache hit). It might make sense to allow the node itself to carry its own cache state (so like setDebug() maybe we need setCache() ).

Discussion

  • Paul McGuire

    Paul McGuire - 2016-09-09

    Please look at the recent upgrades to packrat parsing's implementation.

     
  • Paul McGuire

    Paul McGuire - 2018-09-19
    • status: open --> closed-fixed
     
  • Paul McGuire

    Paul McGuire - 2018-09-19

    Packrat parsing has undergone major rewrite/improvements

     

Log in to post a comment.