Pyparsing uses an abnormally high amount of stack frames to parse a simple query based on the Lucene parser. On Alpine linux, which has a default stack size of 80KB, this causes a segfault. The stack size can be increased with threading.stack_size()
of course, but that doesn't explain why such recursion is being used for a simple query. The stack size is much higher when using ParserElement.enablePackrat()
. Without caching, it does not crash but of course takes more operations.
I've reduced the code to the smallest I could to reproduce the problem:
import inspect import threading from threading import Thread from pyparsing import (CaselessKeyword, Forward, Suppress, Optional, Group, infixNotation, opAssoc, ParserElement) threading.stack_size(80 * 1024) ParserElement.enablePackrat() LPAR, RPAR = map(Suppress, "()") and_, or_ = map(CaselessKeyword, "AND OR".split()) expression = Forward() term = Forward() term << Group(LPAR + expression + RPAR) expression << infixNotation(term, [ ((and_ | '&&').setParseAction(lambda: "AND"), 2, opAssoc.LEFT), (Optional(or_ | '||').setParseAction(lambda: "OR"), 2, opAssoc.LEFT), ]) _parse = ParserElement.preParse def _inspect(*args, **kwargs): print 'Stack size %s' % len(inspect.stack()) return _parse(*args, **kwargs) ParserElement.preParse = _inspect def do_it(): # This query is what causes the deep recursion. # The original query was `( (role:foo) || (role:bar) )` which is # overly escaped but a valid Lucene query coming from a user. Just # nested parenthesis is enough to cause the issue. expression.parseString('( () )') def do_it_in_thread(): t = Thread(target=do_it) t.start() return t t = do_it_in_thread() t.join() print 'Done'
With caching:
Stack size 8 Stack size 20 Stack size 27 Stack size 36 Stack size 43 Stack size 49 Stack size 58 Stack size 70 Stack size 77 Stack size 86 Stack size 93 Stack size 99 Stack size 108 Stack size 120 Stack size 127 Stack size 136 Stack size 143 Stack size 149 Segmentation fault
Without caching:
Stack size 7 Stack size 15 Stack size 20 Stack size 26 Stack size 31 Stack size 35 Stack size 41 Stack size 49 Stack size 54 Stack size 60 Stack size 65 Stack size 69 Stack size 82 Stack size 88 Stack size 93 Stack size 97 Stack size 97 Stack size 88 Stack size 88 Stack size 77 [snip over 1169 lines total] Stack size 44 Stack size 50 Stack size 55 Stack size 59 Stack size 59 Stack size 50 Stack size 50 Stack size 39 Stack size 43 Stack size 48 Stack size 52 Stack size 52 Stack size 43 Stack size 43 Exception in thread Thread-1: Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/local/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "tmp.py", line 35, in do_it expression.parseString('( () )') File "/usr/local/lib/python2.7/site-packages/pyparsing.py", line 1632, in parseString raise exc ParseException: Expected {[{"OR" | "||"}] term | {"AND" | "&&"} term} (at char 3), (line:1, col:4) Done
To avoid the segfault on Alpine I'm increasing the stack size, but I'm concerned about how many stack frames are used for this simple of a query. There seems to be a recurssion problem.
( ( ) )
may be valid Lucene, but it isn't valid as far asinfixNotation
is concerned. There have to be some operands in the string to be parsed, not just empty grouping ()'s. That is what that ParseException is trying to say.Your definition of
term
is just the parens aroundexpression
, which is unnecessary sinceinfixNotation
does this recursive part for you. Plus, you haven't defined just what a valid operand looks like. Reading your comments, a valid query string would look like( (role:foo) || (role:bar) )
, so I defined your grammar as:(Note: no Forwards - infixNotation takes care of that). There is a lot of recursion going on here, I believe stemming from the use of the Optional
||
operator. But with these changes to term and expression, I was able to run this code against the original query, with and without packratting (much faster with), and with and without the call tothreading.stack_size
.