Pyparsing uses an abnormally high amount of stack frames to parse a simple query based on the Lucene parser. On Alpine linux, which has a default stack size of 80KB, this causes a segfault. The stack size can be increased with threading.stack_size()
of course, but that doesn't explain why such recursion is being used for a simple query. The stack size is much higher when using ParserElement.enablePackrat()
. Without caching, it does not crash but of course takes more operations.
I've reduced the code to the smallest I could to reproduce the problem:
import inspect
import threading
from threading import Thread
from pyparsing import (CaselessKeyword, Forward, Suppress,
Optional, Group, infixNotation, opAssoc, ParserElement)
threading.stack_size(80 * 1024)
ParserElement.enablePackrat()
LPAR, RPAR = map(Suppress, "()")
and_, or_ = map(CaselessKeyword, "AND OR".split())
expression = Forward()
term = Forward()
term << Group(LPAR + expression + RPAR)
expression << infixNotation(term,
[
((and_ | '&&').setParseAction(lambda: "AND"), 2, opAssoc.LEFT),
(Optional(or_ | '||').setParseAction(lambda: "OR"), 2, opAssoc.LEFT),
])
_parse = ParserElement.preParse
def _inspect(*args, **kwargs):
print 'Stack size %s' % len(inspect.stack())
return _parse(*args, **kwargs)
ParserElement.preParse = _inspect
def do_it():
# This query is what causes the deep recursion.
# The original query was `( (role:foo) || (role:bar) )` which is
# overly escaped but a valid Lucene query coming from a user. Just
# nested parenthesis is enough to cause the issue.
expression.parseString('( () )')
def do_it_in_thread():
t = Thread(target=do_it)
t.start()
return t
t = do_it_in_thread()
t.join()
print 'Done'
With caching:
Stack size 8
Stack size 20
Stack size 27
Stack size 36
Stack size 43
Stack size 49
Stack size 58
Stack size 70
Stack size 77
Stack size 86
Stack size 93
Stack size 99
Stack size 108
Stack size 120
Stack size 127
Stack size 136
Stack size 143
Stack size 149
Segmentation fault
Without caching:
Stack size 7
Stack size 15
Stack size 20
Stack size 26
Stack size 31
Stack size 35
Stack size 41
Stack size 49
Stack size 54
Stack size 60
Stack size 65
Stack size 69
Stack size 82
Stack size 88
Stack size 93
Stack size 97
Stack size 97
Stack size 88
Stack size 88
Stack size 77
[snip over 1169 lines total]
Stack size 44
Stack size 50
Stack size 55
Stack size 59
Stack size 59
Stack size 50
Stack size 50
Stack size 39
Stack size 43
Stack size 48
Stack size 52
Stack size 52
Stack size 43
Stack size 43
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "tmp.py", line 35, in do_it
expression.parseString('( () )')
File "/usr/local/lib/python2.7/site-packages/pyparsing.py", line 1632, in parseString
raise exc
ParseException: Expected {[{"OR" | "||"}] term | {"AND" | "&&"} term} (at char 3), (line:1, col:4)
Done
To avoid the segfault on Alpine I'm increasing the stack size, but I'm concerned about how many stack frames are used for this simple of a query. There seems to be a recurssion problem.
( ( ) )
may be valid Lucene, but it isn't valid as far asinfixNotation
is concerned. There have to be some operands in the string to be parsed, not just empty grouping ()'s. That is what that ParseException is trying to say.Your definition of
term
is just the parens aroundexpression
, which is unnecessary sinceinfixNotation
does this recursive part for you. Plus, you haven't defined just what a valid operand looks like. Reading your comments, a valid query string would look like( (role:foo) || (role:bar) )
, so I defined your grammar as:(Note: no Forwards - infixNotation takes care of that). There is a lot of recursion going on here, I believe stemming from the use of the Optional
||
operator. But with these changes to term and expression, I was able to run this code against the original query, with and without packratting (much faster with), and with and without the call tothreading.stack_size
.