Thread: [Pyparsing] grammar with bad performance (with examples :))
Brought to you by:
ptmcg
From: david <asd...@gn...> - 2007-12-10 14:50:18
|
Hi all, I use pyparsing to parse a simple, self-made, query language. Here is the grammar: ----8<------------------------------------------------------------------- import pyparsing as psg EQUAL = psg.Keyword('=') NOT_EQUAL = psg.Keyword('!=') CONTAINS = psg.Keyword('~') NOT_CONTAINS = psg.Keyword('!~') STARTWITH = psg.Keyword('~>') OPERATOR = psg.Or([EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS, STARTWITH]) FIELD = psg.Word(psg.alphanums + '/').setParseAction(lambda s, loc, toks: Field(toks[0])) VALUE = psg.quotedString.setParseAction(psg.removeQuotes) ATOM = FIELD + OPERATOR + VALUE LOGIC_OP = psg.Or([ psg.Keyword('and').setParseAction(lambda *args: And()), psg.Keyword('or').setParseAction(lambda *args: Or()) ]) BIN_EXPR = psg.Forward() UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR]) BIN_EXPR << (LOGIC_OP + '(' + UNABIN_EXPR + ',' + UNABIN_EXPR + ')') GRAMMAR = psg.Or([BIN_EXPR, ATOM]) ---->8------------------------------------------------------------------- This grammar works like a charme for small input string, but hang with CPU 100% for minutes with a large input string. For large input string I mean something like (sorry this will break your newsreader): ----8<------------------------------------------------------------------- data = """or(/BI/BIB/BIBH = '00000509', or(/BI/BIB/BIBH = '00000769', or(/ BI/BIB/BIBH = '00001673', or(/BI/BIB/BIBH = '00000058', or(/BI/BIB/BIBH = '00000764', or(/BI/BIB/BIBH = '00001238', or(/BI/BIB/BIBH = '00001592', or (/BI/BIB/BIBH = '00001017', or(/BI/BIB/BIBH = '00002676', or(/BI/BIB/BIBH = '00001554', or(/BI/BIB/BIBH = '00002193', or(/BI/BIB/BIBH = '00001907', or(/BI/BIB/BIBH = '00000366', or(/BI/BIB/BIBH = '00001161', or(/BI/BIB/ BIBH = '00002991', or(/BI/BIB/BIBH = '00002820', or(/BI/BIB/BIBH = '00000610', or(/BI/BIB/BIBH = '00000521', or(/BI/BIB/BIBH = '00003143', or (/BI/BIB/BIBH = '00002869', or(/BI/BIB/BIBH = '00000410', or(/BI/BIB/BIBH = '00001926', or(/BI/BIB/BIBH = '00000061', or(/BI/BIB/BIBH = '00000165', or(/BI/BIB/BIBH = '00000669', or(/BI/BIB/BIBH = '00002675', or(/BI/BIB/ BIBH = '00000770', or(/BI/BIB/BIBH = '00000981', or(/BI/BIB/BIBH = '00001841', or(/BI/BIB/BIBH = '00002668', or(/BI/BIB/BIBH = '00001949', or (/BI/BIB/BIBH = '00002819', or(/BI/BIB/BIBH = '00000623', or(/BI/BIB/BIBH = '00002365', or(/BI/BIB/BIBH = '00000865', or(/BI/BIB/BIBH = '00000176', or(/BI/BIB/BIBH = '00002036', or(/BI/BIB/BIBH = '00002640', or(/BI/BIB/ BIBH = '00001297', or(/BI/BIB/BIBH = '00000811', or(/BI/BIB/BIBH = '00001594', or(/BI/BIB/BIBH = '00001160', or(/BI/BIB/BIBH = '00000709', or (/BI/BIB/BIBH = '00001908', or(/BI/BIB/BIBH = '00002836', or(/BI/BIB/BIBH = '00001283', or(/BI/BIB/BIBH = '00002946', or(/BI/BIB/BIBH = '00001877', or(/BI/BIB/BIBH = '00002378', or(/BI/BIB/BIBH = '00000474', or(/BI/BIB/ BIBH = '00002710', or(/BI/BIB/BIBH = '00001884', or(/BI/BIB/BIBH = '00000968', or(/BI/BIB/BIBH = '00000926', or(/BI/BIB/BIBH = '00001902', or (/BI/BIB/BIBH = '00001018', or(/BI/BIB/BIBH = '00002989', or(/BI/BIB/BIBH = '00001590', or(/BI/BIB/BIBH = '00001300', or(/BI/BIB/BIBH = '00002754', or(/BI/BIB/BIBH = '00001923', or(/BI/BIB/BIBH = '00002771', or(/BI/BIB/ BIBH = '00000768', or(/BI/BIB/BIBH = '00002034', or(/BI/BIB/BIBH = '00001851', or(/BI/BIB/BIBH = '00001303', or(/BI/BIB/BIBH = '00002642', or (/BI/BIB/BIBH = '00002949', or(/BI/BIB/BIBH = '00000821', or(/BI/BIB/BIBH = '00000989', or(/BI/BIB/BIBH = '00001377', or(/BI/BIB/BIBH = '00001514', or(/BI/BIB/BIBH = '00000510', or(/BI/BIB/BIBH = '00000054', or(/BI/BIB/ BIBH = '00001280', or(/BI/BIB/BIBH = '00002370', or(/BI/BIB/BIBH = '00001924', or(/BI/BIB/BIBH = '00000063', or(/BI/BIB/BIBH = '00001632', or (/BI/BIB/BIBH = '00000434', or(/BI/BIB/BIBH = '00003035', or(/BI/BIB/BIBH = '00001657', or(/BI/BIB/BIBH = '00002053', or(/BI/BIB/BIBH = '00000966', or(/BI/BIB/BIBH = '00002699', or(/BI/BIB/BIBH = '00003178', or(/BI/BIB/ BIBH = '00002602', or(/BI/BIB/BIBH = '00002194', or(/BI/BIB/BIBH = '00001591', or(/BI/BIB/BIBH = '00001910', or(/BI/BIB/BIBH = '00000174', or (/BI/BIB/BIBH = '00002132', /BI/BIB/BIBH = '00000665'))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))""" import time start = time.time() tokens = GRAMMAR.parseString(data) print time.time() - start ---->8------------------------------------------------------------------- I tried with last pyparsing version (1.4.8) with no luck, have you any ideas? The grammar is self-made, I can change it if you think the bad performance is fault of it. |
From: Paul M. <pt...@au...> - 2007-12-10 16:53:33
|
David - Here is a modified version of your program, that runs for me in 0.05 seconds or so. The main problem was your universal use of Or instead of MatchFirst. Using Or in the definition of OPERATOR is not so big a problem, but you can convert to MatchFirst if you are careful about the ordering of your operators. And the operators may as well be Literals, and not Keywords, since Keyword only looks for boundary alphanumeric characters, not symbols or whitespace. Is there any reason you choose the OPERATOR = psg.Or([EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS, STARTWITH]) style, instead of using the built-in operators: OPERATOR = EQUAL ^ NOT_EQUAL ^ CONTAINS ^ NOT_CONTAINS ^ STARTWITH or even better: OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS ?? But the real performance culprit is this: UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR]) Since BIN_EXPR has a recursive component, and since Or has to evaluate both expressions to test for a longest match, things end up taking a Really. Long. Time. Changing this item to a MatchFirst makes things move right along. And there is no chance of misparsing, since an ATOM definition cannot be a leading part of a BIN_EXPR (which is when MatchFirst has issues - note how I reordered the items in the OPERATOR list, so that '~>' would be tested ahead of '~'). Lastly, I added a Group definition to LOGIC_EXPR. This will structure the results in a hierarchy that matches any nesting of and's and or's. pprint will print out the data in this hierarchy. Try it with and without the Group and you can see the difference. Cheers, -- Paul import pyparsing as psg EQUAL = psg.Literal('=') NOT_EQUAL = psg.Literal('!=') CONTAINS = psg.Literal('~') NOT_CONTAINS = psg.Literal('!~') STARTWITH = psg.Literal('~>') #~ OPERATOR = psg.MatchFirst([STARTWITH, EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS]) OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS #~ FIELD = psg.Word(psg.alphanums + '/').setParseAction(lambda s, loc, toks: Field(toks[0])) FIELD = psg.Word(psg.alphanums + '/')#.setParseAction(lambda s, loc, toks: Field(toks[0])) VALUE = psg.quotedString.setParseAction(psg.removeQuotes) ATOM = FIELD + OPERATOR + VALUE #~ LOGIC_OP = psg.Or([ #~ psg.Keyword('and').setParseAction(lambda *args: And()), #~ psg.Keyword('or').setParseAction(lambda *args: Or()) #~ ]) #~ LOGIC_OP = psg.Keyword('and').setParseAction(lambda *args: And()) | \ #~ psg.Keyword('or').setParseAction(lambda *args: Or()) LOGIC_OP = psg.Keyword('and') | \ psg.Keyword('or') BIN_EXPR = psg.Forward() #~ UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR]) #~ UNABIN_EXPR = psg.MatchFirst([BIN_EXPR,ATOM]) UNABIN_EXPR = BIN_EXPR | ATOM # this is the culprit - change operator to '^' and this takes forever! BIN_EXPR << psg.Group(LOGIC_OP + '(' + UNABIN_EXPR + ',' + UNABIN_EXPR + ')') #~ GRAMMAR = psg.Or([BIN_EXPR, ATOM]) #~ GRAMMAR = psg.MatchFirst([BIN_EXPR, ATOM]) GRAMMAR = UNABIN_EXPR # I used the Python textwrap module to get this under control! :) data = "or(/BI/BIB/BIBH = '00000509', or(/BI/BIB/BIBH = '00000769'," \ "or(/BI/BIB/BIBH = '00001673', or(/BI/BIB/BIBH = '00000058'," \ "or(/BI/BIB/BIBH = '00000764', or(/BI/BIB/BIBH = '00001238'," \ "or(/BI/BIB/BIBH = '00001592', or (/BI/BIB/BIBH = '00001017'," \ "or(/BI/BIB/BIBH = '00002676', or(/BI/BIB/BIBH = '00001554'," \ "or(/BI/BIB/BIBH = '00002193', or(/BI/BIB/BIBH = '00001907'," \ "or(/BI/BIB/BIBH = '00000366', or(/BI/BIB/BIBH = '00001161'," \ "or(/BI/BIB/BIBH = '00002991', or(/BI/BIB/BIBH = '00002820'," \ "or(/BI/BIB/BIBH = '00000610', or(/BI/BIB/BIBH = '00000521'," \ "or(/BI/BIB/BIBH = '00003143', or (/BI/BIB/BIBH = '00002869'," \ "or(/BI/BIB/BIBH = '00000410', or(/BI/BIB/BIBH = '00001926'," \ "or(/BI/BIB/BIBH = '00000061', or(/BI/BIB/BIBH = '00000165'," \ "or(/BI/BIB/BIBH = '00000669', or(/BI/BIB/BIBH = '00002675'," \ "or(/BI/BIB/BIBH = '00000770', or(/BI/BIB/BIBH = '00000981'," \ "or(/BI/BIB/BIBH = '00001841', or(/BI/BIB/BIBH = '00002668'," \ "or(/BI/BIB/BIBH = '00001949', or (/BI/BIB/BIBH = '00002819'," \ "or(/BI/BIB/BIBH = '00000623', or(/BI/BIB/BIBH = '00002365'," \ "or(/BI/BIB/BIBH = '00000865', or(/BI/BIB/BIBH = '00000176'," \ "or(/BI/BIB/BIBH = '00002036', or(/BI/BIB/BIBH = '00002640'," \ "or(/BI/BIB/BIBH = '00001297', or(/BI/BIB/BIBH = '00000811'," \ "or(/BI/BIB/BIBH = '00001594', or(/BI/BIB/BIBH = '00001160'," \ "or(/BI/BIB/BIBH = '00000709', or (/BI/BIB/BIBH = '00001908'," \ "or(/BI/BIB/BIBH = '00002836', or(/BI/BIB/BIBH = '00001283'," \ "or(/BI/BIB/BIBH = '00002946', or(/BI/BIB/BIBH = '00001877'," \ "or(/BI/BIB/BIBH = '00002378', or(/BI/BIB/BIBH = '00000474'," \ "or(/BI/BIB/BIBH = '00002710', or(/BI/BIB/BIBH = '00001884'," \ "or(/BI/BIB/BIBH = '00000968', or(/BI/BIB/BIBH = '00000926'," \ "or(/BI/BIB/BIBH = '00001902', or (/BI/BIB/BIBH = '00001018'," \ "or(/BI/BIB/BIBH = '00002989', or(/BI/BIB/BIBH = '00001590'," \ "or(/BI/BIB/BIBH = '00001300', or(/BI/BIB/BIBH = '00002754'," \ "or(/BI/BIB/BIBH = '00001923', or(/BI/BIB/BIBH = '00002771'," \ "or(/BI/BIB/BIBH = '00000768', or(/BI/BIB/BIBH = '00002034'," \ "or(/BI/BIB/BIBH = '00001851', or(/BI/BIB/BIBH = '00001303'," \ "or(/BI/BIB/BIBH = '00002642', or (/BI/BIB/BIBH = '00002949'," \ "or(/BI/BIB/BIBH = '00000821', or(/BI/BIB/BIBH = '00000989'," \ "or(/BI/BIB/BIBH = '00001377', or(/BI/BIB/BIBH = '00001514'," \ "or(/BI/BIB/BIBH = '00000510', or(/BI/BIB/BIBH = '00000054'," \ "or(/BI/BIB/BIBH = '00001280', or(/BI/BIB/BIBH = '00002370'," \ "or(/BI/BIB/BIBH = '00001924', or(/BI/BIB/BIBH = '00000063'," \ "or(/BI/BIB/BIBH = '00001632', or (/BI/BIB/BIBH = '00000434'," \ "or(/BI/BIB/BIBH = '00003035', or(/BI/BIB/BIBH = '00001657'," \ "or(/BI/BIB/BIBH = '00002053', or(/BI/BIB/BIBH = '00000966'," \ "or(/BI/BIB/BIBH = '00002699', or(/BI/BIB/BIBH = '00003178'," \ "or(/BI/BIB/BIBH = '00002602', or(/BI/BIB/BIBH = '00002194'," \ "or(/BI/BIB/BIBH = '00001591', or(/BI/BIB/BIBH = '00001910'," \ "or(/BI/BIB/BIBH = '00000174', or (/BI/BIB/BIBH = '00002132'," \ "/BI/BIB/BIBH = '00000665')))))))))))))))))))))))))))))))))))))))))))))" \ ")))))))))))))))))))))))))))))))))))))))))))))))" import time start = time.time() tokens = GRAMMAR.parseString(data) print time.time() - start import pprint pprint.pprint(tokens.asList()) |
From: david <asd...@gn...> - 2007-12-13 01:10:36
|
On Mon, 10 Dec 2007 10:53:13 -0600, Paul McGuire wrote: > David - > > Here is a modified version of your program, that runs for me in 0.05 > seconds or so. Paul, Thank you very much, your explanation is excellent, and the new grammar works like a charms :) > The main problem was your universal use of Or instead of MatchFirst. > Using Or in the definition of OPERATOR is not so big a problem, but you > can convert to MatchFirst if you are careful about the ordering of your > operators. And the operators may as well be Literals, and not Keywords, > since Keyword only looks for boundary alphanumeric characters, not > symbols or whitespace. I understand the performance implications between Or and MatchFirst, but I still don't understand why Literals are faster than of Keywords. > Is there any reason you choose the > > OPERATOR = psg.Or([EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS, STARTWITH]) > > style, instead of using the built-in operators: > > OPERATOR = EQUAL ^ NOT_EQUAL ^ CONTAINS ^ NOT_CONTAINS ^ STARTWITH > > or even better: > > OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS Writing grammars is not my business, thanks to pyparsing's simplicity I wrote my first grammar (the one your tune) months ago in one day; I tend to prefer a more readable syntax for non repetitive tasks. Thank you david |