Re: [Pyparsing] grammar with bad performance (with examples :))
Brought to you by:
ptmcg
From: Paul M. <pt...@au...> - 2007-12-10 16:53:33
|
David - Here is a modified version of your program, that runs for me in 0.05 seconds or so. The main problem was your universal use of Or instead of MatchFirst. Using Or in the definition of OPERATOR is not so big a problem, but you can convert to MatchFirst if you are careful about the ordering of your operators. And the operators may as well be Literals, and not Keywords, since Keyword only looks for boundary alphanumeric characters, not symbols or whitespace. Is there any reason you choose the OPERATOR = psg.Or([EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS, STARTWITH]) style, instead of using the built-in operators: OPERATOR = EQUAL ^ NOT_EQUAL ^ CONTAINS ^ NOT_CONTAINS ^ STARTWITH or even better: OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS ?? But the real performance culprit is this: UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR]) Since BIN_EXPR has a recursive component, and since Or has to evaluate both expressions to test for a longest match, things end up taking a Really. Long. Time. Changing this item to a MatchFirst makes things move right along. And there is no chance of misparsing, since an ATOM definition cannot be a leading part of a BIN_EXPR (which is when MatchFirst has issues - note how I reordered the items in the OPERATOR list, so that '~>' would be tested ahead of '~'). Lastly, I added a Group definition to LOGIC_EXPR. This will structure the results in a hierarchy that matches any nesting of and's and or's. pprint will print out the data in this hierarchy. Try it with and without the Group and you can see the difference. Cheers, -- Paul import pyparsing as psg EQUAL = psg.Literal('=') NOT_EQUAL = psg.Literal('!=') CONTAINS = psg.Literal('~') NOT_CONTAINS = psg.Literal('!~') STARTWITH = psg.Literal('~>') #~ OPERATOR = psg.MatchFirst([STARTWITH, EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS]) OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS #~ FIELD = psg.Word(psg.alphanums + '/').setParseAction(lambda s, loc, toks: Field(toks[0])) FIELD = psg.Word(psg.alphanums + '/')#.setParseAction(lambda s, loc, toks: Field(toks[0])) VALUE = psg.quotedString.setParseAction(psg.removeQuotes) ATOM = FIELD + OPERATOR + VALUE #~ LOGIC_OP = psg.Or([ #~ psg.Keyword('and').setParseAction(lambda *args: And()), #~ psg.Keyword('or').setParseAction(lambda *args: Or()) #~ ]) #~ LOGIC_OP = psg.Keyword('and').setParseAction(lambda *args: And()) | \ #~ psg.Keyword('or').setParseAction(lambda *args: Or()) LOGIC_OP = psg.Keyword('and') | \ psg.Keyword('or') BIN_EXPR = psg.Forward() #~ UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR]) #~ UNABIN_EXPR = psg.MatchFirst([BIN_EXPR,ATOM]) UNABIN_EXPR = BIN_EXPR | ATOM # this is the culprit - change operator to '^' and this takes forever! BIN_EXPR << psg.Group(LOGIC_OP + '(' + UNABIN_EXPR + ',' + UNABIN_EXPR + ')') #~ GRAMMAR = psg.Or([BIN_EXPR, ATOM]) #~ GRAMMAR = psg.MatchFirst([BIN_EXPR, ATOM]) GRAMMAR = UNABIN_EXPR # I used the Python textwrap module to get this under control! :) data = "or(/BI/BIB/BIBH = '00000509', or(/BI/BIB/BIBH = '00000769'," \ "or(/BI/BIB/BIBH = '00001673', or(/BI/BIB/BIBH = '00000058'," \ "or(/BI/BIB/BIBH = '00000764', or(/BI/BIB/BIBH = '00001238'," \ "or(/BI/BIB/BIBH = '00001592', or (/BI/BIB/BIBH = '00001017'," \ "or(/BI/BIB/BIBH = '00002676', or(/BI/BIB/BIBH = '00001554'," \ "or(/BI/BIB/BIBH = '00002193', or(/BI/BIB/BIBH = '00001907'," \ "or(/BI/BIB/BIBH = '00000366', or(/BI/BIB/BIBH = '00001161'," \ "or(/BI/BIB/BIBH = '00002991', or(/BI/BIB/BIBH = '00002820'," \ "or(/BI/BIB/BIBH = '00000610', or(/BI/BIB/BIBH = '00000521'," \ "or(/BI/BIB/BIBH = '00003143', or (/BI/BIB/BIBH = '00002869'," \ "or(/BI/BIB/BIBH = '00000410', or(/BI/BIB/BIBH = '00001926'," \ "or(/BI/BIB/BIBH = '00000061', or(/BI/BIB/BIBH = '00000165'," \ "or(/BI/BIB/BIBH = '00000669', or(/BI/BIB/BIBH = '00002675'," \ "or(/BI/BIB/BIBH = '00000770', or(/BI/BIB/BIBH = '00000981'," \ "or(/BI/BIB/BIBH = '00001841', or(/BI/BIB/BIBH = '00002668'," \ "or(/BI/BIB/BIBH = '00001949', or (/BI/BIB/BIBH = '00002819'," \ "or(/BI/BIB/BIBH = '00000623', or(/BI/BIB/BIBH = '00002365'," \ "or(/BI/BIB/BIBH = '00000865', or(/BI/BIB/BIBH = '00000176'," \ "or(/BI/BIB/BIBH = '00002036', or(/BI/BIB/BIBH = '00002640'," \ "or(/BI/BIB/BIBH = '00001297', or(/BI/BIB/BIBH = '00000811'," \ "or(/BI/BIB/BIBH = '00001594', or(/BI/BIB/BIBH = '00001160'," \ "or(/BI/BIB/BIBH = '00000709', or (/BI/BIB/BIBH = '00001908'," \ "or(/BI/BIB/BIBH = '00002836', or(/BI/BIB/BIBH = '00001283'," \ "or(/BI/BIB/BIBH = '00002946', or(/BI/BIB/BIBH = '00001877'," \ "or(/BI/BIB/BIBH = '00002378', or(/BI/BIB/BIBH = '00000474'," \ "or(/BI/BIB/BIBH = '00002710', or(/BI/BIB/BIBH = '00001884'," \ "or(/BI/BIB/BIBH = '00000968', or(/BI/BIB/BIBH = '00000926'," \ "or(/BI/BIB/BIBH = '00001902', or (/BI/BIB/BIBH = '00001018'," \ "or(/BI/BIB/BIBH = '00002989', or(/BI/BIB/BIBH = '00001590'," \ "or(/BI/BIB/BIBH = '00001300', or(/BI/BIB/BIBH = '00002754'," \ "or(/BI/BIB/BIBH = '00001923', or(/BI/BIB/BIBH = '00002771'," \ "or(/BI/BIB/BIBH = '00000768', or(/BI/BIB/BIBH = '00002034'," \ "or(/BI/BIB/BIBH = '00001851', or(/BI/BIB/BIBH = '00001303'," \ "or(/BI/BIB/BIBH = '00002642', or (/BI/BIB/BIBH = '00002949'," \ "or(/BI/BIB/BIBH = '00000821', or(/BI/BIB/BIBH = '00000989'," \ "or(/BI/BIB/BIBH = '00001377', or(/BI/BIB/BIBH = '00001514'," \ "or(/BI/BIB/BIBH = '00000510', or(/BI/BIB/BIBH = '00000054'," \ "or(/BI/BIB/BIBH = '00001280', or(/BI/BIB/BIBH = '00002370'," \ "or(/BI/BIB/BIBH = '00001924', or(/BI/BIB/BIBH = '00000063'," \ "or(/BI/BIB/BIBH = '00001632', or (/BI/BIB/BIBH = '00000434'," \ "or(/BI/BIB/BIBH = '00003035', or(/BI/BIB/BIBH = '00001657'," \ "or(/BI/BIB/BIBH = '00002053', or(/BI/BIB/BIBH = '00000966'," \ "or(/BI/BIB/BIBH = '00002699', or(/BI/BIB/BIBH = '00003178'," \ "or(/BI/BIB/BIBH = '00002602', or(/BI/BIB/BIBH = '00002194'," \ "or(/BI/BIB/BIBH = '00001591', or(/BI/BIB/BIBH = '00001910'," \ "or(/BI/BIB/BIBH = '00000174', or (/BI/BIB/BIBH = '00002132'," \ "/BI/BIB/BIBH = '00000665')))))))))))))))))))))))))))))))))))))))))))))" \ ")))))))))))))))))))))))))))))))))))))))))))))))" import time start = time.time() tokens = GRAMMAR.parseString(data) print time.time() - start import pprint pprint.pprint(tokens.asList()) |