Re: [Pyparsing] grammar with bad performance (with examples :))

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

David -

Here is a modified version of your program, that runs for me in 0.05 seconds
or so.

The main problem was your universal use of Or instead of MatchFirst.  Using
Or in the definition of OPERATOR is not so big a problem, but you can
convert to MatchFirst if you are careful about the ordering of your
operators.  And the operators may as well be Literals, and not Keywords,
since Keyword only looks for boundary alphanumeric characters, not symbols
or whitespace.

Is there any reason you choose the 

OPERATOR = psg.Or([EQUAL, NOT_EQUAL, CONTAINS, NOT_CONTAINS, STARTWITH])

style, instead of using the built-in operators:

OPERATOR = EQUAL ^ NOT_EQUAL ^ CONTAINS ^ NOT_CONTAINS ^ STARTWITH

or even better:

OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS

??

But the real performance culprit is this:

UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR])

Since BIN_EXPR has a recursive component, and since Or has to evaluate both
expressions to test for a longest match, things end up taking a Really.
Long. Time.

Changing this item to a MatchFirst makes things move right along.  And there
is no chance of misparsing, since an ATOM definition cannot be a leading
part of a BIN_EXPR (which is when MatchFirst has issues - note how I
reordered the items in the OPERATOR list, so that '~>' would be tested ahead
of '~').

Lastly, I added a Group definition to LOGIC_EXPR.  This will structure the
results in a hierarchy that matches any nesting of and's and or's.  pprint
will print out the data in this hierarchy.  Try it with and without the
Group and you can see the difference.

Cheers,
-- Paul

import pyparsing as psg

EQUAL = psg.Literal('=')
NOT_EQUAL = psg.Literal('!=')
CONTAINS = psg.Literal('~')
NOT_CONTAINS = psg.Literal('!~')
STARTWITH = psg.Literal('~>')
#~ OPERATOR = psg.MatchFirst([STARTWITH, EQUAL, NOT_EQUAL, CONTAINS,
NOT_CONTAINS])
OPERATOR = STARTWITH | EQUAL | NOT_EQUAL | CONTAINS | NOT_CONTAINS

#~ FIELD = psg.Word(psg.alphanums + '/').setParseAction(lambda s, loc, toks:
Field(toks[0]))
FIELD = psg.Word(psg.alphanums + '/')#.setParseAction(lambda s, loc, toks:
Field(toks[0]))
VALUE = psg.quotedString.setParseAction(psg.removeQuotes)

ATOM = FIELD + OPERATOR + VALUE

#~ LOGIC_OP = psg.Or([
    #~ psg.Keyword('and').setParseAction(lambda *args: And()),
    #~ psg.Keyword('or').setParseAction(lambda *args: Or())
#~ ])
#~ LOGIC_OP = psg.Keyword('and').setParseAction(lambda *args: And()) | \
           #~ psg.Keyword('or').setParseAction(lambda *args: Or())
LOGIC_OP = psg.Keyword('and') | \
           psg.Keyword('or')

BIN_EXPR = psg.Forward()

#~ UNABIN_EXPR = psg.Or([ATOM, BIN_EXPR])
#~ UNABIN_EXPR = psg.MatchFirst([BIN_EXPR,ATOM])
UNABIN_EXPR = BIN_EXPR | ATOM  # this is the culprit - change operator to
'^' and this takes forever!
BIN_EXPR << psg.Group(LOGIC_OP + '(' + UNABIN_EXPR + ',' + UNABIN_EXPR +
')')

#~ GRAMMAR = psg.Or([BIN_EXPR, ATOM])
#~ GRAMMAR = psg.MatchFirst([BIN_EXPR, ATOM])
GRAMMAR = UNABIN_EXPR

# I used the Python textwrap module to get this under control! :)
data = "or(/BI/BIB/BIBH = '00000509', or(/BI/BIB/BIBH = '00000769'," \
"or(/BI/BIB/BIBH = '00001673', or(/BI/BIB/BIBH = '00000058'," \
"or(/BI/BIB/BIBH = '00000764', or(/BI/BIB/BIBH = '00001238'," \
"or(/BI/BIB/BIBH = '00001592', or (/BI/BIB/BIBH = '00001017'," \
"or(/BI/BIB/BIBH = '00002676', or(/BI/BIB/BIBH = '00001554'," \
"or(/BI/BIB/BIBH = '00002193', or(/BI/BIB/BIBH = '00001907'," \
"or(/BI/BIB/BIBH = '00000366', or(/BI/BIB/BIBH = '00001161'," \
"or(/BI/BIB/BIBH = '00002991', or(/BI/BIB/BIBH = '00002820'," \
"or(/BI/BIB/BIBH = '00000610', or(/BI/BIB/BIBH = '00000521'," \
"or(/BI/BIB/BIBH = '00003143', or (/BI/BIB/BIBH = '00002869'," \
"or(/BI/BIB/BIBH = '00000410', or(/BI/BIB/BIBH = '00001926'," \
"or(/BI/BIB/BIBH = '00000061', or(/BI/BIB/BIBH = '00000165'," \
"or(/BI/BIB/BIBH = '00000669', or(/BI/BIB/BIBH = '00002675'," \
"or(/BI/BIB/BIBH = '00000770', or(/BI/BIB/BIBH = '00000981'," \
"or(/BI/BIB/BIBH = '00001841', or(/BI/BIB/BIBH = '00002668'," \
"or(/BI/BIB/BIBH = '00001949', or (/BI/BIB/BIBH = '00002819'," \
"or(/BI/BIB/BIBH = '00000623', or(/BI/BIB/BIBH = '00002365'," \
"or(/BI/BIB/BIBH = '00000865', or(/BI/BIB/BIBH = '00000176'," \
"or(/BI/BIB/BIBH = '00002036', or(/BI/BIB/BIBH = '00002640'," \
"or(/BI/BIB/BIBH = '00001297', or(/BI/BIB/BIBH = '00000811'," \
"or(/BI/BIB/BIBH = '00001594', or(/BI/BIB/BIBH = '00001160'," \
"or(/BI/BIB/BIBH = '00000709', or (/BI/BIB/BIBH = '00001908'," \
"or(/BI/BIB/BIBH = '00002836', or(/BI/BIB/BIBH = '00001283'," \
"or(/BI/BIB/BIBH = '00002946', or(/BI/BIB/BIBH = '00001877'," \
"or(/BI/BIB/BIBH = '00002378', or(/BI/BIB/BIBH = '00000474'," \
"or(/BI/BIB/BIBH = '00002710', or(/BI/BIB/BIBH = '00001884'," \
"or(/BI/BIB/BIBH = '00000968', or(/BI/BIB/BIBH = '00000926'," \
"or(/BI/BIB/BIBH = '00001902', or (/BI/BIB/BIBH = '00001018'," \
"or(/BI/BIB/BIBH = '00002989', or(/BI/BIB/BIBH = '00001590'," \
"or(/BI/BIB/BIBH = '00001300', or(/BI/BIB/BIBH = '00002754'," \
"or(/BI/BIB/BIBH = '00001923', or(/BI/BIB/BIBH = '00002771'," \
"or(/BI/BIB/BIBH = '00000768', or(/BI/BIB/BIBH = '00002034'," \
"or(/BI/BIB/BIBH = '00001851', or(/BI/BIB/BIBH = '00001303'," \
"or(/BI/BIB/BIBH = '00002642', or (/BI/BIB/BIBH = '00002949'," \
"or(/BI/BIB/BIBH = '00000821', or(/BI/BIB/BIBH = '00000989'," \
"or(/BI/BIB/BIBH = '00001377', or(/BI/BIB/BIBH = '00001514'," \
"or(/BI/BIB/BIBH = '00000510', or(/BI/BIB/BIBH = '00000054'," \
"or(/BI/BIB/BIBH = '00001280', or(/BI/BIB/BIBH = '00002370'," \
"or(/BI/BIB/BIBH = '00001924', or(/BI/BIB/BIBH = '00000063'," \
"or(/BI/BIB/BIBH = '00001632', or (/BI/BIB/BIBH = '00000434'," \
"or(/BI/BIB/BIBH = '00003035', or(/BI/BIB/BIBH = '00001657'," \
"or(/BI/BIB/BIBH = '00002053', or(/BI/BIB/BIBH = '00000966'," \
"or(/BI/BIB/BIBH = '00002699', or(/BI/BIB/BIBH = '00003178'," \
"or(/BI/BIB/BIBH = '00002602', or(/BI/BIB/BIBH = '00002194'," \
"or(/BI/BIB/BIBH = '00001591', or(/BI/BIB/BIBH = '00001910'," \
"or(/BI/BIB/BIBH = '00000174', or (/BI/BIB/BIBH = '00002132'," \
"/BI/BIB/BIBH = '00000665')))))))))))))))))))))))))))))))))))))))))))))" \
")))))))))))))))))))))))))))))))))))))))))))))))"
import time
start = time.time()
tokens = GRAMMAR.parseString(data)
print time.time() - start

import pprint
pprint.pprint(tokens.asList())