Thread: [Pyparsing] Slow parsing with indentedBlock()

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi guys,

I have a problem with indentedBlock() and a relatively simple grammar that
relies on indentation and recursion. I'm seeing extremely long execution
times. On one example that I've attached it doesn't terminate within an
hour. 

The trick suggested elswehere with commenting out a FollowedBy statement in
Pyparsing (line 3633) doesn't work, as it breaks the grammar. Packrat
parsing doesn't work either; it appears to be incompatible with
indentedBlock(). I've attached a test case that gets broken by enabling
Packrat. I'm using Pyparsing 1.5.2 on Python 2.6.4.

I'd be extremely grateful for hints how to redesign my grammar so that I get
more controlled execution times. The grammar itself isn't really intricate.
It's for an obscure data representation format for entity-relationship data.
It works around nested entity and relationship declarations.
Entity-relationship dependencies are marked by indentation. Entities are
declared by entity names followed by a number. In addition, URI references
can appear as entities, either inline in angle brackets, or using a one-time
external URI declaration. The grammar is fairly self-explanatory; here's the
basics in BNF:

entity-reference:   (entity-name number) | (<inlineURI>)
external-URI-decl:  entity-reference = <inlineURI>
entity-declaration: entity-reference ["parameter"] [par_n:value_n]+

inline-relation:    @relation-name [par_n:value_n] entity-reference
multiline-relation: @relation_name [par_n:value_n] NEWLINE INDENT
                        entity-declaration-block UNDENT

entity-declaration-block: entity-declaration NEWLINE INDENT
                           [inline-relation|multiline-relation]* UNDENT
relationClause: entity-reference [inline-relation|multiline-relation]+

suite: [entity-declaration-block | relationClause | external-URI-decl]+

I've attached a (simple but rather verbose) Pyparsing version of the
grammar, as well as a couple of test cases. Execution time seems to go up
extremaly fast as things become more indented.

Thanks, Philipp

# ================= GRAMMAR DEFINITION ====================
import pyparsing as pp
#pp.ParserElement.enablePackrat()

indentStack = [1]

colon = pp.Literal(u":")
equal = pp.Literal(u"=")
relMarker = pp.Literal(u"@")
srelMarker = pp.Literal(u"$")
subrelMarker = pp.Literal(u"_")

entityName = pp.Word(pp.alphas)
entityNumber = pp.Word(pp.nums).setParseAction(lambda t: int(t[0]))
entityRef = pp.Group(pp.Combine(entityName + entityNumber)
                    ).setResultsName("entityRef")
inlineURI = pp.Group(pp.QuotedString(u'<', escQuote=None, multiline=False,
                    endQuoteChar=u">")).setResultsName("inlineURI")

externalURI = pp.Group( entityRef + equal.suppress() + inlineURI
    ).setResultsName("externalURI")

relationName = pp.Word(pp.alphanums+"-")
relationRef = pp.Combine(relMarker + relationName)
superrelationRef = pp.Combine(srelMarker + relationName)
subrelationRef = pp.Combine(subrelMarker + relationName)

parameterName = pp.Word(pp.alphanums+"-").setResultsName("parameterName")
parValueQuote = pp.QuotedString( u'"',u"\\",escQuote=None, multiline=True
                            ).setResultsName("parameterValue")
parValueDirect = pp.Word( pp.alphanums+"-_"
                            ).setResultsName("parameterValue")
parameterStmt = pp.Group( parameterName + colon.suppress()
                            + (parValueQuote | parValueDirect)
                            ).setResultsName("parameterStmt")
parameterBlock = pp.Group( pp.OneOrMore( parameterStmt )
                         ).setResultsName("parameterBlock")
superparameter = pp.QuotedString( u'"',u"\\",escQuote=None, multiline=True
                            ).setResultsName("superparameter")

entityDecl = pp.Group( entityRef
    + (parameterBlock | superparameter + pp.Optional(parameterBlock))
    ).setResultsName("entityDecl")

entityDeclBlock = pp.Forward()

relationInline =  pp.Group( relationRef
    + pp.Optional(parameterBlock)
    + (inlineURI | entityRef)
    ).setResultsName("relationInline")

relationMultiline = pp.Group( relationRef
    + pp.Optional(parameterBlock)
    + pp.indentedBlock(
        entityDeclBlock,
      indentStack).setResultsName("indentedBlock")
    ).setResultsName("relationMultiline")

entityDeclBlock << pp.Group( entityDecl
    + pp.Optional (
       pp.indentedBlock(
         relationMultiline |
         relationInline,
       indentStack).setResultsName("indentedBlock")
       )
    ).setResultsName("entityDeclBlock")

relationClause = pp.Group( (inlineURI | entityRef)
    + (relationInline |
       relationMultiline #
       )
    ).setResultsName("relationClause")

suite = pp.OneOrMore (
    relationClause | externalURI | entityDeclBlock
    ).setResultsName("suite").setDebug()

# ================= TEST CASES ====================

TESTCASES = {
"basic":
    u"""\
A1 par:value
  @rel relpar:value A2
  @rel
    A3 entpar:value
  @rel relpar:value
    A4 entpar:value
  @rel <some:URI/reference>
A2=<some:URI/reference>
""",
"packrat":
    u"""\
C1 id:"mutawalli" type:"appointment"
  @refersToAppointmentTarget <waqf:ent/office/d408:d408-mutawalli>
  @regulatesSuccession
    AO1 id:"muezzinAppointment"
      @hasAppointee <waqf:ent/office/world:muezzin-ayyub>
C2 id:"revenue" type:"revenue"
  @refersToIncome <waqf:ent/payable/d408:waqfIncome>
  @regulatesDistribution
    D1 id:"distStart"
      @hasChildNode
         DF1 id:"haqq at-tawliya" denom:"10" num:"1"
           @hasRecipient <waqf:ent/office/d408:d408-mutawalli>
         DR1 id:"remaining90percent"
           @hasRecipient <waqf:ent/office/d408:d408-mutawalli>
           @inExchangeFor <waqf:ent/duty/d408:d408-light>
           @inExchangeFor <waqf:ent/duty/d408:d408-furniture>""",
"slow":
    u"""\
DT1 id:"dist-taxation" tax:"xaraj"
  @hasChildNode
    D1 id:"dist-repair"
      @hasRecipient <waqf:ent/office/d405:mutawalli-405>
      @inExchangeFor
        X1 id:"graverepair" type:"repair"
          @dutyTarget <waqf:ent/structure/d405:grave-waqif>
  @hasChildNode
    DR1 id:"dist-remainder"
      @hasChildNode
        DF1 id:"dist-haqqattawliya" denom:"10" num:"1"
          @hasRecipient <waqf:ent/office/d405:mutawalli-405>
      @hasChildNode
        DR2 id:"dist-remainder2"
          @hasChildNode
            DF2 id:"dist-imam" denom:"3" num:"2"
              @hasRecipient <waqf:ent/office/world:imam-ayyub>
              @inExchangeFor
                X2 id:"prayer-imam" type:"prayer"
                  @dutyTarget
                    Y1 "text" id:"pid1" schedule:"sat" type:"m"
          @hasChildNode
            DF3 id:"dist-muezzin" denom:"3" num:"1"
              @hasRecipient <waqf:ent/office/world:muezzin-ayyub>
              @inExchangeFor
              X2 id:"prayer-muezzin" type:"prayer"
                @dutyTarget
                  Y2 "text" id:"pid2" schedule:"fri" type:"m"
          @hasChildNode
            DR3 id:"dist-fuqara-symbolic"
              @hasRecipient <waqf:ent/group/world:fuqara>
"""}

suite.parseString(TESTCASES["basic"], parseAll = True)
# Packrat breaks the following
suite.parseString(TESTCASES["packrat"], parseAll = True)
# The following takes extremely long
suite.parseString(TESTCASES["slow"], parseAll = True) 

Thread: [Pyparsing] Slow parsing with indentedBlock()

pyparsing-users