Thread: [Pyparsing] Slow parsing with indentedBlock()
Brought to you by:
ptmcg
From: Philipp R. <phi...@gm...> - 2009-11-15 16:55:22
|
Hi guys, I have a problem with indentedBlock() and a relatively simple grammar that relies on indentation and recursion. I'm seeing extremely long execution times. On one example that I've attached it doesn't terminate within an hour. The trick suggested elswehere with commenting out a FollowedBy statement in Pyparsing (line 3633) doesn't work, as it breaks the grammar. Packrat parsing doesn't work either; it appears to be incompatible with indentedBlock(). I've attached a test case that gets broken by enabling Packrat. I'm using Pyparsing 1.5.2 on Python 2.6.4. I'd be extremely grateful for hints how to redesign my grammar so that I get more controlled execution times. The grammar itself isn't really intricate. It's for an obscure data representation format for entity-relationship data. It works around nested entity and relationship declarations. Entity-relationship dependencies are marked by indentation. Entities are declared by entity names followed by a number. In addition, URI references can appear as entities, either inline in angle brackets, or using a one-time external URI declaration. The grammar is fairly self-explanatory; here's the basics in BNF: entity-reference: (entity-name number) | (<inlineURI>) external-URI-decl: entity-reference = <inlineURI> entity-declaration: entity-reference ["parameter"] [par_n:value_n]+ inline-relation: @relation-name [par_n:value_n] entity-reference multiline-relation: @relation_name [par_n:value_n] NEWLINE INDENT entity-declaration-block UNDENT entity-declaration-block: entity-declaration NEWLINE INDENT [inline-relation|multiline-relation]* UNDENT relationClause: entity-reference [inline-relation|multiline-relation]+ suite: [entity-declaration-block | relationClause | external-URI-decl]+ I've attached a (simple but rather verbose) Pyparsing version of the grammar, as well as a couple of test cases. Execution time seems to go up extremaly fast as things become more indented. Thanks, Philipp # ================= GRAMMAR DEFINITION ==================== import pyparsing as pp #pp.ParserElement.enablePackrat() indentStack = [1] colon = pp.Literal(u":") equal = pp.Literal(u"=") relMarker = pp.Literal(u"@") srelMarker = pp.Literal(u"$") subrelMarker = pp.Literal(u"_") entityName = pp.Word(pp.alphas) entityNumber = pp.Word(pp.nums).setParseAction(lambda t: int(t[0])) entityRef = pp.Group(pp.Combine(entityName + entityNumber) ).setResultsName("entityRef") inlineURI = pp.Group(pp.QuotedString(u'<', escQuote=None, multiline=False, endQuoteChar=u">")).setResultsName("inlineURI") externalURI = pp.Group( entityRef + equal.suppress() + inlineURI ).setResultsName("externalURI") relationName = pp.Word(pp.alphanums+"-") relationRef = pp.Combine(relMarker + relationName) superrelationRef = pp.Combine(srelMarker + relationName) subrelationRef = pp.Combine(subrelMarker + relationName) parameterName = pp.Word(pp.alphanums+"-").setResultsName("parameterName") parValueQuote = pp.QuotedString( u'"',u"\\",escQuote=None, multiline=True ).setResultsName("parameterValue") parValueDirect = pp.Word( pp.alphanums+"-_" ).setResultsName("parameterValue") parameterStmt = pp.Group( parameterName + colon.suppress() + (parValueQuote | parValueDirect) ).setResultsName("parameterStmt") parameterBlock = pp.Group( pp.OneOrMore( parameterStmt ) ).setResultsName("parameterBlock") superparameter = pp.QuotedString( u'"',u"\\",escQuote=None, multiline=True ).setResultsName("superparameter") entityDecl = pp.Group( entityRef + (parameterBlock | superparameter + pp.Optional(parameterBlock)) ).setResultsName("entityDecl") entityDeclBlock = pp.Forward() relationInline = pp.Group( relationRef + pp.Optional(parameterBlock) + (inlineURI | entityRef) ).setResultsName("relationInline") relationMultiline = pp.Group( relationRef + pp.Optional(parameterBlock) + pp.indentedBlock( entityDeclBlock, indentStack).setResultsName("indentedBlock") ).setResultsName("relationMultiline") entityDeclBlock << pp.Group( entityDecl + pp.Optional ( pp.indentedBlock( relationMultiline | relationInline, indentStack).setResultsName("indentedBlock") ) ).setResultsName("entityDeclBlock") relationClause = pp.Group( (inlineURI | entityRef) + (relationInline | relationMultiline # ) ).setResultsName("relationClause") suite = pp.OneOrMore ( relationClause | externalURI | entityDeclBlock ).setResultsName("suite").setDebug() # ================= TEST CASES ==================== TESTCASES = { "basic": u"""\ A1 par:value @rel relpar:value A2 @rel A3 entpar:value @rel relpar:value A4 entpar:value @rel <some:URI/reference> A2=<some:URI/reference> """, "packrat": u"""\ C1 id:"mutawalli" type:"appointment" @refersToAppointmentTarget <waqf:ent/office/d408:d408-mutawalli> @regulatesSuccession AO1 id:"muezzinAppointment" @hasAppointee <waqf:ent/office/world:muezzin-ayyub> C2 id:"revenue" type:"revenue" @refersToIncome <waqf:ent/payable/d408:waqfIncome> @regulatesDistribution D1 id:"distStart" @hasChildNode DF1 id:"haqq at-tawliya" denom:"10" num:"1" @hasRecipient <waqf:ent/office/d408:d408-mutawalli> DR1 id:"remaining90percent" @hasRecipient <waqf:ent/office/d408:d408-mutawalli> @inExchangeFor <waqf:ent/duty/d408:d408-light> @inExchangeFor <waqf:ent/duty/d408:d408-furniture>""", "slow": u"""\ DT1 id:"dist-taxation" tax:"xaraj" @hasChildNode D1 id:"dist-repair" @hasRecipient <waqf:ent/office/d405:mutawalli-405> @inExchangeFor X1 id:"graverepair" type:"repair" @dutyTarget <waqf:ent/structure/d405:grave-waqif> @hasChildNode DR1 id:"dist-remainder" @hasChildNode DF1 id:"dist-haqqattawliya" denom:"10" num:"1" @hasRecipient <waqf:ent/office/d405:mutawalli-405> @hasChildNode DR2 id:"dist-remainder2" @hasChildNode DF2 id:"dist-imam" denom:"3" num:"2" @hasRecipient <waqf:ent/office/world:imam-ayyub> @inExchangeFor X2 id:"prayer-imam" type:"prayer" @dutyTarget Y1 "text" id:"pid1" schedule:"sat" type:"m" @hasChildNode DF3 id:"dist-muezzin" denom:"3" num:"1" @hasRecipient <waqf:ent/office/world:muezzin-ayyub> @inExchangeFor X2 id:"prayer-muezzin" type:"prayer" @dutyTarget Y2 "text" id:"pid2" schedule:"fri" type:"m" @hasChildNode DR3 id:"dist-fuqara-symbolic" @hasRecipient <waqf:ent/group/world:fuqara> """} suite.parseString(TESTCASES["basic"], parseAll = True) # Packrat breaks the following suite.parseString(TESTCASES["packrat"], parseAll = True) # The following takes extremely long suite.parseString(TESTCASES["slow"], parseAll = True) |
From: spir <den...@fr...> - 2009-11-15 19:38:13
|
Le Sun, 15 Nov 2009 17:51:00 +0100, Philipp Reichmuth <phi...@gm...> s'exprima ainsi: > I have a problem with indentedBlock() and a relatively simple grammar that > relies on indentation and recursion. I'm seeing extremely long execution > times. On one example that I've attached it doesn't terminate within an > hour. May be useful: I do not parse anymore indented structure, instead systematically preprocess to transform it into delimited structure (say, C style). The reason is complication of the grammar and state-dependance. I have a pair of tool funcs that "transcode" in both directions (can send if you like). It's easy as long as you can rely on indentation to be consistent (which is not necessary true in eg python code). So, 1 1 2 2 3 2 ==> 1 1 { 2 2 { 3 } 2 } or 1 1 { 2 2 { 3 } 2 } Just need to find delimiters that cannot clash against possible content. (I chose to have delimiters on their own line for this reason and also cause it's easier to check visually, maybe also to parse, I guess.) But think that your performance issues may come in great part from recursive patterns, not only from indented structure. Denis -------------------------------- * la vita e estrany * http://spir.wikidot.com/ |
From: Paul M. <pt...@au...> - 2009-11-16 08:02:13
|
> I have a problem with indentedBlock() and a relatively simple grammar that > relies on indentation and recursion. I'm seeing extremely long execution > times. On one example that I've attached it doesn't terminate within an > hour. > Yikes! If this is a simple grammar, you do some crazy parsing! You have so much going on in this grammar, have you thought of using something a little simpler, but just as self-descriptive, like JSON perhaps? I know it uses delimiters instead of indentation, but pyparsing seems to be really struggling with this input text. You might also look at some of the indented examples on the pyparsing wiki, and roll your own version of indentedBlock, to see if you can make some better headway. -- Paul |
From: Philipp R. <phi...@gm...> - 2009-11-16 14:53:46
|
Am Mon, 16 Nov 2009 02:02:01 -0600 schrieb Paul McGuire: > You have so much going on in this grammar, have you thought of using > something a little simpler, but just as self-descriptive, like JSON perhaps? It's a legacy format, unfortunately. I actually worked through most of the indented examples already. I'm also somewhat new to pyparsing, been doing some tricky things with it, but not to the point where I feel really confident. I was kinda hoping for some pointers to tweak the existing grammar, but that seems to be difficult. Either way you now have a test case for indentedBlock() breaking Packrat, and for those indentedBlock() recursion problems that people were reporting some time ago ;) Philipp |
From: Philipp R. <phi...@gm...> - 2009-11-16 14:59:14
|
Am Sun, 15 Nov 2009 20:37:59 +0100 schrieb spir: > May be useful: I do not parse anymore indented structure, instead > systematically preprocess to transform it into delimited structure (say, > C style). The reason is complication of the grammar and > state-dependance. I see the point. I'll think if I can preprocess the source text to avoid using indentedBlock(). > I have a pair of tool funcs that "transcode" in both directions (can send > if you like). It's easy as long as you can rely on indentation to be > consistent (which is not necessary true in eg python code). If you could send me those, I'd be grateful. From what I've seen so far, indentation seems to be fairly consistent. I have some cases that look like this: entity 1... @relation 1... entity 2... @relation 3... @relation 4... @relation 5... But those should be easy to catch. The problem seems indeed to be the combination of indentedBlock() and recursion - indentedBlock() currently uses a lookahead mechanism that seems to lead to exponential branching in the parse tree under some conditions. Philipp |
From: spir <den...@fr...> - 2009-11-17 08:38:32
|
Le Mon, 16 Nov 2009 15:58:35 +0100, Philipp Reichmuth <phi...@gm...> stated: > Am Sun, 15 Nov 2009 20:37:59 +0100 schrieb spir: > > May be useful: I do not parse anymore indented structure, instead > > systematically preprocess to transform it into delimited structure (say, > > C style). The reason is complication of the grammar and > > state-dependance. > > I see the point. I'll think if I can preprocess the source text to avoid > using indentedBlock(). > > > I have a pair of tool funcs that "transcode" in both directions (can send > > if you like). It's easy as long as you can rely on indentation to be > > consistent (which is not necessary true in eg python code). > > If you could send me those, I'd be grateful. From what I've seen so far, > indentation seems to be fairly consistent. > I have some cases that look like > this: > > entity 1... > @relation 1... > entity 2... > @relation 3... > @relation 4... > @relation 5... > > But those should be easy to catch. How should it be? (what should be indented in respect to what?) > The problem seems indeed to be the combination of indentedBlock() and > recursion - indentedBlock() currently uses a lookahead mechanism that seems > to lead to exponential branching in the parse tree under some conditions. > > Philipp Here is the tool. Try it first on various typical substrings of your source. If works as expected, should be a major boost (and simplication of your grammar as well). (Note: the funcs expects indent level 0 at start of source -- just realize this now.) Denis =================================================== ### indented <--> wrapped structure # tool funcs def howManyAtStart(text, string): ''' how many times a (sub)string appears at start of text ''' pos = 0 n = 0 length = len(string) while text[pos:].startswith(string): pos += length n += 1 return n def indentMark(lines): ''' find & return indentation mark ~ either TAB or n spaces ~ must be consistent ''' for line in lines: if line.strip() == '': continue if line[0] == TAB: return TAB n = howManyAtStart(line, SPC) if n > 0: return n * SPC return None def WrapIndentedStructure( source, INDENT=None, OPEN="{\n", CLOSE="}\n", keepIndent=False ): ''' Transform indented to wrapped structure. ~ Indentation must be consistent! ~ If INDENT not given, set to the first start-of-line whitespace. ~ Indentation can be kept: nicer & more legible result but needs to be coped with during parsing. ~ Blank lines are ignored & left as is (else problematic). ''' level = 0 # current indent level # add artificial EOFile marker source += EOF + EOL lines = source.splitlines() # find 'INDENT' indentation mark if not given if INDENT is None: INDENT = indentMark(lines) # case no indent at all in source if INDENT is None: return source # find indent level *changes* & replace them with tokens result = "" length = len(INDENT) for (i,line) in enumerate(lines): # skip blank line if line.strip() == '': if keepIndent: result += level*INDENT + EOL else: result += EOL continue # get offset: difference of indentation if line == EOF: line = '' offset = howManyAtStart(line, INDENT) - level # case no indent level change if offset == 0: result += line + EOL # case indent level increment (+1) elif offset == 1: level += 1 open_mark = (INDENT*level + OPEN) if keepIndent else OPEN if not keepIndent: line = line[length:] result += open_mark + line + EOL # case indent level decrement (<= current level) elif offset < 0: offset = -offset level -= offset if keepIndent: close_marks = "" for n in range(level+offset, level, -1): close_marks += (n*INDENT + CLOSE) else: close_marks = offset * CLOSE line = line[offset*length:] result += close_marks + line + EOL else: # case indent level inconsistency (increment > 1) message = "Inconsistent indentation at line #%s" \ " (increment > 1):\n%s" % (i,line) raise ValueError(message) return result def IndentWrappedStructure(source, INDENT=' ', open="{",close="}"): ''' Transform wrapped to indented structure. ~ Wrapping must be consistent! ~ open/close tokens must be on their own line! ''' EOL = '\n' result = "" (pos,level) = (0,0) # current pos in text & indentation level lines = source.splitlines() for (i,line) in enumerate(lines): # case open if line.strip() == open: level += 1 # case close elif line.strip() == close: if level == 0: message = "Inconsistent indentation at line #%s" \ " (decrement under zero):\n%s" % (i,line) raise ValueError(message) level -= 1 # else record line with proper indentation else: result += level*INDENT + line.lstrip() + EOL return result ####### test ####### def testWrapIndent(): # erroneous example source = """\ 0 0 1 3 2 1 0 """ print "\n=== wrap indented blocks (erroneous case) in source:\n%s\n"\ % (source) try: print WrapIndentedStructure(source, INDENT=None, keepIndent=True) except ValueError,e: print e # correct example source = """\ 0 0 1 1 2 2 3 3 4 5 6 3 3 1 1 0 0 """ print "\n=== wrap indented blocks (keeping indent) in source:\n%s\n"\ % (source) result= WrapIndentedStructure(source, keepIndent=True) print result print "\n=== reindent same source" print IndentWrappedStructure(result) def test(): #~ testNormalize() #~ print RULER testWrapIndent() =================================================== -------------------------------- * la vita e estrany * http://spir.wikidot.com/ |
From: Philipp R. <phi...@gm...> - 2009-11-20 20:46:51
|
Am Mon, 16 Nov 2009 18:06:31 +0100 schrieb spir: > Here is the tool. Try it first on various typical substrings of your > source. If works as expected, should be a major boost (and simplication > of your grammar as well). Thanks. I rewrote it to support variable-length indentation using an indent stack approach, as I apparently have quite a bit of data in formats such as this: 1 2 3 3 3 2 That seems to be just as stable and catches the indentation at level 0 thing as well. The grammar is more or less the same, replacing indentedBlock() instances with OPEN + OneOrMore + CLOSE instances, but parsing times are orders of magnitude better. Thanks, Philipp |