Re: [Pyparsing] Slow parsing with indentedBlock()
Brought to you by:
ptmcg
From: spir <den...@fr...> - 2009-11-17 08:38:32
|
Le Mon, 16 Nov 2009 15:58:35 +0100, Philipp Reichmuth <phi...@gm...> stated: > Am Sun, 15 Nov 2009 20:37:59 +0100 schrieb spir: > > May be useful: I do not parse anymore indented structure, instead > > systematically preprocess to transform it into delimited structure (say, > > C style). The reason is complication of the grammar and > > state-dependance. > > I see the point. I'll think if I can preprocess the source text to avoid > using indentedBlock(). > > > I have a pair of tool funcs that "transcode" in both directions (can send > > if you like). It's easy as long as you can rely on indentation to be > > consistent (which is not necessary true in eg python code). > > If you could send me those, I'd be grateful. From what I've seen so far, > indentation seems to be fairly consistent. > I have some cases that look like > this: > > entity 1... > @relation 1... > entity 2... > @relation 3... > @relation 4... > @relation 5... > > But those should be easy to catch. How should it be? (what should be indented in respect to what?) > The problem seems indeed to be the combination of indentedBlock() and > recursion - indentedBlock() currently uses a lookahead mechanism that seems > to lead to exponential branching in the parse tree under some conditions. > > Philipp Here is the tool. Try it first on various typical substrings of your source. If works as expected, should be a major boost (and simplication of your grammar as well). (Note: the funcs expects indent level 0 at start of source -- just realize this now.) Denis =================================================== ### indented <--> wrapped structure # tool funcs def howManyAtStart(text, string): ''' how many times a (sub)string appears at start of text ''' pos = 0 n = 0 length = len(string) while text[pos:].startswith(string): pos += length n += 1 return n def indentMark(lines): ''' find & return indentation mark ~ either TAB or n spaces ~ must be consistent ''' for line in lines: if line.strip() == '': continue if line[0] == TAB: return TAB n = howManyAtStart(line, SPC) if n > 0: return n * SPC return None def WrapIndentedStructure( source, INDENT=None, OPEN="{\n", CLOSE="}\n", keepIndent=False ): ''' Transform indented to wrapped structure. ~ Indentation must be consistent! ~ If INDENT not given, set to the first start-of-line whitespace. ~ Indentation can be kept: nicer & more legible result but needs to be coped with during parsing. ~ Blank lines are ignored & left as is (else problematic). ''' level = 0 # current indent level # add artificial EOFile marker source += EOF + EOL lines = source.splitlines() # find 'INDENT' indentation mark if not given if INDENT is None: INDENT = indentMark(lines) # case no indent at all in source if INDENT is None: return source # find indent level *changes* & replace them with tokens result = "" length = len(INDENT) for (i,line) in enumerate(lines): # skip blank line if line.strip() == '': if keepIndent: result += level*INDENT + EOL else: result += EOL continue # get offset: difference of indentation if line == EOF: line = '' offset = howManyAtStart(line, INDENT) - level # case no indent level change if offset == 0: result += line + EOL # case indent level increment (+1) elif offset == 1: level += 1 open_mark = (INDENT*level + OPEN) if keepIndent else OPEN if not keepIndent: line = line[length:] result += open_mark + line + EOL # case indent level decrement (<= current level) elif offset < 0: offset = -offset level -= offset if keepIndent: close_marks = "" for n in range(level+offset, level, -1): close_marks += (n*INDENT + CLOSE) else: close_marks = offset * CLOSE line = line[offset*length:] result += close_marks + line + EOL else: # case indent level inconsistency (increment > 1) message = "Inconsistent indentation at line #%s" \ " (increment > 1):\n%s" % (i,line) raise ValueError(message) return result def IndentWrappedStructure(source, INDENT=' ', open="{",close="}"): ''' Transform wrapped to indented structure. ~ Wrapping must be consistent! ~ open/close tokens must be on their own line! ''' EOL = '\n' result = "" (pos,level) = (0,0) # current pos in text & indentation level lines = source.splitlines() for (i,line) in enumerate(lines): # case open if line.strip() == open: level += 1 # case close elif line.strip() == close: if level == 0: message = "Inconsistent indentation at line #%s" \ " (decrement under zero):\n%s" % (i,line) raise ValueError(message) level -= 1 # else record line with proper indentation else: result += level*INDENT + line.lstrip() + EOL return result ####### test ####### def testWrapIndent(): # erroneous example source = """\ 0 0 1 3 2 1 0 """ print "\n=== wrap indented blocks (erroneous case) in source:\n%s\n"\ % (source) try: print WrapIndentedStructure(source, INDENT=None, keepIndent=True) except ValueError,e: print e # correct example source = """\ 0 0 1 1 2 2 3 3 4 5 6 3 3 1 1 0 0 """ print "\n=== wrap indented blocks (keeping indent) in source:\n%s\n"\ % (source) result= WrapIndentedStructure(source, keepIndent=True) print result print "\n=== reindent same source" print IndentWrappedStructure(result) def test(): #~ testNormalize() #~ print RULER testWrapIndent() =================================================== -------------------------------- * la vita e estrany * http://spir.wikidot.com/ |