Re: [Pyparsing] Slow parsing with indentedBlock()

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Le Mon, 16 Nov 2009 15:58:35 +0100,
Philipp Reichmuth <phi...@gm...> stated:

> Am Sun, 15 Nov 2009 20:37:59 +0100 schrieb spir:
> > May be useful: I do not parse anymore indented structure, instead
> > systematically preprocess to transform it into delimited structure (say,
> > C style). The reason is complication of the grammar and
> > state-dependance.
> 
> I see the point. I'll think if I can preprocess the source text to avoid
> using indentedBlock().
> 
> > I have a pair of tool funcs that "transcode" in both directions (can send
> > if you like). It's easy as long as you can rely on indentation to be
> > consistent (which is not necessary true in eg python code).
> 
> If you could send me those, I'd be grateful. From what I've seen so far,
> indentation seems to be fairly consistent.
> I have some cases that look like
> this:
> 
> entity 1...
>  @relation 1...
>   entity 2...
>       @relation 3...
>      @relation 4...
>    @relation 5...
> 
> But those should be easy to catch.

How should it be? (what should be indented in respect to what?)

> The problem seems indeed to be the combination of indentedBlock() and
> recursion - indentedBlock() currently uses a lookahead mechanism that seems
> to lead to exponential branching in the parse tree under some conditions.
> 
> Philipp

Here is the tool. Try it first on various typical substrings of your source. If works as expected, should be a major boost (and simplication of your grammar as well).
(Note: the funcs expects indent level 0 at start of source -- just realize this now.)

Denis

===================================================
### indented <--> wrapped structure

# tool funcs
def howManyAtStart(text, string):
    ''' how many times a (sub)string appears at start of text '''
    pos = 0
    n = 0
    length = len(string)
    while text[pos:].startswith(string):
        pos += length
        n += 1
    return n

def indentMark(lines):
    ''' find & return indentation mark
        ~ either TAB or n spaces
        ~ must be consistent '''
    for line in lines:
        if line.strip() == '':
            continue
        if line[0] == TAB:
            return TAB
        n = howManyAtStart(line, SPC)
        if n > 0:
            return n * SPC
    return None

def WrapIndentedStructure(  source,
                            INDENT=None,
                            OPEN="{\n", CLOSE="}\n",
                            keepIndent=False ):
    ''' Transform indented to wrapped structure.
        ~ Indentation must be consistent!
        ~ If INDENT not given, set to the first start-of-line whitespace.
        ~ Indentation can be kept: nicer & more legible result
          but needs to be coped with during parsing.
        ~ Blank lines are ignored & left as is (else problematic). '''
    level = 0       # current indent level

    # add artificial EOFile marker
    source += EOF + EOL
    lines = source.splitlines()

    # find 'INDENT' indentation mark if not given
    if INDENT is None:
        INDENT = indentMark(lines)
    # case no indent at all in source
    if INDENT is None:
        return source

    # find indent level *changes* & replace them with tokens
    result = ""
    length = len(INDENT)
    for (i,line) in enumerate(lines):
        # skip blank line
        if line.strip() == '':
            if keepIndent:
                result += level*INDENT + EOL
            else:
                result += EOL
            continue
        # get offset: difference of indentation
        if line == EOF: line = ''
        offset = howManyAtStart(line, INDENT) - level
        # case no indent level change
        if offset == 0:
            result += line + EOL
        # case indent level increment (+1)
        elif offset == 1:
            level += 1
            open_mark = (INDENT*level + OPEN) if keepIndent else OPEN
            if not keepIndent:
                line = line[length:]
            result += open_mark + line + EOL
        # case indent level decrement (<= current level)
        elif offset < 0:
            offset = -offset
            level -= offset
            if keepIndent:
                close_marks = ""
                for n in range(level+offset, level, -1):
                    close_marks += (n*INDENT + CLOSE)
            else:
                close_marks = offset * CLOSE
                line = line[offset*length:]
            result += close_marks + line + EOL
        else:
            # case indent level inconsistency (increment > 1)
            message = "Inconsistent indentation at line #%s" \
                        " (increment > 1):\n%s" % (i,line)
            raise ValueError(message)
    return result

def IndentWrappedStructure(source, INDENT='    ', open="{",close="}"):
    ''' Transform wrapped to indented structure.
        ~ Wrapping must be consistent!
        ~ open/close tokens must be on their own line! '''
    EOL = '\n'
    result = ""
    (pos,level) = (0,0)         # current pos in text & indentation level
    lines = source.splitlines()
    for (i,line) in enumerate(lines):
        # case open
        if line.strip() == open:
            level += 1
        # case close
        elif line.strip() == close:
            if level == 0:
                message = "Inconsistent indentation at line #%s" \
                            " (decrement under zero):\n%s" % (i,line)
                raise ValueError(message)
            level -= 1
        # else record line with proper indentation
        else:
            result += level*INDENT + line.lstrip() + EOL
    return result

####### test #######
def testWrapIndent():

    # erroneous example
    source = """\
0
0
  1
      3
    2
  1
0
"""
    print   "\n=== wrap indented blocks (erroneous case) in source:\n%s\n"\
            % (source)
    try:
        print WrapIndentedStructure(source, INDENT=None, keepIndent=True)
    except ValueError,e:
        print e

    # correct example
    source = """\
0
0
  1
  1
    2
    2
      3
      3
        4
          5
            6
      3
      3
  1
  1
0
0
"""
    print   "\n=== wrap indented blocks (keeping indent) in source:\n%s\n"\
            % (source)
    result= WrapIndentedStructure(source, keepIndent=True)
    print result
    print   "\n=== reindent same source"
    print IndentWrappedStructure(result)
def test():
#~     testNormalize()
#~     print RULER
    testWrapIndent()

===================================================

--------------------------------
* la vita e estrany *

http://spir.wikidot.com/