how to keep track of indentation

  • Julian

    Julian - 2007-03-10

    Hi there-

    I've used pyparsing in the past (~2 years ago), and it did a fantastic job (thanks!)
    However, that time I didn't care about white space.
    Now the revenge of the new log files...

    I have a format like:
    top level value : 100
      sub section value 1 = 22
      sub section value 2 = 23
        sub-sub section value 1: 19

    top level value b = 23
      sub section value 3 : 12
      sub section value 1 = 22

    I need to extract each section (back to zero indentation), and then the 'level number' for each enclosed lines.  There are no rules for indentation, other than if it increases compared to the previous line, it's a new level.  (And it never jumps/skips levels.)
    I also need to extract the text stripped of beginning/end white space, but preserve white space inside the value label.  I also need the same for the values after the ":" or "=".

    A top level section may have no sub sections, so if the parser saw:
    statistic one: 1
    statistic two=23

    It needs to see each line as independant sections.

    The parser itself doesn't have to give me numeric values for indent/level: structures inside structures are fine, so long as I can infer this information by traversing the data structure(s).

    I think I could figure things out from last time I had to parse log files... other than the indentation management.

    If anyone can supply some hints (even just knowing that this is really outside of pyparsing's domain), I'd be very grateful.

    (I could just write a custom method, but my past experience was once I got the basic pyparsing grammar working, it turned out to be more self documenting and easier to maintain/extend.)

    Again, any help greatly appreciated.



    • Eike Welk

      Eike Welk - 2007-03-11

      Hello Julian!

      I haven't used grammars where indentation matters myself (yet). There is however an example of such a grammar in the 'examples' directory of the source package:

      Regards, Eike.

      • Paul McGuire

        Paul McGuire - 2007-03-11

        Eike -

        Thanks for reminding me that that example exists.  I was just going through my drawer of pyparsing examples, and was looking for one to post on the wiki (for some reason, isn't out there - I'll fix that posthaste).  Plus, I'm really happy that pyparsing is starting to get something of a community going, where other pyparsing users chime in on posted questions.

        Julian - yes, this example is about the state-of-the-art for indentation-based grammars with pyparsing, and requires parse actions to do the indentation validation.  Note the use of ParseFatalException to flag illegal nesting.  Unlike ParseException, ParseFatalException stops all parsing immediately.

        The biggest trick in this type of parser is getting the whitespace-skipping right, and this is handled in the example using the magic empty's and parse actions in INDENT and UNDENT.  If this doesn't work in your application, try disabling \n as whitespace by setting the default whitespace set to just '\t ' using ParserElement.setDefaultWhitespace before defining any parser expressions.  Then put explicit LineEnd()'s in your grammar where appropriate.  You'll still need to validate the indent level with parse actions, though.

        I've considered adding Indent and Dedent as pyparsing classes, but I am still not completely happy enough with them to make them part of the distributed class library, so I'm just leaving this in the examples directory for now (and will post to the wiki in just a minute).

        Post back and let us know how this works out for you, and if you end up having another good example, I'd be happy to include it in the distribution.

        -- Paul

    • Paul McGuire

      Paul McGuire - 2007-03-11

      (Also, I took the liberty of adding your closing parenthetical comment to the pyparsing wiki Feedback page - you are my latest poster child!)

      -- Paul

    • Julian

      Julian - 2007-03-14

      Thanks for the suggestions Eike and Paul.

      I have a couple of questions about the example, and also a couple about my specific task.

      First, from the example:


      indentStack = [1]

      def checkPeerIndent(s,l,t):
          curCol = col(l,s)
          if curCol != indentStack[-1]:
      ****    if (not indentStack) or curCol > indentStack[-1]:
                  raise ParseFatalException(s,l,"illegal nesting")
              raise ParseException(s,l,"not a peer entry")

      In the *'d line, is the (not indentStack) redundant- indentStack can only be False if empty, in which case the previous line would have raised an exception, wouldn't it?  This would also imply you've un-indented beyond the first indent value (in this case, column 1.)  Or (just as likely ;-), what have I missed?


      def checkUnindent(s,l,t):
          if l >= len(s): return
          curCol = col(l,s)
          if not(curCol < indentStack[-1] and curCol <= indentStack[-2]):
              raise ParseException(s,l,"not an unindent")

      Took me quite a while to understand how this and checkPeerIndent() work together.
      I wasn't sure how this grammar was ensuring that a un-indent aligns with a previously seen indentation level.
      e.g. I was wondering why there wasn't a search against the values in indentStack[].  It wasn't until I realized that that part is really handled by checkPeerIndent() that it became clear(er).  Could a comment explaining how the two work together be added?


      INDENT = lineEnd.suppress() + empty + empty.copy().setParseAction(checkSubIndent)
      UNDENT = FollowedBy(empty).setParseAction(checkUnindent)

      You're right- the empty's are quite magical ;-).  Could you explain what/where they match on/how they're working?
      I can't figure out if they're matching on whitespace or not...

      Excellent example by the way, shows what can be done quite succinctly.

      Now in my particular case, things are a little simpler, as there is no gauranteed peer.  I just need to keep track of the (relative) indent level, and make sure they match levels that have been set before.
      Matching should start at a line that starts at col 0, and should continue up-until the next line that starts at col 0.

      But to make things 'more interesting' I'm going to be supplied a table that looks something like:
      id  string
      1   text for this message
      2   text for some other message


      I want to explicitly match only the specified ids, and also return (possibly only return) the associated id number.

      I think I can set see how I can do it with a generic setParseAction to handle the indentation tracking.
      But is there anyway to include user defined extra elements to setParseAction? i.e. I'll need to dynamically create a parse element for each string above, and want to pass in the id to the parse action call-back.

      Suggestion appreciated!



      PS re adding the comment to the wiki, I'm honored.  Part of my 15 minutes of internet fame ;-).
      And feel free to take away the parenthesis.
      Also,thinking about that comment, I think part of the difficulty is that you see an example that might be close to what you want to implement, with no 'warning' about it's complexity.  Perhaps it'd be good to give the examples a complexity score, just to indicate which ones are better to start with when learning.

      • Paul McGuire

        Paul McGuire - 2007-03-14

        Julian -

        This could actually simplify things considerably.  But first, let me respond to your specific questions:

        1. Yes, the (not indentStack) is redundant in the *'d line.  I think it is a legacy from an earlier version of this grammar, in which I was still deciding whether to initialize indentStack with [1] or just [], so I had to test for a possibly empty indentStack.  I'm inclined to leave it in from a defensive coding standpoint, but you are correct, indentStack should never actually get empty.

        2. Sadly, no one ever accused me of over-commenting my code.  I find that my thinking races ahead when writing Python, and I end up relying pretty heavily on code readability.  I'll try to add some comments to this example before the next release.

        3. empty's are a way to advance past whitespace to the next non-empty character, without actually consuming that character, so you can use empty as a placeholder for checking the column of the next non-whitespace char.  empty always matches, so it's also a good choice for arbitrarily inserting some sort of validating parse action into the grammar.  So here is how the parser works through INDENT:
        a. starting with the end of the previous line, read and suppress the line-ending \n.
        b. the first empty advances past any intervening whitespace
        c. the second empty has no more whitespace to skip, so it matches immediately - specifically, it matches with the correct current location within the input string
        d. the parse action is called to see if the current location represents an subindent beyond the current indentation column; if so, the new indentation column is added to the indentStack; if not, INDENT fails, and reverts the parser back to the previous end of line

        So it looks like your input file is in two parts, the first looks like:

        id  string
        1   text for this message
        2   text for some other message

        perhaps ending with a blank line.  Is this then followed by something like (since SF collapses out leading whitespace, I'm putting in leading '.'s for spaces):

        This should be fairly easy to handle, and may not even require all the parse action intrigue.  You may be better off parsing this second part with just a Word(nums).scanString call, which would report each integer token, and its parsing location, which you could then attach to a hierarchical data structure.  For instance, if this text were in a Python variable 'outline', this code:

        for t,s,e in Word(nums).scanString(outline):
        ....print t[0],col(s,outline)

        will print out:
        1 1
        2 3
        3 3
        4 1
        5 1
        6 3
        3 3
        8 3
        10 5
        2 5
        5 3
        9 5
        11 1

        Hope this fills in some blanks for you, and gives you some additional ideas.

        -- Paul

        • Julian

          Julian - 2007-03-14

          Paul, thanks for your *very* clear explanation.  Empty is a lot more interesting/useful/cool than it's name belies :-)

          (We should also twist your arm to write more documentation, you're good at it ;-).  OTOH,  perhaps that's something we in the user community could also lend a hand in doing... (note to self!))

          Unfortunately my problem hasn't reduced itself to being quite that simple, and I wasn't as clear in my description as I should have been.

          I now will have two files.  The first one will be the table of unique-strings and their id.  The other file is the one I need to parse.  I need to create a grammar that will only accept strings defined in the table, and that have (optional) associated values, and follow the indentation scheme I outlined.  This will be applied to the second file.

          I'd like to programatically define the strings in the grammar by simply iterating over the contents of the string definition table.  The grammar will be required to return a structure indicating the found strings, and their relative indentation levels.  One further, I'd very much like it to also return a structure containing the indentation level and the relative indentation level.  (And that's why I was wondering if  you can include a user-specified parameter to a parse-action, set at the point where the parse action is added.  I was thinking of using that to pass in the associated string-id number.)

          Currently the goal is to parse/reject a summary 'frame' of information that follows the se loose rules.  However I want to ensure that if there are any other strings than those defined in the 'string definition table', the grammar will not accept the frame--- indicating either corruption, or that the table is out of date (and that I need to go talk to one of the other software teams ;-) ).

          Longer term, I want to apply this to the 'real' log file, and pull out successive 'frames', using the grammar to skip regions of junk (other random data/instructions to users/etc).  I remember doing something similar using scanString(?) last time I used pyparsing... (need to find where those scripts I wrote two years ago ended up!)

          Rather than storing all the text, I'm thinking I can instead create a (sqlite based) database like:
          Frame No.|StringId|IndentationLvl|Value

          etc.  (NB- completely made up example.  I have no idea yet how many unique string ids there are.)
          Then we can look for trends etc using straight forward sql queries.  I think could/will open up a lot of potential analysis we just can't really do (reliably) now.

          Plus, they should be quick,after intial parsing, as there's at least a roughly 20:1 compression factor (complete guess)! 8^)

          But the key is... getting the parsing reliable, extendable, and maintainable.

          I hope this makes more sense, it's late, only Tuesday, and already a long week :-/

          Thanks in advance for any thoughts and/or suggestions-




Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks