How to limit matches

Julian
2005-02-18
2013-05-14
  • Julian

    Julian - 2005-02-18

    Hi again- this is addictive (if at times frustrating :-) ).

    I've got another type of file I need to capture into data structures for manipulation.

    It looks like:

    [constants]
    name value
    const1 value with spaces
    const2 value with spaces //and a comment
    const3 "value with embedded //"
    const4 "value with embedded //" //and a comment
    const5 "value"//comment with no white space
    const6 "value with embedded //" /* another comment */
    // comment
    /* comment */
    /* comment
    [endconstants]

    White space is irrelevant, except within quottes.
    The application that reads this actually first strips all comments, and then uses that as the results before parsing.  It doesn't handle multi-line C-style comments; EOL is an implicit '*/'.

    To handle quottes, it first checks for a '"' (double quote) character in the current line,  then for a second, and then for any '//' characters, and doesn't remove them (and the following text) if they're within the bounds of the quottes.

    I think it should be possible to parse this directly using pyparsing (without stripping comments.)

    Ignoring quottes and comments on declaration lines, I've got:

    constantSection = CaselessLiteral("[constants]") +ZeroOrMore((((Literal('//')|Literal('/*'))+restOfLine).suppress() | (Word(printables) + restOfLine)) ) + (CaselessLiteral("[endconstants]"))

    Of course this always fails parsing- the ZeroOrMore object always gobbles the "[endconstants]".
    I've tried lots of variations of trying to add "[endconstants]" as an Optional item, LineEnd(), and all sorts of other things.

    I'm pretty sure I'm just missing how to frame this problem properly- suggestions appreciated!

    Thanks in advance-

    Julian

    PS- the examples are good, but some are not exactly simple python (for a beginner, anyway), or deal with non-trivial parsers.
    Perhaps a few extra comments?
    Any chance you have access to a Wiki?  I'd be willing to participate in documenting some of the examples/create some introductory works (which is were I'm at ;-) ).

     
    • Julian

      Julian - 2005-02-18

      Ok solved this myself, by re-browsing some of the previous posts.

      I added ~CaselessLiteral("[enddefine]")+ to the beginning of the ZeroOrMore().

      Although if you see any other flaws or potential gotcha's in what I'm trying, please let me know.

      I too got tripped up on thinking of this as super regexs- from a previous post "pyparsing doesn't look ahead for literals".

      Although I'm sure I tried the above this morning... I guess I must have messed up some other part of the grammar when I tried it.

      I'm also finding it a bit hard to remember which items are strings (e.g. printables, restOfLine), and which are really objects (e.g. LineEnd())- ending up in  Python compile errors.  I know it's more typing, but perhaps 'const', or 'str', or somesuch prefix for the constants vs helper objects?

      Another comment- although the final resulting code is  particularly clear (c.f. regexs), and extremely powerful, I'm finding it takes a lot of time to create the grammars.  Then again, they appear to be far more robust than regexs alone...  A list of differences between pyparsing's behaviour and regexs would be useful.  Also a cookbook- e.g. to parse this, try this; with some amount of explanation.

      BTW- at the moment I'm succesfully using pyparsing to interpret test report logs and create easy to use summaries.  It's working really well.

      I'm planning to expand this to managing the actual test scripts themselves.  First to manage testcases (e.g. automatically create limited 'acceptance test' scripts from complete regression suites), and perhaps as a front-end lint/script checker.  This will be more challenging as it will have to have a means of extending the list of accepted keywords at runtime, and able to specify sub-grammers for these new keywords- our framework is extensible.  The nice thing is, this will be about my third pet-project in python, and it seems quite doable part time.  Python (and pyparsing) are quite easy to remember, even if you don't get to use them everyday!

      I'm guessing that may even be quite a common usage pattern of pyparsing- first as a log file scanner, then as a real parser for 'tiny languages'.

      Cheers!

      Julian

       
    • Paul McGuire

      Paul McGuire - 2005-02-18

      Julian -

      I think you may be too tolerant of variability in your input data.  For instance, if a constant's value contains spaces, it is not unreasonable to expect people to enclose it in quotes.  And using Word(printables) for constant names also allows constants named "@#$(@#$".  If you just constrain the allowed characters in a constant name to the more typical alphanums+"_$", then you don't need any special handling to reject "[endconstants]" as a potential constant name - it starts with a '[' after all.  And lastly, pyparsing provides a different mechanism for skipping comments, mostly because there is *no way* of knowing where comments will crop up, and it really junks up the grammar to put in all those things to be ignored.  So pyparsing uses the ignore() method to identify patterns that are to be ignored between valid parse elements.

      Please look over this refinement to your original grammar - it also uses results names and gives you and example of the Dict class.  But mostly, look how clean the grammar is without having to specify comments within the grammar pattern - instead it is just ignored at a global level.  Even the comment inside the value for const1 is correctly ignored.

      Good luck,
      -- Paul

      from pyparsing import *

      testdata = """
      [constants]
      name value
      const1 value with /* an embedded comment */ spaces
      const2 value with spaces //and a comment
      const3 "value with embedded //" 
      const4 "value with embedded //" //and a comment
      const5 "value"//comment with no white space
      const6 "value with embedded //" /* another comment */
      // comment
      /* comment */
      /* comment */
      [endconstants]
      """

      # Original grammar
      #constantSection = CaselessLiteral("[constants]") +ZeroOrMore((((Literal('//')|Literal('/*'))+restOfLine).suppress() | (Word(printables) + restOfLine)) ) + (CaselessLiteral("[endconstants]"))

      constantLine = Group(~CaselessLiteral("[endconstants]") + Word(printables) + \                             (quotedString | Combine(OneOrMore(~LineEnd() + Word(printables)), joinString=" ", adjacent=False ) ) )
      constantSection = ( CaselessLiteral("[constants]").suppress() +
                                  ZeroOrMore(constantLine ).setResultsName("constants") +
                                  CaselessLiteral("[endconstants]").suppress() )
                                 
      doubleSlashComment = Literal("//") + restOfLine
      constantSection.ignore( doubleSlashComment )
      constantSection.ignore( cStyleComment )

      print constantSection.parseString( testdata ).constants

      print
      print "Now make the constants into a dictionary"
      constantSection = ( CaselessLiteral("[constants]").suppress() +
                                  Dict(ZeroOrMore(constantLine )) +
                                  CaselessLiteral("[endconstants]").suppress() )
                                 
      doubleSlashComment = Literal("//") + restOfLine
      constantSection.ignore( doubleSlashComment )
      constantSection.ignore( cStyleComment )

      res = constantSection.parseString( testdata )
      print res
      print res.keys()
      print res["const2"]
      print res.const6

       
      • Julian

        Julian - 2005-02-18

        Thanks Paul-

        your guidance is much appreciated.  I completely agree with your comments; unfortunately I'm dealing with something I can't change- the definition of constants and how they're handled is already in place in the framework... I did my best to request this wasn't so, but it basically strips comments, stores the first 'printables' string and the rest of the line in a structure in an array.  When it interprets an actual test case, it pulls out a 'printables' string, checks the table, and if there is a match found, replaces the occurrence with the 'value' stored, and in this case re-grabs the 'printable' at the current point in the line and finally decodes it into a method call.

        I can create all sorts of constants that at run time will lead to absolutely bizzare errors when parsed.  And I get to support this... but thankfully (so far), and surprisingly, our users have neither created such nightmares, nor complained... mostly they've stuck to the kind of common sense guidelines you suggest.  But there's always one or two ;-)

        Sorry to ask again, but do you have any thoughts re a Wiki for pyparsing?  You've mentioned you'd like to create better documentation- is there anything I can do to help?

        I'm trying to learn a bit more about the theory behind how this all works.  I've skimmed the following, and think they look quite interesting:
        http://www.garshol.priv.no/download/text/bnf.html#id1.
        http://pages.cpsc.ucalgary.ca/~aycock/spark/
        http://www.cs.vu.nl/~dick/PTAPG.html
        http://systems.cs.uchicago.edu/ply/ply.html
        http://www.ietf.org/rfc/rfc2234.txt

        Just thought you or some other users may be interested.

        Finally, have you thought about writing a simple guide to how pyparsing's internals work?

        Thanks again!

        Julian

         

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks