exclude newline from whitespaces

2005-03-23
2013-05-14
  • Klaas Hofstra
    Klaas Hofstra
    2005-03-23

    Hi,

    I really like pyparsing and my grammar is getting more and more complex ;)

    In my grammar I'd like to remove the need for expression termination with a ";". I'd like to have expressions terminated by a newline, just like in python. If I understand correctly, the default definition of whiteChar makes this difficult:

    whiteChars = " \n\t\r"

    I could use leaveWhiteSpace() but than I have to specify all the whitespaces. I'd like to set whiteChars = " \t". I could not find a function like setWhiteSpaces() to do this. How do I go about this?

    Thanks in advance,

    Klaas

     
    • Klaas Hofstra
      Klaas Hofstra
      2005-03-24

      Just to let you all know, I got it working the way I want it by doing some preprocessing. I replaced all "\n" in the input string with "\254\n". I then defined "\254" as my end-of-line in my grammar. This way, parsing-elements like restOfLine still work.

      -Klaas

       
      • Paul McGuire
        Paul McGuire
        2005-03-25

        Klaas -

        Sorry for not getting back to you sooner.  Currently, you could update the whitespaceChars attribute of the ParserElement that you don't want to skip newlines.

        You really got me thinking though, and I've added two routines for the next release to make this easier: setWhitespaceChars and setDefaultWhitespaceChars.  setWhitespaceChars is just an accessor to the whitespaceChars attribute of ParserElement; setDefaultWhitespaceChars allows for you to define what *all* subsequently created ParserElements should use (in recognition of how big a pain it would be to call setWhitespaceChars on every Literal, And, Or, etc. element).

        I'll try to get this release out in the next few days.

        -- Paul

         
      • Paul McGuire
        Paul McGuire
        2005-03-28

        Check out the latest 1.3 release - your wish is granted! (along with appropriate credit in the change log)

        -- Paul

         
        • Klaas Hofstra
          Klaas Hofstra
          2005-04-02

          Lectori Salutem,

          I'm trying to use these new features with this simplified grammar:
          --------
              ParserElement.setDefaultWhitespaceChars(" \t")

              LE = OneOrMore(LineEnd())
              component_ = Keyword("component")
              end_ = Literal("end")
              in_ = Literal("in")
              out_ = Literal("out")
              lparen = Literal("(")
              rparen = Literal(")")   
              identifier = Word( alphas, alphanums + "_" )

              line = Group(identifier + OneOrMore(identifier)) + LE
              body = ZeroOrMore(line)
             
              comp = component_ + identifier + lparen + identifier + rparen + LE + \            out_ + lparen + identifier + rparen + LE + \            in_ + lparen + identifier + rparen + LE + \            body + end_ + LE
             
              bnf = comp

          --------

          The parser get stuck in a loop when I use that grammar on the following string:
          --------
          component test(bla)
              out(bla)
              in(bla)

              qwe dfd dfgdf
              bla ffdg
              sdfhj dfg
          end
          --------

          Could you tell me what is wrong here?

          Another question: when I use setWhitespaceChars on a ParseElement, do those whitespaces apply for both sides (left and right) of the element?

           
          • Paul McGuire
            Paul McGuire
            2005-04-02

            Here's a little walkthrough on some pyparsing troubleshooting:

            I copied your code, and sure enough, it just hangs with no output.  So to get a little more insight into the parsing process, I added a .setDebug() to your definition of LE:
            LE = OneOrMore(LineEnd()).setDebug()

            so that I could watch the output as each LE was matched or not.  With this one change, we get this output:

            Match {LineEnd}... at loc 19 (1,20)
            Matched {LineEnd}... -> ['\n']
            Match {LineEnd}... at loc 28 (2,9)
            Matched {LineEnd}... -> ['\n']
            Match {LineEnd}... at loc 36 (3,8)
            Matched {LineEnd}... -> ['\n', '\n']
            Match {LineEnd}... at loc 51 (5,14)
            Matched {LineEnd}... -> ['\n']
            Match {LineEnd}... at loc 60 (6,9)
            Matched {LineEnd}... -> ['\n']
            Match {LineEnd}... at loc 70 (7,10)
            Matched {LineEnd}... -> ['\n']
            Match {LineEnd}... at loc 74 (8,4)

            So we see that each line end is matched successfully, but when we reach the end of the whole string, we just hang.

            It turns out that LineEnd also successfully matches the end of the input string, but does not advance the parse position.  Because the grammar defined LE as OneOrMore(LineEnd()), we just loop forever at the end of the string.

            So I was able to prevent this infinite looping by modifying the definition of LE to:
            LE = OneOrMore(~StringEnd() + LineEnd())

            This way, we prevent the end of string from causing our LE to loop forever.

            Some other suggestions:
            - Add .suppress() to the LE expression, to keep the \n tokens from cluttering up the returned results.
            - Add setResultsName to body and the component identifier, as in:

            body = ZeroOrMore(line).setResultsName("body")

            comp = component_ + identifier.setResultsName("compName") + lparen + identifier + rparen + LE +  \         out_ + lparen + identifier + rparen + LE + \         in_ + lparen + identifier + rparen + LE + \         body + end_ + LE

            This makes it easier to extract those tokens or token groups after parsing:

            results = bnf.parseString( testdata )
            print results.compName
            print results.body

            Prints out:
            test
            [['qwe', 'dfd', 'dfgdf'], ['bla', 'ffdg'], ['sdfhj', 'dfg']]

            Good luck,
            -- Paul

             
            • Klaas Hofstra
              Klaas Hofstra
              2005-04-03

              Paul,

              Thanks for your response, much appreciated!

              I've changed the definition of LE as you suggested. This gives me the following output with .setDebug() enabled:
              --begin --
              Exception raised: Found unexpected token, StringEnd (121), (8,4)
              end
                 ^
              --end--

              This is caused by an extra newline at the end of the file/string. However, the output tokens are OK. When I remove .setDebug() from LE I don't get exceptions.

              Is this exception in debug mode a problem or is this "by design" and can I just ignore it?

              Cheers,

              Klaas

               
              • Paul McGuire
                Paul McGuire
                2005-04-03

                parseString parses until it reaches the end of the string or the end of the grammar.  If the end of grammar is reached before the end of the string, the remaining text is ignored.  If you have to ensure that there is no additional text after the parsed text, add a StringEnd() to the end of your grammar.

                When you use setDebug() every match is flagged at the beginning and end, and a match can end with a successful match or with an exception.  If you are getting an exception only with setDebug() enabled, you can safely ignore it.

                -- Paul

                 
    • Paul McGuire
      Paul McGuire
      2005-04-02

      Oops, forgot to answer the second question:
      -------------------
      Another question: when I use setWhitespaceChars on a ParseElement, do those whitespaces apply for both sides (left and right) of the element?
      -------------------
      The left, since this is the whitespace that is "skipped" prior to trying to match the expression.  Once the expression is matched, pyparsing moves on to the next expression in the grammar, and skips *its* whitespace in preparation for trying to match the expression.

      -- Paul