Questions: parentheses, restOfLine, Keyword

rzhanka
2005-04-03
2013-05-14
  • rzhanka
    rzhanka
    2005-04-03

    1) I tried setting up a composite parsing element factory as follows (this is a simplified version):

      COLON = Literal(':').suppress()

      def Field(name): return (Literal(name).suppress() + COLON + restOfLine).setResultsName(name)

      record = Group(Field('Name') + OneOrMore(Field('Description'))).setResultsName('record')

      print OneOrMore(record).parseString(data).asXML()

    For data like this:

    Name: a name
    Description: some description
    Description: and some more
    Name: ...
    ...

    the results come out like this:

    <record>
      <record>
        <ITEM> a name</ITEM>
        <Description> some description</Description>
        <Description> and some more</Description>
      </record>
      <record>
        ...
      </record>
    </record>

    There are a couple of things that seem strange to me here. One is that the outermost tag is <record> not <ITEM>. The second is that the tag that should be <Name> does appear as <ITEM>.

    The first problem can be dealt with by explicitly adding executing setResultsName on the OneOrMore in the print statement. The name thus set will become the outermost tag. However, why was any special value accorded to that tag when it was not explicitly set?

    The second problem can be solved by removing the parentheses in Field like so:

      def Field(name): return Literal(name).suppress() + COLON + restOfLine.setResultsName(name)

    This works ok as all tokens except the restOfLine are suppressed. I realize that in most situations the parser elements that were in the parentheses would need to be in a Group or Combine element, but I'm not sure why the tag gets messed up for the first element of each record, but not the subsequent ones.

    I'm not sure what parts of the above behavior would be considered bugs, and which parts are just a result of me committing wanton syntax abuse.

    2) My second question concerns the use of restOfLine in situations like the one above where it is desirable that restOfLine skip leading whitespace. The first workaround I thought of was:

    restOfLine = Optional(White(' ')).suppress() + pyparsing.restOfLine

    Later, I had a second idea which I thought might be more efficient:

    restOfLine = Optional(CharsNotIn('\n\r'), default='').setParseAction(lambda s,l,t: [ t[0].lstrip() ])

    Is one of these, in fact, definitively more efficient for someone who knows how the guts of pyparsing work? or, is there some better way of dealing with this issue that I didn't think of?

    3) It would be nice if there were an analog of setDefaultWhitespaceCharacters for the Keyword identChars. If the keywords of a grammar have a different character set than the default, it would be nice if it were not required to specify them for every Keyword object.

    This is my current workaround:

    def Keyword(matchString, identChars=myCharSet, caseless=False):
        return pyparsing.Keyword(matchString, identChars, caseless)

    Also, for the sake of consistency, might it not make sense to have Literals and Keywords deal with case sensitivity in the same fashion? I.e. either have a CaselessKeyword class like the CaselessLiteral class or allow the case sensitivity of Literals to be set via a constructor argument? (Or am I just being hopelessly picky?)

    Many thanks for providing pyparsing. Using it has been much more fun than arguing with regular expressions -- and, after all, programming in Python is supposed to be fun ;)

     
    • Paul McGuire
      Paul McGuire
      2005-04-03

      Rzhanka -

      I'll do my best.
      1. asXML() is still touchy/unpredictable.  Try changing:
      OneOrMore(Field('Description'))).setResultsName('record')
      to:
      Group(OneOrMore(Field('Description')))).setResultsName('record')
      Otherwise, all I can suggest is trial-and-error to get the XML results that you want.

      2. Here's how to skip whitespace before restOfLine:
      restOfLineWithNoLeadingWhitespace = Empty() + restOfLine
      The Empty() will skip whitespace, then match a null string at the beginning of the restOfLine string.

      3. Great idea to have a setDefaultKeywordChars method, I worked on Keyword and setDefaultWhitespace at very different times, and didn't make the connection. I'll add it to the Keyword class in the next release.

      -- Paul

       
    • rzhanka
      rzhanka
      2005-04-04

      Hi Paul,

      Thanks very much for the response.

      re 3) I'm glad to be of help.

      re 2) Now that you say this, I have a vague memory of seeing this idea somewhere, i.e. that Empty can be used when it is necessary to skip whitespace without making any other action (it must have been in one of the examples as I don't see it anywhere obvious in the documentation). Perhaps it would be good to show this concept in the 'Using pyparsing' document.

      re 1) Hmm... I seem to have been completely unclear in what I was asking here. (The lack of code formatting here didn't help either.)

      My question wasn't really concerning asXML per se, rather the effect of setResultsName, or possibly the interaction of setResultsName with OneOrMore. I was just using asXML as a way of displaying the resultsNames. (I realize this is not obvious in my earlier post.)

      Here's an example that I hope will be clearer:

      w1 = Word(alphas).setResultsName('word1')
      w2 = Word(alphas).setResultsName('word2')

      x1 = w1.parseString('yaddayadda')
      y1 = OneOrMore(w1).parseString('foo bar gronk')

      y2 = OneOrMore(w2).parseString('foo bar gronk')

      print y1.asXML()
      print y2.asXML()

      Now presumably the display of y1 and y2 should differ only in the name given to the XML elements. However, the actual printout is:

      <word1>
        <ITEM>foo</ITEM>
        <word1>bar</word1>
        <word1>gronk</word1>
      </word1>

      <word2>
        <word2>foo</word2>
        <word2>bar</word2>
        <word2>gronk</word2>
      </word2>

      Printing out repr(y1) and repr(y2) shows that the internal representations in y1 and y2 are also different.

      Another question is what should be returned by (still using the above example) y1.getName()? The results stored in y1 are not a match to a w1, but a OneOrMore(w1) which has not been given a resultsName, yet this object returns the resultsName of w1.

      Suppose the OneOrMore is given an explicit name:

      w3 = Word(alphas).setResultsName('word3')
      y3 = OneOrMore(w3).setResultsName('set').parseString('foo bar gronk')
      print y3.asXML()

      The results are now:

      <set>
        <word3>foo</word3>
        <word3>bar</word3>
        <word3>gronk</word3>
      </set>

      But does this make sense? A OneOrMore is not a Group -- OneOrMore puts results into a container but is not itself a container -- so why should its resultsName be applied to the container which the OneOrMore has filled. In fact it's not clear that there can be any meaningful interpretation for a resultsName attached to a OneOrMore.

      Another way in which OneOrMore acts like a container is that tokens it parses are not placed in ParseResults objects. Thus the individual objects cannot be checked for a resultsName.

      Suppose a parser was constructed like this:

      p = OneOrMore(p1) + OneOrMore(p2) + OneOrMore(p3)
      result = p.parseString(data)

      If the objects p1,p2,p3 each have a resultsName, then you should theoretically be able to process result as follows:

      for token in result:
      if token.getName() == 'name1': action1(token)
      if token.getName() == 'name2': action2(token)
      if token.getName() == 'name3': action3(token)

      But doing this will cause an error as, at least if p1 is as simple as w1 above, token will be a string, not a ParseResult and thus getName will not be a valid method.

      Now this is not likely to be a good way to set things up, but the abstraction of pyparsing seems to imply that it ought to be possible (and maybe there are other cases where something similar would make more sense).

      I have no expertise with the theory of how parsers are written; I'm just basing all this on how the abstractions in pyparsing seem to be structured. Above I stated that I thought my question was about setResultsName. After writing all this, I realize that what I'm talking about is really the behavior of OneOrMore as a sort of quasi-container in a way that seems inconsistent with the underlying abstraction. I'm interpreting the abstraction of pyparsing as saying that OneOrMore (as I said above) puts results into a container but is not itself a container (Group, for example, being a pyparsing abstraction that is explicitly a container). What I have tried to display is the places where this abstraction does not seem to work, places where OneOrMore tries to have it both ways, acting sort of like a container and also sort of not.

      I hope all the above makes some sense. And I also hope it does not sound like I'm being randomly critical for its own sake -- I assure you this is not my intention. In the end, there is no real question here that I'm asking concerning something that I, myself, am implementing, bur rather (assuming I'm not just seeing patterns where none exist) what I hope is a pointer to a detail where the abstraction and implementation of pyparsing could be brought to closer accord.

      And again, thank you for providing such a fine addition to the Python toolkit.

      --rzhanka

       
    • Paul McGuire
      Paul McGuire
      2005-04-04

      Rzhanka -

      Thanks again for your kind remarks about pyparsing.  And no offense taken, I really appreciate your comments regarding results names and OneOrMore.  pyparsing has not really had a rigorous review of its internals, rather it fairly organically emerged/evolved from my past experiences doing hand-coded parsers, and my attempts at "objectizing" individually assemblable grammar fragments, for developers to create their grammars using object operations.  In doing more research after the fact, I find that there is a family of parsers known as "combinators," and pyparsing fits into this group pretty well.  The combinator I have read most about is called Parsec, and it is written in Haskell.

      On the flip side of grammar definition and creation by "combinating," the other aspect of pyparsing that I wanted to flesh out was the various options for extracting the parsed results.  I specifically wanted *more* than to simply return a list of tokens, requiring some parse action handler to index to the n'th element for, say, a phone number or zip code or such.  This is an extremely fragile mechanism over time, as grammars and input text requirements change.  The insertion of an optional minus sign ahead of a token will throw off all the subsequent list index references.  So that is why I introducted setResultsName().  The names stay the same even if other tokens appear ahead in the list.

      Now OneOrMore (and its analog ZeroOrMore) have been somewhat problematic to me.  As you observe, the default behavior for any parsing expression is to simply return a "list" of tokens, which are appended in place to the accumulated "list" of parsed tokens.  (As you can guess, I put "list" in quotes because we actually create a ParseResults object, which is most simply conceputalized as a list, but as you know, can also behave as a dictionary or as an object with named attribute fields.)  So OneOrMore() is an odd construct for associating with a name - it really just represents some ungrouped subset of the tokens in the list, without any real clear boundary.  I've not really explicitly looked at it this way before, but I think I've had the idea kicking around, which is why I intuitively suggested you Group() the OneOrMore that you are trying to name.

      OneOrMore also gives me some fits in reporting ParseExceptions.  If a grammar containing OneOrMore matches 'n' elements, and fails on the 'n+1'th, I don't really have a mechanism for propagating that exception back up the stack - the OneOrMore succeeded, so the exception gets discarded, and we advance to the next expression in the grammar.  Then what we get is some less-than-helpful exception message about an unrelated grammar element.

      This also affects asXML(), since I use the same name organization to create the XML tags.  So if we can reconcile these other issues, asXML is likely to benefit also.

      My one difficulty with making improvements to OneOrMore is backward compatibility.  For instance, if OneOrMore were modified to implicitly Group its returned tokens, I fear this would break a number of existing applications.  Still, if this is what it takes to really nail this down, I suppose such a change could be made in a major release, like 2.0.

      So please, continue to send your comments and suggestions, I really welcome them, especially when made in so constructive a manner (I've had other suggestions that were less gracefully posed, and they do rub a bit raw in the reading...).

      And I'd be happy to hear about any "success stories" you might have, especially ones you are comfortable having posted on my SourceForge project home page.

      Sincerely,
      -- Paul

       
    • Paul McGuire
      Paul McGuire
      2005-04-04

      I'm now looking more closely at your example with parsing foo, bar, and gronk, and this truly is an odd/buggy behavior.  It appears that w1 behaves differently after having parsed 'yaddayadda'.  If you omit the x1=... statement, both y1 and y2 *do* yield the same values.  I'll do some more investigation and let you know what I find.

      -- Paul

       
    • rzhanka
      rzhanka
      2005-04-07

      Paul --

      One clarification before I go back to rambling on about abstractions:

      >> which is why I intuitively suggested you Group() the OneOrMore that you are trying to name

      The particular instance of setResultsName you're referring to -- in the definition of 'record' back in my initial post -- is actually applied to the outermost Group object in the definition, not the OneOrMore inside the Group. The OneOrMore I talked about applying a name to is the one in the print statement following that definition.

      ***

      Overall, I would expect that OneOrMore applied to a parser p would be the same as And applied to p an arbitrary number of times, i.e.

      OneOrMore(p)

      should produce the same results as

      p + p + p + ...

      My tests indicate that this does, in fact, seem to be true -- both as regards desired behavior and as regards the odd behavior arising when p has been used previously. Here is my earlier example rewritten with Ands:

      w1 = Word(alphas).setResultsName('word1')
      w2 = Word(alphas).setResultsName('word2')

      x1 = w1.parseString('yaddayadda')
      y1 = (w1+ w1 + w1).parseString('foo bar gronk')

      y2 = (w2 + w2 + w2).parseString('foo bar gronk')

      print y1.asXML()
      print y2.asXML()

      The output for these two expressions is identical to the corresponding output in the earlier example. The corresponding internal representations (at least as displayed by repr) are also identical.

      However, while y1 in each case produces the erroneous output in asXML, the internal representation for y1 also seems closer to what I would expect from the abstraction. For example, y1.word1 returns a ParseResults object, while y2.word2 returns a string object.

      When a subscript is used (i.e. y1[1] or y2[1]), the value returned is always a string object, which means that getName cannot be executed on these results. While an argument could be made that this behavior is ok for OneOrMore (depending on how much "container-ness" one might attach to it), it seems to me to make no sense for the case of the parser constructed from And. (Each of these And-based parsers also attaches its resultsName to the outermost XML tag, which also makes even less sense than it did with the case of OneOrMore.)

      (Rereading your post I realized that some of the next paragraph is addressed by your discussion of results above, but I thought I'd leave it as is, representing as it does a sort of unfiltered reaction.)

      My feeling is that there is something unresolved about the degree to which ParseResults abstraction operates as hierarchical structure (i.e. like an XML parse tree) and the degree to which it operates like a flat list (i.e. more like numbered regexp sub-matches), but I'm not sure how much of this is due to any actual inconsistency in the pyparser implementation, and how much is due to my brain filling in the lacunae in the documentation with what I want to be there rather than what is there. So it's possible this will just resolve itself. Plus, on a purely practical level, any difficulty arising from this issue can be solved fairly easily by incrementally attacking the results with parseActions. (Thinking about this further: perhaps what my concern arises from is the sense that the hierarchy of the parse elements themselves and the way in which they are used in processing is overall more concrete than that same hierarchy as it appears in the results?)

      I'm not quite sure how to offer much more than these kinds of descriptive observations as I don't know what bits of behavior are a necessary result of internal optimizations, and sometimes even which version of the behavior was the intended one, and which version the anomolous one.

      In the case where a OneOrMore has been given a resultsName (as in the y3 example in the previous post), the ParseResults object that is produced when this is used does assume more aspects of what might be called an "invisible container" (to distinguish it both from the explicit Group-like container and from the idea of a simple source of contents mentioned previously). In the y3 example, for instance, the expression y3.set does return the entire list of results that the OneOrMore matched. This suggests turning OneOrMore more definitively into an "invisible container" might be possible without breaking backward compatibility.

      Included below is a script that produces all the results I have been discussing. (Note, however, that subscripts in the script do not correspond to the numbering in my earlier examples at all: the results numbered 0, 2, and 4 show the And-based parser, the OneOrMore-based parser, and the OneOrMore parser with resultsName respectively; the results numbered 1, 3, and 5 show these same three ideas when the underlying Word parser has been previously executed.)

      Thanks again,
      -- rzhanka

      -------------------------------------------------------------------

      from pyparsing import *
      from pprint import pformat

      def pf(exp): return pformat(eval(repr(exp)))

      def makeWord(i): return Word(alphas).setResultsName('word'+str(i))

      count = 6
      w = map(makeWord, range(count))

      x = [ w[n].parseString('yaddayadda') for n in (1,3,5) ]

      y = [ w[0] + w[0] + w[0],
            w[1] + w[1] + w[1],
            OneOrMore(w[2]),
            OneOrMore(w[3]),
            OneOrMore(w[4]).setResultsName('set4'),
            OneOrMore(w[5]).setResultsName('set5') ]

      text = 'foo bar gronk'
      result = [ yn.parseString(text) for yn in y ]

      for n,r in zip(range(count), result):
          print '*** Result', n, '***\n', r.asXML(), '\n\n', pf(r), '\n\n',
          name = 'word' + str(n)
          print 'result[%s].%s ='%(n,name), repr(getattr(r, name)), 'with type', type(getattr(r, name))
          print 'result[%s].%s.getName()'%(n,name),
          try:
              print '= %s' % getattr(r, name).getName()
          except AttributeError:
              print 'does not exist'
          print 'result[%s][2] ='%n, repr(r[2]), 'with type', type(r[2])
          if n>3:
              name2 = 'set' + str(n)
              print 'result.%s =\n'%name2, pf(getattr(r, name2))
          print '\n\n',

       
    • rzhanka
      rzhanka
      2005-04-07

      Drat, I forgot that the formatting would get annihilated in the script I posted. The only part that is really affected is the final for-loop section. Every line after the for statement should be indented one level, except for the following which are indented two levels:

      * 1 statement each following the try and the except
      * 2 statements following 'if n>3'

      -- rzhanka

       
    • Paul McGuire
      Paul McGuire
      2005-04-07

      Rzhanka -

      Another side-effect of parsing a string is the implicit call to streamline() done within parseString().  When And'ing expressions together using '+' operators, A + B + C becomes And([And([A, B]), C]), since the __add__ function only looks at two elements at a time.  streamline() collapses these degenerate And's into the more optimal And([A,B,C]).  I didn't want to impose the calling of streamline() on the caller, so I implicitly call it as part of the setup for parseString().  streamline() recursively traverses the whole grammar looking for degenerate And's, Or's, and MatchFirst's, so we only need to invoke it on the root grammar expression.  I've seen people invoke streamline explictly in their code, and I tell them to remove it.  But both streamline()'ed and un-streamline()'ed forms should return the same results.  But I thought this might give a clue toward the behavior you're seeing.

      (Still haven't had a crack at debugging the actual problem yet, I have a pressing project at work which is looking to take up the next couple of weeks.)

      -- Paul

       
    • rzhanka
      rzhanka
      2005-04-08

      I'm getting a strange result when I define restOfLine to skip whitespace using Empty.

      Here's a test script: -------------------------------------

      import pyparsing
      from pyparsing import Optional, CharsNotIn, Empty, Literal, restOfLine

      rol1  = Optional(CharsNotIn('\n\r'), default='').setParseAction(lambda s,l,t: [ t[0].lstrip() ])
      rol2 = Empty() + restOfLine

      def Field1(name): return Literal(name).suppress() + rol1.setResultsName(name.lower())
      def Field2(name): return Literal(name).suppress() + rol2.setResultsName(name.lower())

      text = 'ReturnType   char'

      f1 = Field1('ReturnType').parseString(text)
      f2 = Field2('ReturnType').parseString(text)

      print 'repr(f1) =', repr(f1), '\nf1.returntype =', f1.returntype, '\nf1[0] =', f1[0]
      print
      print 'repr(f2) =', repr(f2), '\nf2.returntype =', f2.returntype, '\nf2[0] =', f2[0]

      -------------------------------------------------------------------

      And here are the results:

      repr(f1) = (['char'], {'returntype': [('char', 0)]})
      f1.returntype = char
      f1[0] = char

      repr(f2) = (['char'], {'returntype': [((['char'], {}), -1)]})
      f2.returntype = ['char']
      f2[0] = char

      I assume that f1 is the normal behavior, i.e. that f2.returntype should also be identical to f2[0] as is the case for f1. Is this another bug, or am I just missing something again?

      Thanks,
      -- rzhanka