Menu

Need some help on one-line parser.

2008-06-24
2013-05-14
  • concentricpuddle

    Hi, I'm trying to write a parser for my program. It's pretty integral to my project, so please if you can, please help.

    Anyway, let me get to it. Take this string: "$function(random words %tag%, $function2(!@#$^, %tag2%))"

    I want to parse this string as where %tag% is how many such tags are specified. Also functions are defined like $functionname(arguments). I've been trying this for days now and haven't been able to get a robust solution. Here's the code:

    --------------------------------------
    from pyparsing import Word, alphas, alphanums, nums, ZeroOrMore, Literal,Forward,delimitedList, Combine, NotAny, OneOrMore
    import functions

    identifier = Combine(ZeroOrMore("\$") + Word(alphanums + "_ \\($%!@"))
    integer  = Word( nums )
    funcstart =  Combine(Literal("$") + ZeroOrMore(Word(alphanums + "_")) + "(")
    arg = identifier | integer
    args = arg + ZeroOrMore("," + arg)
    #expression = functor + lparen + args + rparen

    def what(s,loc,tok):
        return "Got Here"

    def tagparse(s,loc,tok):
        return "tagget it"

    ignored = ZeroOrMore(Word(alphanums + " !@#$^&*(){}|\\][?+=/_~`"))
    tag = Combine(NotAny(r"\\") + Literal("%") + OneOrMore(Word(alphas)) + Literal("%")).setParseAction(tagparse)
    tagger = ignored + tag + ignored

    foo = Forward()
    foo << tagger
    z = foo.transformString("$what(%shit% aoeiu % taoeu %nottheshit%)")
    print z

    content = Forward()
    expression = funcstart + delimitedList(content) + Literal(")").suppress()
    expression.setParseAction(what)
    content << (expression | identifier | integer)
    parsedContent = content.transformString("$what(aouaou 3425 45 % ^!@\\aoeusthaou)")
    print parsedContent

    ------------------------------------
    When I run it I get $what(tagged itaoeiu % ^&^taoeu tagged it) instead of the expected "Got heretagged itaoeiu % ^&^taoeu tagged it". Any ideas. Note though, that I have no experience at all with parsing or crafting compilers :)

     
    • Paul McGuire

      Paul McGuire - 2008-06-26

      Well, it's not entirely clear to me what you are trying to do.  Your program contains a mixture of debugging statements that may be confusing your actual intent.  For instance, your parse action that returns "Got Here" replaces the entire input string, but I think you just want to print out that you successfully matched.

      Here are a couple of debugging and usage tips for pyparsing.  Consider this sample program:

          from pyparsing import *

          integer = Word(nums)
          alpha = Word(alphas)

          # parse action to reverse matched tokens
          def revToken(t):
              return t[0][::-1]
          def doubleTokenChars(t):
              return "".join(a+b for a,b in zip(t[0],t[0]))
          alpha.setParseAction(revToken)
          integer.setParseAction(doubleTokenChars, revToken)

          text = "1234 5678 xyz hubba hubba"
          print (alpha | integer).transformString(text)

      which prints out:
         
          44332211 88776655 zyx abbuh abbuh

         
      Just as your program did, this sample uses parse actions to replace the matched tokens with something else, in this case, alpha words are reversed, and numeric words have their letters doubled, and then reversed, too.
         
      If you just want to print out a message when a particular expression was matched, you can call setDebug().  In our sample program, we would add:

          integer.setDebug()
          alpha.setDebug()

      which will cause pyparsing to print out every time that the integer or alpha expressions are tried for matching, and if matched, whether it matched, if it matched, what it matched, and if it didn't match, what the exception was.  Here is the debugging output created by adding these lines:

          Match W:(abcd...) at loc 0(1,1)
          Exception raised:Expected W:(abcd...) (at char 0), (line:1, col:1)
          Match W:(0123...) at loc 0(1,1)
          Matched W:(0123...) -> ['44332211']
          Match W:(abcd...) at loc 5(1,6)
          Exception raised:Expected W:(abcd...) (at char 5), (line:1, col:6)
          Match W:(0123...) at loc 5(1,6)
          Matched W:(0123...) -> ['88776655']
          Match W:(abcd...) at loc 10(1,11)
          Matched W:(abcd...) -> ['zyx']
          Match W:(abcd...) at loc 14(1,15)
          Matched W:(abcd...) -> ['abbuh']
          Match W:(abcd...) at loc 20(1,21)
          Matched W:(abcd...) -> ['abbuh']
          Match W:(abcd...) at loc 25(1,26)
          Exception raised:Expected W:(abcd...) (at char 25), (line:1, col:26)
          Match W:(0123...) at loc 25(1,26)
          Exception raised:Expected W:(0123...) (at char 25), (line:1, col:26)

      This is a little hard to read, because our alpha and integer expressions are not named.  If we change their definitions to:

          integer = Word(nums).setName("integer")
          alpha = Word(alphas).setName("alphaword")

      the debugging output starts to look a little more sensible:

          Match alphaword at loc 0(1,1)
          Exception raised:Expected alphaword (at char 0), (line:1, col:1)
          Match integer at loc 0(1,1)
          Matched integer -> ['44332211']
          Match alphaword at loc 5(1,6)
          Exception raised:Expected alphaword (at char 5), (line:1, col:6)
          Match integer at loc 5(1,6)
          Matched integer -> ['88776655']
          Match alphaword at loc 10(1,11)
          Matched alphaword -> ['zyx']
          Match alphaword at loc 14(1,15)
          Matched alphaword -> ['abbuh']
          Match alphaword at loc 20(1,21)
          Matched alphaword -> ['abbuh']
          Match alphaword at loc 25(1,26)
          Exception raised:Expected alphaword (at char 25), (line:1, col:26)
          Match integer at loc 25(1,26)
          Exception raised:Expected integer (at char 25), (line:1, col:26)

      (Note that setName gives a name to the expression itself - setResultsName is used to give a name to the tokens matched and returned *by* the expression.)

      Lastly, you can troubleshoot parse actions by using the @traceParseAction decorator.  If we precede both of our parse actions with this decorator, we get this debugging output (I have removed the calls to setDebug):

          >>entering doubleTokenChars(line: '1234 5678 xyz hubba hubba', 0, ['1234'])
          <<leaving doubleTokenChars (ret: 11223344)
          >>entering revToken(line: '1234 5678 xyz hubba hubba', 0, ['11223344'])
          <<leaving revToken (ret: 44332211)
          >>entering doubleTokenChars(line: '1234 5678 xyz hubba hubba', 5, ['5678'])
          <<leaving doubleTokenChars (ret: 55667788)
          >>entering revToken(line: '1234 5678 xyz hubba hubba', 5, ['55667788'])
          <<leaving revToken (ret: 88776655)
          >>entering revToken(line: '1234 5678 xyz hubba hubba', 10, ['xyz'])
          <<leaving revToken (ret: zyx)
          >>entering revToken(line: '1234 5678 xyz hubba hubba', 14, ['hubba'])
          <<leaving revToken (ret: abbuh)
          >>entering revToken(line: '1234 5678 xyz hubba hubba', 20, ['hubba'])
          <<leaving revToken (ret: abbuh)

      @traceParseAction shows the actual input text, the current parse location, and the tokens being sent to the parse action.

      Of course, you can also just include your own print statements within a parse action, too.  My point is that, if you want to flag that a particular parse action was reached, use this decorator or a print statement - returning a string will cause the original text to be replaced by your debugging text.

      Ok, let's use these debugging techniques to find out what's happening with your parser.  I added these lines just before calling transformString:

      expression.setName("expr").setDebug()
      content.setName("content").setDebug()

      This gives me this debugging output:

          Match content at loc 0(1,1)
          Match expr at loc 0(1,1)
          Match content at loc 6(1,7)
          Match expr at loc 6(1,7)
          Exception raised:Expected "$" (at char 6), (line:1, col:7)
          Matched content -> ['aouaou 3425 45% ']
          Exception raised:Expected ")" (at char 22), (line:1, col:23)
          Matched content -> ['$what(aouaou 3425 45% ']
          ...

      Well our matched content doesn't really look like we want, we would have expected to get back something like '$what(aouaou 3425 45% ^!@\aoeusthaou)'.  There is apparently something wrong with that '^' character.  Since content matches expressions (which start with a '$'), identifiers or integers, it seems we need to expand the definition of an identifier.  Let's add '^' to the set of characters for an identifier:

          identifier = Combine(ZeroOrMore("\$") + Word(alphanums + "_ \\($%!@^"))

      We now get:

          Match content at loc 0(1,1)
          Match expr at loc 0(1,1)
          Match content at loc 6(1,7)
          Match expr at loc 6(1,7)
          Exception raised:Expected "$" (at char 6), (line:1, col:7)
          Matched content -> ['aouaou 3425 45% ^!@\\aoeusthaou']
          >>entering what(line: '$what(aouaou 3425 45% ^!@\aoeusthaou)', 0, ['$what(', 'aouaou 3425 45% ^!@\\aoeusthaou'])
          <<leaving what (ret: None)
          Matched expr -> ['$what(', 'aouaou 3425 45% ^!@\\aoeusthaou']
          Matched content -> ['$what(', 'aouaou 3425 45% ^!@\\aoeusthaou']
          Match content at loc 37(1,38)
          Match expr at loc 37(1,38)
          Exception raised:Expected "$" (at char 37), (line:1, col:38)
          Exception raised:Expected "$" (at char 37), (line:1, col:38)
          $what(aouaou 3425 45% ^!@\aoeusthaou

      Now we are successfully matching your input string.

      Here are some other comments on your parser.

      - In general, I try to avoid including whitespace in the set of valid characters in a Word expression.  If I am going to parse a list of integers separated by whitespace like '123 456 789', instead of using:

          integers = Word(nums+" ")

      I'd recommend using OneOrMore instead:

          integers = OneOrMore(Word(nums))

      This lets pyparsing hassle with the whitespace, and you get back your list of integers in nice separate tokens.

      - I notice that it looks like you are using delimitedList for a sequence of items that are separated by whitespace.  Such a sequence is *not* a candidate for delimitedList, it is just a OneOrMore.  Use delimitedList when items in a sequence are delimited by some other text, such as commas or semi-colons.  For instance, here is a nice simple parser to get the items in a comma-separated list, even if some of the items are quoted strings that may contain commas:

          item = Word(alphanums+"`~!@#$%^&*()_-+={}[]|:;<>.") | quotedString
          itemList = delimitedList(item)

          print itemList.parseString("abc, 123, 'item, I say', blah")

      giving:

          ['abc', '123', "'item, I say'", 'blah']

      The intervening comma delimiters are stripped out, and the quoted strings safely preserve any contained whitespace or commas.

      Well, that's about it for now, I'm sure this is plenty to chew on!  Please look over the examples in the pyparsing wiki and in the published documentation and articles for other suggestions about developing parsers with pyparsing.  I tried to make it easy to get started with simple parsers, without having to learn the whole thing up front. 

      Please write back if you have more questions (sorry this reply was delayed, I've been traveling on business and just saw your message).

      Cheers,
      -- Paul

       
    • concentricpuddle

      I realized (after reading your post) that I was going about it the wrong way and that the parser would probably wind up being very error prone.

      I have seen the light. Thanks a lot.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.