HtmlTags and ParseAction problem

Tim Arnold
2006-07-07
2013-05-14
  • Tim Arnold
    Tim Arnold
    2006-07-07

    hi,
    I get an error from transformString (expected string, list found) when using the following code. Sure enough, there is a nested list in the 'out' list that transformString tries to return.... but if I use the lambda (commented out in the code below), it works.  I'm baffled! any help appreciated.
    --Tim
    from pyparsing import *

    def rfn(s,l,t):
        t.src

    start, imge = makeHTMLTags('IMG')
    start.setParseAction(rfn)
    #start.setParseAction(lambda s,l,t:t.src)
    text = '<img src="testme.gif">'
    s = start.transformString(text)
    print s

     
    • Paul McGuire
      Paul McGuire
      2006-07-08

      Tim -

      This is pretty amazing!  I tripped over this exact same problem about 2 days ago when I was looking at responding to a question on c.l.py.

      The problem here is that the makeHTMLTags helper maybe helps a little too much!  The expression returned contains several embedded parse actions to make the resulting tokens easier to work with (converting attributes to consistent lowercase, stripping quotes from quoted strings, etc.).  Unfortunately, in the process of doing this, the tokens passed back don't exactly match the original text.  Plus, due to pyparsing's further helpfulness in skipping over whitespace, some crucial space separators also get lost in this shuffle.  For usages where we are scraping data out of HTML, this is no big deal - but as you and I discovered, transformString gets screwed up in this process.

      So I wrote an additional helper parse action to reset things all aright, called keepOriginalText.  Here it is:

      import inspect
      def keepOriginalText(s,startLoc,t):
          """Helper parse action to preserve original parsed text,
             overriding any nested parse actions."""
          f = inspect.stack()[1][0]
          try:
              endloc = f.f_locals["loc"]
          finally:
              del f
          return s[startLoc:endloc]

      Yes, it is rather nasty-looking, since it peeks up the stack frame at the caller's locals environment.  Fortunately, since this is a parse action, I *know* who the caller always will be, so this is not totally unsafe.

      In the posting to c.l.py, the OP wanted to change "sleeping" to "dead" everywhere in the HTML, except in the href attributes of <A> anchor tokens.  (As you found, <IMG> tokens have similar issues with their src attributes.)  The way to do this is to define a converter expression, and just include a definition of the patterns that contain unwanted matches.  Something like:

      from pyparsing import *

      # define search-and-replace target
      converter = Keyword("sleeping",caseless=True)\             .setParseAction( replaceWith( "dead" ) )

      makeHTMLStartTag = lambda tag: makeHTMLTags(tag)[0]

      aStartTag = makeHTMLStartTag("A")
      imgStartTag = makeHTMLStartTag("IMG")

      # define "parser", testing for tags ahead of target - this
      # will recognize tags and process (i.e., skip over) them
      # instead of looking for targets internally
      searchAndReplaceParser = ( aStartTag | imgStartTag | converter )

      text = """This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it is sleeping."""
      print text
      print searchAndReplaceParser.transformString(text)

      Well, just as you found, this raises an error in transformString, since the aStartTag and imgStartTag expressions don't return nice strings, but instead they return lists of tokens.

      But if you include the keepOriginalText parse action (I plan to include this in the next pyparsing release), you can change the makeHTMLStartTag lambda:

      makeHTMLStartTag = lambda tag: makeHTMLTags(tag)[0].setParseAction(keepOriginalText)

      and now things start to look a lot better!

      Unfortunately for the c.l.py poster, I thought this was a bit too unwieldy to show off how "easy" this solution would be, so I didn't post this whole program.  Maybe some time in the future...

      -- Paul

       
      • Tim Arnold
        Tim Arnold
        2006-07-11

        Hey Paul,
        thanks very much for the quick reply and the fix-up. I worked with it last night and I still have troubles. Here's a test case, below. Again, using the lambda parse action, I get a good result, but with the 'rfn' parse action, I get the same error code as before (expected string, list found).

        must be something weird in my source file img tags I guess.
        ------------
        from pyparsing import *
        def keepOriginalText(s,startLoc,t):
            """Helper parse action to preserve original parsed text,
               overriding any nested parse actions. (p_mg)"""
            f = inspect.stack()[1][0]
            try:
                endloc = f.f_locals["loc"]
            finally:
                del f
            return s[startLoc:endloc]

        def rfn(s,l,t):
            t.src

        makeHTMLStartTag = lambda tag: makeHTMLTags(tag)[0].setParseAction(keepOriginalText)
        start, imge = makeHTMLTags('IMG')
        start.setParseAction(rfn)
        #start.setParseAction(lambda s,l,t:t.src)
        text = ''' <img src="images/cal.png"
        alt="cal image" width="16" height="15"> '''
        s = start.transformString(text)
        print s

         
        • Paul McGuire
          Paul McGuire
          2006-07-11

          Nope, to misquote Shakespeare, the problem is not in our stars, but in our parse actions - both yours and mine.

          This version has a better form of keepOriginalText, that also preserves the named fields.  See the embedded comments for other corrections.  This version prints the <img> tag src attribute instead of the tag itself.

          -- Paul

          from pyparsing import *
          import inspect
          # new and improved version of keepOriginalText - preserves
          # named fields
          def keepOriginalText(s,startLoc,t):
              """Helper parse action to preserve original parsed text,
                 overriding any nested parse actions. (p_mg)"""
              f = inspect.stack()[1][0]
              try:
                  endloc = f.f_locals["loc"]
              finally:
                  del f
              # don't just return the original string, modify the
              # given ParseResults in place, and then return them
              #~ return s[startLoc:endloc]
              del t[:]
              t += ParseResults(s[startLoc:endloc])
              return t

          def rfn(t):
              # oops, this isn't a lambda any more, have to *return* the value
              #~ t.src
              return t.src

          makeHTMLStartTag = lambda tag: makeHTMLTags(tag)[0].setParseAction(keepOriginalText)
             
          # use the lambda, Luke
          #~ start, imge = makeHTMLTags('IMG') 
          start = makeHTMLStartTag('IMG')

          # don't replace our fancy parse action with rfn,
          # append rfn to the list of parse actions
          #~ start.setParseAction(rfn)
          start.addParseAction(rfn)

          #start.setParseAction(lambda s,l,t:t.src)
          text = ''' <img src="images/cal.png"
          alt="cal image" width="16" height="15"> '''
          s = start.transformString(text)
          print s

           
    • Paul McGuire
      Paul McGuire
      2006-07-11

      BTW, notice the new abbreviated arg list for parse actions.  Since rfn only needs the incoming tokens, but not the string or parse location, I can just define it as "rfn(t)", and pyparsing fixes up the arg list as necessary.

      Of course, the old format still works, and I have no plans of getting rid of it - this just makes the code look a little cleaner.

      -- Paul

       
      • Tim Arnold
        Tim Arnold
        2006-07-12

        Hi Paul,
        Thanks so much for this fix. arggg, I wasn't thinking of my return function correctly--of course I should have returned the value, not printed it! Just going too fast I guess.

        The change you made for the parseaction fn to accept just the 't' is great--pychecker always warned me that 's' and 'l' were never used.

        Just to let you know, I'm about 1500 lines into my parsing latex project and it's looking pretty good. When I get out from this emergency-mode speed, I'll put together a little tutorial/tipsheet for working with Forward.

        thanks again,
        --Tim Arnold