Deleting named results from ParseResults

Rob
2009-01-09
2013-05-14
  • Rob
    Rob
    2009-01-09

    I want to parse firefox bookmark.html files and ultimately merge various files, remove duplicates and write a new file. I have a folder with about 200 such files in, collected over 5 years! I probably could have done it by hand now but I'm having fun with pyparsing.

    Anyway,  I have written a parser that uses 'ListAllMatches=True' to create a list of hyperlinks for each bookmark folder. This works fine.

    As an example of the output:

    >>tokens[8]
    (['<DT><H3 ADD_DATE="1106333529" ID="rdf:#$HGCRi2">Linux Audio </H3>', '<DT><A HREF="http://www.ladspa.org/" ADD_DATE="1106333529" LAST_CHARSET="ISO-8859-1" ID="rdf:#$IGCRi2">Linux Audio Developer\'s Simple Plugin API (LADSPA)</A>', '<DT><A HREF="http://www.djcj.org/LAU/guide/index.php" ADD_DATE="1106333529" LAST_VISIT="1112727843" LAST_CHARSET="ISO-8859-1" ID="rdf:#$JGCRi2">Linux Audio Users Guide</A>', '<DT><A HREF="http://www.google.co.uk/search?q=ladspa+surround+sound&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official" ADD_DATE="1106333529" LAST_CHARSET="UTF-8" ID="rdf:#$KGCRi2">Google Search: ladspa surround sound</A>', '<DT><A HREF="http://plugin.org.uk/" ADD_DATE="1106333529" LAST_VISIT="1106334523" LAST_CHARSET="UTF-8" ID="rdf:#$LGCRi2">plugin.org.uk</A>', '<DT><A HREF="http://www.oreillynet.com/pub/au/101" ADD_DATE="1106333529" LAST_CHARSET="ISO-8859-1" ID="rdf:#$MGCRi2">Dave Phillips</A>'], {'Folder': [('Linux Audio ', 0)], 'HyperLink': [('http://www.ladspa.org/', 1), ('http://www.djcj.org/LAU/guide/index.php', 2), ('http://www.google.co.uk/search?q=ladspa+surround+sound&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official', 3), ('http://plugin.org.uk/', 4), ('http://www.oreillynet.com/pub/au/101', 5)]})

    So I have a list of strings where the first is the folder name and followed by each hyperlink entry. Each string is the original line from the html file. These strings will eventually be used to create a new bookmark.html. Then there is a named result for the folder name and accumulated result names for the hyperlinks, these are used to inspect the structure.

    Now I want to merge two pyparsing.ParseResults structures by deleting duplicate hyperlinks but I am finding that deleting tokens from the structure doesn't delete the associated named result. This makes it hard to track what has been done to the structure.

    I have also found that I can't delete an item from the accumulated list of named results because ParseResults['HyperLinks'] returns a newly created ParseResults dictionary rather than access to the internal dictionary.

    What can I do? I can't think of any other way to delete a single occurrence from a ListAllMatches named result without changing the api to allow access to the internal __tokdict where the information is stored.

    I could delete the named result list and then recreate it using the class __ParseResultsWithOffset() every time I delete a token.

    I can access the structure's internal dictionary through myinstance._ParseResults__tokdict but that is considered bad python.

    I've considered changing ParseResults.__delitem__() to delete any associated named results when a token is deleted. Currently this interferes with Combine(). Combine() could be rewritten to recreate the __tokdict but only using the _ParseResultsWithOffset class. Otherwise special cases could be created in ParseResults.__delitem__(); for instance del toklist[:] or del toklist[slice()] wouldn't delete named results but del toklist[int] would. However maybe there would be times that people would want to delete the tokens but not the named results?

    For now I am going to use a hacked pyparsing so I can get my bookmark script to work but have I missed something obvious?

    Rob

     
    • Paul McGuire
      Paul McGuire
      2009-01-09

      Very impressive!  And I'm glad pyparsing is in the "fun" category for you, that was certainly part of my goal in writing it.

      I looked briefly at the __set/del/getitem__ methods in the ParseResults class, and an easy in-place update doesn't leap out at me.  I understand about reaching into the _ParseResults_tokdict instance as "bad python" - I really assume that the leading '_'s are supposed to mean "keep out!".

      On the other hand, this *is* a one-shot utility that you are going to use to merge these two large bookmark files, you seem to have studied the problem pretty thoroughly, and accessing the "private" field through its mangled name seems preferable to hacking the pyparsing module itself.

      But to be more Pythonic than that, I think the Pythonic Way to approach this would not be to update the ParseResults in place, but to use the two structure to create a third consolidated structure.  Just as the advice to new users on c.l.py or the tutor list, when asking how to delete entries from a list, is usually "create a new list that leaves out the entries you don't want."  And there is no requirement that this new structure needs to be a ParseResults (which I will call "PR"s from here because I am lazy).  If I were presented with your situation, I would take the two PR structures, and then create a defaultdict(set) for the consolidation.  Then go through each PR, add all of the hyperlinks for each foldername to the dict[foldername] set (and any other filtering you want to do, do here).  Or if preserving order is important, use a defaultdict(list), and test each link for existence in the list before adding it.  After processing both PR's, the defaultdict will now contain your consolidated structure, with no duplicates.

      HTH,
      -- Paul

       
    • Rob
      Rob
      2009-01-09

      Thanks for the quick response and good advice. I'll probably use the mangled name approach but I was interested to see that __iadd__(self,other) could bypass the name mangling when 'other' was of the same class. So I might create a subclass of ParseResults with an overridden __iadd__(), could that work? or I could redefine __iadd__() directly?

      Rob

       
      • Paul McGuire
        Paul McGuire
        2009-01-09

        The hitch in overriding or adding behavior in a derived class is that sometimes you get surprised and get back an instance of the superclass, which *doesn't* have your special behavior.  This is usually the case when people derive from, say str, and then use one of the superclass methods and get back str, not the derived class.  For instance:

        class RomanNumeral(str):
            def __init__(self,s):
                orig_s = s
                s = s.upper()
                self.value = 0
                for rn,v in (("M",1000),("CM",900),("D",500),("CD",400),
                              ("C",100),("XC",90),("L",50),("XL",40),
                              ("X",10),("IX",9),("V",5),("IV",4),("I",1)):
                    while s.startswith(rn):
                        self.value += v
                        s = s[len(rn):]
                if s:
                    raise ValueError(
                        "invalid Roman numeral string '%s' specified" %
                        orig_s)

        seventy_nine = RomanNumeral("LXXIX")
        print seventy_nine.value
        print seventy_nine.lower()
        print seventy_nine.lower().value

        Prints:

        79
        lxxix
        Traceback (most recent call last):
          File "romanNumeralStr.py", line 14, in <module>
            print seventy_nine.lower().value
        AttributeError: 'str' object has no attribute 'value'

        lower() returns a str, which has no value attribute. 

        I fear that ParseResults operations may happen behind the scenes on you, giving you an unadorned PR instead of your enhanced derived class.  But if you can keep this straight in your own code, then overriding __iadd__ or one of __set/get/delitem__ might do the job.

        Good luck,
        -- Paul

         
    • Paul McGuire
      Paul McGuire
      2009-01-09

      Urk, I forgot SF doesn't handle leading whitespace well.  Here is the RomanNumeral sample code from the pyparsing pastebin: http://pyparsing.pastebin.com/f321124ea

      -- Paul

       
    • Paul McGuire
      Paul McGuire
      2009-01-09

      Yikes!  It just dawned on me that you are parsing HTML with pyparsing - I hope the makeHTMLTags method is useful to you in this process.  makeHTMLTags takes care of a lot of surprises that crop up when trying to wade through HTML code (unexpected attributes, attributes out of the expected order, attributes with no quotes around their values, tag names and attributes of the wrong case, etc.), and returns helpful PR structures containing the values of tag attributes if any are given.

      -- Paul

       
    • Rob
      Rob
      2009-01-12

      Thanks for all of the advice! I didn't have internet over the weekend so couldn't see it until now.

      I tried to inherit ParseResults but I found that when you inherit a class you still don't get access to its private attributes/methods.

      I tried to add/change the class' bound methods. This can be done by assigning methods to the class (rather than assigning to an instance which doesn't lead to bound methods). However, name mangling does not get done on the assigned method.

      All of which means that I just have to manually use name mangling.

      I didn't use makeHTMLTags in my parser although I knew that I should do because I was just learning pyparsing for the first time and I was a bit overwhelmed by the returned list, instead I used custom literals and pp.Combine() to split the file into bookmark tokens.

      I'd just come from trying to use lxml / beautiful soup to parse the files and I was getting frustrated that those parsers insisted that the <DT> <DD> <p> tags should be closed and did so in seemingly random places. So I went for a manual approach. At least with the current code, I can parse a bookmark file and then recreate the original string, if I desire. But I've never done much parsing, so I'm probably caring about the wrong things.

      The reason that I wanted to combine two folders and then remove duplicates was to cover those cases where I already had duplicate links within a folder of bookmarks, then I could remove all duplicates in one pass but that is kind of like premature optimisation? Now I've made a simple function that turns my instance of ParseResults into a set of nested dictionaries (using _ParseResults__tokdict ), so duplicates are not an issue (within any given folder). Should be easy to create a function that merges two of these dictionaries when I next have time?!

       
    • Rob
      Rob
      2009-01-12

      This is my code if you are interested

      http://pyparsing.pastebin.com/m63579778

      Rob

       
    • Rob
      Rob
      2009-01-12

      Just for the record, I now know that you can add a bound method to a class instance (and not the class)

      either use the 'new' module (deprecated):

      foo=new.instancemethod(function, instance, class)
      setattr(instance, foo.func_name, foo)

      or the types module

      instance.new_method=types.MethodType(function,instance,class)

      Neither does name mangling though!