Hi, in another forum you (PAUL) gave a small example for parsing HTML and making changes to url's. And in response to another person's comment about what would happen if the "href" came after other attributes you suggested using "makeHTMLTags". I'm pretty new at this stuff so I was wondering if you could give me an example using makeHTMLTags for something like this. Lets say I want to go to an arbitrary site (www.yahoo.com) and go through all the links on the page and make changes to each one but at the same time accounting for the possiblity of hrefs coming after other attributes. Thanks a lot.
Good question, I put a bit of magic into makeHTMLTags and makeXMLTags, and never really described them very well, so a sample program should help some. And, in researching this question, you helped me uncover a bug or two in makeHTMLTags, so I'll be releasing a new pyparsing version in the next day or two (I had a new version pretty much ready to go, but no urgency in putting it out - now I have a reason!).
Here's a simple test program to access all the <A HREF="xxx">lskklsjflj</A> patterns on a web site, and list out a cross-reference of labels to URL:
from pyparsing import *
anchorStart,anchorEnd = makeHTMLTags("a")
# read HTML from a web page
serverListPage = urllib.urlopen( "http://www.yahoo.com" )
htmlText = serverListPage.read()
anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd
for tokens,start,end in anchor.scanString(htmlText):
First off, note that makeHTMLTags returns two values. Predictably, these are expressions for the start and end tags. The first simplistic version of makeHTMLTags was just
return "<" + Literal(tagname) + ">", "</" + Literal(tagname) + ">"
But this falls pretty far short of what we really need out of these expressions.
For one thing, HTML is far more lax than XML is on matching of opening and closing tags, especially with respect to upper/lower case. So I really wanted these to be CaselessLiteral expressions, so that <B> could match up with a closing </b>.
I also realized that I was taking on some of the expression definition responsibility within makeHTMLTags, such as the addition of results names. So I added results names to both opening and closing tags, named "start<tagname>" and "end<tagname>", with the tagname in title case (capitalized first letter).
Next I wanted to handle the case where an opening tag was also its own closing tag, as in this construct: "<empty_tag/>". This is more of an XML-ish feature, but I wanted to be able to share most of the code between making HTML and XML tag expression pairs. So I allowed for the opening tag to include a trailing "/" before its closing ">" sign, and marked the slash with the results name "empty", so that client code could test for tokens.empty, which will return "/" or "".
Lastly, this simple form doesn't do anything about attributes in the opening tag. Here things get especially tricky. I accommodated the attributes by including an expression for zero or more "name=quotedString" expressions, returning the names and values as a Dict expression. This permits the caller to access the attributes as dictionary entries or object attributes. You can see this in the code example, when we access the href contents as "tokens.startA.href".
I have to sign off here for now. This sample partially works with the latest version of pyparsing, and will be fully functional with the next release (which will include the bugfixes to makeHTMLTags). I'll try to respond in a day or two with the example you asked for, making changes to the HREF's.
Hi. Thanks for the response. I tried your example and it seemed to work fairly well. The only thing was that when I get to longer/more complicated web pages and try to sort out all the links it starts giving me other stuff that's not in the links. I'm not sure if i'm doing something wrong or not. But here's an example:
the web page was a news article link on cnn.com
url = (http://www.cnn.com/2005/WEATHER/10/10/severe.flooding.ap/index.html)
from pyparsing import *
from urllib2 import *
anchorStart, anchorEnd = makeHTMLTags("a")
for tokens, start, end in anchor.scanString(feed):
when i print this, it gives me links but a lot of unwanted html as well. I'm not sure but is start, end supposed to be the indexes for the start and end of each link? when i just print out the tokens, it seems to give me all the <a.....</a>'s just fine. Or like you suggested, using "tokens.startA.href" I get the listing of all the href's. This is sort of what i want, except i would like to be able to insert text into the links as well as just parse through all of them. Is there a way to do this with the tokens? or am i doing something wrong? Thanks for your time.
Yes, start and end are the start and end location of the found tokens within the input string. Note that pyparsing by default implicitly expands tabs, which changes the offsets of the token locations.
You can fix this in several ways, the simplest is to do
feed = feed.expandtabs()
just before your loop on scanString.
If you want to make changes, you can define the changes in a parse action. Parse actions receive the matched tokens as the third argument (after the input string and match location). In the body of your parse action, you can read the incoming tokens, and then return a modified form of them. Then instead of using scanString, use transformString, which internally uses scanString to piece together the parts of the input string with any values returned from parse actions.
In your case, you would attach a parse action to anchorStart, manipulate the href value, and then return a complete token string for the <A HREF...> tag, including your changes (unfortunately, you can't just update the .href attribute in place, it is a readonly value).
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.