Hi,
I would like to parse the yahoo news web page so that I can extract news from it. What the program has to do is,
1. Read the headlines
2. Display the link source of that headline
3. Display 2-3 lines of summary
This is something like RSS feed
By looking at the yahoo news page source, i have decided to display the headlines with the help of span tags. Here is the sample code from yahoo news web page...
<h2 >
<a href="/s/ap/20051011/ap_on_re_as/pakistan_quake;_ylt=AsicfAeeh5d5R6wwQsMsM1us0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--">Desperate Pakistanis Await Earthquake Aid</a></h2> <em>AP - <span class=recenttimedate> 43 minutes ago</span></em>
I search for the spantag with recenttimedate as attribute and display the headline. I am able to upto this part. but the problem is when i try to display the link (by moving to the parent node of span tag) i cannot do it. Please expleain me why
Here is the part of code that I wrote....
Parser parser = new Parser (sourceURL);
NodeList list = parser.parse (null);
NodeList spanList = list.extractAllNodesThatMatch(new TagNameFilter ("SPAN"),true);
From the LI tag, get the children list and either go through them one by one looking for the link tag or use a NodeNameFilter:
NodeList links = list_item_tag.getChildren ().extractAllNodesThatMatch (new NodeNameFilter ("A"));
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I would like to parse the yahoo news web page so that I can extract news from it. What the program has to do is,
1. Read the headlines
2. Display the link source of that headline
3. Display 2-3 lines of summary
This is something like RSS feed
By looking at the yahoo news page source, i have decided to display the headlines with the help of span tags. Here is the sample code from yahoo news web page...
<h2 >
<a href="/s/ap/20051011/ap_on_re_as/pakistan_quake;_ylt=AsicfAeeh5d5R6wwQsMsM1us0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--">Desperate Pakistanis Await Earthquake Aid</a></h2> <em>AP - <span class=recenttimedate> 43 minutes ago</span></em>
I search for the spantag with recenttimedate as attribute and display the headline. I am able to upto this part. but the problem is when i try to display the link (by moving to the parent node of span tag) i cannot do it. Please expleain me why
Here is the part of code that I wrote....
Parser parser = new Parser (sourceURL);
NodeList list = parser.parse (null);
NodeList spanList = list.extractAllNodesThatMatch(new TagNameFilter ("SPAN"),true);
while(i < spanList.size())
{
Span spanTag = (Span)spanList.elementAt(i);
if(spanTag.getAttribute("class").equals("recenttimedate"))
{
System.out.println(spanList.elementAt(i).getParent().toPlainTextString());
}
i++;
}
Can anyone point out what is wrong??
Riaz
What is the output?
Not seeing the link?
Try toHtml() instead of toPlainTextString.
This is the output when i use toHtml():
<li>
<a href="/s/ap/20051011/ap_on_re_mi_ea/iraq">Insurgents Kill More Than 40 Iraqis</a>
<em>AP - <span class=recenttimedate>1 hour, 29 minutes ago</span></em> </li>
So, I only need the href value for the text string...
By looking here it seems to me that <li> is the parent for this span tag, but how do i move to the link value.
From the LI tag, get the children list and either go through them one by one looking for the link tag or use a NodeNameFilter:
NodeList links = list_item_tag.getChildren ().extractAllNodesThatMatch (new NodeNameFilter ("A"));