Menu

Parsing Yahoo news - Questions

Help
GotJava
2005-10-11
2013-04-27
  • GotJava

    GotJava - 2005-10-11

    Hi,
    I would like to parse the yahoo news web page so that I can extract news from it. What the program has to do is,
    1. Read the headlines
    2. Display the link source of that headline
    3. Display 2-3 lines of summary

    This is something like RSS feed

    By looking at the yahoo news page source, i have decided to display the headlines with the help of span tags. Here is the sample code from yahoo news web page...
    <h2 >
    <a href="/s/ap/20051011/ap_on_re_as/pakistan_quake;_ylt=AsicfAeeh5d5R6wwQsMsM1us0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--">Desperate Pakistanis Await Earthquake Aid</a></h2> <em>AP - <span class=recenttimedate> 43 minutes ago</span></em>

    I search for the spantag with recenttimedate as attribute and display the headline. I am able to upto this part. but the problem is when i try to display the link (by moving to the parent node of span tag) i cannot do it. Please expleain me why

    Here is the part of code that I wrote....

    Parser parser = new Parser (sourceURL);
                NodeList list = parser.parse (null);
                NodeList spanList = list.extractAllNodesThatMatch(new TagNameFilter ("SPAN"),true);

    while(i < spanList.size())
                {
                    Span spanTag = (Span)spanList.elementAt(i);
                                   
                    if(spanTag.getAttribute("class").equals("recenttimedate"))
                    {
                        System.out.println(spanList.elementAt(i).getParent().toPlainTextString());
                    }
                    i++;              
                }

    Can anyone point out what is wrong??

    Riaz

     
    • Derrick Oswald

      Derrick Oswald - 2005-10-11

      What is the output?
      Not seeing the link?
      Try toHtml() instead of toPlainTextString.

       
    • GotJava

      GotJava - 2005-10-11

      This is the output when i use toHtml():
      <li>
               <a href="/s/ap/20051011/ap_on_re_mi_ea/iraq">Insurgents Kill More Than 40 Iraqis</a>

               <em>AP - <span class=recenttimedate>1 hour,  29 minutes ago</span></em>      </li>

      So, I only need the href value for the text string...

      By looking here it seems to me that <li> is the parent for this span tag, but how do i move to the link value.

       
      • Derrick Oswald

        Derrick Oswald - 2005-10-11

        From the LI tag, get the children list and either go through them one by one looking for the link tag or use a NodeNameFilter:
          NodeList links = list_item_tag.getChildren ().extractAllNodesThatMatch (new NodeNameFilter ("A"));

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.