Re: [Htmlparser-user] How to extract the content of certain html tableonly

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Henry,

Try this:

public static String getKeywords(String file){
    try {
    Parser parser = new Parser (file);
    NodeList list = parser.parse (new TagNameFilter ("table"));
    System.out.println(list.toHtml());
    } catch (Exception pe) {
      pe.printStackTrace ();
  }
  return Keywords;
 }

Narindra Jeethan
Office: 780.493.7211
Mobile: 780.288.5961

________________________________

From: htm...@li... [mailto:htm...@li...] On Behalf Of Henry Tran
Sent: Wednesday, March 19, 2008 12:36 AM
To: Htm...@li...
Subject: [Htmlparser-user] How to extract the content of certain html tableonly

Hi,

I would like to read the content of all the tables from a web page using HTML Parser. Below is an example of what make up a html table:

</table>

    <tr>

    <td class="propType"><b>Address</b></td>
    <td class="propType"><b>Company</b></td>
    <td class="propType"><b>Department</b></td>
    <td class="propType" align="right"><b>Employee</b></td>
    <td colspan="6"><strong class="propType">
    <td><strong>Firstname</strong></td>
    <td><strong>Surname</strong></td>
    <td><strong>DOB</strong></td>
    <td><strong>Sex</strong></td>
    <td class="even">John</td>
    <td class="even">Smith</td>
     <td class="even">01/02/2001</td>
    <td class="even">Male</td>
  </tr>
</table>

I am using the following example provided in html parser filter page but still not quite get there just yet:

 1 import java.io.*;
 2 import java.net.*;
 3 import org.htmlparser.*;
 4 import org.htmlparser.filters.TagNameFilter;
 5 import org.htmlparser.filters.NodeClassFilter;
 6 import org.htmlparser.filters.HasParentFilter;
 7 import org.htmlparser.filters.*;
 8 import org.htmlparser.util.*;
 9 
 10 public class DnldURL {
 11    public static void main (String[] args) throws ParserException {
 12        DnldURL dnldURL = new DnldURL();
 13    }
 14    public DnldURL() throws ParserException {
 15       try {
 16          Parser parser = new Parser ("http://www.abc.com");
 17          parser.parse (new HasParentFilter());
 18          NodeList list = new NodeList();
 19          NodeFilter filter = new OrFilter(
 20                         new TagNameFilter ("table"),
 21                         new HasChildFilter(
 22                         new TagNameFilter("tr")));     
 23          for (NodeIterator e = parser.elements(); e.hasMoreNodes(); )
 24    //        System.out.println(e.nextNode().toHtml());
 25              System.out.println(e.nextNode().collectInto(list, filter);
 26       } catch (MalformedURLException mue) {
 27          System.out.println("Ouch - a MalformedURLException ha2pened.");
 28          mue.printStackTrace();
 29         
 System.exit(1);
 30       } catch (IOException ioe) {
 31          System.out.println("Oops- an IOException happened.");
 32          ioe.printStackTrace();
 33          System.exit(1);
 34       } 
 35    }

The important thing is to get lines 17, 19-22 correctly set up so that the filter could pick up the content and printed on line 25.

Not only am I confused on how to set up the table filter dependencies (<table> ...<tr> ...<td>...) but also how to get line 25 to combine both the filter and to.Html() together.

For instance:

    System.out.println(e.nextNode().collectInto(list, filter).toHtml()); which doesn't work currently.
I also would like to set up some dependency on what the content of <table>, <tr> and <td> should be so that only those relevant tables are being retrieved as opposed to all the tables.
Many thanks,
Jack

________________________________

Get the name you always wanted with the new y7mail email address <http://au.rd.yahoo.com/mail/taglines/au/y7mail/default/*http://au.yahoo.com/y7mail/?p1=ni&p2=general&p3=tagline&p4=other> .