[Htmlparser-user] How to extract the content of certain html table only
Brought to you by:
derrickoswald
From: Henry T. <htr...@ya...> - 2008-03-19 06:35:44
|
Hi, I would like to read the content of all the tables from a web page using HTML Parser. Below is an example of what make up a html table: </table> <tr> <td class="propType"><b>Address</b></td> <td class="propType"><b>Company</b></td> <td class="propType"><b>Department</b></td> <td class="propType" align="right"><b>Employee</b></td> <td colspan="6"><strong class="propType"> <td><strong>Firstname</strong></td> <td><strong>Surname</strong></td> <td><strong>DOB</strong></td> <td><strong>Sex</strong></td> <td class="even">John</td> <td class="even">Smith</td> <td class="even">01/02/2001</td> <td class="even">Male</td> </tr> </table> I am using the following example provided in html parser filter page but still not quite get there just yet: 1 import java.io.*; 2 import java.net.*; 3 import org.htmlparser.*; 4 import org.htmlparser.filters.TagNameFilter; 5 import org.htmlparser.filters.NodeClassFilter; 6 import org.htmlparser.filters.HasParentFilter; 7 import org.htmlparser.filters.*; 8 import org.htmlparser.util.*; 9 10 public class DnldURL { 11 public static void main (String[] args) throws ParserException { 12 DnldURL dnldURL = new DnldURL(); 13 } 14 public DnldURL() throws ParserException { 15 try { 16 Parser parser = new Parser (“http://www.abc.com”); 17 parser.parse (new HasParentFilter()); 18 NodeList list = new NodeList(); 19 NodeFilter filter = new OrFilter( 20 new TagNameFilter ("table"), 21 new HasChildFilter( 22 new TagNameFilter("tr"))); 23 for (NodeIterator e = parser.elements(); e.hasMoreNodes(); ) 24 // System.out.println(e.nextNode().toHtml()); 25 System.out.println(e.nextNode().collectInto(list, filter); 26 } catch (MalformedURLException mue) { 27 System.out.println("Ouch - a MalformedURLException ha2pened."); 28 mue.printStackTrace(); 29 System.exit(1); 30 } catch (IOException ioe) { 31 System.out.println("Oops- an IOException happened."); 32 ioe.printStackTrace(); 33 System.exit(1); 34 } 35 } The important thing is to get lines 17, 19-22 correctly set up so that the filter could pick up the content and printed on line 25. Not only am I confused on how to set up the table filter dependencies (<table> …<tr> …<td>…) but also how to get line 25 to combine both the filter and to.Html() together. For instance: System.out.println(e.nextNode().collectInto(list, filter).toHtml()); which doesn’t work currently. I also would like to set up some dependency on what the content of <table>, <tr> and <td> should be so that only those relevant tables are being retrieved as opposed to all the tables. Many thanks, Jack Get the name you always wanted with the new y7mail email address. www.yahoo7.com.au/y7mail |