Re: [Htmlparser-user] How to extract the content of certain html tableonly
Brought to you by:
derrickoswald
From: Narindra J. <Nar...@te...> - 2008-03-19 13:55:09
|
Hi Henry, Try this: public static String getKeywords(String file){ try { Parser parser = new Parser (file); NodeList list = parser.parse (new TagNameFilter ("table")); System.out.println(list.toHtml()); } catch (Exception pe) { pe.printStackTrace (); } return Keywords; } Narindra Jeethan Office: 780.493.7211 Mobile: 780.288.5961 ________________________________ From: htm...@li... [mailto:htm...@li...] On Behalf Of Henry Tran Sent: Wednesday, March 19, 2008 12:36 AM To: Htm...@li... Subject: [Htmlparser-user] How to extract the content of certain html tableonly Hi, I would like to read the content of all the tables from a web page using HTML Parser. Below is an example of what make up a html table: </table> <tr> <td class="propType"><b>Address</b></td> <td class="propType"><b>Company</b></td> <td class="propType"><b>Department</b></td> <td class="propType" align="right"><b>Employee</b></td> <td colspan="6"><strong class="propType"> <td><strong>Firstname</strong></td> <td><strong>Surname</strong></td> <td><strong>DOB</strong></td> <td><strong>Sex</strong></td> <td class="even">John</td> <td class="even">Smith</td> <td class="even">01/02/2001</td> <td class="even">Male</td> </tr> </table> I am using the following example provided in html parser filter page but still not quite get there just yet: 1 import java.io.*; 2 import java.net.*; 3 import org.htmlparser.*; 4 import org.htmlparser.filters.TagNameFilter; 5 import org.htmlparser.filters.NodeClassFilter; 6 import org.htmlparser.filters.HasParentFilter; 7 import org.htmlparser.filters.*; 8 import org.htmlparser.util.*; 9 10 public class DnldURL { 11 public static void main (String[] args) throws ParserException { 12 DnldURL dnldURL = new DnldURL(); 13 } 14 public DnldURL() throws ParserException { 15 try { 16 Parser parser = new Parser ("http://www.abc.com"); 17 parser.parse (new HasParentFilter()); 18 NodeList list = new NodeList(); 19 NodeFilter filter = new OrFilter( 20 new TagNameFilter ("table"), 21 new HasChildFilter( 22 new TagNameFilter("tr"))); 23 for (NodeIterator e = parser.elements(); e.hasMoreNodes(); ) 24 // System.out.println(e.nextNode().toHtml()); 25 System.out.println(e.nextNode().collectInto(list, filter); 26 } catch (MalformedURLException mue) { 27 System.out.println("Ouch - a MalformedURLException ha2pened."); 28 mue.printStackTrace(); 29 System.exit(1); 30 } catch (IOException ioe) { 31 System.out.println("Oops- an IOException happened."); 32 ioe.printStackTrace(); 33 System.exit(1); 34 } 35 } The important thing is to get lines 17, 19-22 correctly set up so that the filter could pick up the content and printed on line 25. Not only am I confused on how to set up the table filter dependencies (<table> ...<tr> ...<td>...) but also how to get line 25 to combine both the filter and to.Html() together. For instance: System.out.println(e.nextNode().collectInto(list, filter).toHtml()); which doesn't work currently. I also would like to set up some dependency on what the content of <table>, <tr> and <td> should be so that only those relevant tables are being retrieved as opposed to all the tables. Many thanks, Jack ________________________________ Get the name you always wanted with the new y7mail email address <http://au.rd.yahoo.com/mail/taglines/au/y7mail/default/*http://au.yahoo.com/y7mail/?p1=ni&p2=general&p3=tagline&p4=other> . |