Re: [Htmlparser-user] How to extract the content of certain html tableonly
Brought to you by:
derrickoswald
|
From: Narindra J. <Nar...@te...> - 2008-03-19 13:55:09
|
Hi Henry,
Try this:
public static String getKeywords(String file){
try {
Parser parser = new Parser (file);
NodeList list = parser.parse (new TagNameFilter ("table"));
System.out.println(list.toHtml());
} catch (Exception pe) {
pe.printStackTrace ();
}
return Keywords;
}
Narindra Jeethan
Office: 780.493.7211
Mobile: 780.288.5961
________________________________
From: htm...@li... [mailto:htm...@li...] On Behalf Of Henry Tran
Sent: Wednesday, March 19, 2008 12:36 AM
To: Htm...@li...
Subject: [Htmlparser-user] How to extract the content of certain html tableonly
Hi,
I would like to read the content of all the tables from a web page using HTML Parser. Below is an example of what make up a html table:
</table>
<tr>
<td class="propType"><b>Address</b></td>
<td class="propType"><b>Company</b></td>
<td class="propType"><b>Department</b></td>
<td class="propType" align="right"><b>Employee</b></td>
<td colspan="6"><strong class="propType">
<td><strong>Firstname</strong></td>
<td><strong>Surname</strong></td>
<td><strong>DOB</strong></td>
<td><strong>Sex</strong></td>
<td class="even">John</td>
<td class="even">Smith</td>
<td class="even">01/02/2001</td>
<td class="even">Male</td>
</tr>
</table>
I am using the following example provided in html parser filter page but still not quite get there just yet:
1 import java.io.*;
2 import java.net.*;
3 import org.htmlparser.*;
4 import org.htmlparser.filters.TagNameFilter;
5 import org.htmlparser.filters.NodeClassFilter;
6 import org.htmlparser.filters.HasParentFilter;
7 import org.htmlparser.filters.*;
8 import org.htmlparser.util.*;
9
10 public class DnldURL {
11 public static void main (String[] args) throws ParserException {
12 DnldURL dnldURL = new DnldURL();
13 }
14 public DnldURL() throws ParserException {
15 try {
16 Parser parser = new Parser ("http://www.abc.com");
17 parser.parse (new HasParentFilter());
18 NodeList list = new NodeList();
19 NodeFilter filter = new OrFilter(
20 new TagNameFilter ("table"),
21 new HasChildFilter(
22 new TagNameFilter("tr")));
23 for (NodeIterator e = parser.elements(); e.hasMoreNodes(); )
24 // System.out.println(e.nextNode().toHtml());
25 System.out.println(e.nextNode().collectInto(list, filter);
26 } catch (MalformedURLException mue) {
27 System.out.println("Ouch - a MalformedURLException ha2pened.");
28 mue.printStackTrace();
29
System.exit(1);
30 } catch (IOException ioe) {
31 System.out.println("Oops- an IOException happened.");
32 ioe.printStackTrace();
33 System.exit(1);
34 }
35 }
The important thing is to get lines 17, 19-22 correctly set up so that the filter could pick up the content and printed on line 25.
Not only am I confused on how to set up the table filter dependencies (<table> ...<tr> ...<td>...) but also how to get line 25 to combine both the filter and to.Html() together.
For instance:
System.out.println(e.nextNode().collectInto(list, filter).toHtml()); which doesn't work currently.
I also would like to set up some dependency on what the content of <table>, <tr> and <td> should be so that only those relevant tables are being retrieved as opposed to all the tables.
Many thanks,
Jack
________________________________
Get the name you always wanted with the new y7mail email address <http://au.rd.yahoo.com/mail/taglines/au/y7mail/default/*http://au.yahoo.com/y7mail/?p1=ni&p2=general&p3=tagline&p4=other> .
|