Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables
Brought to you by:
derrickoswald
From: Henry T. <htr...@ya...> - 2008-06-10 23:43:59
|
Hi Derrick, I have tried the following table data filter by taking into account of your suggestion to use TagNameFilter() to look for <table> & <tr> as opposed to TagNameFilter() to look for <table> and HasChildFilter() for <tr> but still not parsing anything through: new AndFilter ( new AndFilter ( new TagNameFilter ("table"), new AndFilter ( new HasAttributeFilter ("border","0"), new AndFilter ( new HasAttributeFilter ("cellspacing","0"), new HasAttributeFilter ("width","100%")))), new AndFilter ( new TagNameFilter ("tr"), new AndFilter ( new HasChildFilter ( new TagNameFilter ("td")), new OrFilter ( new HasAttributeFilter ("class","propType"), new HasAttributeFilter ("class","even"))))); I still don't understand why we should treat both <table> & <tr> on the same level even though <tr> is the child of <table>. As a result, <td> should be the grandchild of <table>. The class attribute now should pick up either "propType" or "even" value but not both. Thanks, Henry ----- Original Message ---- From: Derrick Oswald <der...@ro...> To: htmlparser user list <htm...@li...> Sent: Tuesday, 10 June, 2008 9:12:02 PM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables The FilterBuilder project is off the trunk in SVN and I believe it is included in the download. Visitors are in the parser tree in SVN trunk\parser\src\main\java\org\htmlparser\visitors. The link filters operate on the href text of a link <a href="KKK">. NodeClassFilter is like TagNameFilter but uses the tag class instead of the tag name. ----- Original Message ---- From: Henry Tran <htr...@ya...> To: htmlparser user list <htm...@li...> Sent: Tuesday, June 10, 2008 12:03:01 AM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables Hi Derrick, Where can I find a copy of the FilterBuilder, visitors, custom tags in conjunction with the PrototypicalNodeFactory tutorials? I also not sure how do those LinkRegexFilter, LinkStringFilter and NodeClassFilter work. Btw, I have worked out how to do the question (ii) earlier. Thanks, Henry ----- Original Message ---- From: Derrick Oswald <der...@ro...> To: htmlparser user list <htm...@li...> Sent: Monday, 9 June, 2008 8:56:47 PM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables It looks like you've got two HasAttribute filters looking for two different values in the same "class" attribute. How can a tag have a "class=proType" *and* a "class=even" at the same time? GrandParents and GrandChildren are handled with subfilters. Here's an example for 'TABLE has a grand child TD'. new AndFilter (new TagNameFilter ("TABLE"), new AndFilter (new TagNameFilter ("TR"), new HasChildFilter (new TagNameFilter ("TD"))) You should probably play with the FilterBuilder application - it has a tutorial - to get the hang of it. ----- Original Message ---- From: Henry Tran <htr...@ya...> To: htmlparser user list <htm...@li...> Sent: Saturday, June 7, 2008 8:45:01 PM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables Hi Derrick, It appears that I have made one step forward but two steps back in terms of parsing some of these html tables. I would like to read the following table: <table border="0" cellspacing="0" cellpadding="2" width="100%"> // HasGrandParent()... <tr> // HasParent()... <td class="propType"> </td> // HasAttributeFilter()... <td class="propType"><b>Patient</b></td> <td class="propType"><b>Firstname</b></td> <td class="propType"><b>Surname</b></td> <td class="propType" align="right"><b>Date of Birth</b></td> <td class="propType">Sex</td> </tr> </table> Below are the various table data filters used to in an attempt to distinguish the correct table I wanted to read without success: (a) new AndFilter ( new TagNameFilter ("td"), new AndFilter ( new HasAttributeFilter("class", "proType"), new HasAttributeFilter("class", "even"))); (b) new AndFilter ( new TagNameFilter ("td"), new AndFilter ( new HasAttributeFilter("class", "proType"), new AndFilter ( new HasAttributeFilter("class", "even"), new AndFilter ( new HasParentFilter ( new TagNameFilter ("tr")), new AndFilter ( new HasParentFilter ( new TagNameFilter ("table")), new AndFilter ( new HasAttributeFilter ("border","0"), ( new HasAttributeFilter("width", "100%")))))))); (c) new AndFilter ( new TagNameFilter ("table"), new AndFilter ( new HasAttributeFilter("border","0"), new AndFilter ( new HasAttributeFilter("width", "100%"), new AndFilter ( new HasChildFilter ( new TagNameFilter ("tr")), new AndFilter ( new HasChildFilter ( new TagNameFilter ("td")), new AndFilter ( new HasAttributeFilter("class", "proType"), new HasAttributeFilter("class", "even"))))))); None of the above filters parse the table data I wanted. Where have I gone wrong? (i) Btw, does htmlparser support HasGrandParent() and HasGrandChild() which would allow me to parse: <table border="0" cellspacing="0" cellpadding="2" width="100%"> // HasGrandParent()... <tr> // HasParent()... <td class="propType"> </td> // HasAttributeFilter()... (ii) I would also like to retrieve all the content of the same webpage to a file and then read it back to test out various parsing needs without having a direct Internet connection to this site for every parsing test. Is this possible? If so, any idea on how this can be done? Many thanks again, Jack ----- Original Message ---- From: Derrick Oswald <der...@ro...> To: htmlparser user list <htm...@li...> Sent: Thursday, 5 June, 2008 9:08:32 AM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables Create a node list: NodeList results = new NodeList (); Then in your loop over each result, add the nodes to the list instead of printing them out: for (int i=0; i<len; i+=1) { TagNode tag = (TagNode)a1.elementAt(i); results.Add (tag); } Then when you've collected all the tables using whatever currenttabledatafilter values you have, all the tables will be in your results NodeList and you can iterate over them with the same type of loop that you have: int len = results.size(); for (int i=0; i<len; i+=1) { TagNode tag = (TagNode)results.elementAt(i); // do what you want } ----- Original Message ---- From: Henry Tran <htr...@ya...> To: htmlparser user list <htm...@li...> Sent: Wednesday, June 4, 2008 5:40:29 PM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables Hi Derrick, Can you explain a little more perhaps with a few lines of example, if it is not to much of an effort? I thought I have already got a Nodelist a1 but the challenge is to distinguish which <TD> from which table. I am very new to using htmlparser and would appreciate a little guidance. Thanks very much again, Henry ----- Original Message ---- From: Derrick Oswald <der...@ro...> To: htmlparser user list <htm...@li...> Sent: Wednesday, 4 June, 2008 10:56:07 PM Subject: Re: [Htmlparser-user] How to save <TD> value to unique variables from html tables You should just add the tags you want to a NodeList of your own. Then later on process all the nodes in the list... filing them to a database for instance. ----- Original Message ---- From: Henry Tran <htr...@ya...> To: Htm...@li... Cc: htm...@li... Sent: Wednesday, June 4, 2008 8:43:09 AM Subject: [Htmlparser-user] How to save <TD> value to unique variables from html tables Hi All, I have been successful in extracting almost all the table data using the following htmlparser statements in Java: Parser parser = new Parser ("http://www.abc.com/..."); NodeList nl = parser.parse(null); NodeFilter currenttabledatafilter = new AndFilter ( new TagNameFilter ("td"), new OrFilter ( new HasAttributeFilter("class","even"), new OrFilter ( new HasAttributeFilter("class", "odd"), new AndFilter ( new HasAttributeFilter("colspan","6"), new HasChildFilter(new TagNameFilter ("Strong")))))); NodeList a1 = nl.extractAllNodesThatMatch(currenttabledatafilter,true); int len = a1.size(); for (int i=0; i<len; i+=1) { TagNode tag = (TagNode)a1.elementAt(i); System.out.println(tag.toPlainTextString()); // System.out.println(tag.toHtml()); } } catch(Exception pe) { pe.printStackTrace(); } This is great for retrieving all these table data. However, I would like to save the value of each <td> to a unique variable so that they could be used in the program and ultimately save them to database. As a result, I am looking to structure a program to assign each value to a unique variable (or insert it into the database, which I can do once they are available) from as many html tables on a web page. Each table has some distinct attributes but varies on the number of <td> in them. In other, I am looking for some thing similar to the loop through a text a file as follows: While not end of line (i) identify a new table based on its unique attributes. (ii) assign the value/content of each <td> in the current table to a unique variable for instance. (iii) repeat step (i) and (ii) for remaining tables. Thanks a lot, Henry Send instant messages to your online friends http://au.messenger.yahoo.com ________________________________ Get the name you always wanted with the new y7mail email address. ________________________________ Get the name you always wanted with the new y7mail email address. ________________________________ Get the name you always wanted with the new y7mail email address. Get the name you always wanted with the new y7mail email address. www.yahoo7.com.au/mail |