HTML Parser / Discussion / Help: Parsing HTML Problem

Pandu - 2007-07-31

Hi Everybody,

I am very much new to html parsing.
I have to parse a html file which has table tag, tr and tds. Every TD may, or may not have data. For example my html is like this

<HTML>
<HEAD>
<TITLE>no title</TITLE>
</HEAD>
<BODY>
   <TABLE>
             <TR>
                   <TD>Id</TD>
                   <TD>Name</TD>
                   <TD>Age</TD>
                   <TD>Sex</TD>
                   <TD>Salary</TD>
             </TR>
                   <TD>1</TD>
                   <TD>one</TD>
                   <TD>20</TD>
                   <TD>Male</TD>
                   <TD>2000</TD>
             </TR>
             <TR>
                   <TD>2</TD>
                   <TD>Two</TD>
                   <TD>21</TD>
                   <TD>Female</TD>
                   <TD></TD>
             </TR>
             <TR>
                   <TD>3</TD>
                   <TD>Three</TD>
                   <TD>22</TD>
                   <TD>Male</TD>
                   <TD></TD>
             </TR>
             <TR>
                   <TD>4</TD>
                   <TD>Four</TD>
                   <TD>23</TD>
                   <TD></TD>
                   <TD></TD>

             </TR>
             <TR>
                   <TD>5</TD>
                   <TD>Five</TD>
                   <TD>24</TD>
                   <TD></TD>
                   <TD>30000</TD>

             </TR>
   </TABLE>
</BODY>
</HTML>

Now I have to go through all the trs ,get the text in tds and i have make it as object. For example object is like this (1 one twenty male 2000). I have managed to write the code to parse the html, but its not giving the exact result as i wanted. I am posting my code also.

            Parser parser = new Parser("D:/java/HtmlParse/Example.html");
            NodeList tablesList = parser.parse (new TagNameFilter("table"));
            NodeList tr_tagsList = tablesList.extractAllNodesThatMatch(new TagNameFilter("TR"), true);


            NodeList td_tagsList = tr_tagsList.extractAllNodesThatMatch(new TagNameFilter("TD"), true);
            System.out.println("found "+tr_tagsList.size() +" tr tags");
            System.out.println("found "+td_tagsList.size() +" td tags");


Do suggest me any changes if needed. Please help me, Any help is appreciated and thanks in advance

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2007-07-31
  
  How is it "not giving the exact result as i wanted"?
  What are you getting that isn't right?
  
  If you want each row individually you will need to extract the TD tags only from each node:
  
  ... for each tr tag in the tr_tagsList...
  tr.getChildren().extractAllNodesThatMatch(new TagNameFilter("TD"), true);
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pandu - 2007-08-01
  
  Thanks for the help Derrick.
  You got my intention. I want each row to be extracted as you said. Thanks for the help
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pandu - 2007-08-01
  
  The problem got solved. But now i struckedup with new problem, the td's which do not have values aren't giving any values, not even a null value or a space. It seems those are objects of somekind. I trimmed theat value but not getting any thing. Can you suggest anything for this??? My code is like this
  
  for(int i=0;i<tr_tagsList.size();i++)
  {
  
                  NodeList nlist = tr_tagsList.elementAt(i).getChildren().extractAllNodesThatMatch(new TagNameFilter("TD"), true);
                  for(int j=0;j<nlist.size();j++)
                      String str = nlist.elementAt(j).toPlainTextString().toString();
                      str.trim();
  if(str==null)
  {
      System.out.println("Hi");
  }
  
  if(str!=null)
  {
     System.out.println("Hello");
  }
  }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Parsing HTML Problem

Forums

Help

Parsing HTML Problem document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Parsing HTML Problem