I am very much new to html parsing.
I have to parse a html file which has table tag, tr and tds. Every TD may, or may not have data. For example my html is like this
Now I have to go through all the trs ,get the text in tds and i have make it as object. For example object is like this (1 one twenty male 2000). I have managed to write the code to parse the html, but its not giving the exact result as i wanted. I am posting my code also.
Parser parser = new Parser("D:/java/HtmlParse/Example.html");
NodeList tablesList = parser.parse (new TagNameFilter("table"));
NodeList tr_tagsList = tablesList.extractAllNodesThatMatch(new TagNameFilter("TR"), true);
The problem got solved. But now i struckedup with new problem, the td's which do not have values aren't giving any values, not even a null value or a space. It seems those are objects of somekind. I trimmed theat value but not getting any thing. Can you suggest anything for this??? My code is like this
Hi Everybody,
I am very much new to html parsing.
I have to parse a html file which has table tag, tr and tds. Every TD may, or may not have data. For example my html is like this
<HTML>
<HEAD>
<TITLE>no title</TITLE>
</HEAD>
<BODY>
<TABLE>
<TR>
<TD>Id</TD>
<TD>Name</TD>
<TD>Age</TD>
<TD>Sex</TD>
<TD>Salary</TD>
</TR>
<TD>1</TD>
<TD>one</TD>
<TD>20</TD>
<TD>Male</TD>
<TD>2000</TD>
</TR>
<TR>
<TD>2</TD>
<TD>Two</TD>
<TD>21</TD>
<TD>Female</TD>
<TD></TD>
</TR>
<TR>
<TD>3</TD>
<TD>Three</TD>
<TD>22</TD>
<TD>Male</TD>
<TD></TD>
</TR>
<TR>
<TD>4</TD>
<TD>Four</TD>
<TD>23</TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD>5</TD>
<TD>Five</TD>
<TD>24</TD>
<TD></TD>
<TD>30000</TD>
</TR>
</TABLE>
</BODY>
</HTML>
Now I have to go through all the trs ,get the text in tds and i have make it as object. For example object is like this (1 one twenty male 2000). I have managed to write the code to parse the html, but its not giving the exact result as i wanted. I am posting my code also.
Parser parser = new Parser("D:/java/HtmlParse/Example.html");
NodeList tablesList = parser.parse (new TagNameFilter("table"));
NodeList tr_tagsList = tablesList.extractAllNodesThatMatch(new TagNameFilter("TR"), true);
NodeList td_tagsList = tr_tagsList.extractAllNodesThatMatch(new TagNameFilter("TD"), true);
System.out.println("found "+tr_tagsList.size() +" tr tags");
System.out.println("found "+td_tagsList.size() +" td tags");
Do suggest me any changes if needed. Please help me, Any help is appreciated and thanks in advance
How is it "not giving the exact result as i wanted"?
What are you getting that isn't right?
If you want each row individually you will need to extract the TD tags only from each node:
... for each tr tag in the tr_tagsList...
tr.getChildren().extractAllNodesThatMatch(new TagNameFilter("TD"), true);
Thanks for the help Derrick.
You got my intention. I want each row to be extracted as you said. Thanks for the help
The problem got solved. But now i struckedup with new problem, the td's which do not have values aren't giving any values, not even a null value or a space. It seems those are objects of somekind. I trimmed theat value but not getting any thing. Can you suggest anything for this??? My code is like this
for(int i=0;i<tr_tagsList.size();i++)
{
NodeList nlist = tr_tagsList.elementAt(i).getChildren().extractAllNodesThatMatch(new TagNameFilter("TD"), true);
for(int j=0;j<nlist.size();j++)
String str = nlist.elementAt(j).toPlainTextString().toString();
str.trim();
if(str==null)
{
System.out.println("Hi");
}
if(str!=null)
{
System.out.println("Hello");
}
}