I am new to both Java and htmlparser. I need help on how to parse an HTML page so that I can extract text from SPECIFIC columns within a table. I need something that will get an html as below:
You can get all tables using a filter:
NodeList list = parser.extractAllNodesThatMatch (new TagNameFilter ("TABLE"));
for (int i = 0; i < list.size (); i++)
TableTag table = (TableTag)list.elementAt (i);
Once you have the table tag, you can get at the data by rows:
TableRow[] rows = table.getRows ();
for (int i = 0; i < rows.length; i++)
{
TableColumn[] columns = rows[i].getColumns ();
for (int j = 0; j < rows.length; j++)
System.out.println (columns[j].toPlainTextString ());
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am new to both Java and htmlparser. I need help on how to parse an HTML page so that I can extract text from SPECIFIC columns within a table. I need something that will get an html as below:
<tr>
<td><b>First Column: </b>
<ul>
<li><a href="...">First</a></li>
<li><a href="...">Second</a></li>
</ul>
</td>
<td><b>Second Column: </b>
<ul>
<li><a href="...">extra1</a></li>
</ul>
</td>
<td><b>Third Column: </b>
<ul>
<li><a href="...">Extra2</a></li>
</ul>
</td>
</tr>
...and produce only text from the lists under the column with text "First Column" and "Third Column" in the TD tags:
First
Second
Extra2
How can this be achieved?
Thank you.
You can get all tables using a filter:
NodeList list = parser.extractAllNodesThatMatch (new TagNameFilter ("TABLE"));
for (int i = 0; i < list.size (); i++)
TableTag table = (TableTag)list.elementAt (i);
Once you have the table tag, you can get at the data by rows:
TableRow[] rows = table.getRows ();
for (int i = 0; i < rows.length; i++)
{
TableColumn[] columns = rows[i].getColumns ();
for (int j = 0; j < rows.length; j++)
System.out.println (columns[j].toPlainTextString ());
}