[Htmlparser-announce] Integration Release 1.3-20030420 is out

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Folks,
    This week's release is out. From the change log:

Integration Build 1.3 - 20030420
--------------------------------
[1] Fixed bug #722046 StringExtractor.extractStrings misses most of the
text,
change to use a StringBean to dig into tables.
[2] add checking in Translate to eliminate bug #722835
StringIndexOutOfBoundsException exception
[3] added line-break condition in assertXmlEquals
[4] added fit testing framework
[5] added parent association for each node
[6] added digupStringNode() and findPositionOf(Node) to CompositeTag
[7] Fixed bug 723835 in LinkExtractor

We have some powerful searching capability with this release.
From any node, you can find the parent composite tag, and navigate thru the
entire html structure. This is useful in scenarios like :

Search for data that lies close to a certain piece of text.

e.g. ... <table>
                <tr>
                    <td>
                        <b>Name:</b><i>John Doe</i>
                    </td>
                </tr>
            </table>

We can extract John Doe, by using our knowledge of its expected position.
If we assume that the contents are inside a table tag, here's what a program
could look like:

parser.registerScanners();
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Lets assume our data is in the second table
TableTag table = (TableTag)nodes[1];

// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");

// We assume that the first node that matched is the one we want. We
navigate to its parent
Node parentOfName = stringNodes[0].getParent();

// From the parent, we shall find out the position of "Name"
int posOfName = parentOfName.findPositionOf(stringNodes[0]);

// Its easy now to navigate to John Doe, as we know it is 3 positions away
Node expectedName = parentOfName.childAt(posOfName + 3);

This can be useful for writing tests for your pages or extracting position
based info - new possibilities open up for semantic searches.

Regards,
Somik