[Htmlparser-developer] Design Modifications to HTMLNode API

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Folks,
    Following some nice suggestions from Sam Joseph, I have just =
completed some design modifications to the basic HTMLNode API.
    The modifications are :
[1] HTMLNode is no longer an interface, but an abstract class. There =
were two reasons for this change. Firstly, I couldnt think of a scenario =
where an object would be an html tag AND something else. Secondly, I =
wanted to enforce the implementation of toString(), which is usually =
done if you implement from the interface (as Object has a default =
toString()).

[2] abstract toString() method - children have to implement this.
[3] print() and print(PrintWriter) - final methods. They will make a =
call to toString(), and print to standard output and the print writer =
respectively.
[4] toPlainText() - this method will provide a string representation of =
a tag, if there is such a representation. If not , a blank string is =
returned. This has implications - our program to extract all strings =
from a html page will be simplified to:

HTMLNode node;
for (Enumeration e =3D parser.elements();e.hasMoreElements();) {
    node =3D (HTMLNode)e.nextElement();
    System.out.println(node.toPlainTextString()); // or whatever =
processing you want to do with the string
}

[5] toRawString() - this method provides the complete html element (a =
reconstruction), thus allowing ripping programs to be really simple. So =
if you want to rip the html page to your local hard disk, your program =
would look like,

PrintWriter pw =3D new PrintWriter(new FileWriter("..."));
for (Enumeration e =3D parser.elements();e.hasMoreElements();) {
    node =3D (HTMLNode)e.nextElement();
    pw.println(node.toRawString());
}
pw.close();

[6] Lots of bug fixes done - HTMLImageScanner had a bug, =
HTMLStyleScanner also had one - all caught with more testcases.

We have 100 testcases as of now, all of them passing.

To-do list for Release 1.2
------------------------------------
[1] Integration of Raghavender Srimantula's contribution - =
HTMLFrameScanner and HTMLFormScanner, into the parser. This will be =
integrated as soon as I get the testcases from Raghav.
[2] Adding an HTML Ripping program in the parserApplications package.
[3] Improving the Robot Crawler (??)
[4] Bug fixes to any bugs that get reported in this period.

You can check out the latest code from CVS. Or you can go to =
http://htmlparser.sourceforge.net and click on the download link, and =
choose htmlparser1_2_20020507.zip

Feedback is welcome.

Regards,
Somik