[Htmlparser-developer] Design Modifications to HTMLNode API
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-05-07 06:25:34
|
Hi Folks, Following some nice suggestions from Sam Joseph, I have just = completed some design modifications to the basic HTMLNode API. The modifications are : [1] HTMLNode is no longer an interface, but an abstract class. There = were two reasons for this change. Firstly, I couldnt think of a scenario = where an object would be an html tag AND something else. Secondly, I = wanted to enforce the implementation of toString(), which is usually = done if you implement from the interface (as Object has a default = toString()). [2] abstract toString() method - children have to implement this. [3] print() and print(PrintWriter) - final methods. They will make a = call to toString(), and print to standard output and the print writer = respectively. [4] toPlainText() - this method will provide a string representation of = a tag, if there is such a representation. If not , a blank string is = returned. This has implications - our program to extract all strings = from a html page will be simplified to: HTMLNode node; for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); System.out.println(node.toPlainTextString()); // or whatever = processing you want to do with the string } [5] toRawString() - this method provides the complete html element (a = reconstruction), thus allowing ripping programs to be really simple. So = if you want to rip the html page to your local hard disk, your program = would look like, PrintWriter pw =3D new PrintWriter(new FileWriter("...")); for (Enumeration e =3D parser.elements();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); pw.println(node.toRawString()); } pw.close(); [6] Lots of bug fixes done - HTMLImageScanner had a bug, = HTMLStyleScanner also had one - all caught with more testcases. We have 100 testcases as of now, all of them passing. To-do list for Release 1.2 ------------------------------------ [1] Integration of Raghavender Srimantula's contribution - = HTMLFrameScanner and HTMLFormScanner, into the parser. This will be = integrated as soon as I get the testcases from Raghav. [2] Adding an HTML Ripping program in the parserApplications package. [3] Improving the Robot Crawler (??) [4] Bug fixes to any bugs that get reported in this period. You can check out the latest code from CVS. Or you can go to = http://htmlparser.sourceforge.net and click on the download link, and = choose htmlparser1_2_20020507.zip Feedback is welcome. Regards, Somik |