[Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi everybody!

I am the new kid on the developer block because I joined the HTMLParser =
just last week. And now, as my first deed I would like to propose some =
changes to the API in the main HTMLParser class. Since these changes are =
quite incisive in my opinion, I kindly ask you for some comments on =
these propositions.

First of all, the current status-quo of the HTMLParser is:
As the first thing you have to create a new HTMLParser each time you =
want to parse from some new HTML source be it a file, a url, etc.. Then =
you register the scanners. And then you retrieve the HTMLNodes by =
calling the elements() method. If you want to parse another document the =
whole procedure starts from the beginning.

According to my idea you would have do the following:
First you create one HTMLParser object by calling the empty constructor:
- HTMLParser()
(This single HTMLParser object can be reused in consecutive parsing =
actions.)

Second, you register the scanners the same way as it is done now by =
calling registerScanners().
Third, you can add one or more (instead of only one as right now) =
feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback =
htmlParserFeedback).

Then you would use one of the following parse methods:
- void parse(java.lang.String string)
- void parse(java.io.File file)
- void parse(java.io.InputStream inputStream)
- void parse(java.io.Reader reader)
- void parse(java.net.URL url)
- void parse(java.net.URI uri) (but this would require JDK 1.4, so =
better leave this out for now)
(Remark: I know there already is a method parse(java.lang.String string) =
in the HTMLParser class where the parameter is the name of a filter. =
Question: Is this function used a lot or at all? Can it be renamed or =
dropped and its functionality reimplemented in another way?)

Finally you would get the results with:
- java.util.List getResultList() that returns a List containing HTMLNode =
objects
Returning simply a List is good in my opinion since this integrates the =
HTMLParser nicely into the standard Java collections framework. It also =
makes it future save for the later applicability of Generics found in =
Java 1.5.

The solution for retrieving results with getResultXXX() methods would =
also allow to simply add some more and different result retriever =
methods, e.g.
- org.htmlparser.util.HTMLEnumeration getResultHTMLEnumeration() or=20
- org.htmlparser.util.HTMLTree getResultHTMLTree() that would retrieve =
an (to be programmed) HTMLTree (similar to a w3c Document)=20
etc.

The implementation that would transform all of the above said in real =
code can be done in two distinct, consecutive steps:
- First step: Add the methods to the existing HTMLParser class and fit =
them into the class by changing the rest of the class only minimally and =
(most importantly) only internally. This could be done fairly quickly.
- Second step: Refactor the HTMLParser, but keep the existing interfaces =
to the outside world (e.g. the existing constructors) and deprecate =
them.

Bye and thanks in advance for your comments,
Holger

--------------------------------------------------------
Holger Stenzhorn
Software Engineer

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbr=FCcken
Phone: +49 (681) 302-5100
Fax: +49 (681) 302-5109
ho...@xt...
www.xtramind.com
--------------------------------------------------------