[Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API
Brought to you by:
derrickoswald
From: Holger S. <Hol...@xt...> - 2003-01-07 13:12:53
|
Hi everybody! I am the new kid on the developer block because I joined the HTMLParser = just last week. And now, as my first deed I would like to propose some = changes to the API in the main HTMLParser class. Since these changes are = quite incisive in my opinion, I kindly ask you for some comments on = these propositions. First of all, the current status-quo of the HTMLParser is: As the first thing you have to create a new HTMLParser each time you = want to parse from some new HTML source be it a file, a url, etc.. Then = you register the scanners. And then you retrieve the HTMLNodes by = calling the elements() method. If you want to parse another document the = whole procedure starts from the beginning. According to my idea you would have do the following: First you create one HTMLParser object by calling the empty constructor: - HTMLParser() (This single HTMLParser object can be reused in consecutive parsing = actions.) Second, you register the scanners the same way as it is done now by = calling registerScanners(). Third, you can add one or more (instead of only one as right now) = feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback = htmlParserFeedback). Then you would use one of the following parse methods: - void parse(java.lang.String string) - void parse(java.io.File file) - void parse(java.io.InputStream inputStream) - void parse(java.io.Reader reader) - void parse(java.net.URL url) - void parse(java.net.URI uri) (but this would require JDK 1.4, so = better leave this out for now) (Remark: I know there already is a method parse(java.lang.String string) = in the HTMLParser class where the parameter is the name of a filter. = Question: Is this function used a lot or at all? Can it be renamed or = dropped and its functionality reimplemented in another way?) Finally you would get the results with: - java.util.List getResultList() that returns a List containing HTMLNode = objects Returning simply a List is good in my opinion since this integrates the = HTMLParser nicely into the standard Java collections framework. It also = makes it future save for the later applicability of Generics found in = Java 1.5. The solution for retrieving results with getResultXXX() methods would = also allow to simply add some more and different result retriever = methods, e.g. - org.htmlparser.util.HTMLEnumeration getResultHTMLEnumeration() or=20 - org.htmlparser.util.HTMLTree getResultHTMLTree() that would retrieve = an (to be programmed) HTMLTree (similar to a w3c Document)=20 etc. The implementation that would transform all of the above said in real = code can be done in two distinct, consecutive steps: - First step: Add the methods to the existing HTMLParser class and fit = them into the class by changing the rest of the class only minimally and = (most importantly) only internally. This could be done fairly quickly. - Second step: Refactor the HTMLParser, but keep the existing interfaces = to the outside world (e.g. the existing constructors) and deprecate = them. Bye and thanks in advance for your comments, Holger -------------------------------------------------------- Holger Stenzhorn Software Engineer XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbr=FCcken Phone: +49 (681) 302-5100 Fax: +49 (681) 302-5109 ho...@xt... www.xtramind.com -------------------------------------------------------- |