Thread: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API
Brought to you by:
derrickoswald
From: Holger S. <Hol...@xt...> - 2003-01-07 13:12:53
|
Hi everybody! I am the new kid on the developer block because I joined the HTMLParser = just last week. And now, as my first deed I would like to propose some = changes to the API in the main HTMLParser class. Since these changes are = quite incisive in my opinion, I kindly ask you for some comments on = these propositions. First of all, the current status-quo of the HTMLParser is: As the first thing you have to create a new HTMLParser each time you = want to parse from some new HTML source be it a file, a url, etc.. Then = you register the scanners. And then you retrieve the HTMLNodes by = calling the elements() method. If you want to parse another document the = whole procedure starts from the beginning. According to my idea you would have do the following: First you create one HTMLParser object by calling the empty constructor: - HTMLParser() (This single HTMLParser object can be reused in consecutive parsing = actions.) Second, you register the scanners the same way as it is done now by = calling registerScanners(). Third, you can add one or more (instead of only one as right now) = feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback = htmlParserFeedback). Then you would use one of the following parse methods: - void parse(java.lang.String string) - void parse(java.io.File file) - void parse(java.io.InputStream inputStream) - void parse(java.io.Reader reader) - void parse(java.net.URL url) - void parse(java.net.URI uri) (but this would require JDK 1.4, so = better leave this out for now) (Remark: I know there already is a method parse(java.lang.String string) = in the HTMLParser class where the parameter is the name of a filter. = Question: Is this function used a lot or at all? Can it be renamed or = dropped and its functionality reimplemented in another way?) Finally you would get the results with: - java.util.List getResultList() that returns a List containing HTMLNode = objects Returning simply a List is good in my opinion since this integrates the = HTMLParser nicely into the standard Java collections framework. It also = makes it future save for the later applicability of Generics found in = Java 1.5. The solution for retrieving results with getResultXXX() methods would = also allow to simply add some more and different result retriever = methods, e.g. - org.htmlparser.util.HTMLEnumeration getResultHTMLEnumeration() or=20 - org.htmlparser.util.HTMLTree getResultHTMLTree() that would retrieve = an (to be programmed) HTMLTree (similar to a w3c Document)=20 etc. The implementation that would transform all of the above said in real = code can be done in two distinct, consecutive steps: - First step: Add the methods to the existing HTMLParser class and fit = them into the class by changing the rest of the class only minimally and = (most importantly) only internally. This could be done fairly quickly. - Second step: Refactor the HTMLParser, but keep the existing interfaces = to the outside world (e.g. the existing constructors) and deprecate = them. Bye and thanks in advance for your comments, Holger -------------------------------------------------------- Holger Stenzhorn Software Engineer XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbr=FCcken Phone: +49 (681) 302-5100 Fax: +49 (681) 302-5109 ho...@xt... www.xtramind.com -------------------------------------------------------- |
From: Derrick O. <Der...@ro...> - 2003-01-08 01:58:32
|
Holger Stenzhorn wrote: <snip> >According to my idea you would have do the following: >First you create one HTMLParser object by calling the empty constructor: >- HTMLParser() >(This single HTMLParser object can be reused in consecutive parsing actions.) > > I believe you can do this now (see my recent submission 'Beanize the parser', described below). >Third, you can add one or more (instead of only one as right now) feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback htmlParserFeedback). > The feedback object was under consideration for replacement by the generic logging facade provided by Jakarta, http://jakarta.apache.org/commons/logging.html which does allow for multiple 'loggers'. > >Then you would use one of the following parse methods: >- void parse(java.lang.String string) >- void parse(java.io.File file) >- void parse(java.io.InputStream inputStream) >- void parse(java.io.Reader reader) >- void parse(java.net.URL url) >- void parse(java.net.URI uri) (but this would require JDK 1.4, so better leave this out for now) >(Remark: I know there already is a method parse(java.lang.String string) in the HTMLParser class where the parameter is the name of a filter. Question: Is this function used a lot or at all? Can it be renamed or dropped and its functionality reimplemented in another way?) > The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and setReader(), provide the facility you want, so I would suggest using this same 'bean' pattern instead of the misnomer parse(), because it really isn't parsed till later. Following this naming convention, the existing setURL() which handles file names as well as URLs should probably be broken up into two methods, setFileName() and setURLString(), but it's very handy to have a single method that understands both for command line interpretation. Resist the temptation to overload it [as in setURL(URL url)], or you'll break a very useful bean pattern. I might suggest the current setURL() be renamed to setSource(). The parse(String) method you mention presumably takes HTML text and wraps it in a reader like HTMLParserTestCase.createParser() does. This should be called setHTML(). So we have: setSource("http://..." or "/usr/local") setURLString("http://...") setFileName("/usr/local/...") setHTML("<html><head>...") setFile(new File("/usr/local/..")) setInputStream(new BufferedInputStream()) setReader(new FileReader("/usr/local/..")) setURL(new URL("http://...")) setConnection(url.getConnection()) I would suggest that all these channel through a common initialization method to avoid repeating the same code over and over and to ensure correctly resetting all necessary things. For reuse, all of these methods would need to set field resourceLocn somehow so that a stale source is not used in warning messages so a setResourceLocation() is probably needed that just sets the field. And most would need to set the encoding in order to correctly convert raw bytes into characters. Since setEncoding() resets the current reader or connection to handle a charset directive in the HTML header, a setCharset() method that just sets the character encoding probably is needed (or vica-versa). That would mean the typical re-usage would then be: parser = new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setResourceLocation("<where>"); parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); enumeration = parser.getResultHTMLEnumeration(); } However, since the HTMLParser object is fairly light weight, it may be better to just create another one whenever it's needed and if you're really concerned about memory churn, just move the scanners into place: parser = new HTMLParser(); parser.registerScanners(); scanners = parser.getScanners() while (<more>) { parser = new HTMLParser(); parser.setScanners(scanners); ... |
From: Somik R. <so...@ya...> - 2003-01-08 05:44:53
|
Hi Holger, >Finally you would get the results with: >- java.util.List getResultList() that returns a List containing HTMLNode objects >Returning simply a List is good in my opinion since this integrates the HTMLParser nicely into the standard Java collections >framework. It also makes it future save for the later applicability of Generics found in Java 1.5. This is a good suggestion. But, the drawback of this approach is that we have to keep casting to the objects we want. I feel there is a significant performance improvement to be had, by creating our own "list" object. You will find HTMLVector already in the source - but not yet integrated with the code (that requires a bit of work), which addresses this issue. Would you like to take that up ? >The solution for retrieving results with getResultXXX() methods would also allow to simply add some more and different result >retriever methods, e.g. >- org.htmlparser.util.HTMLEnumeration getResultHTMLEnumeration() or >- org.htmlparser.util.HTMLTree getResultHTMLTree() that would retrieve an (to be programmed) HTMLTree (similar to a w3c >Document) Getting the results from the parser is a very important area, and we've been adding some visitors which we've found very useful. HTMLTree sounds really interesting. It would be nice if you can also check out the existing visitors. Regards, Somik ******************************************** Somik Raha Extreme Programmer and Coach Industrial Logic, Inc. so...@in... http://industriallogic.com Voice : 510-540-8336 Fax : 510-540-8936 ******************************************** Periodic reassessment means looking at things which are taken for granted, things which seem beyond doubt. Periodic reassessment means challenging all assumptions. It is not a matter of reassessing something because there is a need to reassess it; there may be no need at all. It is a matter of reassessing something simply because it is there and has not been assessed for a long time. It is a deliberate and quite unjustified attempt to look at things in a new way. --- Edward De Bono in Lateral Thinking, Chapter 5, The Use of Lateral Thinking |