Re: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-01-08 01:58:32
|
Holger Stenzhorn wrote: <snip> >According to my idea you would have do the following: >First you create one HTMLParser object by calling the empty constructor: >- HTMLParser() >(This single HTMLParser object can be reused in consecutive parsing actions.) > > I believe you can do this now (see my recent submission 'Beanize the parser', described below). >Third, you can add one or more (instead of only one as right now) feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback htmlParserFeedback). > The feedback object was under consideration for replacement by the generic logging facade provided by Jakarta, http://jakarta.apache.org/commons/logging.html which does allow for multiple 'loggers'. > >Then you would use one of the following parse methods: >- void parse(java.lang.String string) >- void parse(java.io.File file) >- void parse(java.io.InputStream inputStream) >- void parse(java.io.Reader reader) >- void parse(java.net.URL url) >- void parse(java.net.URI uri) (but this would require JDK 1.4, so better leave this out for now) >(Remark: I know there already is a method parse(java.lang.String string) in the HTMLParser class where the parameter is the name of a filter. Question: Is this function used a lot or at all? Can it be renamed or dropped and its functionality reimplemented in another way?) > The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and setReader(), provide the facility you want, so I would suggest using this same 'bean' pattern instead of the misnomer parse(), because it really isn't parsed till later. Following this naming convention, the existing setURL() which handles file names as well as URLs should probably be broken up into two methods, setFileName() and setURLString(), but it's very handy to have a single method that understands both for command line interpretation. Resist the temptation to overload it [as in setURL(URL url)], or you'll break a very useful bean pattern. I might suggest the current setURL() be renamed to setSource(). The parse(String) method you mention presumably takes HTML text and wraps it in a reader like HTMLParserTestCase.createParser() does. This should be called setHTML(). So we have: setSource("http://..." or "/usr/local") setURLString("http://...") setFileName("/usr/local/...") setHTML("<html><head>...") setFile(new File("/usr/local/..")) setInputStream(new BufferedInputStream()) setReader(new FileReader("/usr/local/..")) setURL(new URL("http://...")) setConnection(url.getConnection()) I would suggest that all these channel through a common initialization method to avoid repeating the same code over and over and to ensure correctly resetting all necessary things. For reuse, all of these methods would need to set field resourceLocn somehow so that a stale source is not used in warning messages so a setResourceLocation() is probably needed that just sets the field. And most would need to set the encoding in order to correctly convert raw bytes into characters. Since setEncoding() resets the current reader or connection to handle a charset directive in the HTML header, a setCharset() method that just sets the character encoding probably is needed (or vica-versa). That would mean the typical re-usage would then be: parser = new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setResourceLocation("<where>"); parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); enumeration = parser.getResultHTMLEnumeration(); } However, since the HTMLParser object is fairly light weight, it may be better to just create another one whenever it's needed and if you're really concerned about memory churn, just move the scanners into place: parser = new HTMLParser(); parser.registerScanners(); scanners = parser.getScanners() while (<more>) { parser = new HTMLParser(); parser.setScanners(scanners); ... |