AW: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API
Brought to you by:
derrickoswald
From: Holger S. <Hol...@xt...> - 2003-01-08 10:36:47
|
Hi! First of all: Thanx for all your comments!=20 Second, my comments to your comments :-) - Logging: I have been using the Jakarta Log4J and also the Commons = Logging for some time now and my experience with that was very good so = far. It is easy and intuitive to use and also quite powerful. But the = point Claude is making in his mail about depending on other projects is = also true, so his proposal of a feedback utility class is good in my = view and would provide a nice facade to the outside world. Question: = Java 1.4, as you all know, actually provides a built-in logging = facility. HTMLParser is targeted also at Java version 1.2 and 1.3, so = the usage of this built-in facility is prohibitive, right? - Naming Convention: I actually wrote the same thing about get/setURL = last week to Somik. I would expect the getURL() method to return a URL = object just as the standard Java classes do (e.g. java.net.URI, = java.net.HttpURLConnection, ...). So either do split up the functions as = you propose or change the function alltogether to let it return a URL = object that can encapsulate both a filename and a URL string (and parse = that one for correctness directly when generating the object). - Bean Pattern and Parse Methods: I actually thought of using that = pattern too since I use it a lot in other code too. The reason why I = propose the parse(XXX) methods is conformity: All standard XML parsers = like javax.xml.parsers.DocumentBuilder/SAXParser or = org.jdom.input.DOMBuilder/SAXBuilder use the same or very similar API = usage patterns. In this way users that deploy our HTMLParser and some = XML parser in their work (like I do for example) would have a very = homogenous way of accessing the APIs. What is also important to note = here: The parse method would be only a facade to the users of the = HTMLParser. Internally I would also apply the bean pattern that you = propose. So I think there would be not much code duplication at all, if = any. Well, if I look at your code snipplet, then there is not much = difference to my API proposal, actually only one line would change: parser =3D new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); parser.parse(<what>); enumeration =3D parser.getResultHTMLEnumeration(); } Still one more addition to the above: Just planting in the parse() = methods in the HTMLParser code as it is right now would be indeed a = misnomer. That is why I think a refactoring should take place. Well, = this refactoring would be a good thing to do anyways whether you add the = parse() methods or not.=20 - HTMLVector and Vistors (to Somik): I did already take a brief look at. = I will dig deeper into it as soon as possible. Perhaps I can readily = trash some of my ideas if I looked more carefully at that stuff. :-)=20 But still :What do you think about that? Holger -----Urspr=FCngliche Nachricht----- Von: Derrick Oswald [mailto:Der...@ro...] Gesendet: Mittwoch, 8. Januar 2003 03:03 An: htm...@li... Betreff: Re: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API Holger Stenzhorn wrote: <snip> >According to my idea you would have do the following: >First you create one HTMLParser object by calling the empty = constructor: >- HTMLParser() >(This single HTMLParser object can be reused in consecutive parsing = actions.) > =20 > I believe you can do this now (see my recent submission 'Beanize the=20 parser', described below). >Third, you can add one or more (instead of only one as right now) = feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback = htmlParserFeedback). > The feedback object was under consideration for replacement by the=20 generic logging facade provided by Jakarta,=20 http://jakarta.apache.org/commons/logging.html which does allow for=20 multiple 'loggers'. > >Then you would use one of the following parse methods: >- void parse(java.lang.String string) >- void parse(java.io.File file) >- void parse(java.io.InputStream inputStream) >- void parse(java.io.Reader reader) >- void parse(java.net.URL url) >- void parse(java.net.URI uri) (but this would require JDK 1.4, so = better leave this out for now) >(Remark: I know there already is a method parse(java.lang.String = string) in the HTMLParser class where the parameter is the name of a = filter. Question: Is this function used a lot or at all? Can it be = renamed or dropped and its functionality reimplemented in another way?) > The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and=20 setReader(), provide the facility you want, so I would suggest using=20 this same 'bean' pattern instead of the misnomer parse(), because it=20 really isn't parsed till later. Following this naming convention, the existing setURL() which handles=20 file names as well as URLs should probably be broken up into two=20 methods, setFileName() and setURLString(), but it's very handy to have a = single method that understands both for command line interpretation.=20 Resist the temptation to overload it [as in setURL(URL url)], or you'll = break a very useful bean pattern. I might suggest the current setURL()=20 be renamed to setSource(). The parse(String) method you mention presumably takes HTML text and=20 wraps it in a reader like HTMLParserTestCase.createParser() does. This=20 should be called setHTML(). So we have: setSource("http://..." or "/usr/local") setURLString("http://...") setFileName("/usr/local/...") setHTML("<html><head>...") setFile(new File("/usr/local/..")) setInputStream(new BufferedInputStream()) setReader(new FileReader("/usr/local/..")) setURL(new URL("http://...")) setConnection(url.getConnection()) I would suggest that all these channel through a common initialization=20 method to avoid repeating the same code over and over and to ensure=20 correctly resetting all necessary things. For reuse, all of these methods would need to set field resourceLocn=20 somehow so that a stale source is not used in warning messages so=20 a setResourceLocation() is probably needed that just sets the field. And = most would need to set the encoding in order to correctly convert raw=20 bytes into characters. Since setEncoding() resets the current reader or = connection to handle a charset directive in the HTML header, a=20 setCharset() method that just sets the character encoding probably is=20 needed (or vica-versa). That would mean the typical re-usage would then = be: parser =3D new HTMLParser(); parser.registerScanners(); while (<more>) { parser.setResourceLocation("<where>"); parser.setCharset("<encoding>"); parser.setXXXX(<whatever>); enumeration =3D parser.getResultHTMLEnumeration(); } However, since the HTMLParser object is fairly light weight, it may be=20 better to just create another one whenever it's needed and if you're=20 really concerned about memory churn, just move the scanners into place: parser =3D new HTMLParser(); parser.registerScanners(); scanners =3D parser.getScanners() while (<more>) { parser =3D new HTMLParser(); parser.setScanners(scanners); ... ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See! http://www.vasoftware.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |