AW: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi!

First of all: Thanx for all your comments!=20

Second, my comments to your comments :-)

- Logging: I have been using the Jakarta Log4J and also the Commons =
Logging for some time now and my experience with that was very good so =
far. It is easy and intuitive to use and also quite powerful. But the =
point Claude is making in his mail about depending on other projects is =
also true, so his proposal of a feedback utility class is good in my =
view and would provide a nice facade to the outside world. Question: =
Java 1.4, as you all know, actually provides a built-in logging =
facility. HTMLParser is targeted also at Java version 1.2 and 1.3, so =
the usage of this built-in facility is prohibitive, right?

- Naming Convention: I actually wrote the same thing about get/setURL =
last week to Somik. I would expect the getURL() method to return a URL =
object just as the standard Java classes do (e.g. java.net.URI, =
java.net.HttpURLConnection, ...). So either do split up the functions as =
you propose or change the function alltogether to let it return a URL =
object that can encapsulate both a filename and a URL string (and parse =
that one for correctness directly when generating the object).

- Bean Pattern and Parse Methods: I actually thought of using that =
pattern too since I use it a lot in other code too. The reason why I =
propose the parse(XXX) methods is conformity: All standard XML parsers =
like javax.xml.parsers.DocumentBuilder/SAXParser or =
org.jdom.input.DOMBuilder/SAXBuilder use the same or very similar API =
usage patterns. In this way users that deploy our HTMLParser and some =
XML parser in their work (like I do for example) would have a very =
homogenous way of accessing the APIs. What is also important to note =
here: The parse method would be only a facade to the users of the =
HTMLParser. Internally I would also apply the bean pattern that you =
propose. So I think there would be not much code duplication at all, if =
any. Well, if I look at your code snipplet, then there is not much =
difference to my API proposal, actually only one line would change:

parser =3D new HTMLParser();
parser.registerScanners();
while (<more>)
{
    parser.setCharset("<encoding>");
    parser.setXXXX(<whatever>);
    parser.parse(<what>);
    enumeration =3D parser.getResultHTMLEnumeration();
}

Still one more addition to the above: Just planting in the parse() =
methods in the HTMLParser code as it is right now would be indeed a =
misnomer. That is why I think a refactoring should take place. Well, =
this refactoring would be a good thing to do anyways whether you add the =
parse() methods or not.=20

- HTMLVector and Vistors (to Somik): I did already take a brief look at. =
I will dig deeper into it as soon as possible. Perhaps I can readily =
trash some of my ideas if I looked more carefully at that stuff. :-)=20

But still :What do you think about that?

Holger

-----Urspr=FCngliche Nachricht-----
Von: Derrick Oswald [mailto:Der...@ro...]
Gesendet: Mittwoch, 8. Januar 2003 03:03
An: htm...@li...
Betreff: Re: [Htmlparser-developer] Request for comments: Proposal for
changes in HTMLParser API

Holger Stenzhorn wrote:

<snip>

>According to my idea you would have do the following:
>First you create one HTMLParser object by calling the empty =
constructor:
>- HTMLParser()
>(This single HTMLParser object can be reused in consecutive parsing =
actions.)
> =20
>
I believe you can do this now (see my recent submission 'Beanize the=20
parser', described below).

>Third, you can add one or more (instead of only one as right now) =
feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback =
htmlParserFeedback).
>
The feedback object was under consideration for replacement by the=20
generic logging facade provided by Jakarta,=20
http://jakarta.apache.org/commons/logging.html which does allow for=20
multiple 'loggers'.

>
>Then you would use one of the following parse methods:
>- void parse(java.lang.String string)
>- void parse(java.io.File file)
>- void parse(java.io.InputStream inputStream)
>- void parse(java.io.Reader reader)
>- void parse(java.net.URL url)
>- void parse(java.net.URI uri) (but this would require JDK 1.4, so =
better leave this out for now)
>(Remark: I know there already is a method parse(java.lang.String =
string) in the HTMLParser class where the parameter is the name of a =
filter. Question: Is this function used a lot or at all? Can it be =
renamed or dropped and its functionality reimplemented in another way?)
>
The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and=20
setReader(), provide the facility you want, so I would suggest using=20
this same 'bean' pattern instead of the misnomer parse(), because it=20
really isn't parsed till later.

Following this naming convention, the existing setURL() which handles=20
file names as well as URLs should probably be broken up into two=20
methods, setFileName() and setURLString(), but it's very handy to have a =

single method that understands both for command line interpretation.=20
 Resist the temptation to overload it [as in setURL(URL url)], or you'll =

break a very useful bean pattern. I might suggest the current setURL()=20
be renamed to setSource().

The parse(String) method you mention presumably takes HTML text and=20
wraps it in a reader like HTMLParserTestCase.createParser() does.  This=20
should be called setHTML().

So we have:
    setSource("http://..." or "/usr/local")
    setURLString("http://...")
    setFileName("/usr/local/...")
    setHTML("<html><head>...")
    setFile(new File("/usr/local/.."))
    setInputStream(new BufferedInputStream())
    setReader(new FileReader("/usr/local/.."))
    setURL(new URL("http://..."))
    setConnection(url.getConnection())

I would suggest that all these channel through a common initialization=20
method to avoid repeating the same code over and over and to ensure=20
correctly resetting all necessary things.

For reuse, all of these methods would need to set field resourceLocn=20
somehow so that a stale source is not used in warning messages so=20
a setResourceLocation() is probably needed that just sets the field. And =

most would need to set the encoding in order to correctly convert raw=20
bytes into characters.  Since setEncoding() resets the current reader or =

connection to handle a charset directive in the HTML header, a=20
setCharset() method that just sets the character encoding probably is=20
needed (or vica-versa). That would mean the typical re-usage would then =
be:

parser =3D new HTMLParser();
parser.registerScanners();
while (<more>)
{
    parser.setResourceLocation("<where>");
    parser.setCharset("<encoding>");
    parser.setXXXX(<whatever>);
    enumeration =3D parser.getResultHTMLEnumeration();
}

However, since the HTMLParser object is fairly light weight, it may be=20
better to just create another one whenever it's needed and if you're=20
really concerned about memory churn, just move the scanners into place:

parser =3D new HTMLParser();
parser.registerScanners();
scanners =3D parser.getScanners()
while (<more>)
{
    parser =3D new HTMLParser();
    parser.setScanners(scanners);
    ...

-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See!
http://www.vasoftware.com
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer