Re: [Htmlparser-developer] Request for comments: Proposal for changes in HTMLParser API

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Holger Stenzhorn wrote:

<snip>

>According to my idea you would have do the following:
>First you create one HTMLParser object by calling the empty constructor:
>- HTMLParser()
>(This single HTMLParser object can be reused in consecutive parsing actions.)
>  
>
I believe you can do this now (see my recent submission 'Beanize the 
parser', described below).

>Third, you can add one or more (instead of only one as right now) feedbacks by calling addHTMLParserFeedback(HTMLParserFeedback htmlParserFeedback).
>
The feedback object was under consideration for replacement by the 
generic logging facade provided by Jakarta, 
http://jakarta.apache.org/commons/logging.html which does allow for 
multiple 'loggers'.

>
>Then you would use one of the following parse methods:
>- void parse(java.lang.String string)
>- void parse(java.io.File file)
>- void parse(java.io.InputStream inputStream)
>- void parse(java.io.Reader reader)
>- void parse(java.net.URL url)
>- void parse(java.net.URI uri) (but this would require JDK 1.4, so better leave this out for now)
>(Remark: I know there already is a method parse(java.lang.String string) in the HTMLParser class where the parameter is the name of a filter. Question: Is this function used a lot or at all? Can it be renamed or dropped and its functionality reimplemented in another way?)
>
The HTMLParser setXXX() methods, i.e. setURL(), setConnection() and 
setReader(), provide the facility you want, so I would suggest using 
this same 'bean' pattern instead of the misnomer parse(), because it 
really isn't parsed till later.

Following this naming convention, the existing setURL() which handles 
file names as well as URLs should probably be broken up into two 
methods, setFileName() and setURLString(), but it's very handy to have a 
single method that understands both for command line interpretation. 
 Resist the temptation to overload it [as in setURL(URL url)], or you'll 
break a very useful bean pattern. I might suggest the current setURL() 
be renamed to setSource().

The parse(String) method you mention presumably takes HTML text and 
wraps it in a reader like HTMLParserTestCase.createParser() does.  This 
should be called setHTML().

So we have:
    setSource("http://..." or "/usr/local")
    setURLString("http://...")
    setFileName("/usr/local/...")
    setHTML("<html><head>...")
    setFile(new File("/usr/local/.."))
    setInputStream(new BufferedInputStream())
    setReader(new FileReader("/usr/local/.."))
    setURL(new URL("http://..."))
    setConnection(url.getConnection())

I would suggest that all these channel through a common initialization 
method to avoid repeating the same code over and over and to ensure 
correctly resetting all necessary things.

For reuse, all of these methods would need to set field resourceLocn 
somehow so that a stale source is not used in warning messages so 
a setResourceLocation() is probably needed that just sets the field. And 
most would need to set the encoding in order to correctly convert raw 
bytes into characters.  Since setEncoding() resets the current reader or 
connection to handle a charset directive in the HTML header, a 
setCharset() method that just sets the character encoding probably is 
needed (or vica-versa). That would mean the typical re-usage would then be:

parser = new HTMLParser();
parser.registerScanners();
while (<more>)
{
    parser.setResourceLocation("<where>");
    parser.setCharset("<encoding>");
    parser.setXXXX(<whatever>);
    enumeration = parser.getResultHTMLEnumeration();
}

However, since the HTMLParser object is fairly light weight, it may be 
better to just create another one whenever it's needed and if you're 
really concerned about memory churn, just move the scanners into place:

parser = new HTMLParser();
parser.registerScanners();
scanners = parser.getScanners()
while (<more>)
{
    parser = new HTMLParser();
    parser.setScanners(scanners);
    ...