Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Derrick,
    You've raised some interesting points..

    Looking at it from an end-user perspective, prior to using the visitor,
the code for collecting text from a page would be :

StringBuffer textCollectionBuffer = new StringBuffer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString());
}
String result = textCollectionBuffer.toString();

After putting in the visitor, the code looks like :

Renderer plainTextRenderer = new PlainTextRenderer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    e.nextHTMLNode().renderWith(plainTextRenderer);
}
String result = plainTextRenderer.getResult();

IMO, from a user perspective, there's almost no extra complexity. You'd
anyway collect results into a string buffer, while the renderer now does
that for you.

If it was just for the plain text renderer, we would not be justified in
this refactoring. However, the code for getting html out would look like :
Renderer htmlRenderer = new HTMLRenderer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    e.nextHTMLNode().renderWith(htmlRenderer);
}
String result = htmlRenderer.getResult();

And now, when ripping webpages, we would like to modify the links and images
to a base location in the local disk. We could have modifying html renderer,
like this:

Renderer  modifyingLinksAndImagesRenderer = new
ModifyingLinksAndImagesRenderer();
...

Any modification in the mechanism of rendering is now easily handled, and we
don't have to modify the parser per se, for every new user requirement.

> It breaks encapsulation.  It presumes that an exterior class (the
> visitor) knows how to do something to a class better than the class
> itself does.

In light of the above, this is not a bad thing. At the same time, putting in
the visitor has caused/will cause refactoring of the tag classes so that
most of the logic is still encapsulated in the tags, and the visitor only
makes calls to the tag's concrete methods.
e.g. The visitor still does NOT know how to render a tag's html. Such logic
continues to be encapsulated within the tag. The visitor only deals with a
high-level abstraction.

> It makes the system brittle.  The visitor class has a method defined for
> every subclass of visited object.  Adding, removing or refactoring any
> visited class breaks the visitors.

A valid point indeed. To counter, in our imlpementation, there is a lot of
duplication, and quite a few classes really don't need their own toHTML()
method. The parent class - HTMLTag, should take care of single tags. Those
tags with children always have duplicated code to recurse through their
children. We can refactor such functionality out, so that, tags with
children conform to a recursion interface. Using such an interface, you wont
have the recursion loops in each and every node class, but you could handle
it uniformly outside. So, the visitor does not need to take care of each and
every subclass. It will only take care of some special ones only, while
visiting the rest as a simple HTMLTag.

> It's obscure.  Looking at the visit loop tells you nothing of what is
> being perfomed.  The only indication of the task might be the name of
> the visitor class.

That is again not a bad thing - for well-intentioned naming reduces the need
for excessive documentation. You should be able to look at a class name and
have a fairly good idea of what it does. A look at the API of the class
would confirm your guess.

At the same time, looking at the visit loop above, you'd see that it looks
intuitive enough - the loops is not in the visitor, it is still in the
client code.

> It's redundant.  In every example I've seen (I've never seen it used in
> real applications), the result could have been coded in a simpler
> fashion with correct class design.

This is the point we're really thinking about. I've personally been thinking
on this for some months - for we did have realistic requirements of being
able to modify links and images in a ripper, in a uniform manner. In fact,
the current version of the parser already has a visitor. The Sample Programs
makes not mention of the word "visitor" to avoid controversy and confusion
:), but if you look at
http://htmlparser.sourceforge.net/samples/ripper.html - this is what's
happening. I must add that although this visitor does provide a solution, I
wasn't very happy with it. We've been able to refactor this out, and provide
a uniform mechanism now.

> I would only use the visitor pattern if either the set of public methods
> on the visited nodes completely specify their possible behavior and
> allowed the client to fully manipulate them at the highest level of
> abstraction (i.e. everything you can do to the visited classes is
> already coded and you don't want them to change)

I think we're talking the same language here - this is also our vision. The
code you see on the samples page is an example of that (although not a very
good one). We're working on those lines, and the first step was to introduce
such levels of abstraction.

Regards,
Somik