Re: [Htmlparser-developer] toPlainTextString() feedback requested
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-24 18:48:57
|
Hi Derrick, You've raised some interesting points.. Looking at it from an end-user perspective, prior to using the visitor, the code for collecting text from a page would be : StringBuffer textCollectionBuffer = new StringBuffer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString()); } String result = textCollectionBuffer.toString(); After putting in the visitor, the code looks like : Renderer plainTextRenderer = new PlainTextRenderer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { e.nextHTMLNode().renderWith(plainTextRenderer); } String result = plainTextRenderer.getResult(); IMO, from a user perspective, there's almost no extra complexity. You'd anyway collect results into a string buffer, while the renderer now does that for you. If it was just for the plain text renderer, we would not be justified in this refactoring. However, the code for getting html out would look like : Renderer htmlRenderer = new HTMLRenderer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { e.nextHTMLNode().renderWith(htmlRenderer); } String result = htmlRenderer.getResult(); And now, when ripping webpages, we would like to modify the links and images to a base location in the local disk. We could have modifying html renderer, like this: Renderer modifyingLinksAndImagesRenderer = new ModifyingLinksAndImagesRenderer(); ... Any modification in the mechanism of rendering is now easily handled, and we don't have to modify the parser per se, for every new user requirement. > It breaks encapsulation. It presumes that an exterior class (the > visitor) knows how to do something to a class better than the class > itself does. In light of the above, this is not a bad thing. At the same time, putting in the visitor has caused/will cause refactoring of the tag classes so that most of the logic is still encapsulated in the tags, and the visitor only makes calls to the tag's concrete methods. e.g. The visitor still does NOT know how to render a tag's html. Such logic continues to be encapsulated within the tag. The visitor only deals with a high-level abstraction. > It makes the system brittle. The visitor class has a method defined for > every subclass of visited object. Adding, removing or refactoring any > visited class breaks the visitors. A valid point indeed. To counter, in our imlpementation, there is a lot of duplication, and quite a few classes really don't need their own toHTML() method. The parent class - HTMLTag, should take care of single tags. Those tags with children always have duplicated code to recurse through their children. We can refactor such functionality out, so that, tags with children conform to a recursion interface. Using such an interface, you wont have the recursion loops in each and every node class, but you could handle it uniformly outside. So, the visitor does not need to take care of each and every subclass. It will only take care of some special ones only, while visiting the rest as a simple HTMLTag. > It's obscure. Looking at the visit loop tells you nothing of what is > being perfomed. The only indication of the task might be the name of > the visitor class. That is again not a bad thing - for well-intentioned naming reduces the need for excessive documentation. You should be able to look at a class name and have a fairly good idea of what it does. A look at the API of the class would confirm your guess. At the same time, looking at the visit loop above, you'd see that it looks intuitive enough - the loops is not in the visitor, it is still in the client code. > It's redundant. In every example I've seen (I've never seen it used in > real applications), the result could have been coded in a simpler > fashion with correct class design. This is the point we're really thinking about. I've personally been thinking on this for some months - for we did have realistic requirements of being able to modify links and images in a ripper, in a uniform manner. In fact, the current version of the parser already has a visitor. The Sample Programs makes not mention of the word "visitor" to avoid controversy and confusion :), but if you look at http://htmlparser.sourceforge.net/samples/ripper.html - this is what's happening. I must add that although this visitor does provide a solution, I wasn't very happy with it. We've been able to refactor this out, and provide a uniform mechanism now. > I would only use the visitor pattern if either the set of public methods > on the visited nodes completely specify their possible behavior and > allowed the client to fully manipulate them at the highest level of > abstraction (i.e. everything you can do to the visited classes is > already coded and you don't want them to change) I think we're talking the same language here - this is also our vision. The code you see on the samples page is an example of that (although not a very good one). We're working on those lines, and the first step was to introduce such levels of abstraction. Regards, Somik |