Re: [Htmlparser-developer] toPlainTextString() feedback requested
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2002-12-24 20:00:21
|
Somik Raha wrote: >Hi Derrick, > You've raised some interesting points.. > > Looking at it from an end-user perspective, prior to using the visitor, >the code for collecting text from a page would be : > >StringBuffer textCollectionBuffer = new StringBuffer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString()); >} >String result = textCollectionBuffer.toString(); > >After putting in the visitor, the code looks like : > >Renderer plainTextRenderer = new PlainTextRenderer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > e.nextHTMLNode().renderWith(plainTextRenderer); >} >String result = plainTextRenderer.getResult(); > >IMO, from a user perspective, there's almost no extra complexity. You'd >anyway collect results into a string buffer, while the renderer now does >that for you. > >If it was just for the plain text renderer, we would not be justified in >this refactoring. However, the code for getting html out would look like : >Renderer htmlRenderer = new HTMLRenderer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > e.nextHTMLNode().renderWith(htmlRenderer); >} >String result = htmlRenderer.getResult(); > >And now, when ripping webpages, we would like to modify the links and images >to a base location in the local disk. We could have modifying html renderer, >like this: > >Renderer modifyingLinksAndImagesRenderer = new >ModifyingLinksAndImagesRenderer(); >... > >Any modification in the mechanism of rendering is now easily handled, and we >don't have to modify the parser per se, for every new user requirement. > The first two examples gain you nothing from using a visitor pattern. The plainTextRenderer and HTMLRenderer call the toPlainTextString() and toHTML() methods because only the tag knows how to render itself into text or HTML (see your comment below). Your last example indicates the visitor class knows which parameters have links and which parameters have images for each tag and I would suggest this knowledge should be in the tag class itself. If the base class (HTMLTag I presume) had an abstract method signature 'toHTML(URL baseURL)', then code within the class would adjust relative link and image references based on it's value, and the current behaviour of 'toHRML()' would just call toHTML(null). The code is not duplicated, only refactored to handle a base URL. > > > >>It breaks encapsulation. It presumes that an exterior class (the >>visitor) knows how to do something to a class better than the class >>itself does. >> >> > >In light of the above, this is not a bad thing. At the same time, putting in >the visitor has caused/will cause refactoring of the tag classes so that >most of the logic is still encapsulated in the tags, and the visitor only >makes calls to the tag's concrete methods. >e.g. The visitor still does NOT know how to render a tag's html. Such logic >continues to be encapsulated within the tag. The visitor only deals with a >high-level abstraction. > So why bother with the extra overhead of the visitor class? Logically, either the visitor has code that differs from the tag code or it doesn't. If it doesn't, then cut the Gordian knot and use the simplest mechanism and forget the visitor. If it does differ, the code within the visitor either uses the methods of the visitee only or it has some new methods with logic and data that are external to the visitee. In the former case, that code that only uses methods on the visitee could be a method on the vistee's superclass. If it has new methods, either the visitors extra logic and data are specific to each vistee or they are of a generic nature. If they are specific to the visitee, then they should be within the visitee class and not in a catch-all visitor class, otherwise if they are of a general nature they might as well be moved to the superclass (or a utility class if they don't belong there). There is really no need for a visitor pattern. > > > >>It makes the system brittle. The visitor class has a method defined for >>every subclass of visited object. Adding, removing or refactoring any >>visited class breaks the visitors. >> >> > >A valid point indeed. To counter, in our imlpementation, there is a lot of >duplication, and quite a few classes really don't need their own toHTML() >method. The parent class - HTMLTag, should take care of single tags. Those >tags with children always have duplicated code to recurse through their >children. We can refactor such functionality out, so that, tags with >children conform to a recursion interface. Using such an interface, you wont >have the recursion loops in each and every node class, but you could handle >it uniformly outside. So, the visitor does not need to take care of each and >every subclass. It will only take care of some special ones only, while >visiting the rest as a simple HTMLTag. > > This The duplication should be refactored to superclasses. You may need is subclass of HTMLTag that has a 'container' interface, but why not put that interface on HTMLTag itself and have a default implementation that is an empty container. Then FORM and FRAMESET and other container tags derive from an abstract class (HTMLContainerTag ?) that derives from HTMLTag but overrides this behaviour and provides real recursion support. > > >>It's obscure. Looking at the visit loop tells you nothing of what is >>being perfomed. The only indication of the task might be the name of >>the visitor class. >> >> > >That is again not a bad thing - for well-intentioned naming reduces the need >for excessive documentation. You should be able to look at a class name and >have a fairly good idea of what it does. A look at the API of the class >would confirm your guess. > >At the same time, looking at the visit loop above, you'd see that it looks >intuitive enough - the loops is not in the visitor, it is still in the >client code. > > But the StringBuffer is now hidden and it's not obvious that a getResult() method call returns it's contents, so you have an indirection to look up the visitor class. > > >>It's redundant. In every example I've seen (I've never seen it used in >>real applications), the result could have been coded in a simpler >>fashion with correct class design. >> >> > >This is the point we're really thinking about. I've personally been thinking >on this for some months - for we did have realistic requirements of being >able to modify links and images in a ripper, in a uniform manner. In fact, >the current version of the parser already has a visitor. The Sample Programs >makes not mention of the word "visitor" to avoid controversy and confusion >:), but if you look at >http://htmlparser.sourceforge.net/samples/ripper.html - this is what's >happening. I must add that although this visitor does provide a solution, I >wasn't very happy with it. We've been able to refactor this out, and provide >a uniform mechanism now. > > That's not really a visitor example, that's a callback, which is obviated if the toHTML(URL) signature is used. > > >>I would only use the visitor pattern if either the set of public methods >>on the visited nodes completely specify their possible behavior and >>allowed the client to fully manipulate them at the highest level of >>abstraction (i.e. everything you can do to the visited classes is >>already coded and you don't want them to change) >> >> > >I think we're talking the same language here - this is also our vision. The >code you see on the samples page is an example of that (although not a very >good one). We're working on those lines, and the first step was to introduce >such levels of abstraction. > >Regards, >Somik > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |