RE: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I agree with Sam when he says that "don't fix it when its not broken".
But at the same time I see the need to make code better, more readable
and simpler. However as suggested below it seems that the Visitor
pattern is going to make things difficult to understand. And I too like
Sam had a problem with HTMLParser initially. It may be with the
documentation but this whole tag-scanner things was extremely confusing
in the beginning(though subsequently I have loved them so much that I
have even written a few myself without any problem). Hence I think that
even if the visitor pattern is used, the user must have some simple
easy-to-use methods to get his work done rather than try to understand
the Visitor pattern.

Just like Same, this was my two cents.

Cheers,
Dhaval

-----Original Message-----
From: gaijin [mailto:ga...@yh...]
Sent: Tuesday, December 24, 2002 12:21 PM
To: htmlparser-developer
Cc: gaijin; htmlparser-user
Subject: Re: [Htmlparser-developer] toPlainTextString() feedback
requested

Hi Somik and Joshua

Joshua Kerievsky wrote:

>>Could you explain why you want to refactor these methods?  Remember
the
>>danger of premature refactoring ... you lose flexibility that then has
>>to be re-added later on, making more work in the long run.
>>    
>>
>There's a good deal of duplicate code in way the two toHTML methods and
the
>toPlainTextString method do their work.  The central theme is
information
>accumulation/alteration.  That involves outputing tag and node results
and
>recusing through tags.  The refactoring to Visitor allows us to
>
>* remove many lines of duplicate code, spread across many classes
>* remove hard-coded accumulation/alteration logic, thereby making it
easier
>for clients to get the data they need
>
>Visitor takes some getting used to.  I rarely use the pattern. In this
case,
>IMO, it was a good fit.
>
The Visitor pattern sounds interesting, and I look forward to hearing 
more about it.  However, duplicated code itself is not IMO necessarily 
an evil. It all depends on whether one thinks that the duplicated 
components are going to diverge in functionality in the future.  If you 
are sure they are not, then fine, refactor away.

I guess my surprise at your (or perhaps Somik's) focus on refactoring 
comes from the fact that while the htmlparser is a great piece of 
software, the javadocs and other documentation could use some attention.

 For example, I can't find any explanation in the javadocs or otherwise 
of how the filters are supposed to work with the different scanners, or 
what values they are allowed to take.

I generally work to "if it's not broken don't fix it", but I often add 
"before you start fixing it, make sure your documentation is up to
date".

Using the Visitor pattern may make it easier for clients to get the data

they need, but given that the htmlparser is "working" (well it works for

me), I would say that the more urgent issue here is making sure all the 
documentation is up to date.   I have a lot of positive things to say 
about htmlparser, so don't take it the wrong way, when I say that the 
biggest problem I've had in using it in the last few weeks is inadequate

javadocs.

>>Is there some efficiency reason why you want to refactor these methods
>>or is it just for neatness?
>>    
>>
>
>Duplication removal is reason #1. 
>
As I mention above.  One should be careful of duplication removal for 
the sake of it.

> Removal of hard-coded logic is reason #2.
>
This is a good reason.  However I get the feeling that introduction of 
these Visitor classes will make the system conceptually more difficult 
to use rather than easier.  I would feel better if the current set up 
was more fully documented before more complexity was added.

And even if the Visitor pattern is used, I would recommend leaving 
methods like toPlainTextString() etc in place, but just making them 
short cut implementations to certain kinds of visitor-using methods. 
 This will allow people who have yet to grasp the Visitor  pattern 
something to work with.

If you are keen to see lots of people using htmlparser, I think that you

don't want people to have to come to terms with too many new concepts at

once.  You say yourself that the Visitor pattern takes some getting used

to.  I think the whole scanner concept takes some getting used to ....

>Simplicity is reason #3: there is little reason to fatten the
interfaces of
>tag and node classes with various data accumulation/alteration methods
when
>one method and a variety of concrete Visitors can do the job with much
less
>code.
>
well I would agree if you could guarantee that there will be no 
divergence whatsoever in how the different methods will be used.  If you

can create a flexible enough implementation of the Visitor pattern then 
I guess that will support any possible divergence in the separate 
methods. However, I think there is a reason to have a fatter interface, 
in that convenience methods lower the barrier to entry for new users.  

Perhaps ideally one has a well implemented Visitor pattern that supports

a raw method access, and a number of convenience methods?

A well implemented Visitor pattern will, I assume, support all sorts of 
different operations, but I would feel much happier if the htmlparser 
had a complete javadoc and documentation review before any refactoring 
took place.  People are trying to use the existing system and having 
trouble not because of the lack of refactoring, but a lack of well 
described methods.  Well I say people, I mean me, I don't know if anyone

else feels the same.  Maybe it's just me :-)

Just my two cents.

CHEERS> SAM

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer