Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sam wrote:
> The Visitor pattern sounds interesting, and I look forward to hearing
> more about it.  However, duplicated code itself is not IMO necessarily
> an evil. It all depends on whether one thinks that the duplicated
> components are going to diverge in functionality in the future.  If you
> are sure they are not, then fine, refactor away.

At the moment, Somik and I speculate that 40% to 50% of the current html
parser code base is pure fat -- unnecessary code that only serves to bloat
the code base.  Duplicate code, in subtle or not-so-subtle versions, is
mostly responsible for the bloat.  In my experience in industry, I'd say
that 95% of the time, duplicate code is bad.

> I guess my surprise at your (or perhaps Somik's) focus on refactoring
> comes from the fact that while the htmlparser is a great piece of
> software, the javadocs and other documentation could use some attention.
>  For example, I can't find any explanation in the javadocs or otherwise
> of how the filters are supposed to work with the different scanners, or
> what values they are allowed to take.

The html parser is in sore need of refactoring Sam.  In fact, sometimes code
becomes unnecessarily complex, in which case people *really* need
documentation.  I confront such a problem by first asking if the code could
be simplified so we didn't need so much documentation.   In the event that
we do need docs, it is best to write executable documentation.  Are you
familiar with executable documentation?

> I generally work to "if it's not broken don't fix it", but I often add
> "before you start fixing it, make sure your documentation is up to date".

Software becomes brittle and bloated under the philosophy of "if it ain't
broke, don't fix it."   We have clients with code bases that are 2 millions
lines of "working" speghetti code.  They need lots of help to learn to do
continuous refactoring.

> Using the Visitor pattern may make it easier for clients to get the data
> they need, but given that the htmlparser is "working" (well it works for
> me), I would say that the more urgent issue here is making sure all the
> documentation is up to date.   I have a lot of positive things to say
> about htmlparser, so don't take it the wrong way, when I say that the
> biggest problem I've had in using it in the last few weeks is inadequate
> javadocs.

I'm not content to live with the world as it is Sam.  If something isn't
easy, it's wrong.  The best software in the world is easy to use, it's
self-explanatory.  We should always strive for that.  In the meantime, if
you have tasks to complete and don't know how to do them because of lack of
docs, I'd suggest you ask questions here.  The best result will be
executable documentation for the html parser.

> >>Is there some efficiency reason why you want to refactor these methods
> >>or is it just for neatness?
> >>
> >>
> >
> >Duplication removal is reason #1.
> >
> As I mention above.  One should be careful of duplication removal for
> the sake of it.

And as I mentioned above, I completely disagree with you.  I wonder, how
much refactoring have you done in your career given you philosophy of "if it
ain't broke, don't fix it?"   Have you read Martin Fowler's landmark book,
Refactoring?   If not, I'd suggest you study it thoroughly - you'll be a
better programmer for it.

> > Removal of hard-coded logic is reason #2.
> >
> This is a good reason.  However I get the feeling that introduction of
> these Visitor classes will make the system conceptually more difficult
> to use rather than easier.  I would feel better if the current set up
> was more fully documented before more complexity was added.

As I also said above, if it ain't simple, it's wrong.  Our changes will not
add complexity - that would be foolish.

> And even if the Visitor pattern is used, I would recommend leaving
> methods like toPlainTextString() etc in place, but just making them
> short cut implementations to certain kinds of visitor-using methods.
>  This will allow people who have yet to grasp the Visitor  pattern
> something to work with.

People can always call deprecated methods.

> If you are keen to see lots of people using htmlparser, I think that you
> don't want people to have to come to terms with too many new concepts at
> once.  You say yourself that the Visitor pattern takes some getting used
> to.  I think the whole scanner concept takes some getting used to ....

Our Visitor implementation is so trivial that folks won't even know they're
using the pattern.

> >Simplicity is reason #3: there is little reason to fatten the interfaces
of
> >tag and node classes with various data accumulation/alteration methods
when
> >one method and a variety of concrete Visitors can do the job with much
less
> >code.
> >
> well I would agree if you could guarantee that there will be no
> divergence whatsoever in how the different methods will be used.  If you
> can create a flexible enough implementation of the Visitor pattern then
> I guess that will support any possible divergence in the separate
> methods. However, I think there is a reason to have a fatter interface,
> in that convenience methods lower the barrier to entry for new users.

Wouldn't it be marvelous if the barrier to entry was low and our code wasn't
bloatware?

> Perhaps ideally one has a well implemented Visitor pattern that supports
> a raw method access, and a number of convenience methods?

The changes will make the code easier to use - if they don't, they really
cool thing about software is that it is soft - we can change it.

> A well implemented Visitor pattern will, I assume, support all sorts of
> different operations, but I would feel much happier if the htmlparser
> had a complete javadoc and documentation review before any refactoring
> took place.  People are trying to use the existing system and having
> trouble not because of the lack of refactoring, but a lack of well
> described methods.  Well I say people, I mean me, I don't know if anyone
> else feels the same.  Maybe it's just me :-)

Executable documentation is a term we use for Customter Tests.   A Customer
Tests shows how a body of code gets used to perform real tasks.   You have
real-world tasks to perform.  We can perform those tasks via automated
tests.   Then, when we refactor, we must make sure our executable
documentation is up to date (i.e. it passes its tests).   We find this ideal
for making sure that documentation reflects what the code is actually doing.
This also leaves room for a document that gives some graphical
representation of how things work, which must be maintained by someone, lest
it get out of date.  We just minimize the need for such documents by keeping
our code simple and small and surrounding it with easy to understand
executable documentation.

best regards
jk