Thread: [Htmlparser-developer] toPlainTextString() feedback requested

Brought to you by: derrickoswald

htmlparser-developer

[Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 02:20:37

Hi Folks,
  We are performing a series of refactorings, which
will involve merging of the functionality of
toPlainTextString() and toHTML(HTMLRenderer), and
perhaps, toHTML() itself..

  We'd like to know user stories of how you've been
using toPlainTextString().

Do you enumerate through the nodes and :
[1] collect all the text data using
toPlainTextString() into a buffer? 

[2] make calls to toPlainTextString() but don't
collect the data ? 

[3] collect all text data using toPlainTextString()
into a buffer, but only after having processed the
string in some way ? 

Another question - if we remove toPlainTextString()
[and provide you an alternative], will it hurt, or
would you prefer a deprecation step (in 1.3) before
removal (in 1.4) ?

It would be good for us to know more details about
your applications, so that we can perform our design
modifications correctly. Feel free to tell us your
stories in detail. (Remember to click Reply All so the
replies go to both lists).

Regards,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-24 02:39:57

Hi Somik,

I tend to use toPlainTextString() in all three ways described below.  I 
use the first to generate a plain test overview, the second for 
debugging statements, and the final one for when I want to add my own 
additional formatting to the final text.

Could you explain why you want to refactor these methods?  Remember the 
danger of premature refactoring ... you lose flexibility that then has 
to be re-added later on, making more work in the long run.

Is there some efficiency reason why you want to refactor these methods 
or is it just for neatness?

Either way, I would certainly recommend a deprecation step.

CHEERS> SAM

Somik Raha wrote:

>Hi Folks,
>  We are performing a series of refactorings, which
>will involve merging of the functionality of
>toPlainTextString() and toHTML(HTMLRenderer), and
>perhaps, toHTML() itself..
>
>  We'd like to know user stories of how you've been
>using toPlainTextString().
>
>Do you enumerate through the nodes and :
>[1] collect all the text data using
>toPlainTextString() into a buffer? 
>
>[2] make calls to toPlainTextString() but don't
>collect the data ? 
>
>[3] collect all text data using toPlainTextString()
>into a buffer, but only after having processed the
>string in some way ? 
>
>Another question - if we remove toPlainTextString()
>[and provide you an alternative], will it hurt, or
>would you prefer a deprecation step (in 1.3) before
>removal (in 1.4) ?
>
>It would be good for us to know more details about
>your applications, so that we can perform our design
>modifications correctly. Feel free to tell us your
>stories in detail. (Remember to click Reply All so the
>replies go to both lists).
>
>Regards,
>Somik
>
>__________________________________________________
>Do you Yahoo!?
>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>http://mailplus.yahoo.com
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>
>  
>

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 03:37:22

Hi Sam,

> Could you explain why you want to refactor these
> methods?  Remember the 
> danger of premature refactoring ... you lose
> flexibility that then has 
> to be re-added later on, making more work in the
> long run.

There seems to be so much duplication in the parser,
that it seems like Merciless_Refactoring can alone
help clean the code.
 
> Is there some efficiency reason why you want to
> refactor these methods 
> or is it just for neatness?

Readability and efficiency are important factors - far
more important is to have a simple way of getting data
from the parser. The idea is to pass in your own
visitors into the parser, which could either extract
text, or html. If you need to modify them, then you
can modify these visitors to customize behaviour. The
weight of the parser would reduce, code duplication
would not be there, and it would be easier to
customize.

> Either way, I would certainly recommend a
> deprecation step.

Will keep that in mind.

Cheers,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Joshua K. <jo...@in...> - 2002-12-24 05:40:22

> Could you explain why you want to refactor these methods?  Remember the
> danger of premature refactoring ... you lose flexibility that then has
> to be re-added later on, making more work in the long run.

Hi Sam,

There's a good deal of duplicate code in way the two toHTML methods and the
toPlainTextString method do their work.  The central theme is information
accumulation/alteration.  That involves outputing tag and node results and
recusing through tags.  The refactoring to Visitor allows us to

* remove many lines of duplicate code, spread across many classes
* remove hard-coded accumulation/alteration logic, thereby making it easier
for clients to get the data they need

Visitor takes some getting used to.  I rarely use the pattern. In this case,
IMO, it was a good fit.

> Is there some efficiency reason why you want to refactor these methods
> or is it just for neatness?

Duplication removal is reason #1.  Removal of hard-coded logic is reason #2.
Simplicity is reason #3: there is little reason to fatten the interfaces of
tag and node classes with various data accumulation/alteration methods when
one method and a variety of concrete Visitors can do the job with much less
code.

best regards
jk

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-24 06:35:40

Hi Somik and Joshua

Joshua Kerievsky wrote:

>>Could you explain why you want to refactor these methods?  Remember the
>>danger of premature refactoring ... you lose flexibility that then has
>>to be re-added later on, making more work in the long run.
>>    
>>
>There's a good deal of duplicate code in way the two toHTML methods and the
>toPlainTextString method do their work.  The central theme is information
>accumulation/alteration.  That involves outputing tag and node results and
>recusing through tags.  The refactoring to Visitor allows us to
>
>* remove many lines of duplicate code, spread across many classes
>* remove hard-coded accumulation/alteration logic, thereby making it easier
>for clients to get the data they need
>
>Visitor takes some getting used to.  I rarely use the pattern. In this case,
>IMO, it was a good fit.
>
The Visitor pattern sounds interesting, and I look forward to hearing 
more about it.  However, duplicated code itself is not IMO necessarily 
an evil. It all depends on whether one thinks that the duplicated 
components are going to diverge in functionality in the future.  If you 
are sure they are not, then fine, refactor away.

I guess my surprise at your (or perhaps Somik's) focus on refactoring 
comes from the fact that while the htmlparser is a great piece of 
software, the javadocs and other documentation could use some attention. 
 For example, I can't find any explanation in the javadocs or otherwise 
of how the filters are supposed to work with the different scanners, or 
what values they are allowed to take.

I generally work to "if it's not broken don't fix it", but I often add 
"before you start fixing it, make sure your documentation is up to date".

Using the Visitor pattern may make it easier for clients to get the data 
they need, but given that the htmlparser is "working" (well it works for 
me), I would say that the more urgent issue here is making sure all the 
documentation is up to date.   I have a lot of positive things to say 
about htmlparser, so don't take it the wrong way, when I say that the 
biggest problem I've had in using it in the last few weeks is inadequate 
javadocs.

>>Is there some efficiency reason why you want to refactor these methods
>>or is it just for neatness?
>>    
>>
>
>Duplication removal is reason #1. 
>
As I mention above.  One should be careful of duplication removal for 
the sake of it.

> Removal of hard-coded logic is reason #2.
>
This is a good reason.  However I get the feeling that introduction of 
these Visitor classes will make the system conceptually more difficult 
to use rather than easier.  I would feel better if the current set up 
was more fully documented before more complexity was added.

And even if the Visitor pattern is used, I would recommend leaving 
methods like toPlainTextString() etc in place, but just making them 
short cut implementations to certain kinds of visitor-using methods. 
 This will allow people who have yet to grasp the Visitor  pattern 
something to work with.

If you are keen to see lots of people using htmlparser, I think that you 
don't want people to have to come to terms with too many new concepts at 
once.  You say yourself that the Visitor pattern takes some getting used 
to.  I think the whole scanner concept takes some getting used to ....

>Simplicity is reason #3: there is little reason to fatten the interfaces of
>tag and node classes with various data accumulation/alteration methods when
>one method and a variety of concrete Visitors can do the job with much less
>code.
>
well I would agree if you could guarantee that there will be no 
divergence whatsoever in how the different methods will be used.  If you 
can create a flexible enough implementation of the Visitor pattern then 
I guess that will support any possible divergence in the separate 
methods. However, I think there is a reason to have a fatter interface, 
in that convenience methods lower the barrier to entry for new users.  

Perhaps ideally one has a well implemented Visitor pattern that supports 
a raw method access, and a number of convenience methods?

A well implemented Visitor pattern will, I assume, support all sorts of 
different operations, but I would feel much happier if the htmlparser 
had a complete javadoc and documentation review before any refactoring 
took place.  People are trying to use the existing system and having 
trouble not because of the lack of refactoring, but a lack of well 
described methods.  Well I say people, I mean me, I don't know if anyone 
else feels the same.  Maybe it's just me :-)

Just my two cents.

CHEERS> SAM

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 07:01:48

Hi Sam,

> I guess my surprise at your (or perhaps Somik's) focus on refactoring
> comes from the fact that while the htmlparser is a great piece of
> software, the javadocs and other documentation could use some attention.
>  For example, I can't find any explanation in the javadocs or otherwise
> of how the filters are supposed to work with the different scanners, or
> what values they are allowed to take.

The last few weeks, I've been only adding docs...
The Sample Programs are new - you will find an example of using filters here
:
http://htmlparser.sourceforge.net/samples/linksEmbedded.html

From the javadoc :
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/HTMLNode.html
(check collectInto)

I guess more javadoc should be written about this, but I didn't bother to
write it bcos the filters were mostly used for demonstration from the
command line. When you type java -jar htmlparser.jar, you would see a help
menu of each filter.

Its only with the introduction of Collection Parameter that we've been using
filter strings for actual collection of data, and that has been documented.
If its not important, its not documented :).

> I generally work to "if it's not broken don't fix it", but I often add
> "before you start fixing it, make sure your documentation is up to date".
> Using the Visitor pattern may make it easier for clients to get the data
> they need, but given that the htmlparser is "working" (well it works for
> me), I would say that the more urgent issue here is making sure all the
> documentation is up to date.   I have a lot of positive things to say
> about htmlparser, so don't take it the wrong way, when I say that the
> biggest problem I've had in using it in the last few weeks is inadequate
> javadocs.

Sam - feel free to post as often as you like on the list. I'd be glad to
help you out. The lack of adequate documentation is of course my
responsibility and in light of my explanation above, if you can suggest
other areas of inadequate docs, I'll look into it.

> >Duplication removal is reason #1.
> >
> As I mention above.  One should be careful of duplication removal for
> the sake of it.
>
> > Removal of hard-coded logic is reason #2.
> >
> This is a good reason.  However I get the feeling that introduction of
> these Visitor classes will make the system conceptually more difficult
> to use rather than easier.  I would feel better if the current set up
> was more fully documented before more complexity was added.
>
> And even if the Visitor pattern is used, I would recommend leaving
> methods like toPlainTextString() etc in place, but just making them
> short cut implementations to certain kinds of visitor-using methods.
>  This will allow people who have yet to grasp the Visitor  pattern
> something to work with.

Don't worry, we'll leave toPlainTextString() alone - in fact, we've already
begun doing the short cut implementations. :)
The purpose of my initial mail was not to alarm you, but to know more about
realistic "customer" stories.

> If you are keen to see lots of people using htmlparser, I think that you
> don't want people to have to come to terms with too many new concepts at
> once.  You say yourself that the Visitor pattern takes some getting used
> to.  I think the whole scanner concept takes some getting used to ....
>

Bytway, have you gone thru the Sample Programs - I should've thought that
this new addition will make life very simple. If not, we'd probably need
more docs..

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Joshua K. <jo...@in...> - 2002-12-27 18:33:44

Sam wrote:
> The Visitor pattern sounds interesting, and I look forward to hearing
> more about it.  However, duplicated code itself is not IMO necessarily
> an evil. It all depends on whether one thinks that the duplicated
> components are going to diverge in functionality in the future.  If you
> are sure they are not, then fine, refactor away.

At the moment, Somik and I speculate that 40% to 50% of the current html
parser code base is pure fat -- unnecessary code that only serves to bloat
the code base.  Duplicate code, in subtle or not-so-subtle versions, is
mostly responsible for the bloat.  In my experience in industry, I'd say
that 95% of the time, duplicate code is bad.

> I guess my surprise at your (or perhaps Somik's) focus on refactoring
> comes from the fact that while the htmlparser is a great piece of
> software, the javadocs and other documentation could use some attention.
>  For example, I can't find any explanation in the javadocs or otherwise
> of how the filters are supposed to work with the different scanners, or
> what values they are allowed to take.

The html parser is in sore need of refactoring Sam.  In fact, sometimes code
becomes unnecessarily complex, in which case people *really* need
documentation.  I confront such a problem by first asking if the code could
be simplified so we didn't need so much documentation.   In the event that
we do need docs, it is best to write executable documentation.  Are you
familiar with executable documentation?

> I generally work to "if it's not broken don't fix it", but I often add
> "before you start fixing it, make sure your documentation is up to date".

Software becomes brittle and bloated under the philosophy of "if it ain't
broke, don't fix it."   We have clients with code bases that are 2 millions
lines of "working" speghetti code.  They need lots of help to learn to do
continuous refactoring.

> Using the Visitor pattern may make it easier for clients to get the data
> they need, but given that the htmlparser is "working" (well it works for
> me), I would say that the more urgent issue here is making sure all the
> documentation is up to date.   I have a lot of positive things to say
> about htmlparser, so don't take it the wrong way, when I say that the
> biggest problem I've had in using it in the last few weeks is inadequate
> javadocs.

I'm not content to live with the world as it is Sam.  If something isn't
easy, it's wrong.  The best software in the world is easy to use, it's
self-explanatory.  We should always strive for that.  In the meantime, if
you have tasks to complete and don't know how to do them because of lack of
docs, I'd suggest you ask questions here.  The best result will be
executable documentation for the html parser.

> >>Is there some efficiency reason why you want to refactor these methods
> >>or is it just for neatness?
> >>
> >>
> >
> >Duplication removal is reason #1.
> >
> As I mention above.  One should be careful of duplication removal for
> the sake of it.

And as I mentioned above, I completely disagree with you.  I wonder, how
much refactoring have you done in your career given you philosophy of "if it
ain't broke, don't fix it?"   Have you read Martin Fowler's landmark book,
Refactoring?   If not, I'd suggest you study it thoroughly - you'll be a
better programmer for it.

> > Removal of hard-coded logic is reason #2.
> >
> This is a good reason.  However I get the feeling that introduction of
> these Visitor classes will make the system conceptually more difficult
> to use rather than easier.  I would feel better if the current set up
> was more fully documented before more complexity was added.

As I also said above, if it ain't simple, it's wrong.  Our changes will not
add complexity - that would be foolish.

> And even if the Visitor pattern is used, I would recommend leaving
> methods like toPlainTextString() etc in place, but just making them
> short cut implementations to certain kinds of visitor-using methods.
>  This will allow people who have yet to grasp the Visitor  pattern
> something to work with.

People can always call deprecated methods.

> If you are keen to see lots of people using htmlparser, I think that you
> don't want people to have to come to terms with too many new concepts at
> once.  You say yourself that the Visitor pattern takes some getting used
> to.  I think the whole scanner concept takes some getting used to ....

Our Visitor implementation is so trivial that folks won't even know they're
using the pattern.

> >Simplicity is reason #3: there is little reason to fatten the interfaces
of
> >tag and node classes with various data accumulation/alteration methods
when
> >one method and a variety of concrete Visitors can do the job with much
less
> >code.
> >
> well I would agree if you could guarantee that there will be no
> divergence whatsoever in how the different methods will be used.  If you
> can create a flexible enough implementation of the Visitor pattern then
> I guess that will support any possible divergence in the separate
> methods. However, I think there is a reason to have a fatter interface,
> in that convenience methods lower the barrier to entry for new users.

Wouldn't it be marvelous if the barrier to entry was low and our code wasn't
bloatware?

> Perhaps ideally one has a well implemented Visitor pattern that supports
> a raw method access, and a number of convenience methods?

The changes will make the code easier to use - if they don't, they really
cool thing about software is that it is soft - we can change it.

> A well implemented Visitor pattern will, I assume, support all sorts of
> different operations, but I would feel much happier if the htmlparser
> had a complete javadoc and documentation review before any refactoring
> took place.  People are trying to use the existing system and having
> trouble not because of the lack of refactoring, but a lack of well
> described methods.  Well I say people, I mean me, I don't know if anyone
> else feels the same.  Maybe it's just me :-)

Executable documentation is a term we use for Customter Tests.   A Customer
Tests shows how a body of code gets used to perform real tasks.   You have
real-world tasks to perform.  We can perform those tasks via automated
tests.   Then, when we refactor, we must make sure our executable
documentation is up to date (i.e. it passes its tests).   We find this ideal
for making sure that documentation reflects what the code is actually doing.
This also leaves room for a document that gives some graphical
representation of how things work, which must be maintained by someone, lest
it get out of date.  We just minimize the need for such documents by keeping
our code simple and small and surrounding it with easy to understand
executable documentation.

best regards
jk

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 04:02:55

Joshua Kerievsky wrote:

>Sam wrote:
>  
>
>>As I mention above.  One should be careful of duplication removal for
>>the sake of it.
>>    
>>
>
>And as I mentioned above, I completely disagree with you.  I wonder, how
>much refactoring have you done in your career given you philosophy of "if it
>ain't broke, don't fix it?"   Have you read Martin Fowler's landmark book,
>Refactoring?   If not, I'd suggest you study it thoroughly - you'll be a
>better programmer for it.
>
I must say that I have yet to be convinced of the value of continous 
refactoring.  I think you can have too much of a good thing.  I refactor 
a lot in my code.  I am constantly looking at how to design things more 
efficiently, how to implement a number of different pieces of code in a 
smaller, tidier way, and also I am well aware that too much refactoring 
can be dangerous.  In your haste to refactor everything, consider that 
this is only the latest in a long line of computing fads, and none of 
them is a cure-all.  Refactoring is just another technique, like OO, or 
unit testing.  It will only help you if use it in moderation.  Once you 
internalise this, you'll be a better programmer. Trust me.

In my 15 years of coding experience I have seen projects collapse both 
from too much refactoring and too little.  To believe that you will 
improve your project purely as a consequence of refactoring is just 
plain wrong in my experience.  I am not saying that you shouldn't 
refactor in this case, I'm just saying that I think the effort spent on 
refactoring would be better spent on first improving the documentation. 
 Somik replied to my issues about documentation saying that he had been 
working on it and asking me to point out other places where the 
documentation could be improved.  I don't have time right now to go 
through and make those points, but as a *user* of htmlparser I would 
appreciate it if you would leave off any refactoring until you have gone 
through the entire code base and expanded every javadoc comment so that 
it really described what the method was doing in a manner understandable 
to someone unfamiliar with the code base.  I think that if you direct 
your effort here first before refactoring you will have a greater 
increase in users in the long run.

Don't get me wrong, I am not against refactoring, and I use the term "if 
it ain't broke don't fix it" partly in jest.  My main point was about 
"documenting before fixing".  I don't believe that any refactoring that 
you do will replace the need for good documentation, for the code base 
that *is already deployed and in use*.

I am asking you as a *user* of htmlparser to focus on the documentation 
(and release a version including that latest documentation) before 
moving on with your refactoring.  Remember it is dangerous for a 
software project to ignore the needs of its users.

CHEERS> SAM

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 04:23:04

Hi,

I realise that unless I get specific on the issue of documentation I am 
unlikely to be taken seriously

http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/HTMLTag.html

is from the java docs  - here are some javadoc comments

Hashtable getParsed()
          "Gets the parsed"
static java.util.Vector getStrictTags()
          "Gets the strictTags"
 java.lang.String getTagLine()
          "Returns the tagLine"
 java.lang.String getTagName()
            ""

Personally I would like to see a good couple of lines of text for each 
of these.  Some are more important than others.  I mean getTagName might 
seem self explanatory, but I would like to see more, e.g.

Hashtable getParsed()
          "Gets a hashtable of attribute-value pairs from the html tag, 
e.g. type-->hidden, name-->address, value-->default"
static java.util.Vector getStrictTags()
          "<not sure what strict tags are, would be nice if there was a 
description>"
 java.lang.String getTagLine()
          "Returns the line from the document that this tag was found on?"
 java.lang.String getTagName()
            "get the name of the tag itself, e.g. BODY, TR, TABLE etc."

Personally I found the lack of javadocs on getParsed() particularly 
frustrating.  I had to go into the code in some depth to find out what 
it was, and that it was the thing I needed.  Naturally the beauty of 
open source is that I can actually do that.  But I would have preferred 
to have discovered the information I needed from the javadocs.  And I 
would prefer as a *user* that you would expend your efforts updating the 
documentation on the existing code base rather than refactoring.  

CHEERS> SAM

Sam Joseph wrote:

> Somik replied to my issues about documentation saying that he had been 
> working on it and asking me to point out other places where the 
> documentation could be improved.  I don't have time right now to go 
> through and make those points,