Thread: [Htmlparser-developer] toPlainTextString() feedback requested
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-24 02:20:37
|
Hi Folks, We are performing a series of refactorings, which will involve merging of the functionality of toPlainTextString() and toHTML(HTMLRenderer), and perhaps, toHTML() itself.. We'd like to know user stories of how you've been using toPlainTextString(). Do you enumerate through the nodes and : [1] collect all the text data using toPlainTextString() into a buffer? [2] make calls to toPlainTextString() but don't collect the data ? [3] collect all text data using toPlainTextString() into a buffer, but only after having processed the string in some way ? Another question - if we remove toPlainTextString() [and provide you an alternative], will it hurt, or would you prefer a deprecation step (in 1.3) before removal (in 1.4) ? It would be good for us to know more details about your applications, so that we can perform our design modifications correctly. Feel free to tell us your stories in detail. (Remember to click Reply All so the replies go to both lists). Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Sam J. <ga...@yh...> - 2002-12-24 02:39:57
|
Hi Somik, I tend to use toPlainTextString() in all three ways described below. I use the first to generate a plain test overview, the second for debugging statements, and the final one for when I want to add my own additional formatting to the final text. Could you explain why you want to refactor these methods? Remember the danger of premature refactoring ... you lose flexibility that then has to be re-added later on, making more work in the long run. Is there some efficiency reason why you want to refactor these methods or is it just for neatness? Either way, I would certainly recommend a deprecation step. CHEERS> SAM Somik Raha wrote: >Hi Folks, > We are performing a series of refactorings, which >will involve merging of the functionality of >toPlainTextString() and toHTML(HTMLRenderer), and >perhaps, toHTML() itself.. > > We'd like to know user stories of how you've been >using toPlainTextString(). > >Do you enumerate through the nodes and : >[1] collect all the text data using >toPlainTextString() into a buffer? > >[2] make calls to toPlainTextString() but don't >collect the data ? > >[3] collect all text data using toPlainTextString() >into a buffer, but only after having processed the >string in some way ? > >Another question - if we remove toPlainTextString() >[and provide you an alternative], will it hurt, or >would you prefer a deprecation step (in 1.3) before >removal (in 1.4) ? > >It would be good for us to know more details about >your applications, so that we can perform our design >modifications correctly. Feel free to tell us your >stories in detail. (Remember to click Reply All so the >replies go to both lists). > >Regards, >Somik > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > |
From: Somik R. <so...@ya...> - 2002-12-24 03:37:22
|
Hi Sam, > Could you explain why you want to refactor these > methods? Remember the > danger of premature refactoring ... you lose > flexibility that then has > to be re-added later on, making more work in the > long run. There seems to be so much duplication in the parser, that it seems like Merciless_Refactoring can alone help clean the code. > Is there some efficiency reason why you want to > refactor these methods > or is it just for neatness? Readability and efficiency are important factors - far more important is to have a simple way of getting data from the parser. The idea is to pass in your own visitors into the parser, which could either extract text, or html. If you need to modify them, then you can modify these visitors to customize behaviour. The weight of the parser would reduce, code duplication would not be there, and it would be easier to customize. > Either way, I would certainly recommend a > deprecation step. Will keep that in mind. Cheers, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Joshua K. <jo...@in...> - 2002-12-24 05:40:22
|
> Could you explain why you want to refactor these methods? Remember the > danger of premature refactoring ... you lose flexibility that then has > to be re-added later on, making more work in the long run. Hi Sam, There's a good deal of duplicate code in way the two toHTML methods and the toPlainTextString method do their work. The central theme is information accumulation/alteration. That involves outputing tag and node results and recusing through tags. The refactoring to Visitor allows us to * remove many lines of duplicate code, spread across many classes * remove hard-coded accumulation/alteration logic, thereby making it easier for clients to get the data they need Visitor takes some getting used to. I rarely use the pattern. In this case, IMO, it was a good fit. > Is there some efficiency reason why you want to refactor these methods > or is it just for neatness? Duplication removal is reason #1. Removal of hard-coded logic is reason #2. Simplicity is reason #3: there is little reason to fatten the interfaces of tag and node classes with various data accumulation/alteration methods when one method and a variety of concrete Visitors can do the job with much less code. best regards jk |
From: Sam J. <ga...@yh...> - 2002-12-24 06:35:40
|
Hi Somik and Joshua Joshua Kerievsky wrote: >>Could you explain why you want to refactor these methods? Remember the >>danger of premature refactoring ... you lose flexibility that then has >>to be re-added later on, making more work in the long run. >> >> >There's a good deal of duplicate code in way the two toHTML methods and the >toPlainTextString method do their work. The central theme is information >accumulation/alteration. That involves outputing tag and node results and >recusing through tags. The refactoring to Visitor allows us to > >* remove many lines of duplicate code, spread across many classes >* remove hard-coded accumulation/alteration logic, thereby making it easier >for clients to get the data they need > >Visitor takes some getting used to. I rarely use the pattern. In this case, >IMO, it was a good fit. > The Visitor pattern sounds interesting, and I look forward to hearing more about it. However, duplicated code itself is not IMO necessarily an evil. It all depends on whether one thinks that the duplicated components are going to diverge in functionality in the future. If you are sure they are not, then fine, refactor away. I guess my surprise at your (or perhaps Somik's) focus on refactoring comes from the fact that while the htmlparser is a great piece of software, the javadocs and other documentation could use some attention. For example, I can't find any explanation in the javadocs or otherwise of how the filters are supposed to work with the different scanners, or what values they are allowed to take. I generally work to "if it's not broken don't fix it", but I often add "before you start fixing it, make sure your documentation is up to date". Using the Visitor pattern may make it easier for clients to get the data they need, but given that the htmlparser is "working" (well it works for me), I would say that the more urgent issue here is making sure all the documentation is up to date. I have a lot of positive things to say about htmlparser, so don't take it the wrong way, when I say that the biggest problem I've had in using it in the last few weeks is inadequate javadocs. >>Is there some efficiency reason why you want to refactor these methods >>or is it just for neatness? >> >> > >Duplication removal is reason #1. > As I mention above. One should be careful of duplication removal for the sake of it. > Removal of hard-coded logic is reason #2. > This is a good reason. However I get the feeling that introduction of these Visitor classes will make the system conceptually more difficult to use rather than easier. I would feel better if the current set up was more fully documented before more complexity was added. And even if the Visitor pattern is used, I would recommend leaving methods like toPlainTextString() etc in place, but just making them short cut implementations to certain kinds of visitor-using methods. This will allow people who have yet to grasp the Visitor pattern something to work with. If you are keen to see lots of people using htmlparser, I think that you don't want people to have to come to terms with too many new concepts at once. You say yourself that the Visitor pattern takes some getting used to. I think the whole scanner concept takes some getting used to .... >Simplicity is reason #3: there is little reason to fatten the interfaces of >tag and node classes with various data accumulation/alteration methods when >one method and a variety of concrete Visitors can do the job with much less >code. > well I would agree if you could guarantee that there will be no divergence whatsoever in how the different methods will be used. If you can create a flexible enough implementation of the Visitor pattern then I guess that will support any possible divergence in the separate methods. However, I think there is a reason to have a fatter interface, in that convenience methods lower the barrier to entry for new users. Perhaps ideally one has a well implemented Visitor pattern that supports a raw method access, and a number of convenience methods? A well implemented Visitor pattern will, I assume, support all sorts of different operations, but I would feel much happier if the htmlparser had a complete javadoc and documentation review before any refactoring took place. People are trying to use the existing system and having trouble not because of the lack of refactoring, but a lack of well described methods. Well I say people, I mean me, I don't know if anyone else feels the same. Maybe it's just me :-) Just my two cents. CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-24 07:01:48
|
Hi Sam, > I guess my surprise at your (or perhaps Somik's) focus on refactoring > comes from the fact that while the htmlparser is a great piece of > software, the javadocs and other documentation could use some attention. > For example, I can't find any explanation in the javadocs or otherwise > of how the filters are supposed to work with the different scanners, or > what values they are allowed to take. The last few weeks, I've been only adding docs... The Sample Programs are new - you will find an example of using filters here : http://htmlparser.sourceforge.net/samples/linksEmbedded.html From the javadoc : http://htmlparser.sourceforge.net/javadoc/org/htmlparser/HTMLNode.html (check collectInto) I guess more javadoc should be written about this, but I didn't bother to write it bcos the filters were mostly used for demonstration from the command line. When you type java -jar htmlparser.jar, you would see a help menu of each filter. Its only with the introduction of Collection Parameter that we've been using filter strings for actual collection of data, and that has been documented. If its not important, its not documented :). > I generally work to "if it's not broken don't fix it", but I often add > "before you start fixing it, make sure your documentation is up to date". > Using the Visitor pattern may make it easier for clients to get the data > they need, but given that the htmlparser is "working" (well it works for > me), I would say that the more urgent issue here is making sure all the > documentation is up to date. I have a lot of positive things to say > about htmlparser, so don't take it the wrong way, when I say that the > biggest problem I've had in using it in the last few weeks is inadequate > javadocs. Sam - feel free to post as often as you like on the list. I'd be glad to help you out. The lack of adequate documentation is of course my responsibility and in light of my explanation above, if you can suggest other areas of inadequate docs, I'll look into it. > >Duplication removal is reason #1. > > > As I mention above. One should be careful of duplication removal for > the sake of it. > > > Removal of hard-coded logic is reason #2. > > > This is a good reason. However I get the feeling that introduction of > these Visitor classes will make the system conceptually more difficult > to use rather than easier. I would feel better if the current set up > was more fully documented before more complexity was added. > > And even if the Visitor pattern is used, I would recommend leaving > methods like toPlainTextString() etc in place, but just making them > short cut implementations to certain kinds of visitor-using methods. > This will allow people who have yet to grasp the Visitor pattern > something to work with. Don't worry, we'll leave toPlainTextString() alone - in fact, we've already begun doing the short cut implementations. :) The purpose of my initial mail was not to alarm you, but to know more about realistic "customer" stories. > If you are keen to see lots of people using htmlparser, I think that you > don't want people to have to come to terms with too many new concepts at > once. You say yourself that the Visitor pattern takes some getting used > to. I think the whole scanner concept takes some getting used to .... > Bytway, have you gone thru the Sample Programs - I should've thought that this new addition will make life very simple. If not, we'd probably need more docs.. Regards, Somik |
From: Joshua K. <jo...@in...> - 2002-12-27 18:33:44
|
Sam wrote: > The Visitor pattern sounds interesting, and I look forward to hearing > more about it. However, duplicated code itself is not IMO necessarily > an evil. It all depends on whether one thinks that the duplicated > components are going to diverge in functionality in the future. If you > are sure they are not, then fine, refactor away. At the moment, Somik and I speculate that 40% to 50% of the current html parser code base is pure fat -- unnecessary code that only serves to bloat the code base. Duplicate code, in subtle or not-so-subtle versions, is mostly responsible for the bloat. In my experience in industry, I'd say that 95% of the time, duplicate code is bad. > I guess my surprise at your (or perhaps Somik's) focus on refactoring > comes from the fact that while the htmlparser is a great piece of > software, the javadocs and other documentation could use some attention. > For example, I can't find any explanation in the javadocs or otherwise > of how the filters are supposed to work with the different scanners, or > what values they are allowed to take. The html parser is in sore need of refactoring Sam. In fact, sometimes code becomes unnecessarily complex, in which case people *really* need documentation. I confront such a problem by first asking if the code could be simplified so we didn't need so much documentation. In the event that we do need docs, it is best to write executable documentation. Are you familiar with executable documentation? > I generally work to "if it's not broken don't fix it", but I often add > "before you start fixing it, make sure your documentation is up to date". Software becomes brittle and bloated under the philosophy of "if it ain't broke, don't fix it." We have clients with code bases that are 2 millions lines of "working" speghetti code. They need lots of help to learn to do continuous refactoring. > Using the Visitor pattern may make it easier for clients to get the data > they need, but given that the htmlparser is "working" (well it works for > me), I would say that the more urgent issue here is making sure all the > documentation is up to date. I have a lot of positive things to say > about htmlparser, so don't take it the wrong way, when I say that the > biggest problem I've had in using it in the last few weeks is inadequate > javadocs. I'm not content to live with the world as it is Sam. If something isn't easy, it's wrong. The best software in the world is easy to use, it's self-explanatory. We should always strive for that. In the meantime, if you have tasks to complete and don't know how to do them because of lack of docs, I'd suggest you ask questions here. The best result will be executable documentation for the html parser. > >>Is there some efficiency reason why you want to refactor these methods > >>or is it just for neatness? > >> > >> > > > >Duplication removal is reason #1. > > > As I mention above. One should be careful of duplication removal for > the sake of it. And as I mentioned above, I completely disagree with you. I wonder, how much refactoring have you done in your career given you philosophy of "if it ain't broke, don't fix it?" Have you read Martin Fowler's landmark book, Refactoring? If not, I'd suggest you study it thoroughly - you'll be a better programmer for it. > > Removal of hard-coded logic is reason #2. > > > This is a good reason. However I get the feeling that introduction of > these Visitor classes will make the system conceptually more difficult > to use rather than easier. I would feel better if the current set up > was more fully documented before more complexity was added. As I also said above, if it ain't simple, it's wrong. Our changes will not add complexity - that would be foolish. > And even if the Visitor pattern is used, I would recommend leaving > methods like toPlainTextString() etc in place, but just making them > short cut implementations to certain kinds of visitor-using methods. > This will allow people who have yet to grasp the Visitor pattern > something to work with. People can always call deprecated methods. > If you are keen to see lots of people using htmlparser, I think that you > don't want people to have to come to terms with too many new concepts at > once. You say yourself that the Visitor pattern takes some getting used > to. I think the whole scanner concept takes some getting used to .... Our Visitor implementation is so trivial that folks won't even know they're using the pattern. > >Simplicity is reason #3: there is little reason to fatten the interfaces of > >tag and node classes with various data accumulation/alteration methods when > >one method and a variety of concrete Visitors can do the job with much less > >code. > > > well I would agree if you could guarantee that there will be no > divergence whatsoever in how the different methods will be used. If you > can create a flexible enough implementation of the Visitor pattern then > I guess that will support any possible divergence in the separate > methods. However, I think there is a reason to have a fatter interface, > in that convenience methods lower the barrier to entry for new users. Wouldn't it be marvelous if the barrier to entry was low and our code wasn't bloatware? > Perhaps ideally one has a well implemented Visitor pattern that supports > a raw method access, and a number of convenience methods? The changes will make the code easier to use - if they don't, they really cool thing about software is that it is soft - we can change it. > A well implemented Visitor pattern will, I assume, support all sorts of > different operations, but I would feel much happier if the htmlparser > had a complete javadoc and documentation review before any refactoring > took place. People are trying to use the existing system and having > trouble not because of the lack of refactoring, but a lack of well > described methods. Well I say people, I mean me, I don't know if anyone > else feels the same. Maybe it's just me :-) Executable documentation is a term we use for Customter Tests. A Customer Tests shows how a body of code gets used to perform real tasks. You have real-world tasks to perform. We can perform those tasks via automated tests. Then, when we refactor, we must make sure our executable documentation is up to date (i.e. it passes its tests). We find this ideal for making sure that documentation reflects what the code is actually doing. This also leaves room for a document that gives some graphical representation of how things work, which must be maintained by someone, lest it get out of date. We just minimize the need for such documents by keeping our code simple and small and surrounding it with easy to understand executable documentation. best regards jk |
From: Sam J. <ga...@yh...> - 2002-12-28 04:02:55
|
Joshua Kerievsky wrote: >Sam wrote: > > >>As I mention above. One should be careful of duplication removal for >>the sake of it. >> >> > >And as I mentioned above, I completely disagree with you. I wonder, how >much refactoring have you done in your career given you philosophy of "if it >ain't broke, don't fix it?" Have you read Martin Fowler's landmark book, >Refactoring? If not, I'd suggest you study it thoroughly - you'll be a >better programmer for it. > I must say that I have yet to be convinced of the value of continous refactoring. I think you can have too much of a good thing. I refactor a lot in my code. I am constantly looking at how to design things more efficiently, how to implement a number of different pieces of code in a smaller, tidier way, and also I am well aware that too much refactoring can be dangerous. In your haste to refactor everything, consider that this is only the latest in a long line of computing fads, and none of them is a cure-all. Refactoring is just another technique, like OO, or unit testing. It will only help you if use it in moderation. Once you internalise this, you'll be a better programmer. Trust me. In my 15 years of coding experience I have seen projects collapse both from too much refactoring and too little. To believe that you will improve your project purely as a consequence of refactoring is just plain wrong in my experience. I am not saying that you shouldn't refactor in this case, I'm just saying that I think the effort spent on refactoring would be better spent on first improving the documentation. Somik replied to my issues about documentation saying that he had been working on it and asking me to point out other places where the documentation could be improved. I don't have time right now to go through and make those points, but as a *user* of htmlparser I would appreciate it if you would leave off any refactoring until you have gone through the entire code base and expanded every javadoc comment so that it really described what the method was doing in a manner understandable to someone unfamiliar with the code base. I think that if you direct your effort here first before refactoring you will have a greater increase in users in the long run. Don't get me wrong, I am not against refactoring, and I use the term "if it ain't broke don't fix it" partly in jest. My main point was about "documenting before fixing". I don't believe that any refactoring that you do will replace the need for good documentation, for the code base that *is already deployed and in use*. I am asking you as a *user* of htmlparser to focus on the documentation (and release a version including that latest documentation) before moving on with your refactoring. Remember it is dangerous for a software project to ignore the needs of its users. CHEERS> SAM |
From: Sam J. <ga...@yh...> - 2002-12-28 04:23:04
|
Hi, I realise that unless I get specific on the issue of documentation I am unlikely to be taken seriously http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/HTMLTag.html is from the java docs - here are some javadoc comments Hashtable getParsed() "Gets the parsed" static java.util.Vector getStrictTags() "Gets the strictTags" java.lang.String getTagLine() "Returns the tagLine" java.lang.String getTagName() "" Personally I would like to see a good couple of lines of text for each of these. Some are more important than others. I mean getTagName might seem self explanatory, but I would like to see more, e.g. Hashtable getParsed() "Gets a hashtable of attribute-value pairs from the html tag, e.g. type-->hidden, name-->address, value-->default" static java.util.Vector getStrictTags() "<not sure what strict tags are, would be nice if there was a description>" java.lang.String getTagLine() "Returns the line from the document that this tag was found on?" java.lang.String getTagName() "get the name of the tag itself, e.g. BODY, TR, TABLE etc." Personally I found the lack of javadocs on getParsed() particularly frustrating. I had to go into the code in some depth to find out what it was, and that it was the thing I needed. Naturally the beauty of open source is that I can actually do that. But I would have preferred to have discovered the information I needed from the javadocs. And I would prefer as a *user* that you would expend your efforts updating the documentation on the existing code base rather than refactoring. CHEERS> SAM Sam Joseph wrote: > Somik replied to my issues about documentation saying that he had been > working on it and asking me to point out other places where the > documentation could be improved. I don't have time right now to go > through and make those points, |