Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Somik

Somik Raha wrote:

>>If we want to call
>>changing method names and adding documentation, refactoring then I guess
>>I am a huge fan of refactoring :-)
>>    
>>
>
>Changing method names is definitely refactoring. The idea is to use really
>good names that make the javadoc redundant. If you can explain what javadoc
>would achieve for a method like getAttributes() - I'd be happy to add
>javadoc for that.
>
I don't think good method names make javadoc redundant, and I don't 
think they can.  For example Hashtable getAttributes()  tells me very 
little. What is an attribute? What are the classes in the hashtable? 
Give me three examples of what an attribute is and then you might have a 
good javadoc, e.g.

[Gets a hashtable of attribute-value pairs from the html tag, e.g. 
type-->hidden, name-->address, value-->default would come from the tag 
<input type="hidden" name="address" value="default">.  Attributes and 
values are all String classes. ]

Now I think that the method getAttributes with the above javadoc would 
be much easier to use and understand for somebody trying to use 
htmlparser for the first or even the third time, no?  I think you need 
to be careful about what is obvious for you, who have designed the 
system, and what is obvious for a novice user.  

>>However the other part of what I was
>>saying is that I would like to see the documentation additions before we
>>change method names.  I mean you can deprecate away so as not to break
>>my existing code ... I still think you should release a version with a
>>full javadoc "refactoring" before you release anything with refactored
>>code.  What's the harm in focusing on the documentation first?
>>    
>>
>
>Simply bcos once the name is changed to a better one, docs are no longer
>necessary. 
>
See my point above.  I think you make more than a name change to make 
the api novice-user friendly.  And my comments about the need for more 
detailed javadocs go right through the whole project, not just the few 
examples I've given.  

>> I mean we're on version 1.2 now right?  What about a version 1.2.1 that
>>included the documentation fixes, before moving on to a 1.3-beta that
>>included the newly refactored code that you're so keen to start work on.
>> What would be the downside of proceeding in this fashion?
>>    
>>
>
>Duplicate work. Maintaining parallel branches of code is the last thing we
>need to do, considering limited resources. I don't see a problem in going
>straight with 1.3-beta as we've got all our tests in place, and we'll also
>use deprecated methods. That should work for you, right ?
>
Sure, having deprecated methods works for me, but it's not just about 
me.  I mean you've already released 1.2  People are already using it, 
building code around it.  You have to maintain it anyway, or rather 
users will continue asking you questions about it.  I'm suggesting that 
a 1.2.1 improved documentation release would mean less work in having to 
maintain things because the users could work from the javadocs.  I would 
also think that going through and writing thorough javadocs for your 
existing version would put you in a much better position to write a good 
1.3-beta.  Limited resources is an important concern, and I think you'll 
conserve resources by more thorough documentation first and re-writing 
your code second.  

As long as you have deprecated methods then my own code base will be 
fine, so it makes little difference to me in the short term which way 
you go.  However I think in the long term, a 1.2.1 documentation release 
would improve the quality of the project overall, and set a good 
precedent for high quality javadocs in subsequent releases.  I get the 
sense that you are really keen to plow into re-writing all the code, 
what's the hurry?

>>This sounds like an interesting possibility, but I still need to
>>understand how the current parser handles all the existing messy tags.
>>I mean when you started talking through all the messy html examples I
>>though you were going to be showing me something that didn't work and
>>that you wanted some machine learning/AI to fix, but you ended up saying
>>that all the examples could be handled by the exisiting parser.  If that
>>is so, why do you want to change it, and are there examples of messy
>>code that can't be handled by the parser?  If there aren't and its all
>>working fine, what advantage do we gain by replacing the existing core
>>correction logic with some hypothetical AI alternative?
>>    
>>
>
>The main problem is that the current logic is complicated, and uses
>hard-coded logic. Such a system is hard to maintain, change. I was thinking
>on the lines of a rule-based system, where the rule-base could be dynamic.
>But, I will write a seperate mail on this issue as soon as I get some time.
>
Ah, I think I understand.  What you want is some framework for dealing 
with new types of mess as they come up.  A system that can have more 
rules added to without too much work, or even a framework for trying to 
guess which rules to apply in which document ...

>>>Seventh, it might be good to have a Wiki for the parser - as so many open
>>>source projects do. That way, the entire burden of documentation is not
>>>      
>>>
>on
>  
>
>>>any one person. We should be looking at options of having our own Wiki so
>>>      
>>>
>we
>  
>
>>>can add content easily and collaboratively.
>>>
>>>      
>>>
>>You are most welcome to use the wiki I have set up in the short term:
>>
>>http://www.neurogrid.net/devwiki/wikipages/HtmlParser
>>
>>It uses an open source java wiki (devwiki), which is not as fully
>>functional as it might be, but I've got it set up so everything gets
>>backed up to CVS and all changes get sent to the neurogrid-cvs mailing
>>list.  If nothing else it might serve as an example wiki environment.
>>    
>>
>
>Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with
>current website) as in logos, page color, etc..
>
I guess so.  I think that could be arranged.  Send me a copy of the logo 
etc., and I'll see what I can do, although it might take a couple of 
weeks for me to find time to do it.

CHEERS> SAM