Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Somik

Somik Raha wrote:

>Second, I am planning to simplify the design of the scanners. As Dhaval
>mentioned earlier, and as Sam mentioned, it wasn't easy to write a new
>scanner. I agree that the javadoc of getParsed() should be better than what
>it is. However, I think the problem lies with the name itself. In the sense,
>if we rename "parsed" to "parameterTable" or "attributeTable" - and
>getParsed() becomes, getAttributeTable() - the javadoc would become
>redundant. You could say that this is a "documentation" change, or you could
>call this a "refactoring". Either way, it is developer and user-friendly. So
>I think Sam's suggestions and our current direction are one and the same.
>  
>
I think that just changing the name of the method is not quite enough.   
Certainly changing the name of the method is a good step, but I think 
javadoc comments that actually show an example of what is being returned 
make a huge usability difference, particularly when the thing being 
returned is something as complex as a hashtable.  If we want to call 
changing method names and adding documentation, refactoring then I guess 
I am a huge fan of refactoring :-)  However the other part of what I was 
saying is that I would like to see the documentation additions before we 
change method names.  I mean you can deprecate away so as not to break 
my existing code ... I still think you should release a version with a 
full javadoc "refactoring" before you release anything with refactored 
code.  What's the harm in focusing on the documentation first?

 I mean we're on version 1.2 now right?  What about a version 1.2.1 that 
included the documentation fixes, before moving on to a 1.3-beta that 
included the newly refactored code that you're so keen to start work on. 
 What would be the downside of proceeding in this fashion?

>Sixth, we have a very serious possibility of using AI in making tag
>recognitions. I have to find the time to write about the current recognition
>mechanism, and pass that on to Sam Joseph. Sam has considerable experience
>in artificial intelligence, and it will be good for the project to have his
>expertise in its core correction logic.
>
This sounds like an interesting possibility, but I still need to 
understand how the current parser handles all the existing messy tags.   
I mean when you started talking through all the messy html examples I 
though you were going to be showing me something that didn't work and 
that you wanted some machine learning/AI to fix, but you ended up saying 
that all the examples could be handled by the exisiting parser.  If that 
is so, why do you want to change it, and are there examples of messy 
code that can't be handled by the parser?  If there aren't and its all 
working fine, what advantage do we gain by replacing the existing core 
correction logic with some hypothetical AI alternative?

>Seventh, it might be good to have a Wiki for the parser - as so many open
>source projects do. That way, the entire burden of documentation is not on
>any one person. We should be looking at options of having our own Wiki so we
>can add content easily and collaboratively.
>
You are most welcome to use the wiki I have set up in the short term:

http://www.neurogrid.net/devwiki/wikipages/HtmlParser

It uses an open source java wiki (devwiki), which is not as fully 
functional as it might be, but I've got it set up so everything gets 
backed up to CVS and all changes get sent to the neurogrid-cvs mailing 
list.  If nothing else it might serve as an example wiki environment.

CHEERS> SAM