Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Sam,

> If we want to call
> changing method names and adding documentation, refactoring then I guess
> I am a huge fan of refactoring :-)

Changing method names is definitely refactoring. The idea is to use really
good names that make the javadoc redundant. If you can explain what javadoc
would achieve for a method like getAttributes() - I'd be happy to add
javadoc for that.

> However the other part of what I was
> saying is that I would like to see the documentation additions before we
> change method names.  I mean you can deprecate away so as not to break
> my existing code ... I still think you should release a version with a
> full javadoc "refactoring" before you release anything with refactored
> code.  What's the harm in focusing on the documentation first?

Simply bcos once the name is changed to a better one, docs are no longer
necessary. But I understand your predicament. We could have a deprecated
method name for important methods.

>  I mean we're on version 1.2 now right?  What about a version 1.2.1 that
> included the documentation fixes, before moving on to a 1.3-beta that
> included the newly refactored code that you're so keen to start work on.
>  What would be the downside of proceeding in this fashion?

Duplicate work. Maintaining parallel branches of code is the last thing we
need to do, considering limited resources. I don't see a problem in going
straight with 1.3-beta as we've got all our tests in place, and we'll also
use deprecated methods. That should work for you, right ?

> >Sixth, we have a very serious possibility of using AI in making tag
> >recognitions. I have to find the time to write about the current
recognition
> >mechanism, and pass that on to Sam Joseph. Sam has considerable
experience
> >in artificial intelligence, and it will be good for the project to have
his
> >expertise in its core correction logic.
> >
> This sounds like an interesting possibility, but I still need to
> understand how the current parser handles all the existing messy tags.
> I mean when you started talking through all the messy html examples I
> though you were going to be showing me something that didn't work and
> that you wanted some machine learning/AI to fix, but you ended up saying
> that all the examples could be handled by the exisiting parser.  If that
> is so, why do you want to change it, and are there examples of messy
> code that can't be handled by the parser?  If there aren't and its all
> working fine, what advantage do we gain by replacing the existing core
> correction logic with some hypothetical AI alternative?

The main problem is that the current logic is complicated, and uses
hard-coded logic. Such a system is hard to maintain, change. I was thinking
on the lines of a rule-based system, where the rule-base could be dynamic.
But, I will write a seperate mail on this issue as soon as I get some time.

> >Seventh, it might be good to have a Wiki for the parser - as so many open
> >source projects do. That way, the entire burden of documentation is not
on
> >any one person. We should be looking at options of having our own Wiki so
we
> >can add content easily and collaboratively.
> >
> You are most welcome to use the wiki I have set up in the short term:
>
> http://www.neurogrid.net/devwiki/wikipages/HtmlParser
>
> It uses an open source java wiki (devwiki), which is not as fully
> functional as it might be, but I've got it set up so everything gets
> backed up to CVS and all changes get sent to the neurogrid-cvs mailing
> list.  If nothing else it might serve as an example wiki environment.

Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with
current website) as in logos, page color, etc..

Cheers,
Somik