Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Claude,
>I think it may well be that the group's size is hitting critical mass and
that minimal formalities are now required...

Good to hear from you. I quite agree that this project seems to be reaching
critical mass. We need to share a plan of action that we could all work on
parallelly.

First, I've been noticing that we're not consistent on the coding
convention. In the past, if I see a class that does not follow the
convention, I'd go ahead and fix it. However, as so many people are now
working parallelly, its important that we all share the same coding
convention. The one I follow is the standard Java coding convention. In
addition, I particularly don't like using names with single characters in
them - i.e. I'd prefer "tag" to "pTag". That is the convention followed in
most of the parser. The coding convention is of course open for discussion
if anyone feels that we should accomodate any particular practice.

Second, I am planning to simplify the design of the scanners. As Dhaval
mentioned earlier, and as Sam mentioned, it wasn't easy to write a new
scanner. I agree that the javadoc of getParsed() should be better than what
it is. However, I think the problem lies with the name itself. In the sense,
if we rename "parsed" to "parameterTable" or "attributeTable" - and
getParsed() becomes, getAttributeTable() - the javadoc would become
redundant. You could say that this is a "documentation" change, or you could
call this a "refactoring". Either way, it is developer and user-friendly. So
I think Sam's suggestions and our current direction are one and the same.

Third, there's way too much duplicate code in the tags. Almost every tag
runs thru its attributes to render it as html - whereas this can be done
from the superclass - HTMLTag. This is just one example. I've not been
focussing much on reducing duplication - and we can see serious code smells.
However, I think it is important to be able to quantify what we hope to
achieve by undertaking several rounds of refactoring. As we keep
refactoring, I will keep a tab on the size of the parser (using Lines of
Code). If there's any other metric that people think is important, we could
include that as well. It will be interesting to see how much code we can
take out.

Fourth, this is one project where change is welcome - all the time. Simply
bcos it has a massive suite of tests which represents most of the user
requirements that we've received till date. The real-value in the parser is
not really its production code - but its solid test suite. If there are
interesting design changes that will make the current system simpler, they
should be tried out without fear - for if we break anything, we will know
bcos of the automated testing. It is important that the testing suite is
therefore regarded just as important (if not more) as the production code,
and steps taken to improve its design, and make it very easy to add new
tests all the time.

Fifth, as the project grows - new features should keep coming in all the
time. Derrick Oswald has been working away at so many cool new features -
and he's been writing tests for them all - this is a much-appreciated
activity and must continue. I'd be happy to not think about new features and
completely focus on solidifying the existing system, and let Derrick focus
on adding new stuff - for v1.3.

Sixth, we have a very serious possibility of using AI in making tag
recognitions. I have to find the time to write about the current recognition
mechanism, and pass that on to Sam Joseph. Sam has considerable experience
in artificial intelligence, and it will be good for the project to have his
expertise in its core correction logic.

Seventh, it might be good to have a Wiki for the parser - as so many open
source projects do. That way, the entire burden of documentation is not on
any one person. We should be looking at options of having our own Wiki so we
can add content easily and collaboratively.

Pls feel free to add/modify this list if I've missed out on anything.

Regards,
Somik