Re: [Htmlparser-developer] toPlainTextString() feedback requested

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Claude,

> If I may... Watch out for feature creep! Adding AI sounds like a panacea
but it's not. I have considerable experience with AI myself and it rarely
meets expectations unless you can clearly define the objectives up front. If
you plan to train classifers, you'll need to collect a fairly large,
relevant training set. Even then, this may not solve as many problems as it
may cause and you need to make a good decision based on fact.
>
> Some of your messages suggest you are easily taken by wiz-bang
technologies and cool new features (which I suffer from myself and need to
balance). You need to consider these features especially carefully and
realize that this bias can be detrimental under some circumstances.
Certainly, the project needs to be fun, but don't get too caught up in
feature envy. Make sure you keep the coupling loose through interfaces,
especially if you add AI components, and please allow users to decide which
features they want by keeping the architecture as pluggable as possible.
This will benefit both your user and development communities.

I agree with your opinion - I'm keeping the AI stuff a low priority for now,
as I mentioned - focussing only on refactoring for the moment. However, I am
hoping that we can do some evaluation, and decide if we should go for it in
v1.4. The real problem is, we keep finding real-life dirty html that just
does not follow any rules. Just sometime back, I found a bug in
HTMLStringNode..

If we have html like :

<script language="javascript">
var lower = '<%=lowerValue%>';
</script>

the StringNode does not realize that the jsp tag within quotes is not to be
handled as a tag but a string node. I found this problem while refactoring
today. Upon modifying the parsing automata and including a
PARSE_IGNORE_STATE for HTMLStringNode, I was able to handle this case. But
thanks to the tests, I found that we had another failing test, wherein we
had something like :

<A href="somelink.html">Kaarle's homepage</A>

As you can see, the single apostrophe caused the damage. Now, I had to code
in extra logic to also verify that the next character must be an opening tag
for the string node automaton to move into ignoring state.

Good news is that all tests are passing. But I am not really sure that this
philosophy of modifying the parsing automata logic is all that great. If
cases such as this can be coded outside the parser - perhaps with regular
expressions, we will no longer need to modify the code of the parser each
time. However, there'd probably be a performance hit as compared to the
system that we have now, and we've got to look at it really carefully before
we go for it.

We've been going vertical all this while, so I am just trying to go lateral
to see where that might take us - just to generate more possibilities. I'm
thinking that we could try a seperate implementation - writing the parsing
logic from scratch, and using the existing test mechanism to verify. This
could be a seperate module in the htmlparser project, for experimentation
only - to evaluate feasibility.

Regards,
Somik