Re: [Htmlparser-developer] toPlainTextString() feedback requested
Brought to you by:
derrickoswald
|
From: Somik R. <so...@ya...> - 2002-12-29 00:11:42
|
Hi Claude, > If I may... Watch out for feature creep! Adding AI sounds like a panacea but it's not. I have considerable experience with AI myself and it rarely meets expectations unless you can clearly define the objectives up front. If you plan to train classifers, you'll need to collect a fairly large, relevant training set. Even then, this may not solve as many problems as it may cause and you need to make a good decision based on fact. > > Some of your messages suggest you are easily taken by wiz-bang technologies and cool new features (which I suffer from myself and need to balance). You need to consider these features especially carefully and realize that this bias can be detrimental under some circumstances. Certainly, the project needs to be fun, but don't get too caught up in feature envy. Make sure you keep the coupling loose through interfaces, especially if you add AI components, and please allow users to decide which features they want by keeping the architecture as pluggable as possible. This will benefit both your user and development communities. I agree with your opinion - I'm keeping the AI stuff a low priority for now, as I mentioned - focussing only on refactoring for the moment. However, I am hoping that we can do some evaluation, and decide if we should go for it in v1.4. The real problem is, we keep finding real-life dirty html that just does not follow any rules. Just sometime back, I found a bug in HTMLStringNode.. If we have html like : <script language="javascript"> var lower = '<%=lowerValue%>'; </script> the StringNode does not realize that the jsp tag within quotes is not to be handled as a tag but a string node. I found this problem while refactoring today. Upon modifying the parsing automata and including a PARSE_IGNORE_STATE for HTMLStringNode, I was able to handle this case. But thanks to the tests, I found that we had another failing test, wherein we had something like : <A href="somelink.html">Kaarle's homepage</A> As you can see, the single apostrophe caused the damage. Now, I had to code in extra logic to also verify that the next character must be an opening tag for the string node automaton to move into ignoring state. Good news is that all tests are passing. But I am not really sure that this philosophy of modifying the parsing automata logic is all that great. If cases such as this can be coded outside the parser - perhaps with regular expressions, we will no longer need to modify the code of the parser each time. However, there'd probably be a performance hit as compared to the system that we have now, and we've got to look at it really carefully before we go for it. We've been going vertical all this while, so I am just trying to go lateral to see where that might take us - just to generate more possibilities. I'm thinking that we could try a seperate implementation - writing the parsing logic from scratch, and using the existing test mechanism to verify. This could be a seperate module in the htmlparser project, for experimentation only - to evaluate feasibility. Regards, Somik |