htmlparser-developer Mailing List for HTML Parser (Page 20)

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

<< < 1 .. 18 19 20 21 22 .. 33 > >> (Page 20 of 33)

[Htmlparser-developer] Re: test directory

From: Somik R. <so...@ya...> - 2003-01-02 00:14:50

Hi Derrick
    Sorry for the delay in replying.
    I've opened up the htmlparser project - any developer can add stuff into
the site using secure copy tools (scp/filezilla).
    Feel free to add documentation or anything else.

Regards,
Somik
----- Original Message -----
From: "Derrick Oswald" <Der...@ro...>
To: "Somik Raha" <so...@ya...>
Sent: Monday, December 30, 2002 1:47 PM
Subject: test directory


> Somik,
>
> I was wondering if you could open up the directory permissions so others
> could put files in /home/groups/h/ht/htmlparser/htdocs/test too.
> Something like:
>     chmod  g+w  /home/groups/h/ht/htmlparser/htdocs/test
> should allow people in the htmlparser group write access to it.
>
> Derrick

Re: [Htmlparser-developer] way ta go!

From: Kaarle K. <kaa...@ik...> - 2002-12-31 22:01:56

At 10:34 31.12.2002 -0800, you wrote:
>Great! That hasn't happened in quite some time.
>Happy New Year to all!

Happy new year to you!
The new year 2003 started one minute ago here.
For you Somik it would have begun quite some hours ago if
you had stayed in Japan.

regards
Kaarle

>Regards,
>Somik
>--- Derrick Oswald <Der...@ro...> wrote:
> > HTML Parser made the top fifty most active projects
> > last week at
> > position 47 (99.526 percentile):
> >
> > https://sourceforge.net/top/mostactive.php?type=week
> >
> >
> >
> >
>-------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Htmlparser-developer mailing list
> > Htm...@li...
> >
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>
>__________________________________________________
>Do you Yahoo!?
>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>http://mailplus.yahoo.com
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844

Re: [Htmlparser-developer] way ta go!

From: Somik R. <so...@ya...> - 2002-12-31 18:34:17

Great! That hasn't happened in quite some time. 
Happy New Year to all!

Regards,
Somik
--- Derrick Oswald <Der...@ro...> wrote:
> HTML Parser made the top fifty most active projects
> last week at 
> position 47 (99.526 percentile):
> 
> https://sourceforge.net/top/mostactive.php?type=week
> 
> 
> 
>
-------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

[Htmlparser-developer] way ta go!

From: Derrick O. <Der...@ro...> - 2002-12-31 15:55:18

HTML Parser made the top fifty most active projects last week at 
position 47 (99.526 percentile):

https://sourceforge.net/top/mostactive.php?type=week

[Htmlparser-developer] swingui

From: Derrick O. <Der...@ro...> - 2002-12-30 19:46:53

I've dropped changes to use the junit swing runner by default.
I hope that's OK.
If you want the old behaviour (awt runner) use:
    java  org.htmlparser.tests.AllTests  -awt

Re: [Htmlparser-developer] AI - to be or not to be

From: Somik R. <so...@ya...> - 2002-12-30 06:14:51

Sam Joseph wrote:
> I would have thought the solution to this would be to define under which
> circumstances a pair of apostrophes will indicate text.  In the first
> two examples you are inside an ANCHOR tag, and in the third you are
> inside a SCRIPT tag.  It seems to me that you should only be using
> apostrophe's to indicate text strings inside a SCRIPT tag, no?  If
> that's true then set your parsing behaviour differently depending on the
> tag type.

You are right. The script scanner could set some static variable in the
HTMLStringNode - which tells it to move into the ignore states if it
encounters an apostrophe, and flag it off when its done. I'll probably do
that for now..

Regards,
Somik
----- Original Message -----
From: "Sam Joseph" <ga...@yh...>
To: <htm...@li...>
Sent: Sunday, December 29, 2002 10:09 PM
Subject: Re: [Htmlparser-developer] AI - to be or not to be


> Hi Somik,
>
> I would have thought the solution to this would be to define under which
> circumstances a pair of apostrophes will indicate text.  In the first
> two examples you are inside an ANCHOR tag, and in the third you are
> inside a SCRIPT tag.  It seems to me that you should only be using
> apostrophe's to indicate text strings inside a SCRIPT tag, no?  If
> that's true then set your parsing behaviour differently depending on the
> tag type.
>
> CHEERS> SAM
>
> Somik Raha wrote:
>
> >Hi Folks,
> >    Derrick Oswald just checked in a test case that fails..
> >Here's a link tag :
> >
> ><a href="http://cbc.ca/artsCanada/stories/greatnorth271202"
> >class="lgblacku">Vancouver schools plan 'Great Northern Way'</a>
> >
> >We are in a quandary now. When we have cases like :
> >
> ><a href="something.html">Kaarle's Page</a>
> >
> >we should accept the apostrophe without doing anything special.
> >
> >When we get links like,
> ><script>
> >    var code = '<sometag>';
> ></script>
> >
> >We should not take the tag symbols after code seriously, as they are part
of
> >the string.
> >
> >Handling the last two cases causes a conflict with the first case -bcos
the
> >last case is handled by checking if there's a < after ' - and this causes
> >the first case to go into an ignoring mode.
> >
> >How do we handle this problem ? Do we write smart code to handle this
> >particular situation ? From human experience, even if we've not
encountered
> >these cases, we know how to differentiate between a string node and a
tag.
> >Can AI help us here ? Also, pls feel free to suggest any straightforward
> >solutions as well.
> >
> >Regards
> >Somik
> >
> >
> >
> >
> >-------------------------------------------------------
> >This sf.net email is sponsored by:ThinkGeek
> >Welcome to geek heaven.
> >http://thinkgeek.com/sf
> >_______________________________________________
> >Htmlparser-developer mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
> >
> >
> >
> >
>
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Re: [Htmlparser-developer] AI - to be or not to be

From: Sam J. <ga...@yh...> - 2002-12-30 05:54:14

Hi Somik,

I would have thought the solution to this would be to define under which 
circumstances a pair of apostrophes will indicate text.  In the first 
two examples you are inside an ANCHOR tag, and in the third you are 
inside a SCRIPT tag.  It seems to me that you should only be using 
apostrophe's to indicate text strings inside a SCRIPT tag, no?  If 
that's true then set your parsing behaviour differently depending on the 
tag type.

CHEERS> SAM

Somik Raha wrote:

>Hi Folks,
>    Derrick Oswald just checked in a test case that fails..
>Here's a link tag :
>
><a href="http://cbc.ca/artsCanada/stories/greatnorth271202"
>class="lgblacku">Vancouver schools plan 'Great Northern Way'</a>
>
>We are in a quandary now. When we have cases like :
>
><a href="something.html">Kaarle's Page</a>
>
>we should accept the apostrophe without doing anything special.
>
>When we get links like,
><script>
>    var code = '<sometag>';
></script>
>
>We should not take the tag symbols after code seriously, as they are part of
>the string.
>
>Handling the last two cases causes a conflict with the first case -bcos the
>last case is handled by checking if there's a < after ' - and this causes
>the first case to go into an ignoring mode.
>
>How do we handle this problem ? Do we write smart code to handle this
>particular situation ? From human experience, even if we've not encountered
>these cases, we know how to differentiate between a string node and a tag.
>Can AI help us here ? Also, pls feel free to suggest any straightforward
>solutions as well.
>
>Regards
>Somik
>
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>
>  
>

[Htmlparser-developer] AI - to be or not to be

From: Somik R. <so...@ya...> - 2002-12-30 05:19:57

Hi Folks,
    Derrick Oswald just checked in a test case that fails..
Here's a link tag :

<a href="http://cbc.ca/artsCanada/stories/greatnorth271202"
class="lgblacku">Vancouver schools plan 'Great Northern Way'</a>

We are in a quandary now. When we have cases like :

<a href="something.html">Kaarle's Page</a>

we should accept the apostrophe without doing anything special.

When we get links like,
<script>
    var code = '<sometag>';
</script>

We should not take the tag symbols after code seriously, as they are part of
the string.

Handling the last two cases causes a conflict with the first case -bcos the
last case is handled by checking if there's a < after ' - and this causes
the first case to go into an ignoring mode.

How do we handle this problem ? Do we write smart code to handle this
particular situation ? From human experience, even if we've not encountered
these cases, we know how to differentiate between a string node and a tag.
Can AI help us here ? Also, pls feel free to suggest any straightforward
solutions as well.

Regards
Somik

Re: [Htmlparser-developer] Charset tests failing

From: Derrick O. <Der...@ro...> - 2002-12-29 15:50:41

Somik,

Finally reproduced it by backing down my JVM to 1.2.
Code fix and test modifications dropped.
Sorry about that agent 99.

Derrick

Somik Raha wrote:

> Hi Derrick,
>     Are you sure ? I have the latest version here too - but the two 
> tests are consistently failing.
> Here's what I see :
> Charset should be Shift_JIS but was ISO-8859-1
>  
> for both failures. 
>  
> Regards,
> Somik

[Htmlparser-developer] Integration Release 1.3 - 20021228 is out

From: Somik R. <so...@ya...> - 2002-12-29 08:09:57

Hi Folks,
    The integration release for this week is out.
    You can download it from http://htmlparser.sourceforge.net

Integration build 1.3 - 20021228
--------------------------------
[1] Added URLConnection constructors to HTMLParser
[2] Honour charset parameter on HTTP header and in HTML meta tag
[3] Following tags now inherit from HTMLCompositeTag
 (i) HTMLFormTag
 (ii) HTMLLinkTag
 (iii) HTMLSelectTag
 (iv) HTMLFrameSetTag
 (v) HTMLTitleTag
 (vi) HTMLTextAreaTag
 (vii) HTMLStyleTag
 (viii) HTMLScriptTag
 (ix) HTMLAppletTag

[4] Performed Refactoring "Introduce Parameter Object" on HTMLTag,
HTMLCompositeTag, HTMLLinkTag, HTMLFormTag
[5] Refactored HTMLFormTag, pulling up the search methods into
HTMLCompositeTag
[6] Added HTMLVector, which can return HTMLSimpleEnumeration - a
no-exception flavor of HTMLEnumeration
[7] Refactored HTMLEnumeration - created new interface -
HTMLPeekingEnumeration

Notes : HTMLVector is not yet integrated with the tags. That should happen
in the next release.

Regards,
Somik

Re: [Htmlparser-developer] Charset tests failing

From: Somik R. <so...@ya...> - 2002-12-29 01:21:20

Hi Derrick,
    Are you sure ? I have the latest version here too - but the two =
tests are consistently failing.=20
Here's what I see :
Charset should be Shift_JIS but was ISO-8859-1

for both failures. =20

Regards,
Somik
  ----- Original Message -----=20
  From: Derrick Oswald=20
  To: htm...@li...=20
  Sent: Saturday, December 28, 2002 5:17 PM
  Subject: Re: [Htmlparser-developer] Charset tests failing


  Somik,

  Sorry, just got back from the pilgrimage to Bethlehem (Toronto).

  All 268 tests are running OK against the version I took just now.
  Perhaps www.ibm.co.jp or www.sony.co.jp weren't available for you when =
you tried it. Try again, please.

  The tests can't be all in our repository, specifically the =
testHTTPCharset relies on an HTTP header charset parameter being set (by =
the HTTP server) to something other than ISO-8859-1 which the =
sourceforge people are unlikely to set up for us. We could put the =
testHTMLCharset under the sourceforge domain though. Will get to it as =
time permits.

  Derrick

  Somik Raha wrote:

    Hi Derrick,
        I was working on the latest parser, and just found that the =
charset tests are failing (I didnt modify anything in HTMLParser).
        Could you check what might have gone wrong ? Also, I was =
wondering if it might not be better to have the charset pages on our =
domain (http://htmlparser.sourceforge.net) - we could put up encoded =
pages and they'd always be there. =20

    Regards
    Somik
       =20

Re: [Htmlparser-developer] Charset tests failing

From: Derrick O. <Der...@ro...> - 2002-12-29 01:14:58

Somik,

Sorry, just got back from the pilgrimage to Bethlehem (Toronto).

All 268 tests are running OK against the version I took just now.
Perhaps www.ibm.co.jp or www.sony.co.jp weren't available for you when 
you tried it. Try again, please.

The tests can't be all in our repository, specifically the 
testHTTPCharset relies on an HTTP header charset parameter being set (by 
the HTTP server) to something other than ISO-8859-1 which the 
sourceforge people are unlikely to set up for us. We could put the 
testHTMLCharset under the sourceforge domain though. Will get to it as 
time permits.

Derrick

Somik Raha wrote:

> Hi Derrick,
>     I was working on the latest parser, and just found that the 
> charset tests are failing (I didnt modify anything in HTMLParser).
>     Could you check what might have gone wrong ? Also, I was wondering 
> if it might not be better to have the charset pages on our domain 
> (http://htmlparser.sourceforge.net) - we could put up encoded pages 
> and they'd always be there.  
>  
> Regards
> Somik
>

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-29 00:11:42

Hi Claude,

> If I may... Watch out for feature creep! Adding AI sounds like a panacea
but it's not. I have considerable experience with AI myself and it rarely
meets expectations unless you can clearly define the objectives up front. If
you plan to train classifers, you'll need to collect a fairly large,
relevant training set. Even then, this may not solve as many problems as it
may cause and you need to make a good decision based on fact.
>
> Some of your messages suggest you are easily taken by wiz-bang
technologies and cool new features (which I suffer from myself and need to
balance). You need to consider these features especially carefully and
realize that this bias can be detrimental under some circumstances.
Certainly, the project needs to be fun, but don't get too caught up in
feature envy. Make sure you keep the coupling loose through interfaces,
especially if you add AI components, and please allow users to decide which
features they want by keeping the architecture as pluggable as possible.
This will benefit both your user and development communities.

I agree with your opinion - I'm keeping the AI stuff a low priority for now,
as I mentioned - focussing only on refactoring for the moment. However, I am
hoping that we can do some evaluation, and decide if we should go for it in
v1.4. The real problem is, we keep finding real-life dirty html that just
does not follow any rules. Just sometime back, I found a bug in
HTMLStringNode..

If we have html like :

<script language="javascript">
var lower = '<%=lowerValue%>';
</script>

the StringNode does not realize that the jsp tag within quotes is not to be
handled as a tag but a string node. I found this problem while refactoring
today. Upon modifying the parsing automata and including a
PARSE_IGNORE_STATE for HTMLStringNode, I was able to handle this case. But
thanks to the tests, I found that we had another failing test, wherein we
had something like :

<A href="somelink.html">Kaarle's homepage</A>

As you can see, the single apostrophe caused the damage. Now, I had to code
in extra logic to also verify that the next character must be an opening tag
for the string node automaton to move into ignoring state.

Good news is that all tests are passing. But I am not really sure that this
philosophy of modifying the parsing automata logic is all that great. If
cases such as this can be coded outside the parser - perhaps with regular
expressions, we will no longer need to modify the code of the parser each
time. However, there'd probably be a performance hit as compared to the
system that we have now, and we've got to look at it really carefully before
we go for it.

We've been going vertical all this while, so I am just trying to go lateral
to see where that might take us - just to generate more possibilities. I'm
thinking that we could try a seperate implementation - writing the parsing
logic from scratch, and using the existing test mechanism to verify. This
could be a seperate module in the htmlparser project, for experimentation
only - to evaluate feasibility.

Regards,
Somik

[Htmlparser-developer] Result of first round of refactoring

From: Somik R. <so...@ya...> - 2002-12-28 23:55:11

Hi Folks,
    I'm done removing all duplication of toHTML(). Now, most tags either =
inherit from HTMLCompositeTag or use the standard toHTML() in HTMLTag. =
This was only the first step. But here are some interesting results :

[1] Code size of the package org.htmlparser.tags reduced from 1218 to =
1082 lines (not counting tags and empty lines). A new class has been =
added - org.htmlparser.tests.codeMetrics.LineCounter. It can recurse =
through java files in a given directory and get the total count.

[2] Due to the refactoring, I uncovered a bug in HTMLStringNode, in the =
handling of single quotes.

[3] There were a lot of complaints earlier (if I remember right, mostly =
from Dhaval) about toHTML() putting in its own line breaks. That =
shouldn't happen anymore. Any toHTML() bugs will be much easier to fix =
as the code is in one place, and is uniform.

[4] We have the added benefit of getting textarea contents in =
toPlainTextString() automatically, as it now inherits from a composite =
tag.

    We should be able to have a release tomorrow. I am currently waiting =
for Derrick to fix the two tests that are failing. Derrick --> if you =
dont find the time this week, I can comment out the two failing methods =
in this week's version.

Regards,
Somik

RE: [Htmlparser-developer] toPlainTextString() feedback requested

From: Claude D. <CD...@ar...> - 2002-12-28 23:47:45

SWYgSSBtYXkuLi4gV2F0Y2ggb3V0IGZvciBmZWF0dXJlIGNyZWVwISBBZGRpbmcgQUkgc291bmRz
IGxpa2UgYSBwYW5hY2VhIGJ1dCBpdCdzIG5vdC4gSSBoYXZlIGNvbnNpZGVyYWJsZSBleHBlcmll
bmNlIHdpdGggQUkgbXlzZWxmIGFuZCBpdCByYXJlbHkgbWVldHMgZXhwZWN0YXRpb25zIHVubGVz
cyB5b3UgY2FuIGNsZWFybHkgZGVmaW5lIHRoZSBvYmplY3RpdmVzIHVwIGZyb250LiBJZiB5b3Ug
cGxhbiB0byB0cmFpbiBjbGFzc2lmZXJzLCB5b3UnbGwgbmVlZCB0byBjb2xsZWN0IGEgZmFpcmx5
IGxhcmdlLCByZWxldmFudCB0cmFpbmluZyBzZXQuIEV2ZW4gdGhlbiwgdGhpcyBtYXkgbm90IHNv
bHZlIGFzIG1hbnkgcHJvYmxlbXMgYXMgaXQgbWF5IGNhdXNlIGFuZCB5b3UgbmVlZCB0byBtYWtl
IGEgZ29vZCBkZWNpc2lvbiBiYXNlZCBvbiBmYWN0LiANCiANClNvbWUgb2YgeW91ciBtZXNzYWdl
cyBzdWdnZXN0IHlvdSBhcmUgZWFzaWx5IHRha2VuIGJ5IHdpei1iYW5nIHRlY2hub2xvZ2llcyBh
bmQgY29vbCBuZXcgZmVhdHVyZXMgKHdoaWNoIEkgc3VmZmVyIGZyb20gbXlzZWxmIGFuZCBuZWVk
IHRvIGJhbGFuY2UpLiBZb3UgbmVlZCB0byBjb25zaWRlciB0aGVzZSBmZWF0dXJlcyBlc3BlY2lh
bGx5IGNhcmVmdWxseSBhbmQgcmVhbGl6ZSB0aGF0IHRoaXMgYmlhcyBjYW4gYmUgZGV0cmltZW50
YWwgdW5kZXIgc29tZSBjaXJjdW1zdGFuY2VzLiBDZXJ0YWlubHksIHRoZSBwcm9qZWN0IG5lZWRz
IHRvIGJlIGZ1biwgYnV0IGRvbid0IGdldCB0b28gY2F1Z2h0IHVwIGluIGZlYXR1cmUgZW52eS4g
TWFrZSBzdXJlIHlvdSBrZWVwIHRoZSBjb3VwbGluZyBsb29zZSB0aHJvdWdoIGludGVyZmFjZXMs
IGVzcGVjaWFsbHkgaWYgeW91IGFkZCBBSSBjb21wb25lbnRzLCBhbmQgcGxlYXNlIGFsbG93IHVz
ZXJzIHRvIGRlY2lkZSB3aGljaCBmZWF0dXJlcyB0aGV5IHdhbnQgYnkga2VlcGluZyB0aGUgYXJj
aGl0ZWN0dXJlIGFzIHBsdWdnYWJsZSBhcyBwb3NzaWJsZS4gVGhpcyB3aWxsIGJlbmVmaXQgYm90
aCB5b3VyIHVzZXIgYW5kIGRldmVsb3BtZW50IGNvbW11bml0aWVzLg0KIA0KLS0tLS1PcmlnaW5h
bCBNZXNzYWdlLS0tLS0gDQpGcm9tOiBTb21payBSYWhhIFttYWlsdG86c29taWtAeWFob28uY29t
XSANClNlbnQ6IFNhdCAxMi8yOC8yMDAyIDEyOjIwIEFNIA0KVG86IGh0bWxwYXJzZXItZGV2ZWxv
cGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiANClN1YmplY3Q6IFJlOiBbSHRtbHBhcnNl
ci1kZXZlbG9wZXJdIHRvUGxhaW5UZXh0U3RyaW5nKCkgZmVlZGJhY2sgcmVxdWVzdGVkDQoNCg0K
DQoJSGkgQ2xhdWRlLA0KCT5JIHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mg
c2l6ZSBpcyBoaXR0aW5nIGNyaXRpY2FsIG1hc3MgYW5kDQoJdGhhdCBtaW5pbWFsIGZvcm1hbGl0
aWVzIGFyZSBub3cgcmVxdWlyZWQuLi4NCgkNCglHb29kIHRvIGhlYXIgZnJvbSB5b3UuIEkgcXVp
dGUgYWdyZWUgdGhhdCB0aGlzIHByb2plY3Qgc2VlbXMgdG8gYmUgcmVhY2hpbmcNCgljcml0aWNh
bCBtYXNzLiBXZSBuZWVkIHRvIHNoYXJlIGEgcGxhbiBvZiBhY3Rpb24gdGhhdCB3ZSBjb3VsZCBh
bGwgd29yayBvbg0KCXBhcmFsbGVsbHkuDQoJDQoJRmlyc3QsIEkndmUgYmVlbiBub3RpY2luZyB0
aGF0IHdlJ3JlIG5vdCBjb25zaXN0ZW50IG9uIHRoZSBjb2RpbmcNCgljb252ZW50aW9uLiBJbiB0
aGUgcGFzdCwgaWYgSSBzZWUgYSBjbGFzcyB0aGF0IGRvZXMgbm90IGZvbGxvdyB0aGUNCgljb252
ZW50aW9uLCBJJ2QgZ28gYWhlYWQgYW5kIGZpeCBpdC4gSG93ZXZlciwgYXMgc28gbWFueSBwZW9w
bGUgYXJlIG5vdw0KCXdvcmtpbmcgcGFyYWxsZWxseSwgaXRzIGltcG9ydGFudCB0aGF0IHdlIGFs
bCBzaGFyZSB0aGUgc2FtZSBjb2RpbmcNCgljb252ZW50aW9uLiBUaGUgb25lIEkgZm9sbG93IGlz
IHRoZSBzdGFuZGFyZCBKYXZhIGNvZGluZyBjb252ZW50aW9uLiBJbg0KCWFkZGl0aW9uLCBJIHBh
cnRpY3VsYXJseSBkb24ndCBsaWtlIHVzaW5nIG5hbWVzIHdpdGggc2luZ2xlIGNoYXJhY3RlcnMg
aW4NCgl0aGVtIC0gaS5lLiBJJ2QgcHJlZmVyICJ0YWciIHRvICJwVGFnIi4gVGhhdCBpcyB0aGUg
Y29udmVudGlvbiBmb2xsb3dlZCBpbg0KCW1vc3Qgb2YgdGhlIHBhcnNlci4gVGhlIGNvZGluZyBj
b252ZW50aW9uIGlzIG9mIGNvdXJzZSBvcGVuIGZvciBkaXNjdXNzaW9uDQoJaWYgYW55b25lIGZl
ZWxzIHRoYXQgd2Ugc2hvdWxkIGFjY29tb2RhdGUgYW55IHBhcnRpY3VsYXIgcHJhY3RpY2UuDQoJ
DQoJU2Vjb25kLCBJIGFtIHBsYW5uaW5nIHRvIHNpbXBsaWZ5IHRoZSBkZXNpZ24gb2YgdGhlIHNj
YW5uZXJzLiBBcyBEaGF2YWwNCgltZW50aW9uZWQgZWFybGllciwgYW5kIGFzIFNhbSBtZW50aW9u
ZWQsIGl0IHdhc24ndCBlYXN5IHRvIHdyaXRlIGEgbmV3DQoJc2Nhbm5lci4gSSBhZ3JlZSB0aGF0
IHRoZSBqYXZhZG9jIG9mIGdldFBhcnNlZCgpIHNob3VsZCBiZSBiZXR0ZXIgdGhhbiB3aGF0DQoJ
aXQgaXMuIEhvd2V2ZXIsIEkgdGhpbmsgdGhlIHByb2JsZW0gbGllcyB3aXRoIHRoZSBuYW1lIGl0
c2VsZi4gSW4gdGhlIHNlbnNlLA0KCWlmIHdlIHJlbmFtZSAicGFyc2VkIiB0byAicGFyYW1ldGVy
VGFibGUiIG9yICJhdHRyaWJ1dGVUYWJsZSIgLSBhbmQNCglnZXRQYXJzZWQoKSBiZWNvbWVzLCBn
ZXRBdHRyaWJ1dGVUYWJsZSgpIC0gdGhlIGphdmFkb2Mgd291bGQgYmVjb21lDQoJcmVkdW5kYW50
LiBZb3UgY291bGQgc2F5IHRoYXQgdGhpcyBpcyBhICJkb2N1bWVudGF0aW9uIiBjaGFuZ2UsIG9y
IHlvdSBjb3VsZA0KCWNhbGwgdGhpcyBhICJyZWZhY3RvcmluZyIuIEVpdGhlciB3YXksIGl0IGlz
IGRldmVsb3BlciBhbmQgdXNlci1mcmllbmRseS4gU28NCglJIHRoaW5rIFNhbSdzIHN1Z2dlc3Rp
b25zIGFuZCBvdXIgY3VycmVudCBkaXJlY3Rpb24gYXJlIG9uZSBhbmQgdGhlIHNhbWUuDQoJDQoJ
VGhpcmQsIHRoZXJlJ3Mgd2F5IHRvbyBtdWNoIGR1cGxpY2F0ZSBjb2RlIGluIHRoZSB0YWdzLiBB
bG1vc3QgZXZlcnkgdGFnDQoJcnVucyB0aHJ1IGl0cyBhdHRyaWJ1dGVzIHRvIHJlbmRlciBpdCBh
cyBodG1sIC0gd2hlcmVhcyB0aGlzIGNhbiBiZSBkb25lDQoJZnJvbSB0aGUgc3VwZXJjbGFzcyAt
IEhUTUxUYWcuIFRoaXMgaXMganVzdCBvbmUgZXhhbXBsZS4gSSd2ZSBub3QgYmVlbg0KCWZvY3Vz
c2luZyBtdWNoIG9uIHJlZHVjaW5nIGR1cGxpY2F0aW9uIC0gYW5kIHdlIGNhbiBzZWUgc2VyaW91
cyBjb2RlIHNtZWxscy4NCglIb3dldmVyLCBJIHRoaW5rIGl0IGlzIGltcG9ydGFudCB0byBiZSBh
YmxlIHRvIHF1YW50aWZ5IHdoYXQgd2UgaG9wZSB0bw0KCWFjaGlldmUgYnkgdW5kZXJ0YWtpbmcg
c2V2ZXJhbCByb3VuZHMgb2YgcmVmYWN0b3JpbmcuIEFzIHdlIGtlZXANCglyZWZhY3RvcmluZywg
SSB3aWxsIGtlZXAgYSB0YWIgb24gdGhlIHNpemUgb2YgdGhlIHBhcnNlciAodXNpbmcgTGluZXMg
b2YNCglDb2RlKS4gSWYgdGhlcmUncyBhbnkgb3RoZXIgbWV0cmljIHRoYXQgcGVvcGxlIHRoaW5r
IGlzIGltcG9ydGFudCwgd2UgY291bGQNCglpbmNsdWRlIHRoYXQgYXMgd2VsbC4gSXQgd2lsbCBi
ZSBpbnRlcmVzdGluZyB0byBzZWUgaG93IG11Y2ggY29kZSB3ZSBjYW4NCgl0YWtlIG91dC4NCgkN
CglGb3VydGgsIHRoaXMgaXMgb25lIHByb2plY3Qgd2hlcmUgY2hhbmdlIGlzIHdlbGNvbWUgLSBh
bGwgdGhlIHRpbWUuIFNpbXBseQ0KCWJjb3MgaXQgaGFzIGEgbWFzc2l2ZSBzdWl0ZSBvZiB0ZXN0
cyB3aGljaCByZXByZXNlbnRzIG1vc3Qgb2YgdGhlIHVzZXINCglyZXF1aXJlbWVudHMgdGhhdCB3
ZSd2ZSByZWNlaXZlZCB0aWxsIGRhdGUuIFRoZSByZWFsLXZhbHVlIGluIHRoZSBwYXJzZXIgaXMN
Cglub3QgcmVhbGx5IGl0cyBwcm9kdWN0aW9uIGNvZGUgLSBidXQgaXRzIHNvbGlkIHRlc3Qgc3Vp
dGUuIElmIHRoZXJlIGFyZQ0KCWludGVyZXN0aW5nIGRlc2lnbiBjaGFuZ2VzIHRoYXQgd2lsbCBt
YWtlIHRoZSBjdXJyZW50IHN5c3RlbSBzaW1wbGVyLCB0aGV5DQoJc2hvdWxkIGJlIHRyaWVkIG91
dCB3aXRob3V0IGZlYXIgLSBmb3IgaWYgd2UgYnJlYWsgYW55dGhpbmcsIHdlIHdpbGwga25vdw0K
CWJjb3Mgb2YgdGhlIGF1dG9tYXRlZCB0ZXN0aW5nLiBJdCBpcyBpbXBvcnRhbnQgdGhhdCB0aGUg
dGVzdGluZyBzdWl0ZSBpcw0KCXRoZXJlZm9yZSByZWdhcmRlZCBqdXN0IGFzIGltcG9ydGFudCAo
aWYgbm90IG1vcmUpIGFzIHRoZSBwcm9kdWN0aW9uIGNvZGUsDQoJYW5kIHN0ZXBzIHRha2VuIHRv
IGltcHJvdmUgaXRzIGRlc2lnbiwgYW5kIG1ha2UgaXQgdmVyeSBlYXN5IHRvIGFkZCBuZXcNCgl0
ZXN0cyBhbGwgdGhlIHRpbWUuDQoJDQoJRmlmdGgsIGFzIHRoZSBwcm9qZWN0IGdyb3dzIC0gbmV3
IGZlYXR1cmVzIHNob3VsZCBrZWVwIGNvbWluZyBpbiBhbGwgdGhlDQoJdGltZS4gRGVycmljayBP
c3dhbGQgaGFzIGJlZW4gd29ya2luZyBhd2F5IGF0IHNvIG1hbnkgY29vbCBuZXcgZmVhdHVyZXMg
LQ0KCWFuZCBoZSdzIGJlZW4gd3JpdGluZyB0ZXN0cyBmb3IgdGhlbSBhbGwgLSB0aGlzIGlzIGEg
bXVjaC1hcHByZWNpYXRlZA0KCWFjdGl2aXR5IGFuZCBtdXN0IGNvbnRpbnVlLiBJJ2QgYmUgaGFw
cHkgdG8gbm90IHRoaW5rIGFib3V0IG5ldyBmZWF0dXJlcyBhbmQNCgljb21wbGV0ZWx5IGZvY3Vz
IG9uIHNvbGlkaWZ5aW5nIHRoZSBleGlzdGluZyBzeXN0ZW0sIGFuZCBsZXQgRGVycmljayBmb2N1
cw0KCW9uIGFkZGluZyBuZXcgc3R1ZmYgLSBmb3IgdjEuMy4NCgkNCglTaXh0aCwgd2UgaGF2ZSBh
IHZlcnkgc2VyaW91cyBwb3NzaWJpbGl0eSBvZiB1c2luZyBBSSBpbiBtYWtpbmcgdGFnDQoJcmVj
b2duaXRpb25zLiBJIGhhdmUgdG8gZmluZCB0aGUgdGltZSB0byB3cml0ZSBhYm91dCB0aGUgY3Vy
cmVudCByZWNvZ25pdGlvbg0KCW1lY2hhbmlzbSwgYW5kIHBhc3MgdGhhdCBvbiB0byBTYW0gSm9z
ZXBoLiBTYW0gaGFzIGNvbnNpZGVyYWJsZSBleHBlcmllbmNlDQoJaW4gYXJ0aWZpY2lhbCBpbnRl
bGxpZ2VuY2UsIGFuZCBpdCB3aWxsIGJlIGdvb2QgZm9yIHRoZSBwcm9qZWN0IHRvIGhhdmUgaGlz
DQoJZXhwZXJ0aXNlIGluIGl0cyBjb3JlIGNvcnJlY3Rpb24gbG9naWMuDQoJDQoJU2V2ZW50aCwg
aXQgbWlnaHQgYmUgZ29vZCB0byBoYXZlIGEgV2lraSBmb3IgdGhlIHBhcnNlciAtIGFzIHNvIG1h
bnkgb3Blbg0KCXNvdXJjZSBwcm9qZWN0cyBkby4gVGhhdCB3YXksIHRoZSBlbnRpcmUgYnVyZGVu
IG9mIGRvY3VtZW50YXRpb24gaXMgbm90IG9uDQoJYW55IG9uZSBwZXJzb24uIFdlIHNob3VsZCBi
ZSBsb29raW5nIGF0IG9wdGlvbnMgb2YgaGF2aW5nIG91ciBvd24gV2lraSBzbyB3ZQ0KCWNhbiBh
ZGQgY29udGVudCBlYXNpbHkgYW5kIGNvbGxhYm9yYXRpdmVseS4NCgkNCglQbHMgZmVlbCBmcmVl
IHRvIGFkZC9tb2RpZnkgdGhpcyBsaXN0IGlmIEkndmUgbWlzc2VkIG91dCBvbiBhbnl0aGluZy4N
CgkNCglSZWdhcmRzLA0KCVNvbWlrDQoJDQoJDQoJDQoJLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRoaXMgc2YubmV0IGVtYWlsIGlzIHNw
b25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdlZWsgaGVhdmVuLg0KCWh0dHA6Ly90
aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWlsaW5nIGxpc3QNCglIdG1scGFyc2Vy
LWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglodHRwczovL2xpc3RzLnNvdXJjZWZv
cmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRldmVsb3Blcg0KCQ0KDQo=

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 10:23:17

Hi Somik

Somik Raha wrote:

>>If we want to call
>>changing method names and adding documentation, refactoring then I guess
>>I am a huge fan of refactoring :-)
>>    
>>
>
>Changing method names is definitely refactoring. The idea is to use really
>good names that make the javadoc redundant. If you can explain what javadoc
>would achieve for a method like getAttributes() - I'd be happy to add
>javadoc for that.
>
I don't think good method names make javadoc redundant, and I don't 
think they can.  For example Hashtable getAttributes()  tells me very 
little. What is an attribute? What are the classes in the hashtable? 
Give me three examples of what an attribute is and then you might have a 
good javadoc, e.g.

[Gets a hashtable of attribute-value pairs from the html tag, e.g. 
type-->hidden, name-->address, value-->default would come from the tag 
<input type="hidden" name="address" value="default">.  Attributes and 
values are all String classes. ]

Now I think that the method getAttributes with the above javadoc would 
be much easier to use and understand for somebody trying to use 
htmlparser for the first or even the third time, no?  I think you need 
to be careful about what is obvious for you, who have designed the 
system, and what is obvious for a novice user.  

>>However the other part of what I was
>>saying is that I would like to see the documentation additions before we
>>change method names.  I mean you can deprecate away so as not to break
>>my existing code ... I still think you should release a version with a
>>full javadoc "refactoring" before you release anything with refactored
>>code.  What's the harm in focusing on the documentation first?
>>    
>>
>
>Simply bcos once the name is changed to a better one, docs are no longer
>necessary. 
>
See my point above.  I think you make more than a name change to make 
the api novice-user friendly.  And my comments about the need for more 
detailed javadocs go right through the whole project, not just the few 
examples I've given.  

>> I mean we're on version 1.2 now right?  What about a version 1.2.1 that
>>included the documentation fixes, before moving on to a 1.3-beta that
>>included the newly refactored code that you're so keen to start work on.
>> What would be the downside of proceeding in this fashion?
>>    
>>
>
>Duplicate work. Maintaining parallel branches of code is the last thing we
>need to do, considering limited resources. I don't see a problem in going
>straight with 1.3-beta as we've got all our tests in place, and we'll also
>use deprecated methods. That should work for you, right ?
>
Sure, having deprecated methods works for me, but it's not just about 
me.  I mean you've already released 1.2  People are already using it, 
building code around it.  You have to maintain it anyway, or rather 
users will continue asking you questions about it.  I'm suggesting that 
a 1.2.1 improved documentation release would mean less work in having to 
maintain things because the users could work from the javadocs.  I would 
also think that going through and writing thorough javadocs for your 
existing version would put you in a much better position to write a good 
1.3-beta.  Limited resources is an important concern, and I think you'll 
conserve resources by more thorough documentation first and re-writing 
your code second.  

As long as you have deprecated methods then my own code base will be 
fine, so it makes little difference to me in the short term which way 
you go.  However I think in the long term, a 1.2.1 documentation release 
would improve the quality of the project overall, and set a good 
precedent for high quality javadocs in subsequent releases.  I get the 
sense that you are really keen to plow into re-writing all the code, 
what's the hurry?

>>This sounds like an interesting possibility, but I still need to
>>understand how the current parser handles all the existing messy tags.
>>I mean when you started talking through all the messy html examples I
>>though you were going to be showing me something that didn't work and
>>that you wanted some machine learning/AI to fix, but you ended up saying
>>that all the examples could be handled by the exisiting parser.  If that
>>is so, why do you want to change it, and are there examples of messy
>>code that can't be handled by the parser?  If there aren't and its all
>>working fine, what advantage do we gain by replacing the existing core
>>correction logic with some hypothetical AI alternative?
>>    
>>
>
>The main problem is that the current logic is complicated, and uses
>hard-coded logic. Such a system is hard to maintain, change. I was thinking
>on the lines of a rule-based system, where the rule-base could be dynamic.
>But, I will write a seperate mail on this issue as soon as I get some time.
>
Ah, I think I understand.  What you want is some framework for dealing 
with new types of mess as they come up.  A system that can have more 
rules added to without too much work, or even a framework for trying to 
guess which rules to apply in which document ...

>>>Seventh, it might be good to have a Wiki for the parser - as so many open
>>>source projects do. That way, the entire burden of documentation is not
>>>      
>>>
>on
>  
>
>>>any one person. We should be looking at options of having our own Wiki so
>>>      
>>>
>we
>  
>
>>>can add content easily and collaboratively.
>>>
>>>      
>>>
>>You are most welcome to use the wiki I have set up in the short term:
>>
>>http://www.neurogrid.net/devwiki/wikipages/HtmlParser
>>
>>It uses an open source java wiki (devwiki), which is not as fully
>>functional as it might be, but I've got it set up so everything gets
>>backed up to CVS and all changes get sent to the neurogrid-cvs mailing
>>list.  If nothing else it might serve as an example wiki environment.
>>    
>>
>
>Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with
>current website) as in logos, page color, etc..
>
I guess so.  I think that could be arranged.  Send me a copy of the logo 
etc., and I'll see what I can do, although it might take a couple of 
weeks for me to find time to do it.

CHEERS> SAM

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-28 09:39:39

Hi Sam,

> If we want to call
> changing method names and adding documentation, refactoring then I guess
> I am a huge fan of refactoring :-)

Changing method names is definitely refactoring. The idea is to use really
good names that make the javadoc redundant. If you can explain what javadoc
would achieve for a method like getAttributes() - I'd be happy to add
javadoc for that.

> However the other part of what I was
> saying is that I would like to see the documentation additions before we
> change method names.  I mean you can deprecate away so as not to break
> my existing code ... I still think you should release a version with a
> full javadoc "refactoring" before you release anything with refactored
> code.  What's the harm in focusing on the documentation first?

Simply bcos once the name is changed to a better one, docs are no longer
necessary. But I understand your predicament. We could have a deprecated
method name for important methods.

>  I mean we're on version 1.2 now right?  What about a version 1.2.1 that
> included the documentation fixes, before moving on to a 1.3-beta that
> included the newly refactored code that you're so keen to start work on.
>  What would be the downside of proceeding in this fashion?

Duplicate work. Maintaining parallel branches of code is the last thing we
need to do, considering limited resources. I don't see a problem in going
straight with 1.3-beta as we've got all our tests in place, and we'll also
use deprecated methods. That should work for you, right ?

> >Sixth, we have a very serious possibility of using AI in making tag
> >recognitions. I have to find the time to write about the current
recognition
> >mechanism, and pass that on to Sam Joseph. Sam has considerable
experience
> >in artificial intelligence, and it will be good for the project to have
his
> >expertise in its core correction logic.
> >
> This sounds like an interesting possibility, but I still need to
> understand how the current parser handles all the existing messy tags.
> I mean when you started talking through all the messy html examples I
> though you were going to be showing me something that didn't work and
> that you wanted some machine learning/AI to fix, but you ended up saying
> that all the examples could be handled by the exisiting parser.  If that
> is so, why do you want to change it, and are there examples of messy
> code that can't be handled by the parser?  If there aren't and its all
> working fine, what advantage do we gain by replacing the existing core
> correction logic with some hypothetical AI alternative?

The main problem is that the current logic is complicated, and uses
hard-coded logic. Such a system is hard to maintain, change. I was thinking
on the lines of a rule-based system, where the rule-base could be dynamic.
But, I will write a seperate mail on this issue as soon as I get some time.

> >Seventh, it might be good to have a Wiki for the parser - as so many open
> >source projects do. That way, the entire burden of documentation is not
on
> >any one person. We should be looking at options of having our own Wiki so
we
> >can add content easily and collaboratively.
> >
> You are most welcome to use the wiki I have set up in the short term:
>
> http://www.neurogrid.net/devwiki/wikipages/HtmlParser
>
> It uses an open source java wiki (devwiki), which is not as fully
> functional as it might be, but I've got it set up so everything gets
> backed up to CVS and all changes get sent to the neurogrid-cvs mailing
> list.  If nothing else it might serve as an example wiki environment.

Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with
current website) as in logos, page color, etc..

Cheers,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 09:25:17

Hi Somik

Somik Raha wrote:

>Second, I am planning to simplify the design of the scanners. As Dhaval
>mentioned earlier, and as Sam mentioned, it wasn't easy to write a new
>scanner. I agree that the javadoc of getParsed() should be better than what
>it is. However, I think the problem lies with the name itself. In the sense,
>if we rename "parsed" to "parameterTable" or "attributeTable" - and
>getParsed() becomes, getAttributeTable() - the javadoc would become
>redundant. You could say that this is a "documentation" change, or you could
>call this a "refactoring". Either way, it is developer and user-friendly. So
>I think Sam's suggestions and our current direction are one and the same.
>  
>
I think that just changing the name of the method is not quite enough.   
Certainly changing the name of the method is a good step, but I think 
javadoc comments that actually show an example of what is being returned 
make a huge usability difference, particularly when the thing being 
returned is something as complex as a hashtable.  If we want to call 
changing method names and adding documentation, refactoring then I guess 
I am a huge fan of refactoring :-)  However the other part of what I was 
saying is that I would like to see the documentation additions before we 
change method names.  I mean you can deprecate away so as not to break 
my existing code ... I still think you should release a version with a 
full javadoc "refactoring" before you release anything with refactored 
code.  What's the harm in focusing on the documentation first?

 I mean we're on version 1.2 now right?  What about a version 1.2.1 that 
included the documentation fixes, before moving on to a 1.3-beta that 
included the newly refactored code that you're so keen to start work on. 
 What would be the downside of proceeding in this fashion?

>Sixth, we have a very serious possibility of using AI in making tag
>recognitions. I have to find the time to write about the current recognition
>mechanism, and pass that on to Sam Joseph. Sam has considerable experience
>in artificial intelligence, and it will be good for the project to have his
>expertise in its core correction logic.
>
This sounds like an interesting possibility, but I still need to 
understand how the current parser handles all the existing messy tags.   
I mean when you started talking through all the messy html examples I 
though you were going to be showing me something that didn't work and 
that you wanted some machine learning/AI to fix, but you ended up saying 
that all the examples could be handled by the exisiting parser.  If that 
is so, why do you want to change it, and are there examples of messy 
code that can't be handled by the parser?  If there aren't and its all 
working fine, what advantage do we gain by replacing the existing core 
correction logic with some hypothetical AI alternative?

>Seventh, it might be good to have a Wiki for the parser - as so many open
>source projects do. That way, the entire burden of documentation is not on
>any one person. We should be looking at options of having our own Wiki so we
>can add content easily and collaboratively.
>
You are most welcome to use the wiki I have set up in the short term:

http://www.neurogrid.net/devwiki/wikipages/HtmlParser

It uses an open source java wiki (devwiki), which is not as fully 
functional as it might be, but I've got it set up so everything gets 
backed up to CVS and all changes get sent to the neurogrid-cvs mailing 
list.  If nothing else it might serve as an example wiki environment.

CHEERS> SAM

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-28 08:19:10

Hi Claude,
>I think it may well be that the group's size is hitting critical mass and
that minimal formalities are now required...

Good to hear from you. I quite agree that this project seems to be reaching
critical mass. We need to share a plan of action that we could all work on
parallelly.

First, I've been noticing that we're not consistent on the coding
convention. In the past, if I see a class that does not follow the
convention, I'd go ahead and fix it. However, as so many people are now
working parallelly, its important that we all share the same coding
convention. The one I follow is the standard Java coding convention. In
addition, I particularly don't like using names with single characters in
them - i.e. I'd prefer "tag" to "pTag". That is the convention followed in
most of the parser. The coding convention is of course open for discussion
if anyone feels that we should accomodate any particular practice.

Second, I am planning to simplify the design of the scanners. As Dhaval
mentioned earlier, and as Sam mentioned, it wasn't easy to write a new
scanner. I agree that the javadoc of getParsed() should be better than what
it is. However, I think the problem lies with the name itself. In the sense,
if we rename "parsed" to "parameterTable" or "attributeTable" - and
getParsed() becomes, getAttributeTable() - the javadoc would become
redundant. You could say that this is a "documentation" change, or you could
call this a "refactoring". Either way, it is developer and user-friendly. So
I think Sam's suggestions and our current direction are one and the same.

Third, there's way too much duplicate code in the tags. Almost every tag
runs thru its attributes to render it as html - whereas this can be done
from the superclass - HTMLTag. This is just one example. I've not been
focussing much on reducing duplication - and we can see serious code smells.
However, I think it is important to be able to quantify what we hope to
achieve by undertaking several rounds of refactoring. As we keep
refactoring, I will keep a tab on the size of the parser (using Lines of
Code). If there's any other metric that people think is important, we could
include that as well. It will be interesting to see how much code we can
take out.

Fourth, this is one project where change is welcome - all the time. Simply
bcos it has a massive suite of tests which represents most of the user
requirements that we've received till date. The real-value in the parser is
not really its production code - but its solid test suite. If there are
interesting design changes that will make the current system simpler, they
should be tried out without fear - for if we break anything, we will know
bcos of the automated testing. It is important that the testing suite is
therefore regarded just as important (if not more) as the production code,
and steps taken to improve its design, and make it very easy to add new
tests all the time.

Fifth, as the project grows - new features should keep coming in all the
time. Derrick Oswald has been working away at so many cool new features -
and he's been writing tests for them all - this is a much-appreciated
activity and must continue. I'd be happy to not think about new features and
completely focus on solidifying the existing system, and let Derrick focus
on adding new stuff - for v1.3.

Sixth, we have a very serious possibility of using AI in making tag
recognitions. I have to find the time to write about the current recognition
mechanism, and pass that on to Sam Joseph. Sam has considerable experience
in artificial intelligence, and it will be good for the project to have his
expertise in its core correction logic.

Seventh, it might be good to have a Wiki for the parser - as so many open
source projects do. That way, the entire burden of documentation is not on
any one person. We should be looking at options of having our own Wiki so we
can add content easily and collaboratively.

Pls feel free to add/modify this list if I've missed out on anything.

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 04:23:04

Hi,

I realise that unless I get specific on the issue of documentation I am 
unlikely to be taken seriously

http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/HTMLTag.html

is from the java docs  - here are some javadoc comments

Hashtable getParsed()
          "Gets the parsed"
static java.util.Vector getStrictTags()
          "Gets the strictTags"
 java.lang.String getTagLine()
          "Returns the tagLine"
 java.lang.String getTagName()
            ""

Personally I would like to see a good couple of lines of text for each 
of these.  Some are more important than others.  I mean getTagName might 
seem self explanatory, but I would like to see more, e.g.

Hashtable getParsed()
          "Gets a hashtable of attribute-value pairs from the html tag, 
e.g. type-->hidden, name-->address, value-->default"
static java.util.Vector getStrictTags()
          "<not sure what strict tags are, would be nice if there was a 
description>"
 java.lang.String getTagLine()
          "Returns the line from the document that this tag was found on?"
 java.lang.String getTagName()
            "get the name of the tag itself, e.g. BODY, TR, TABLE etc."

Personally I found the lack of javadocs on getParsed() particularly 
frustrating.  I had to go into the code in some depth to find out what 
it was, and that it was the thing I needed.  Naturally the beauty of 
open source is that I can actually do that.  But I would have preferred 
to have discovered the information I needed from the javadocs.  And I 
would prefer as a *user* that you would expend your efforts updating the 
documentation on the existing code base rather than refactoring.  

CHEERS> SAM

Sam Joseph wrote:

> Somik replied to my issues about documentation saying that he had been 
> working on it and asking me to point out other places where the 
> documentation could be improved.  I don't have time right now to go 
> through and make those points,

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 04:02:55

Joshua Kerievsky wrote:

>Sam wrote:
>  
>
>>As I mention above.  One should be careful of duplication removal for
>>the sake of it.
>>    
>>
>
>And as I mentioned above, I completely disagree with you.  I wonder, how
>much refactoring have you done in your career given you philosophy of "if it
>ain't broke, don't fix it?"   Have you read Martin Fowler's landmark book,
>Refactoring?   If not, I'd suggest you study it thoroughly - you'll be a
>better programmer for it.
>
I must say that I have yet to be convinced of the value of continous 
refactoring.  I think you can have too much of a good thing.  I refactor 
a lot in my code.  I am constantly looking at how to design things more 
efficiently, how to implement a number of different pieces of code in a 
smaller, tidier way, and also I am well aware that too much refactoring 
can be dangerous.  In your haste to refactor everything, consider that 
this is only the latest in a long line of computing fads, and none of 
them is a cure-all.  Refactoring is just another technique, like OO, or 
unit testing.  It will only help you if use it in moderation.  Once you 
internalise this, you'll be a better programmer. Trust me.

In my 15 years of coding experience I have seen projects collapse both 
from too much refactoring and too little.  To believe that you will 
improve your project purely as a consequence of refactoring is just 
plain wrong in my experience.  I am not saying that you shouldn't 
refactor in this case, I'm just saying that I think the effort spent on 
refactoring would be better spent on first improving the documentation. 
 Somik replied to my issues about documentation saying that he had been 
working on it and asking me to point out other places where the 
documentation could be improved.  I don't have time right now to go 
through and make those points, but as a *user* of htmlparser I would 
appreciate it if you would leave off any refactoring until you have gone 
through the entire code base and expanded every javadoc comment so that 
it really described what the method was doing in a manner understandable 
to someone unfamiliar with the code base.  I think that if you direct 
your effort here first before refactoring you will have a greater 
increase in users in the long run.

Don't get me wrong, I am not against refactoring, and I use the term "if 
it ain't broke don't fix it" partly in jest.  My main point was about 
"documenting before fixing".  I don't believe that any refactoring that 
you do will replace the need for good documentation, for the code base 
that *is already deployed and in use*.

I am asking you as a *user* of htmlparser to focus on the documentation 
(and release a version including that latest documentation) before 
moving on with your refactoring.  Remember it is dangerous for a 
software project to ignore the needs of its users.

CHEERS> SAM

RE: [Htmlparser-developer] toPlainTextString() feedback requested

From: Claude D. <CD...@ar...> - 2002-12-28 03:35:04

SSB0cmllZCB0byByZWZyYWluIGZyb20ganVtcGluZyBpbiwgYnV0IGhlcmUgYXJlIGEgY291cGxl
IG9mIGNlbnRzIHdvcnRoIG9mIG9waW5pb24sIGp1c3QgZm9yIHRoZSBob2xsaWRheXMgOy0pLg0K
IA0KSW4gbXkgdmlldywgZGlzdGlsbGluZyBzb2Z0d2FyZSBkb3duIHRvIGVzc2VudGlhbHMgaXMg
cGFydCBvZiBpbXByb3ZpbmcgZGVzaWduLiBJdCBjdXRzIGRvd24gb24gY29kZSB0aGF0IHdvdWxk
IGxhdGVyIG5lZWQgdG8gYmUgbWFpbnRhaW5lZC4gUmVkdW5kYW5jeSBpcyBleHBlbnNpdmUuIE1v
cmUgY29kZSBuZWVkcyB0byBjaGFuZ2UgZXZlcnkgdGltZSBhIGJ1ZyBpcyBmaXhlZCwgcG90ZW50
aWFsbHkgbWlzc2luZyBvbmUgb2YgdGhlIHZhcmlhbnRzIGFuZCBpbnRyb2R1Y2luZyBkaWZmaWN1
bHQgdG8gZGlhZ25vc2Ugc2l0dWF0aW9ucy4gTGVzcyBjb2RlIGlzIGJldHRlci4gQXMgZGVzaWdu
cyBpbXByb3ZlLCB0aGUgY29kZSBiYXNlIHdpbGwgdHlwaWNhbGx5IHNocmluay4gUmVmYWN0b3Jp
bmcgYW5kIGFwcGx5aW5nIGVmZmVjdGl2ZSBwYXR0ZXJucyBpcyBrZXkgdG8gbWFraW5nIHRoaXMg
cG9zc2libGUuDQogDQpUaGF0IGJlaW5nIHNhaWQsIEkgdGhpbmsgZ29vZCBzb2Z0d2FyZSBkZXNp
Z24gaGluZ2VzIG9uIHdlbGwgZGVmaW5lZCB1c2UgY2FzZXMuIElmIGNoYW5nZXMgbmVlZCB0byBi
ZSBtYWRlIHRvIG1lZXQgdGhlIHJlcXVpcmVtZW50cywgdGhhdCdzIHdoYXQgbmVlZHMgdG8gaGFw
cGVuLiBSZWZhY3RvcmluZywgZGlzdGlsbGluZywgc2ltcGxpZnlpbmcsIHJlbW92aW5nIGRlYWQg
Y29kZSBhbmQgZG9jdW1lbnRhdGlvbiBhcmUgYWxsIHBhcnQgb2YgZXZlcnkgcmVsZWFzZS4gUmVm
YWN0b3JpbmcgZm9yIHRoZSBzYWtlIG9mIHJlZmFjdG9yaW5nIHNob3VsZCBiZSBhdm9pZGVkLCBh
cyB3ZWxsIGFzIGZvcmdpbmcgYWhlYWQgd2l0aG91dCBhIHdlbGwtY29uc2lkZXJlZCBhbmQgcmV2
aWV3ZWQgcGxhbi4gSWYgdGhlIHBsYW4gc2hvd3MgdGhhdCByZWZhY3RvcmluZyB3aWxsIGltcHJv
dmUgZGVzaWduLCBkZWxpdmVyIGJldHRlciBzb2Z0d2FyZSwgbWVldCByZXF1aXJlbWVudHMgYW5k
IHNhdGlzZnkgdGFyZ2V0IHVzZSBjYXNlcywgdGhlbiBpdCBpcyBwcm9iYWJseSBhIGdvb2QgcGxh
biBhbmQgc2hvdWxkIGJlIGltcGxlbWVudGVkLCBzbyBsb25nIGFzIHRoZXJlJ3MgZW5vdWdoIGFn
cmVlbWVudCB0byB2YWxpZGF0ZSB0aGUgb2JqZWN0aXZlcyBhbmQgc3RyYXRlZ3kuDQogDQpJIHRo
aW5rIG1vc3Qgb2YgdGhpcyBkaXNjdXNzaW9uIGNlbnRlcnMgb24gYSBsYWNrIG9mIGRvY3VtZW50
YXRpb24sIGJ1dCBub3QgdXNlciBkb2N1bWVudGF0aW9uLiBUaGUgZ3JvdXAgaXMgbGFja2luZyBp
biBwcm9jZXNzLiBJZiBhIHBsYW4gaXMgd29ydGggcGVyc3VpbmcsIGl0J3Mgd29ydGggd3JpdGlu
ZyB1cCBhIHN1bW1hcnksIHdpdGggc3VmZmljaWVudCBqdXN0aWZpY2F0aW9uIHRvIHNhdGlmeSB0
aGUgZ3JvdXAuIElmIHRoYXQgcGxhbiBpcyB0aGVuIHN1YmplY3RlZCB0byByZXZpZXcsIEkgZXhw
ZWN0IGl0IGNhbiBiZSB2YWxpZGF0ZWQgcXVpY2tseSBhbmQgZXZlcnlvbmUgY2FuIGZlZWwgY29u
ZmlkZW50IHRoYXQgcHJvcG9zZWQgY2hhbmdlcyB3aWxsIGltcHJvdmUgdGhlIGRlc2lnbiBhbmQg
bm90IHNpbXBseSBpbnRyb2R1Y2UgbW9yZSB3b3JrIG9yIGNoYW5nZSBmb3IgdGhlIHNha2Ugb2Yg
Y2hhbmdlLg0KIA0KSW4gcHJpbmNpcGxlLCBJIGhhdmVudCByZWFkIGFueXRoaW5nIG9uIHRoZSBs
aXN0IHRoYXQgbWFkZSBtZSBuZXJ2b3VzLiBTaWxlbmNlIHNob3VsZCBiZSB0YWtlbiBhcyBpbXBs
aWNpdCBzdXBwb3J0LiBJZiBhIGRvY3VtZW50IGlzIGNpcmN1bGF0ZWQgYW5kIG5vIG9iamVjdGlv
bnMgYXJlIG5vdGVkLCB0aGVyZSBpcyBubyByZWFzb24gdG8gd2FpdCBpbmRlZmluaXRlbHksIHRo
b3VnaCBhIHJlYXNvbmFibGUgcmV2aWV3IHByb2Nlc3Mgc2hvdWxkIHByb2JhYmx5IGJlIGFwcGxp
ZWQuIENvbnRyaWJ1dG9ycyBhcmUgZG9pbmcgdGhlIHdvcmsgYW5kIGhhdmUgYSByaWdodCB0byBk
aXJlY3QgdGhlaXIgb3duIGVmZm9ydHMuIEluIG15IGV4cGVyaWVuY2UsIGhvd2V2ZXIsIGV2ZW4g
c2hvcnQgcHJvY2VzcyBkb2N1bWVudHMgY2FuIG1ha2UgYSB3b3JsZCBvZiBkaWZmZXJlbmNlLiBJ
IHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mgc2l6ZSBpcyBoaXR0aW5nIGNy
aXRpY2FsIG1hc3MgYW5kIHRoYXQgbWluaW1hbCBmb3JtYWxpdGllcyBhcmUgbm93IHJlcXVpcmVk
Li4uDQogDQpIYXBweSBOZXcgWWVhciBFdmVyeW9uZSENCiANCi0tLS0tT3JpZ2luYWwgTWVzc2Fn
ZS0tLS0tIA0KRnJvbTogSm9zaHVhIEtlcmlldnNreSBbbWFpbHRvOmpvc2h1YUBpbmR1c3RyaWFs
bG9naWMuY29tXSANClNlbnQ6IEZyaSAxMi8yNy8yMDAyIDEwOjMzIEFNIA0KVG86IGh0bWxwYXJz
ZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiBodG1scGFyc2VyLXVzZXJA
bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KU3ViamVjdDogUmU6IFtIdG1scGFyc2VyLWRldmVsb3Bl
cl0gdG9QbGFpblRleHRTdHJpbmcoKSBmZWVkYmFjayByZXF1ZXN0ZWQNCg0KDQoNCglTYW0gd3Jv
dGU6DQoJPiBUaGUgVmlzaXRvciBwYXR0ZXJuIHNvdW5kcyBpbnRlcmVzdGluZywgYW5kIEkgbG9v
ayBmb3J3YXJkIHRvIGhlYXJpbmcNCgk+IG1vcmUgYWJvdXQgaXQuICBIb3dldmVyLCBkdXBsaWNh
dGVkIGNvZGUgaXRzZWxmIGlzIG5vdCBJTU8gbmVjZXNzYXJpbHkNCgk+IGFuIGV2aWwuIEl0IGFs
bCBkZXBlbmRzIG9uIHdoZXRoZXIgb25lIHRoaW5rcyB0aGF0IHRoZSBkdXBsaWNhdGVkDQoJPiBj
b21wb25lbnRzIGFyZSBnb2luZyB0byBkaXZlcmdlIGluIGZ1bmN0aW9uYWxpdHkgaW4gdGhlIGZ1
dHVyZS4gIElmIHlvdQ0KCT4gYXJlIHN1cmUgdGhleSBhcmUgbm90LCB0aGVuIGZpbmUsIHJlZmFj
dG9yIGF3YXkuDQoJDQoJQXQgdGhlIG1vbWVudCwgU29taWsgYW5kIEkgc3BlY3VsYXRlIHRoYXQg
NDAlIHRvIDUwJSBvZiB0aGUgY3VycmVudCBodG1sDQoJcGFyc2VyIGNvZGUgYmFzZSBpcyBwdXJl
IGZhdCAtLSB1bm5lY2Vzc2FyeSBjb2RlIHRoYXQgb25seSBzZXJ2ZXMgdG8gYmxvYXQNCgl0aGUg
Y29kZSBiYXNlLiAgRHVwbGljYXRlIGNvZGUsIGluIHN1YnRsZSBvciBub3Qtc28tc3VidGxlIHZl
cnNpb25zLCBpcw0KCW1vc3RseSByZXNwb25zaWJsZSBmb3IgdGhlIGJsb2F0LiAgSW4gbXkgZXhw
ZXJpZW5jZSBpbiBpbmR1c3RyeSwgSSdkIHNheQ0KCXRoYXQgOTUlIG9mIHRoZSB0aW1lLCBkdXBs
aWNhdGUgY29kZSBpcyBiYWQuDQoJDQoJPiBJIGd1ZXNzIG15IHN1cnByaXNlIGF0IHlvdXIgKG9y
IHBlcmhhcHMgU29taWsncykgZm9jdXMgb24gcmVmYWN0b3JpbmcNCgk+IGNvbWVzIGZyb20gdGhl
IGZhY3QgdGhhdCB3aGlsZSB0aGUgaHRtbHBhcnNlciBpcyBhIGdyZWF0IHBpZWNlIG9mDQoJPiBz
b2Z0d2FyZSwgdGhlIGphdmFkb2NzIGFuZCBvdGhlciBkb2N1bWVudGF0aW9uIGNvdWxkIHVzZSBz
b21lIGF0dGVudGlvbi4NCgk+ICBGb3IgZXhhbXBsZSwgSSBjYW4ndCBmaW5kIGFueSBleHBsYW5h
dGlvbiBpbiB0aGUgamF2YWRvY3Mgb3Igb3RoZXJ3aXNlDQoJPiBvZiBob3cgdGhlIGZpbHRlcnMg
YXJlIHN1cHBvc2VkIHRvIHdvcmsgd2l0aCB0aGUgZGlmZmVyZW50IHNjYW5uZXJzLCBvcg0KCT4g
d2hhdCB2YWx1ZXMgdGhleSBhcmUgYWxsb3dlZCB0byB0YWtlLg0KCQ0KCVRoZSBodG1sIHBhcnNl
ciBpcyBpbiBzb3JlIG5lZWQgb2YgcmVmYWN0b3JpbmcgU2FtLiAgSW4gZmFjdCwgc29tZXRpbWVz
IGNvZGUNCgliZWNvbWVzIHVubmVjZXNzYXJpbHkgY29tcGxleCwgaW4gd2hpY2ggY2FzZSBwZW9w
bGUgKnJlYWxseSogbmVlZA0KCWRvY3VtZW50YXRpb24uICBJIGNvbmZyb250IHN1Y2ggYSBwcm9i
bGVtIGJ5IGZpcnN0IGFza2luZyBpZiB0aGUgY29kZSBjb3VsZA0KCWJlIHNpbXBsaWZpZWQgc28g
d2UgZGlkbid0IG5lZWQgc28gbXVjaCBkb2N1bWVudGF0aW9uLiAgIEluIHRoZSBldmVudCB0aGF0
DQoJd2UgZG8gbmVlZCBkb2NzLCBpdCBpcyBiZXN0IHRvIHdyaXRlIGV4ZWN1dGFibGUgZG9jdW1l
bnRhdGlvbi4gIEFyZSB5b3UNCglmYW1pbGlhciB3aXRoIGV4ZWN1dGFibGUgZG9jdW1lbnRhdGlv
bj8NCgkNCgk+IEkgZ2VuZXJhbGx5IHdvcmsgdG8gImlmIGl0J3Mgbm90IGJyb2tlbiBkb24ndCBm
aXggaXQiLCBidXQgSSBvZnRlbiBhZGQNCgk+ICJiZWZvcmUgeW91IHN0YXJ0IGZpeGluZyBpdCwg
bWFrZSBzdXJlIHlvdXIgZG9jdW1lbnRhdGlvbiBpcyB1cCB0byBkYXRlIi4NCgkNCglTb2Z0d2Fy
ZSBiZWNvbWVzIGJyaXR0bGUgYW5kIGJsb2F0ZWQgdW5kZXIgdGhlIHBoaWxvc29waHkgb2YgImlm
IGl0IGFpbid0DQoJYnJva2UsIGRvbid0IGZpeCBpdC4iICAgV2UgaGF2ZSBjbGllbnRzIHdpdGgg
Y29kZSBiYXNlcyB0aGF0IGFyZSAyIG1pbGxpb25zDQoJbGluZXMgb2YgIndvcmtpbmciIHNwZWdo
ZXR0aSBjb2RlLiAgVGhleSBuZWVkIGxvdHMgb2YgaGVscCB0byBsZWFybiB0byBkbw0KCWNvbnRp
bnVvdXMgcmVmYWN0b3JpbmcuDQoJDQoJPiBVc2luZyB0aGUgVmlzaXRvciBwYXR0ZXJuIG1heSBt
YWtlIGl0IGVhc2llciBmb3IgY2xpZW50cyB0byBnZXQgdGhlIGRhdGENCgk+IHRoZXkgbmVlZCwg
YnV0IGdpdmVuIHRoYXQgdGhlIGh0bWxwYXJzZXIgaXMgIndvcmtpbmciICh3ZWxsIGl0IHdvcmtz
IGZvcg0KCT4gbWUpLCBJIHdvdWxkIHNheSB0aGF0IHRoZSBtb3JlIHVyZ2VudCBpc3N1ZSBoZXJl
IGlzIG1ha2luZyBzdXJlIGFsbCB0aGUNCgk+IGRvY3VtZW50YXRpb24gaXMgdXAgdG8gZGF0ZS4g
ICBJIGhhdmUgYSBsb3Qgb2YgcG9zaXRpdmUgdGhpbmdzIHRvIHNheQ0KCT4gYWJvdXQgaHRtbHBh
cnNlciwgc28gZG9uJ3QgdGFrZSBpdCB0aGUgd3Jvbmcgd2F5LCB3aGVuIEkgc2F5IHRoYXQgdGhl
DQoJPiBiaWdnZXN0IHByb2JsZW0gSSd2ZSBoYWQgaW4gdXNpbmcgaXQgaW4gdGhlIGxhc3QgZmV3
IHdlZWtzIGlzIGluYWRlcXVhdGUNCgk+IGphdmFkb2NzLg0KCQ0KCUknbSBub3QgY29udGVudCB0
byBsaXZlIHdpdGggdGhlIHdvcmxkIGFzIGl0IGlzIFNhbS4gIElmIHNvbWV0aGluZyBpc24ndA0K
CWVhc3ksIGl0J3Mgd3JvbmcuICBUaGUgYmVzdCBzb2Z0d2FyZSBpbiB0aGUgd29ybGQgaXMgZWFz
eSB0byB1c2UsIGl0J3MNCglzZWxmLWV4cGxhbmF0b3J5LiAgV2Ugc2hvdWxkIGFsd2F5cyBzdHJp
dmUgZm9yIHRoYXQuICBJbiB0aGUgbWVhbnRpbWUsIGlmDQoJeW91IGhhdmUgdGFza3MgdG8gY29t
cGxldGUgYW5kIGRvbid0IGtub3cgaG93IHRvIGRvIHRoZW0gYmVjYXVzZSBvZiBsYWNrIG9mDQoJ
ZG9jcywgSSdkIHN1Z2dlc3QgeW91IGFzayBxdWVzdGlvbnMgaGVyZS4gIFRoZSBiZXN0IHJlc3Vs
dCB3aWxsIGJlDQoJZXhlY3V0YWJsZSBkb2N1bWVudGF0aW9uIGZvciB0aGUgaHRtbCBwYXJzZXIu
DQoJDQoJPiA+PklzIHRoZXJlIHNvbWUgZWZmaWNpZW5jeSByZWFzb24gd2h5IHlvdSB3YW50IHRv
IHJlZmFjdG9yIHRoZXNlIG1ldGhvZHMNCgk+ID4+b3IgaXMgaXQganVzdCBmb3IgbmVhdG5lc3M/
DQoJPiA+Pg0KCT4gPj4NCgk+ID4NCgk+ID5EdXBsaWNhdGlvbiByZW1vdmFsIGlzIHJlYXNvbiAj
MS4NCgk+ID4NCgk+IEFzIEkgbWVudGlvbiBhYm92ZS4gIE9uZSBzaG91bGQgYmUgY2FyZWZ1bCBv
ZiBkdXBsaWNhdGlvbiByZW1vdmFsIGZvcg0KCT4gdGhlIHNha2Ugb2YgaXQuDQoJDQoJQW5kIGFz
IEkgbWVudGlvbmVkIGFib3ZlLCBJIGNvbXBsZXRlbHkgZGlzYWdyZWUgd2l0aCB5b3UuICBJIHdv
bmRlciwgaG93DQoJbXVjaCByZWZhY3RvcmluZyBoYXZlIHlvdSBkb25lIGluIHlvdXIgY2FyZWVy
IGdpdmVuIHlvdSBwaGlsb3NvcGh5IG9mICJpZiBpdA0KCWFpbid0IGJyb2tlLCBkb24ndCBmaXgg
aXQ/IiAgIEhhdmUgeW91IHJlYWQgTWFydGluIEZvd2xlcidzIGxhbmRtYXJrIGJvb2ssDQoJUmVm
YWN0b3Jpbmc/ICAgSWYgbm90LCBJJ2Qgc3VnZ2VzdCB5b3Ugc3R1ZHkgaXQgdGhvcm91Z2hseSAt
IHlvdSdsbCBiZSBhDQoJYmV0dGVyIHByb2dyYW1tZXIgZm9yIGl0Lg0KCQ0KCT4gPiBSZW1vdmFs
IG9mIGhhcmQtY29kZWQgbG9naWMgaXMgcmVhc29uICMyLg0KCT4gPg0KCT4gVGhpcyBpcyBhIGdv
b2QgcmVhc29uLiAgSG93ZXZlciBJIGdldCB0aGUgZmVlbGluZyB0aGF0IGludHJvZHVjdGlvbiBv
Zg0KCT4gdGhlc2UgVmlzaXRvciBjbGFzc2VzIHdpbGwgbWFrZSB0aGUgc3lzdGVtIGNvbmNlcHR1
YWxseSBtb3JlIGRpZmZpY3VsdA0KCT4gdG8gdXNlIHJhdGhlciB0aGFuIGVhc2llci4gIEkgd291
bGQgZmVlbCBiZXR0ZXIgaWYgdGhlIGN1cnJlbnQgc2V0IHVwDQoJPiB3YXMgbW9yZSBmdWxseSBk
b2N1bWVudGVkIGJlZm9yZSBtb3JlIGNvbXBsZXhpdHkgd2FzIGFkZGVkLg0KCQ0KCUFzIEkgYWxz
byBzYWlkIGFib3ZlLCBpZiBpdCBhaW4ndCBzaW1wbGUsIGl0J3Mgd3JvbmcuICBPdXIgY2hhbmdl
cyB3aWxsIG5vdA0KCWFkZCBjb21wbGV4aXR5IC0gdGhhdCB3b3VsZCBiZSBmb29saXNoLg0KCQ0K
CT4gQW5kIGV2ZW4gaWYgdGhlIFZpc2l0b3IgcGF0dGVybiBpcyB1c2VkLCBJIHdvdWxkIHJlY29t
bWVuZCBsZWF2aW5nDQoJPiBtZXRob2RzIGxpa2UgdG9QbGFpblRleHRTdHJpbmcoKSBldGMgaW4g
cGxhY2UsIGJ1dCBqdXN0IG1ha2luZyB0aGVtDQoJPiBzaG9ydCBjdXQgaW1wbGVtZW50YXRpb25z
IHRvIGNlcnRhaW4ga2luZHMgb2YgdmlzaXRvci11c2luZyBtZXRob2RzLg0KCT4gIFRoaXMgd2ls
bCBhbGxvdyBwZW9wbGUgd2hvIGhhdmUgeWV0IHRvIGdyYXNwIHRoZSBWaXNpdG9yICBwYXR0ZXJu
DQoJPiBzb21ldGhpbmcgdG8gd29yayB3aXRoLg0KCQ0KCVBlb3BsZSBjYW4gYWx3YXlzIGNhbGwg
ZGVwcmVjYXRlZCBtZXRob2RzLg0KCQ0KCT4gSWYgeW91IGFyZSBrZWVuIHRvIHNlZSBsb3RzIG9m
IHBlb3BsZSB1c2luZyBodG1scGFyc2VyLCBJIHRoaW5rIHRoYXQgeW91DQoJPiBkb24ndCB3YW50
IHBlb3BsZSB0byBoYXZlIHRvIGNvbWUgdG8gdGVybXMgd2l0aCB0b28gbWFueSBuZXcgY29uY2Vw
dHMgYXQNCgk+IG9uY2UuICBZb3Ugc2F5IHlvdXJzZWxmIHRoYXQgdGhlIFZpc2l0b3IgcGF0dGVy
biB0YWtlcyBzb21lIGdldHRpbmcgdXNlZA0KCT4gdG8uICBJIHRoaW5rIHRoZSB3aG9sZSBzY2Fu
bmVyIGNvbmNlcHQgdGFrZXMgc29tZSBnZXR0aW5nIHVzZWQgdG8gLi4uLg0KCQ0KCU91ciBWaXNp
dG9yIGltcGxlbWVudGF0aW9uIGlzIHNvIHRyaXZpYWwgdGhhdCBmb2xrcyB3b24ndCBldmVuIGtu
b3cgdGhleSdyZQ0KCXVzaW5nIHRoZSBwYXR0ZXJuLg0KCQ0KCT4gPlNpbXBsaWNpdHkgaXMgcmVh
c29uICMzOiB0aGVyZSBpcyBsaXR0bGUgcmVhc29uIHRvIGZhdHRlbiB0aGUgaW50ZXJmYWNlcw0K
CW9mDQoJPiA+dGFnIGFuZCBub2RlIGNsYXNzZXMgd2l0aCB2YXJpb3VzIGRhdGEgYWNjdW11bGF0
aW9uL2FsdGVyYXRpb24gbWV0aG9kcw0KCXdoZW4NCgk+ID5vbmUgbWV0aG9kIGFuZCBhIHZhcmll
dHkgb2YgY29uY3JldGUgVmlzaXRvcnMgY2FuIGRvIHRoZSBqb2Igd2l0aCBtdWNoDQoJbGVzcw0K
CT4gPmNvZGUuDQoJPiA+DQoJPiB3ZWxsIEkgd291bGQgYWdyZWUgaWYgeW91IGNvdWxkIGd1YXJh
bnRlZSB0aGF0IHRoZXJlIHdpbGwgYmUgbm8NCgk+IGRpdmVyZ2VuY2Ugd2hhdHNvZXZlciBpbiBo
b3cgdGhlIGRpZmZlcmVudCBtZXRob2RzIHdpbGwgYmUgdXNlZC4gIElmIHlvdQ0KCT4gY2FuIGNy
ZWF0ZSBhIGZsZXhpYmxlIGVub3VnaCBpbXBsZW1lbnRhdGlvbiBvZiB0aGUgVmlzaXRvciBwYXR0
ZXJuIHRoZW4NCgk+IEkgZ3Vlc3MgdGhhdCB3aWxsIHN1cHBvcnQgYW55IHBvc3NpYmxlIGRpdmVy
Z2VuY2UgaW4gdGhlIHNlcGFyYXRlDQoJPiBtZXRob2RzLiBIb3dldmVyLCBJIHRoaW5rIHRoZXJl
IGlzIGEgcmVhc29uIHRvIGhhdmUgYSBmYXR0ZXIgaW50ZXJmYWNlLA0KCT4gaW4gdGhhdCBjb252
ZW5pZW5jZSBtZXRob2RzIGxvd2VyIHRoZSBiYXJyaWVyIHRvIGVudHJ5IGZvciBuZXcgdXNlcnMu
DQoJDQoJV291bGRuJ3QgaXQgYmUgbWFydmVsb3VzIGlmIHRoZSBiYXJyaWVyIHRvIGVudHJ5IHdh
cyBsb3cgYW5kIG91ciBjb2RlIHdhc24ndA0KCWJsb2F0d2FyZT8NCgkNCgk+IFBlcmhhcHMgaWRl
YWxseSBvbmUgaGFzIGEgd2VsbCBpbXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gdGhhdCBzdXBw
b3J0cw0KCT4gYSByYXcgbWV0aG9kIGFjY2VzcywgYW5kIGEgbnVtYmVyIG9mIGNvbnZlbmllbmNl
IG1ldGhvZHM/DQoJDQoJVGhlIGNoYW5nZXMgd2lsbCBtYWtlIHRoZSBjb2RlIGVhc2llciB0byB1
c2UgLSBpZiB0aGV5IGRvbid0LCB0aGV5IHJlYWxseQ0KCWNvb2wgdGhpbmcgYWJvdXQgc29mdHdh
cmUgaXMgdGhhdCBpdCBpcyBzb2Z0IC0gd2UgY2FuIGNoYW5nZSBpdC4NCgkNCgk+IEEgd2VsbCBp
bXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gd2lsbCwgSSBhc3N1bWUsIHN1cHBvcnQgYWxsIHNv
cnRzIG9mDQoJPiBkaWZmZXJlbnQgb3BlcmF0aW9ucywgYnV0IEkgd291bGQgZmVlbCBtdWNoIGhh
cHBpZXIgaWYgdGhlIGh0bWxwYXJzZXINCgk+IGhhZCBhIGNvbXBsZXRlIGphdmFkb2MgYW5kIGRv
Y3VtZW50YXRpb24gcmV2aWV3IGJlZm9yZSBhbnkgcmVmYWN0b3JpbmcNCgk+IHRvb2sgcGxhY2Uu
ICBQZW9wbGUgYXJlIHRyeWluZyB0byB1c2UgdGhlIGV4aXN0aW5nIHN5c3RlbSBhbmQgaGF2aW5n
DQoJPiB0cm91YmxlIG5vdCBiZWNhdXNlIG9mIHRoZSBsYWNrIG9mIHJlZmFjdG9yaW5nLCBidXQg
YSBsYWNrIG9mIHdlbGwNCgk+IGRlc2NyaWJlZCBtZXRob2RzLiAgV2VsbCBJIHNheSBwZW9wbGUs
IEkgbWVhbiBtZSwgSSBkb24ndCBrbm93IGlmIGFueW9uZQ0KCT4gZWxzZSBmZWVscyB0aGUgc2Ft
ZS4gIE1heWJlIGl0J3MganVzdCBtZSA6LSkNCgkNCglFeGVjdXRhYmxlIGRvY3VtZW50YXRpb24g
aXMgYSB0ZXJtIHdlIHVzZSBmb3IgQ3VzdG9tdGVyIFRlc3RzLiAgIEEgQ3VzdG9tZXINCglUZXN0
cyBzaG93cyBob3cgYSBib2R5IG9mIGNvZGUgZ2V0cyB1c2VkIHRvIHBlcmZvcm0gcmVhbCB0YXNr
cy4gICBZb3UgaGF2ZQ0KCXJlYWwtd29ybGQgdGFza3MgdG8gcGVyZm9ybS4gIFdlIGNhbiBwZXJm
b3JtIHRob3NlIHRhc2tzIHZpYSBhdXRvbWF0ZWQNCgl0ZXN0cy4gICBUaGVuLCB3aGVuIHdlIHJl
ZmFjdG9yLCB3ZSBtdXN0IG1ha2Ugc3VyZSBvdXIgZXhlY3V0YWJsZQ0KCWRvY3VtZW50YXRpb24g
aXMgdXAgdG8gZGF0ZSAoaS5lLiBpdCBwYXNzZXMgaXRzIHRlc3RzKS4gICBXZSBmaW5kIHRoaXMg
aWRlYWwNCglmb3IgbWFraW5nIHN1cmUgdGhhdCBkb2N1bWVudGF0aW9uIHJlZmxlY3RzIHdoYXQg
dGhlIGNvZGUgaXMgYWN0dWFsbHkgZG9pbmcuDQoJVGhpcyBhbHNvIGxlYXZlcyByb29tIGZvciBh
IGRvY3VtZW50IHRoYXQgZ2l2ZXMgc29tZSBncmFwaGljYWwNCglyZXByZXNlbnRhdGlvbiBvZiBo
b3cgdGhpbmdzIHdvcmssIHdoaWNoIG11c3QgYmUgbWFpbnRhaW5lZCBieSBzb21lb25lLCBsZXN0
DQoJaXQgZ2V0IG91dCBvZiBkYXRlLiAgV2UganVzdCBtaW5pbWl6ZSB0aGUgbmVlZCBmb3Igc3Vj
aCBkb2N1bWVudHMgYnkga2VlcGluZw0KCW91ciBjb2RlIHNpbXBsZSBhbmQgc21hbGwgYW5kIHN1
cnJvdW5kaW5nIGl0IHdpdGggZWFzeSB0byB1bmRlcnN0YW5kDQoJZXhlY3V0YWJsZSBkb2N1bWVu
dGF0aW9uLg0KCQ0KCWJlc3QgcmVnYXJkcw0KCWprDQoJDQoJDQoJDQoJDQoJDQoJDQoJDQoJLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRo
aXMgc2YubmV0IGVtYWlsIGlzIHNwb25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdl
ZWsgaGVhdmVuLg0KCWh0dHA6Ly90aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWls
aW5nIGxpc3QNCglIdG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglo
dHRwczovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRl
dmVsb3Blcg0KCQ0KDQo=

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Joshua K. <jo...@in...> - 2002-12-27 18:52:41

> IMHO the visitor pattern is not a good one. Here are the reasons.

The Visitor pattern is a fine pattern Derrick.  Ralph Johnson once said,
"Most of the time you don't need a Visitor, but when you need a Visitor, you
really need a Visitor."

> It breaks encapsulation.  It presumes that an exterior class (the
> visitor) knows how to do something to a class better than the class
> itself does.

A good many of the Design Patterns "break encapsulation."   I guess those
aren't good patterns, huh?   Is there an OO police agency to report
instances of breaking encapsulation?   ;-)

> It makes the system brittle.  The visitor class has a method defined for
> every subclass of visited object.  Adding, removing or refactoring any
> visited class breaks the visitors.

Hard-coded, duplicated logic in nodes makes class hierarchies brittle.

We've found that the html parser code doesn't need special visit methods for
each tag type, which means it's easy to implement Visitor and it won't be
hard to accomodate visiting new tags in the future.

BTW, do you envision needing to add many new tag types to the html parser?

> It's obscure.  Looking at the visit loop tells you nothing of what is
> being perfomed.  The only indication of the task might be the name of
> the visitor class.

If you call a method that yields the plain text string for some html, and
that code happens to use a Visitor, do you care?   That is, a client won't
even know they're using the pattern.

> It's redundant.  In every example I've seen (I've never seen it used in
> real applications), the result could have been coded in a simpler
> fashion with correct class design.

I agree that it's always best to seek out the simple design.  It takes a
committment to refactoring to find simplicity.  Presently, Somik and I are
removing duplication to get to a simpler design.  You have our promise that
no Visitor will be added if it complicates the design.

> I would only use the visitor pattern if either the set of public methods
> on the visited nodes completely specify their possible behavior and
> allowed the client to fully manipulate them at the highest level of
> abstraction (i.e. everything you can do to the visited classes is
> already coded and you don't want them to change), or the visitor pattern
> had already been coded into the visited nodes and for some reason they
> could not be modified further (i.e. the code was available in binary
> form only).  I don't think either of these applies here.

Visitor is a great pattern when you need various operations to be performed
over an object structure and you don't want to hard-code those operations
into the objects themselves.  So far, it seems to me you have just that need
in html parser.  I could be wrong.  We'll have to see how the code feels
with a Visitor implementation and make decisions based on real code.  CVS
allows us to go back to any previous implementation we like better.

best regards,
jk

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Joshua K. <jo...@in...> - 2002-12-27 18:33:44

Sam wrote:
> The Visitor pattern sounds interesting, and I look forward to hearing
> more about it.  However, duplicated code itself is not IMO necessarily
> an evil. It all depends on whether one thinks that the duplicated
> components are going to diverge in functionality in the future.  If you
> are sure they are not, then fine, refactor away.

At the moment, Somik and I speculate that 40% to 50% of the current html
parser code base is pure fat -- unnecessary code that only serves to bloat
the code base.  Duplicate code, in subtle or not-so-subtle versions, is
mostly responsible for the bloat.  In my experience in industry, I'd say
that 95% of the time, duplicate code is bad.

> I guess my surprise at your (or perhaps Somik's) focus on refactoring
> comes from the fact that while the htmlparser is a great piece of
> software, the javadocs and other documentation could use some attention.
>  For example, I can't find any explanation in the javadocs or otherwise
> of how the filters are supposed to work with the different scanners, or
> what values they are allowed to take.

The html parser is in sore need of refactoring Sam.  In fact, sometimes code
becomes unnecessarily complex, in which case people *really* need
documentation.  I confront such a problem by first asking if the code could
be simplified so we didn't need so much documentation.   In the event that
we do need docs, it is best to write executable documentation.  Are you
familiar with executable documentation?

> I generally work to "if it's not broken don't fix it", but I often add
> "before you start fixing it, make sure your documentation is up to date".

Software becomes brittle and bloated under the philosophy of "if it ain't
broke, don't fix it."   We have clients with code bases that are 2 millions
lines of "working" speghetti code.  They need lots of help to learn to do
continuous refactoring.

> Using the Visitor pattern may make it easier for clients to get the data
> they need, but given that the htmlparser is "working" (well it works for
> me), I would say that the more urgent issue here is making sure all the
> documentation is up to date.   I have a lot of positive things to say
> about htmlparser, so don't take it the wrong way, when I say that the
> biggest problem I've had in using it in the last few weeks is inadequate
> javadocs.

I'm not content to live with the world as it is Sam.  If something isn't
easy, it's wrong.  The best software in the world is easy to use, it's
self-explanatory.  We should always strive for that.  In the meantime, if
you have tasks to complete and don't know how to do them because of lack of
docs, I'd suggest you ask questions here.  The best result will be
executable documentation for the html parser.

> >>Is there some efficiency reason why you want to refactor these methods
> >>or is it just for neatness?
> >>
> >>
> >
> >Duplication removal is reason #1.
> >
> As I mention above.  One should be careful of duplication removal for
> the sake of it.

And as I mentioned above, I completely disagree with you.  I wonder, how
much refactoring have you done in your career given you philosophy of "if it
ain't broke, don't fix it?"   Have you read Martin Fowler's landmark book,
Refactoring?   If not, I'd suggest you study it thoroughly - you'll be a
better programmer for it.

> > Removal of hard-coded logic is reason #2.
> >
> This is a good reason.  However I get the feeling that introduction of
> these Visitor classes will make the system conceptually more difficult
> to use rather than easier.  I would feel better if the current set up
> was more fully documented before more complexity was added.

As I also said above, if it ain't simple, it's wrong.  Our changes will not
add complexity - that would be foolish.

> And even if the Visitor pattern is used, I would recommend leaving
> methods like toPlainTextString() etc in place, but just making them
> short cut implementations to certain kinds of visitor-using methods.
>  This will allow people who have yet to grasp the Visitor  pattern
> something to work with.

People can always call deprecated methods.

> If you are keen to see lots of people using htmlparser, I think that you
> don't want people to have to come to terms with too many new concepts at
> once.  You say yourself that the Visitor pattern takes some getting used
> to.  I think the whole scanner concept takes some getting used to ....

Our Visitor implementation is so trivial that folks won't even know they're
using the pattern.

> >Simplicity is reason #3: there is little reason to fatten the interfaces
of
> >tag and node classes with various data accumulation/alteration methods
when
> >one method and a variety of concrete Visitors can do the job with much
less
> >code.
> >
> well I would agree if you could guarantee that there will be no
> divergence whatsoever in how the different methods will be used.  If you
> can create a flexible enough implementation of the Visitor pattern then
> I guess that will support any possible divergence in the separate
> methods. However, I think there is a reason to have a fatter interface,
> in that convenience methods lower the barrier to entry for new users.

Wouldn't it be marvelous if the barrier to entry was low and our code wasn't
bloatware?

> Perhaps ideally one has a well implemented Visitor pattern that supports
> a raw method access, and a number of convenience methods?

The changes will make the code easier to use - if they don't, they really
cool thing about software is that it is soft - we can change it.

> A well implemented Visitor pattern will, I assume, support all sorts of
> different operations, but I would feel much happier if the htmlparser
> had a complete javadoc and documentation review before any refactoring
> took place.  People are trying to use the existing system and having
> trouble not because of the lack of refactoring, but a lack of well
> described methods.  Well I say people, I mean me, I don't know if anyone
> else feels the same.  Maybe it's just me :-)

Executable documentation is a term we use for Customter Tests.   A Customer
Tests shows how a body of code gets used to perform real tasks.   You have
real-world tasks to perform.  We can perform those tasks via automated
tests.   Then, when we refactor, we must make sure our executable
documentation is up to date (i.e. it passes its tests).   We find this ideal
for making sure that documentation reflects what the code is actually doing.
This also leaves room for a document that gives some graphical
representation of how things work, which must be maintained by someone, lest
it get out of date.  We just minimize the need for such documents by keeping
our code simple and small and surrounding it with easy to understand
executable documentation.

best regards
jk

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-27 07:53:19

Hi Dhaval, Sam,
> I agree with Sam when he says that "don't fix it when its not broken".

I quite disagree with both of u on this philosophy. Looking at the code with
Joshua has made me realize its actually ghastly. A serious code cleanup is
needed, there is just too much duplication! Most of the scanner code is
unreadable, while the tag constructors are a nightmare. I dont think I will
be at peace till a couple of rounds of refactoring has been completed.

I do not think the panic on the visitor is warranted. Like I said before,
the current access methods will continue to be present. However, having a
visitor will make life simpler, as in  - there's so much code now that uses
the same loop over and over again. We can replace code like :

Vector links = new Vector();
for (Enumeration e = parser.elements();e.hasMoreElements();) {
    HTMLNode node = (HTMLNode)e.nextElement();
    if (node instanceof HTMLLinkTag) {
        links.add(node);
    }
}

with :

HTMLLinkVisitor linkVisitor = new HTMLLinkVisitor();
collectNodesWith(linkVisitor);
Vector links = linkVisitor.getResult();

This looks so much more readable and simple. Of course, you could still use
the old way - its just that you would have the option of making life easier.

In the latest round of refactorings, we've put in a HTMLCompositeTag, from
which the link, form and select tags inherit. The toPlainTextString() and
toHTML() methods are now in the parent class. All tests passing (except the
charset tests).

Regards,
Somik

14 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 18 19 20 21 22 .. 33 > >> (Page 20 of 33)