Thread: Re: [Htmlparser-developer] toPlainTextString() feedback requested

Brought to you by: derrickoswald

htmlparser-developer

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Derrick O. <Der...@ro...> - 2002-12-24 17:47:30

IMHO the visitor pattern is not a good one. Here are the reasons.

It breaks encapsulation.  It presumes that an exterior class (the 
visitor) knows how to do something to a class better than the class 
itself does.

It makes the system brittle.  The visitor class has a method defined for 
every subclass of visited object.  Adding, removing or refactoring any 
visited class breaks the visitors.

It's obscure.  Looking at the visit loop tells you nothing of what is 
being perfomed.  The only indication of the task might be the name of 
the visitor class.

It's redundant.  In every example I've seen (I've never seen it used in 
real applications), the result could have been coded in a simpler 
fashion with correct class design.

I would only use the visitor pattern if either the set of public methods 
on the visited nodes completely specify their possible behavior and 
allowed the client to fully manipulate them at the highest level of 
abstraction (i.e. everything you can do to the visited classes is 
already coded and you don't want them to change), or the visitor pattern 
had already been coded into the visited nodes and for some reason they 
could not be modified further (i.e. the code was available in binary 
form only).  I don't think either of these applies here.

my $0.02

Derrick

RE: [Htmlparser-developer] toPlainTextString() feedback requested

From: <dha...@or...> - 2002-12-26 09:32:50

Attachments: BDY.RTF

Hi,

I agree with Sam when he says that "don't fix it when its not broken".
But at the same time I see the need to make code better, more readable
and simpler. However as suggested below it seems that the Visitor
pattern is going to make things difficult to understand. And I too like
Sam had a problem with HTMLParser initially. It may be with the
documentation but this whole tag-scanner things was extremely confusing
in the beginning(though subsequently I have loved them so much that I
have even written a few myself without any problem). Hence I think that
even if the visitor pattern is used, the user must have some simple
easy-to-use methods to get his work done rather than try to understand
the Visitor pattern.

Just like Same, this was my two cents.

Cheers,
Dhaval

-----Original Message-----
From: gaijin [mailto:ga...@yh...]
Sent: Tuesday, December 24, 2002 12:21 PM
To: htmlparser-developer
Cc: gaijin; htmlparser-user
Subject: Re: [Htmlparser-developer] toPlainTextString() feedback
requested

Hi Somik and Joshua

Joshua Kerievsky wrote:

>>Could you explain why you want to refactor these methods?  Remember
the
>>danger of premature refactoring ... you lose flexibility that then has
>>to be re-added later on, making more work in the long run.
>>    
>>
>There's a good deal of duplicate code in way the two toHTML methods and
the
>toPlainTextString method do their work.  The central theme is
information
>accumulation/alteration.  That involves outputing tag and node results
and
>recusing through tags.  The refactoring to Visitor allows us to
>
>* remove many lines of duplicate code, spread across many classes
>* remove hard-coded accumulation/alteration logic, thereby making it
easier
>for clients to get the data they need
>
>Visitor takes some getting used to.  I rarely use the pattern. In this
case,
>IMO, it was a good fit.
>
The Visitor pattern sounds interesting, and I look forward to hearing 
more about it.  However, duplicated code itself is not IMO necessarily 
an evil. It all depends on whether one thinks that the duplicated 
components are going to diverge in functionality in the future.  If you 
are sure they are not, then fine, refactor away.

I guess my surprise at your (or perhaps Somik's) focus on refactoring 
comes from the fact that while the htmlparser is a great piece of 
software, the javadocs and other documentation could use some attention.

 For example, I can't find any explanation in the javadocs or otherwise 
of how the filters are supposed to work with the different scanners, or 
what values they are allowed to take.

I generally work to "if it's not broken don't fix it", but I often add 
"before you start fixing it, make sure your documentation is up to
date".

Using the Visitor pattern may make it easier for clients to get the data

they need, but given that the htmlparser is "working" (well it works for

me), I would say that the more urgent issue here is making sure all the 
documentation is up to date.   I have a lot of positive things to say 
about htmlparser, so don't take it the wrong way, when I say that the 
biggest problem I've had in using it in the last few weeks is inadequate

javadocs.

>>Is there some efficiency reason why you want to refactor these methods
>>or is it just for neatness?
>>    
>>
>
>Duplication removal is reason #1. 
>
As I mention above.  One should be careful of duplication removal for 
the sake of it.

> Removal of hard-coded logic is reason #2.
>
This is a good reason.  However I get the feeling that introduction of 
these Visitor classes will make the system conceptually more difficult 
to use rather than easier.  I would feel better if the current set up 
was more fully documented before more complexity was added.

And even if the Visitor pattern is used, I would recommend leaving 
methods like toPlainTextString() etc in place, but just making them 
short cut implementations to certain kinds of visitor-using methods. 
 This will allow people who have yet to grasp the Visitor  pattern 
something to work with.

If you are keen to see lots of people using htmlparser, I think that you

don't want people to have to come to terms with too many new concepts at

once.  You say yourself that the Visitor pattern takes some getting used

to.  I think the whole scanner concept takes some getting used to ....

>Simplicity is reason #3: there is little reason to fatten the
interfaces of
>tag and node classes with various data accumulation/alteration methods
when
>one method and a variety of concrete Visitors can do the job with much
less
>code.
>
well I would agree if you could guarantee that there will be no 
divergence whatsoever in how the different methods will be used.  If you

can create a flexible enough implementation of the Visitor pattern then 
I guess that will support any possible divergence in the separate 
methods. However, I think there is a reason to have a fatter interface, 
in that convenience methods lower the barrier to entry for new users.  

Perhaps ideally one has a well implemented Visitor pattern that supports

a raw method access, and a number of convenience methods?

A well implemented Visitor pattern will, I assume, support all sorts of 
different operations, but I would feel much happier if the htmlparser 
had a complete javadoc and documentation review before any refactoring 
took place.  People are trying to use the existing system and having 
trouble not because of the lack of refactoring, but a lack of well 
described methods.  Well I say people, I mean me, I don't know if anyone

else feels the same.  Maybe it's just me :-)

Just my two cents.

CHEERS> SAM

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-27 07:53:19

Hi Dhaval, Sam,
> I agree with Sam when he says that "don't fix it when its not broken".

I quite disagree with both of u on this philosophy. Looking at the code with
Joshua has made me realize its actually ghastly. A serious code cleanup is
needed, there is just too much duplication! Most of the scanner code is
unreadable, while the tag constructors are a nightmare. I dont think I will
be at peace till a couple of rounds of refactoring has been completed.

I do not think the panic on the visitor is warranted. Like I said before,
the current access methods will continue to be present. However, having a
visitor will make life simpler, as in  - there's so much code now that uses
the same loop over and over again. We can replace code like :

Vector links = new Vector();
for (Enumeration e = parser.elements();e.hasMoreElements();) {
    HTMLNode node = (HTMLNode)e.nextElement();
    if (node instanceof HTMLLinkTag) {
        links.add(node);
    }
}

with :

HTMLLinkVisitor linkVisitor = new HTMLLinkVisitor();
collectNodesWith(linkVisitor);
Vector links = linkVisitor.getResult();

This looks so much more readable and simple. Of course, you could still use
the old way - its just that you would have the option of making life easier.

In the latest round of refactorings, we've put in a HTMLCompositeTag, from
which the link, form and select tags inherit. The toPlainTextString() and
toHTML() methods are now in the parent class. All tests passing (except the
charset tests).

Regards,
Somik

RE: [Htmlparser-developer] toPlainTextString() feedback requested

From: Claude D. <CD...@ar...> - 2002-12-28 03:35:04

SSB0cmllZCB0byByZWZyYWluIGZyb20ganVtcGluZyBpbiwgYnV0IGhlcmUgYXJlIGEgY291cGxl
IG9mIGNlbnRzIHdvcnRoIG9mIG9waW5pb24sIGp1c3QgZm9yIHRoZSBob2xsaWRheXMgOy0pLg0K
IA0KSW4gbXkgdmlldywgZGlzdGlsbGluZyBzb2Z0d2FyZSBkb3duIHRvIGVzc2VudGlhbHMgaXMg
cGFydCBvZiBpbXByb3ZpbmcgZGVzaWduLiBJdCBjdXRzIGRvd24gb24gY29kZSB0aGF0IHdvdWxk
IGxhdGVyIG5lZWQgdG8gYmUgbWFpbnRhaW5lZC4gUmVkdW5kYW5jeSBpcyBleHBlbnNpdmUuIE1v
cmUgY29kZSBuZWVkcyB0byBjaGFuZ2UgZXZlcnkgdGltZSBhIGJ1ZyBpcyBmaXhlZCwgcG90ZW50
aWFsbHkgbWlzc2luZyBvbmUgb2YgdGhlIHZhcmlhbnRzIGFuZCBpbnRyb2R1Y2luZyBkaWZmaWN1
bHQgdG8gZGlhZ25vc2Ugc2l0dWF0aW9ucy4gTGVzcyBjb2RlIGlzIGJldHRlci4gQXMgZGVzaWdu
cyBpbXByb3ZlLCB0aGUgY29kZSBiYXNlIHdpbGwgdHlwaWNhbGx5IHNocmluay4gUmVmYWN0b3Jp
bmcgYW5kIGFwcGx5aW5nIGVmZmVjdGl2ZSBwYXR0ZXJucyBpcyBrZXkgdG8gbWFraW5nIHRoaXMg
cG9zc2libGUuDQogDQpUaGF0IGJlaW5nIHNhaWQsIEkgdGhpbmsgZ29vZCBzb2Z0d2FyZSBkZXNp
Z24gaGluZ2VzIG9uIHdlbGwgZGVmaW5lZCB1c2UgY2FzZXMuIElmIGNoYW5nZXMgbmVlZCB0byBi
ZSBtYWRlIHRvIG1lZXQgdGhlIHJlcXVpcmVtZW50cywgdGhhdCdzIHdoYXQgbmVlZHMgdG8gaGFw
cGVuLiBSZWZhY3RvcmluZywgZGlzdGlsbGluZywgc2ltcGxpZnlpbmcsIHJlbW92aW5nIGRlYWQg
Y29kZSBhbmQgZG9jdW1lbnRhdGlvbiBhcmUgYWxsIHBhcnQgb2YgZXZlcnkgcmVsZWFzZS4gUmVm
YWN0b3JpbmcgZm9yIHRoZSBzYWtlIG9mIHJlZmFjdG9yaW5nIHNob3VsZCBiZSBhdm9pZGVkLCBh
cyB3ZWxsIGFzIGZvcmdpbmcgYWhlYWQgd2l0aG91dCBhIHdlbGwtY29uc2lkZXJlZCBhbmQgcmV2
aWV3ZWQgcGxhbi4gSWYgdGhlIHBsYW4gc2hvd3MgdGhhdCByZWZhY3RvcmluZyB3aWxsIGltcHJv
dmUgZGVzaWduLCBkZWxpdmVyIGJldHRlciBzb2Z0d2FyZSwgbWVldCByZXF1aXJlbWVudHMgYW5k
IHNhdGlzZnkgdGFyZ2V0IHVzZSBjYXNlcywgdGhlbiBpdCBpcyBwcm9iYWJseSBhIGdvb2QgcGxh
biBhbmQgc2hvdWxkIGJlIGltcGxlbWVudGVkLCBzbyBsb25nIGFzIHRoZXJlJ3MgZW5vdWdoIGFn
cmVlbWVudCB0byB2YWxpZGF0ZSB0aGUgb2JqZWN0aXZlcyBhbmQgc3RyYXRlZ3kuDQogDQpJIHRo
aW5rIG1vc3Qgb2YgdGhpcyBkaXNjdXNzaW9uIGNlbnRlcnMgb24gYSBsYWNrIG9mIGRvY3VtZW50
YXRpb24sIGJ1dCBub3QgdXNlciBkb2N1bWVudGF0aW9uLiBUaGUgZ3JvdXAgaXMgbGFja2luZyBp
biBwcm9jZXNzLiBJZiBhIHBsYW4gaXMgd29ydGggcGVyc3VpbmcsIGl0J3Mgd29ydGggd3JpdGlu
ZyB1cCBhIHN1bW1hcnksIHdpdGggc3VmZmljaWVudCBqdXN0aWZpY2F0aW9uIHRvIHNhdGlmeSB0
aGUgZ3JvdXAuIElmIHRoYXQgcGxhbiBpcyB0aGVuIHN1YmplY3RlZCB0byByZXZpZXcsIEkgZXhw
ZWN0IGl0IGNhbiBiZSB2YWxpZGF0ZWQgcXVpY2tseSBhbmQgZXZlcnlvbmUgY2FuIGZlZWwgY29u
ZmlkZW50IHRoYXQgcHJvcG9zZWQgY2hhbmdlcyB3aWxsIGltcHJvdmUgdGhlIGRlc2lnbiBhbmQg
bm90IHNpbXBseSBpbnRyb2R1Y2UgbW9yZSB3b3JrIG9yIGNoYW5nZSBmb3IgdGhlIHNha2Ugb2Yg
Y2hhbmdlLg0KIA0KSW4gcHJpbmNpcGxlLCBJIGhhdmVudCByZWFkIGFueXRoaW5nIG9uIHRoZSBs
aXN0IHRoYXQgbWFkZSBtZSBuZXJ2b3VzLiBTaWxlbmNlIHNob3VsZCBiZSB0YWtlbiBhcyBpbXBs
aWNpdCBzdXBwb3J0LiBJZiBhIGRvY3VtZW50IGlzIGNpcmN1bGF0ZWQgYW5kIG5vIG9iamVjdGlv
bnMgYXJlIG5vdGVkLCB0aGVyZSBpcyBubyByZWFzb24gdG8gd2FpdCBpbmRlZmluaXRlbHksIHRo
b3VnaCBhIHJlYXNvbmFibGUgcmV2aWV3IHByb2Nlc3Mgc2hvdWxkIHByb2JhYmx5IGJlIGFwcGxp
ZWQuIENvbnRyaWJ1dG9ycyBhcmUgZG9pbmcgdGhlIHdvcmsgYW5kIGhhdmUgYSByaWdodCB0byBk
aXJlY3QgdGhlaXIgb3duIGVmZm9ydHMuIEluIG15IGV4cGVyaWVuY2UsIGhvd2V2ZXIsIGV2ZW4g
c2hvcnQgcHJvY2VzcyBkb2N1bWVudHMgY2FuIG1ha2UgYSB3b3JsZCBvZiBkaWZmZXJlbmNlLiBJ
IHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mgc2l6ZSBpcyBoaXR0aW5nIGNy
aXRpY2FsIG1hc3MgYW5kIHRoYXQgbWluaW1hbCBmb3JtYWxpdGllcyBhcmUgbm93IHJlcXVpcmVk
Li4uDQogDQpIYXBweSBOZXcgWWVhciBFdmVyeW9uZSENCiANCi0tLS0tT3JpZ2luYWwgTWVzc2Fn
ZS0tLS0tIA0KRnJvbTogSm9zaHVhIEtlcmlldnNreSBbbWFpbHRvOmpvc2h1YUBpbmR1c3RyaWFs
bG9naWMuY29tXSANClNlbnQ6IEZyaSAxMi8yNy8yMDAyIDEwOjMzIEFNIA0KVG86IGh0bWxwYXJz
ZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiBodG1scGFyc2VyLXVzZXJA
bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KU3ViamVjdDogUmU6IFtIdG1scGFyc2VyLWRldmVsb3Bl
cl0gdG9QbGFpblRleHRTdHJpbmcoKSBmZWVkYmFjayByZXF1ZXN0ZWQNCg0KDQoNCglTYW0gd3Jv
dGU6DQoJPiBUaGUgVmlzaXRvciBwYXR0ZXJuIHNvdW5kcyBpbnRlcmVzdGluZywgYW5kIEkgbG9v
ayBmb3J3YXJkIHRvIGhlYXJpbmcNCgk+IG1vcmUgYWJvdXQgaXQuICBIb3dldmVyLCBkdXBsaWNh
dGVkIGNvZGUgaXRzZWxmIGlzIG5vdCBJTU8gbmVjZXNzYXJpbHkNCgk+IGFuIGV2aWwuIEl0IGFs
bCBkZXBlbmRzIG9uIHdoZXRoZXIgb25lIHRoaW5rcyB0aGF0IHRoZSBkdXBsaWNhdGVkDQoJPiBj
b21wb25lbnRzIGFyZSBnb2luZyB0byBkaXZlcmdlIGluIGZ1bmN0aW9uYWxpdHkgaW4gdGhlIGZ1
dHVyZS4gIElmIHlvdQ0KCT4gYXJlIHN1cmUgdGhleSBhcmUgbm90LCB0aGVuIGZpbmUsIHJlZmFj
dG9yIGF3YXkuDQoJDQoJQXQgdGhlIG1vbWVudCwgU29taWsgYW5kIEkgc3BlY3VsYXRlIHRoYXQg
NDAlIHRvIDUwJSBvZiB0aGUgY3VycmVudCBodG1sDQoJcGFyc2VyIGNvZGUgYmFzZSBpcyBwdXJl
IGZhdCAtLSB1bm5lY2Vzc2FyeSBjb2RlIHRoYXQgb25seSBzZXJ2ZXMgdG8gYmxvYXQNCgl0aGUg
Y29kZSBiYXNlLiAgRHVwbGljYXRlIGNvZGUsIGluIHN1YnRsZSBvciBub3Qtc28tc3VidGxlIHZl
cnNpb25zLCBpcw0KCW1vc3RseSByZXNwb25zaWJsZSBmb3IgdGhlIGJsb2F0LiAgSW4gbXkgZXhw
ZXJpZW5jZSBpbiBpbmR1c3RyeSwgSSdkIHNheQ0KCXRoYXQgOTUlIG9mIHRoZSB0aW1lLCBkdXBs
aWNhdGUgY29kZSBpcyBiYWQuDQoJDQoJPiBJIGd1ZXNzIG15IHN1cnByaXNlIGF0IHlvdXIgKG9y
IHBlcmhhcHMgU29taWsncykgZm9jdXMgb24gcmVmYWN0b3JpbmcNCgk+IGNvbWVzIGZyb20gdGhl
IGZhY3QgdGhhdCB3aGlsZSB0aGUgaHRtbHBhcnNlciBpcyBhIGdyZWF0IHBpZWNlIG9mDQoJPiBz
b2Z0d2FyZSwgdGhlIGphdmFkb2NzIGFuZCBvdGhlciBkb2N1bWVudGF0aW9uIGNvdWxkIHVzZSBz
b21lIGF0dGVudGlvbi4NCgk+ICBGb3IgZXhhbXBsZSwgSSBjYW4ndCBmaW5kIGFueSBleHBsYW5h
dGlvbiBpbiB0aGUgamF2YWRvY3Mgb3Igb3RoZXJ3aXNlDQoJPiBvZiBob3cgdGhlIGZpbHRlcnMg
YXJlIHN1cHBvc2VkIHRvIHdvcmsgd2l0aCB0aGUgZGlmZmVyZW50IHNjYW5uZXJzLCBvcg0KCT4g
d2hhdCB2YWx1ZXMgdGhleSBhcmUgYWxsb3dlZCB0byB0YWtlLg0KCQ0KCVRoZSBodG1sIHBhcnNl
ciBpcyBpbiBzb3JlIG5lZWQgb2YgcmVmYWN0b3JpbmcgU2FtLiAgSW4gZmFjdCwgc29tZXRpbWVz
IGNvZGUNCgliZWNvbWVzIHVubmVjZXNzYXJpbHkgY29tcGxleCwgaW4gd2hpY2ggY2FzZSBwZW9w
bGUgKnJlYWxseSogbmVlZA0KCWRvY3VtZW50YXRpb24uICBJIGNvbmZyb250IHN1Y2ggYSBwcm9i
bGVtIGJ5IGZpcnN0IGFza2luZyBpZiB0aGUgY29kZSBjb3VsZA0KCWJlIHNpbXBsaWZpZWQgc28g
d2UgZGlkbid0IG5lZWQgc28gbXVjaCBkb2N1bWVudGF0aW9uLiAgIEluIHRoZSBldmVudCB0aGF0
DQoJd2UgZG8gbmVlZCBkb2NzLCBpdCBpcyBiZXN0IHRvIHdyaXRlIGV4ZWN1dGFibGUgZG9jdW1l
bnRhdGlvbi4gIEFyZSB5b3UNCglmYW1pbGlhciB3aXRoIGV4ZWN1dGFibGUgZG9jdW1lbnRhdGlv
bj8NCgkNCgk+IEkgZ2VuZXJhbGx5IHdvcmsgdG8gImlmIGl0J3Mgbm90IGJyb2tlbiBkb24ndCBm
aXggaXQiLCBidXQgSSBvZnRlbiBhZGQNCgk+ICJiZWZvcmUgeW91IHN0YXJ0IGZpeGluZyBpdCwg
bWFrZSBzdXJlIHlvdXIgZG9jdW1lbnRhdGlvbiBpcyB1cCB0byBkYXRlIi4NCgkNCglTb2Z0d2Fy
ZSBiZWNvbWVzIGJyaXR0bGUgYW5kIGJsb2F0ZWQgdW5kZXIgdGhlIHBoaWxvc29waHkgb2YgImlm
IGl0IGFpbid0DQoJYnJva2UsIGRvbid0IGZpeCBpdC4iICAgV2UgaGF2ZSBjbGllbnRzIHdpdGgg
Y29kZSBiYXNlcyB0aGF0IGFyZSAyIG1pbGxpb25zDQoJbGluZXMgb2YgIndvcmtpbmciIHNwZWdo
ZXR0aSBjb2RlLiAgVGhleSBuZWVkIGxvdHMgb2YgaGVscCB0byBsZWFybiB0byBkbw0KCWNvbnRp
bnVvdXMgcmVmYWN0b3JpbmcuDQoJDQoJPiBVc2luZyB0aGUgVmlzaXRvciBwYXR0ZXJuIG1heSBt
YWtlIGl0IGVhc2llciBmb3IgY2xpZW50cyB0byBnZXQgdGhlIGRhdGENCgk+IHRoZXkgbmVlZCwg
YnV0IGdpdmVuIHRoYXQgdGhlIGh0bWxwYXJzZXIgaXMgIndvcmtpbmciICh3ZWxsIGl0IHdvcmtz
IGZvcg0KCT4gbWUpLCBJIHdvdWxkIHNheSB0aGF0IHRoZSBtb3JlIHVyZ2VudCBpc3N1ZSBoZXJl
IGlzIG1ha2luZyBzdXJlIGFsbCB0aGUNCgk+IGRvY3VtZW50YXRpb24gaXMgdXAgdG8gZGF0ZS4g
ICBJIGhhdmUgYSBsb3Qgb2YgcG9zaXRpdmUgdGhpbmdzIHRvIHNheQ0KCT4gYWJvdXQgaHRtbHBh
cnNlciwgc28gZG9uJ3QgdGFrZSBpdCB0aGUgd3Jvbmcgd2F5LCB3aGVuIEkgc2F5IHRoYXQgdGhl
DQoJPiBiaWdnZXN0IHByb2JsZW0gSSd2ZSBoYWQgaW4gdXNpbmcgaXQgaW4gdGhlIGxhc3QgZmV3
IHdlZWtzIGlzIGluYWRlcXVhdGUNCgk+IGphdmFkb2NzLg0KCQ0KCUknbSBub3QgY29udGVudCB0
byBsaXZlIHdpdGggdGhlIHdvcmxkIGFzIGl0IGlzIFNhbS4gIElmIHNvbWV0aGluZyBpc24ndA0K
CWVhc3ksIGl0J3Mgd3JvbmcuICBUaGUgYmVzdCBzb2Z0d2FyZSBpbiB0aGUgd29ybGQgaXMgZWFz
eSB0byB1c2UsIGl0J3MNCglzZWxmLWV4cGxhbmF0b3J5LiAgV2Ugc2hvdWxkIGFsd2F5cyBzdHJp
dmUgZm9yIHRoYXQuICBJbiB0aGUgbWVhbnRpbWUsIGlmDQoJeW91IGhhdmUgdGFza3MgdG8gY29t
cGxldGUgYW5kIGRvbid0IGtub3cgaG93IHRvIGRvIHRoZW0gYmVjYXVzZSBvZiBsYWNrIG9mDQoJ
ZG9jcywgSSdkIHN1Z2dlc3QgeW91IGFzayBxdWVzdGlvbnMgaGVyZS4gIFRoZSBiZXN0IHJlc3Vs
dCB3aWxsIGJlDQoJZXhlY3V0YWJsZSBkb2N1bWVudGF0aW9uIGZvciB0aGUgaHRtbCBwYXJzZXIu
DQoJDQoJPiA+PklzIHRoZXJlIHNvbWUgZWZmaWNpZW5jeSByZWFzb24gd2h5IHlvdSB3YW50IHRv
IHJlZmFjdG9yIHRoZXNlIG1ldGhvZHMNCgk+ID4+b3IgaXMgaXQganVzdCBmb3IgbmVhdG5lc3M/
DQoJPiA+Pg0KCT4gPj4NCgk+ID4NCgk+ID5EdXBsaWNhdGlvbiByZW1vdmFsIGlzIHJlYXNvbiAj
MS4NCgk+ID4NCgk+IEFzIEkgbWVudGlvbiBhYm92ZS4gIE9uZSBzaG91bGQgYmUgY2FyZWZ1bCBv
ZiBkdXBsaWNhdGlvbiByZW1vdmFsIGZvcg0KCT4gdGhlIHNha2Ugb2YgaXQuDQoJDQoJQW5kIGFz
IEkgbWVudGlvbmVkIGFib3ZlLCBJIGNvbXBsZXRlbHkgZGlzYWdyZWUgd2l0aCB5b3UuICBJIHdv
bmRlciwgaG93DQoJbXVjaCByZWZhY3RvcmluZyBoYXZlIHlvdSBkb25lIGluIHlvdXIgY2FyZWVy
IGdpdmVuIHlvdSBwaGlsb3NvcGh5IG9mICJpZiBpdA0KCWFpbid0IGJyb2tlLCBkb24ndCBmaXgg
aXQ/IiAgIEhhdmUgeW91IHJlYWQgTWFydGluIEZvd2xlcidzIGxhbmRtYXJrIGJvb2ssDQoJUmVm
YWN0b3Jpbmc/ICAgSWYgbm90LCBJJ2Qgc3VnZ2VzdCB5b3Ugc3R1ZHkgaXQgdGhvcm91Z2hseSAt
IHlvdSdsbCBiZSBhDQoJYmV0dGVyIHByb2dyYW1tZXIgZm9yIGl0Lg0KCQ0KCT4gPiBSZW1vdmFs
IG9mIGhhcmQtY29kZWQgbG9naWMgaXMgcmVhc29uICMyLg0KCT4gPg0KCT4gVGhpcyBpcyBhIGdv
b2QgcmVhc29uLiAgSG93ZXZlciBJIGdldCB0aGUgZmVlbGluZyB0aGF0IGludHJvZHVjdGlvbiBv
Zg0KCT4gdGhlc2UgVmlzaXRvciBjbGFzc2VzIHdpbGwgbWFrZSB0aGUgc3lzdGVtIGNvbmNlcHR1
YWxseSBtb3JlIGRpZmZpY3VsdA0KCT4gdG8gdXNlIHJhdGhlciB0aGFuIGVhc2llci4gIEkgd291
bGQgZmVlbCBiZXR0ZXIgaWYgdGhlIGN1cnJlbnQgc2V0IHVwDQoJPiB3YXMgbW9yZSBmdWxseSBk
b2N1bWVudGVkIGJlZm9yZSBtb3JlIGNvbXBsZXhpdHkgd2FzIGFkZGVkLg0KCQ0KCUFzIEkgYWxz
byBzYWlkIGFib3ZlLCBpZiBpdCBhaW4ndCBzaW1wbGUsIGl0J3Mgd3JvbmcuICBPdXIgY2hhbmdl
cyB3aWxsIG5vdA0KCWFkZCBjb21wbGV4aXR5IC0gdGhhdCB3b3VsZCBiZSBmb29saXNoLg0KCQ0K
CT4gQW5kIGV2ZW4gaWYgdGhlIFZpc2l0b3IgcGF0dGVybiBpcyB1c2VkLCBJIHdvdWxkIHJlY29t
bWVuZCBsZWF2aW5nDQoJPiBtZXRob2RzIGxpa2UgdG9QbGFpblRleHRTdHJpbmcoKSBldGMgaW4g
cGxhY2UsIGJ1dCBqdXN0IG1ha2luZyB0aGVtDQoJPiBzaG9ydCBjdXQgaW1wbGVtZW50YXRpb25z
IHRvIGNlcnRhaW4ga2luZHMgb2YgdmlzaXRvci11c2luZyBtZXRob2RzLg0KCT4gIFRoaXMgd2ls
bCBhbGxvdyBwZW9wbGUgd2hvIGhhdmUgeWV0IHRvIGdyYXNwIHRoZSBWaXNpdG9yICBwYXR0ZXJu
DQoJPiBzb21ldGhpbmcgdG8gd29yayB3aXRoLg0KCQ0KCVBlb3BsZSBjYW4gYWx3YXlzIGNhbGwg
ZGVwcmVjYXRlZCBtZXRob2RzLg0KCQ0KCT4gSWYgeW91IGFyZSBrZWVuIHRvIHNlZSBsb3RzIG9m
IHBlb3BsZSB1c2luZyBodG1scGFyc2VyLCBJIHRoaW5rIHRoYXQgeW91DQoJPiBkb24ndCB3YW50
IHBlb3BsZSB0byBoYXZlIHRvIGNvbWUgdG8gdGVybXMgd2l0aCB0b28gbWFueSBuZXcgY29uY2Vw
dHMgYXQNCgk+IG9uY2UuICBZb3Ugc2F5IHlvdXJzZWxmIHRoYXQgdGhlIFZpc2l0b3IgcGF0dGVy
biB0YWtlcyBzb21lIGdldHRpbmcgdXNlZA0KCT4gdG8uICBJIHRoaW5rIHRoZSB3aG9sZSBzY2Fu
bmVyIGNvbmNlcHQgdGFrZXMgc29tZSBnZXR0aW5nIHVzZWQgdG8gLi4uLg0KCQ0KCU91ciBWaXNp
dG9yIGltcGxlbWVudGF0aW9uIGlzIHNvIHRyaXZpYWwgdGhhdCBmb2xrcyB3b24ndCBldmVuIGtu
b3cgdGhleSdyZQ0KCXVzaW5nIHRoZSBwYXR0ZXJuLg0KCQ0KCT4gPlNpbXBsaWNpdHkgaXMgcmVh
c29uICMzOiB0aGVyZSBpcyBsaXR0bGUgcmVhc29uIHRvIGZhdHRlbiB0aGUgaW50ZXJmYWNlcw0K
CW9mDQoJPiA+dGFnIGFuZCBub2RlIGNsYXNzZXMgd2l0aCB2YXJpb3VzIGRhdGEgYWNjdW11bGF0
aW9uL2FsdGVyYXRpb24gbWV0aG9kcw0KCXdoZW4NCgk+ID5vbmUgbWV0aG9kIGFuZCBhIHZhcmll
dHkgb2YgY29uY3JldGUgVmlzaXRvcnMgY2FuIGRvIHRoZSBqb2Igd2l0aCBtdWNoDQoJbGVzcw0K
CT4gPmNvZGUuDQoJPiA+DQoJPiB3ZWxsIEkgd291bGQgYWdyZWUgaWYgeW91IGNvdWxkIGd1YXJh
bnRlZSB0aGF0IHRoZXJlIHdpbGwgYmUgbm8NCgk+IGRpdmVyZ2VuY2Ugd2hhdHNvZXZlciBpbiBo
b3cgdGhlIGRpZmZlcmVudCBtZXRob2RzIHdpbGwgYmUgdXNlZC4gIElmIHlvdQ0KCT4gY2FuIGNy
ZWF0ZSBhIGZsZXhpYmxlIGVub3VnaCBpbXBsZW1lbnRhdGlvbiBvZiB0aGUgVmlzaXRvciBwYXR0
ZXJuIHRoZW4NCgk+IEkgZ3Vlc3MgdGhhdCB3aWxsIHN1cHBvcnQgYW55IHBvc3NpYmxlIGRpdmVy
Z2VuY2UgaW4gdGhlIHNlcGFyYXRlDQoJPiBtZXRob2RzLiBIb3dldmVyLCBJIHRoaW5rIHRoZXJl
IGlzIGEgcmVhc29uIHRvIGhhdmUgYSBmYXR0ZXIgaW50ZXJmYWNlLA0KCT4gaW4gdGhhdCBjb252
ZW5pZW5jZSBtZXRob2RzIGxvd2VyIHRoZSBiYXJyaWVyIHRvIGVudHJ5IGZvciBuZXcgdXNlcnMu
DQoJDQoJV291bGRuJ3QgaXQgYmUgbWFydmVsb3VzIGlmIHRoZSBiYXJyaWVyIHRvIGVudHJ5IHdh
cyBsb3cgYW5kIG91ciBjb2RlIHdhc24ndA0KCWJsb2F0d2FyZT8NCgkNCgk+IFBlcmhhcHMgaWRl
YWxseSBvbmUgaGFzIGEgd2VsbCBpbXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gdGhhdCBzdXBw
b3J0cw0KCT4gYSByYXcgbWV0aG9kIGFjY2VzcywgYW5kIGEgbnVtYmVyIG9mIGNvbnZlbmllbmNl
IG1ldGhvZHM/DQoJDQoJVGhlIGNoYW5nZXMgd2lsbCBtYWtlIHRoZSBjb2RlIGVhc2llciB0byB1
c2UgLSBpZiB0aGV5IGRvbid0LCB0aGV5IHJlYWxseQ0KCWNvb2wgdGhpbmcgYWJvdXQgc29mdHdh
cmUgaXMgdGhhdCBpdCBpcyBzb2Z0IC0gd2UgY2FuIGNoYW5nZSBpdC4NCgkNCgk+IEEgd2VsbCBp
bXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gd2lsbCwgSSBhc3N1bWUsIHN1cHBvcnQgYWxsIHNv
cnRzIG9mDQoJPiBkaWZmZXJlbnQgb3BlcmF0aW9ucywgYnV0IEkgd291bGQgZmVlbCBtdWNoIGhh
cHBpZXIgaWYgdGhlIGh0bWxwYXJzZXINCgk+IGhhZCBhIGNvbXBsZXRlIGphdmFkb2MgYW5kIGRv
Y3VtZW50YXRpb24gcmV2aWV3IGJlZm9yZSBhbnkgcmVmYWN0b3JpbmcNCgk+IHRvb2sgcGxhY2Uu
ICBQZW9wbGUgYXJlIHRyeWluZyB0byB1c2UgdGhlIGV4aXN0aW5nIHN5c3RlbSBhbmQgaGF2aW5n
DQoJPiB0cm91YmxlIG5vdCBiZWNhdXNlIG9mIHRoZSBsYWNrIG9mIHJlZmFjdG9yaW5nLCBidXQg
YSBsYWNrIG9mIHdlbGwNCgk+IGRlc2NyaWJlZCBtZXRob2RzLiAgV2VsbCBJIHNheSBwZW9wbGUs
IEkgbWVhbiBtZSwgSSBkb24ndCBrbm93IGlmIGFueW9uZQ0KCT4gZWxzZSBmZWVscyB0aGUgc2Ft
ZS4gIE1heWJlIGl0J3MganVzdCBtZSA6LSkNCgkNCglFeGVjdXRhYmxlIGRvY3VtZW50YXRpb24g
aXMgYSB0ZXJtIHdlIHVzZSBmb3IgQ3VzdG9tdGVyIFRlc3RzLiAgIEEgQ3VzdG9tZXINCglUZXN0
cyBzaG93cyBob3cgYSBib2R5IG9mIGNvZGUgZ2V0cyB1c2VkIHRvIHBlcmZvcm0gcmVhbCB0YXNr
cy4gICBZb3UgaGF2ZQ0KCXJlYWwtd29ybGQgdGFza3MgdG8gcGVyZm9ybS4gIFdlIGNhbiBwZXJm
b3JtIHRob3NlIHRhc2tzIHZpYSBhdXRvbWF0ZWQNCgl0ZXN0cy4gICBUaGVuLCB3aGVuIHdlIHJl
ZmFjdG9yLCB3ZSBtdXN0IG1ha2Ugc3VyZSBvdXIgZXhlY3V0YWJsZQ0KCWRvY3VtZW50YXRpb24g
aXMgdXAgdG8gZGF0ZSAoaS5lLiBpdCBwYXNzZXMgaXRzIHRlc3RzKS4gICBXZSBmaW5kIHRoaXMg
aWRlYWwNCglmb3IgbWFraW5nIHN1cmUgdGhhdCBkb2N1bWVudGF0aW9uIHJlZmxlY3RzIHdoYXQg
dGhlIGNvZGUgaXMgYWN0dWFsbHkgZG9pbmcuDQoJVGhpcyBhbHNvIGxlYXZlcyByb29tIGZvciBh
IGRvY3VtZW50IHRoYXQgZ2l2ZXMgc29tZSBncmFwaGljYWwNCglyZXByZXNlbnRhdGlvbiBvZiBo
b3cgdGhpbmdzIHdvcmssIHdoaWNoIG11c3QgYmUgbWFpbnRhaW5lZCBieSBzb21lb25lLCBsZXN0
DQoJaXQgZ2V0IG91dCBvZiBkYXRlLiAgV2UganVzdCBtaW5pbWl6ZSB0aGUgbmVlZCBmb3Igc3Vj
aCBkb2N1bWVudHMgYnkga2VlcGluZw0KCW91ciBjb2RlIHNpbXBsZSBhbmQgc21hbGwgYW5kIHN1
cnJvdW5kaW5nIGl0IHdpdGggZWFzeSB0byB1bmRlcnN0YW5kDQoJZXhlY3V0YWJsZSBkb2N1bWVu
dGF0aW9uLg0KCQ0KCWJlc3QgcmVnYXJkcw0KCWprDQoJDQoJDQoJDQoJDQoJDQoJDQoJDQoJLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRo
aXMgc2YubmV0IGVtYWlsIGlzIHNwb25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdl
ZWsgaGVhdmVuLg0KCWh0dHA6Ly90aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWls
aW5nIGxpc3QNCglIdG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglo
dHRwczovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRl
dmVsb3Blcg0KCQ0KDQo=

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-28 08:19:10

Hi Claude,
>I think it may well be that the group's size is hitting critical mass and
that minimal formalities are now required...

Good to hear from you. I quite agree that this project seems to be reaching
critical mass. We need to share a plan of action that we could all work on
parallelly.

First, I've been noticing that we're not consistent on the coding
convention. In the past, if I see a class that does not follow the
convention, I'd go ahead and fix it. However, as so many people are now
working parallelly, its important that we all share the same coding
convention. The one I follow is the standard Java coding convention. In
addition, I particularly don't like using names with single characters in
them - i.e. I'd prefer "tag" to "pTag". That is the convention followed in
most of the parser. The coding convention is of course open for discussion
if anyone feels that we should accomodate any particular practice.

Second, I am planning to simplify the design of the scanners. As Dhaval
mentioned earlier, and as Sam mentioned, it wasn't easy to write a new
scanner. I agree that the javadoc of getParsed() should be better than what
it is. However, I think the problem lies with the name itself. In the sense,
if we rename "parsed" to "parameterTable" or "attributeTable" - and
getParsed() becomes, getAttributeTable() - the javadoc would become
redundant. You could say that this is a "documentation" change, or you could
call this a "refactoring". Either way, it is developer and user-friendly. So
I think Sam's suggestions and our current direction are one and the same.

Third, there's way too much duplicate code in the tags. Almost every tag
runs thru its attributes to render it as html - whereas this can be done
from the superclass - HTMLTag. This is just one example. I've not been
focussing much on reducing duplication - and we can see serious code smells.
However, I think it is important to be able to quantify what we hope to
achieve by undertaking several rounds of refactoring. As we keep
refactoring, I will keep a tab on the size of the parser (using Lines of
Code). If there's any other metric that people think is important, we could
include that as well. It will be interesting to see how much code we can
take out.

Fourth, this is one project where change is welcome - all the time. Simply
bcos it has a massive suite of tests which represents most of the user
requirements that we've received till date. The real-value in the parser is
not really its production code - but its solid test suite. If there are
interesting design changes that will make the current system simpler, they
should be tried out without fear - for if we break anything, we will know
bcos of the automated testing. It is important that the testing suite is
therefore regarded just as important (if not more) as the production code,
and steps taken to improve its design, and make it very easy to add new
tests all the time.

Fifth, as the project grows - new features should keep coming in all the
time. Derrick Oswald has been working away at so many cool new features -
and he's been writing tests for them all - this is a much-appreciated
activity and must continue. I'd be happy to not think about new features and
completely focus on solidifying the existing system, and let Derrick focus
on adding new stuff - for v1.3.

Sixth, we have a very serious possibility of using AI in making tag
recognitions. I have to find the time to write about the current recognition
mechanism, and pass that on to Sam Joseph. Sam has considerable experience
in artificial intelligence, and it will be good for the project to have his
expertise in its core correction logic.

Seventh, it might be good to have a Wiki for the parser - as so many open
source projects do. That way, the entire burden of documentation is not on
any one person. We should be looking at options of having our own Wiki so we
can add content easily and collaboratively.

Pls feel free to add/modify this list if I've missed out on anything.

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 09:25:17

Hi Somik

Somik Raha wrote:

>Second, I am planning to simplify the design of the scanners. As Dhaval
>mentioned earlier, and as Sam mentioned, it wasn't easy to write a new
>scanner. I agree that the javadoc of getParsed() should be better than what
>it is. However, I think the problem lies with the name itself. In the sense,
>if we rename "parsed" to "parameterTable" or "attributeTable" - and
>getParsed() becomes, getAttributeTable() - the javadoc would become
>redundant. You could say that this is a "documentation" change, or you could
>call this a "refactoring". Either way, it is developer and user-friendly. So
>I think Sam's suggestions and our current direction are one and the same.
>  
>
I think that just changing the name of the method is not quite enough.   
Certainly changing the name of the method is a good step, but I think 
javadoc comments that actually show an example of what is being returned 
make a huge usability difference, particularly when the thing being 
returned is something as complex as a hashtable.  If we want to call 
changing method names and adding documentation, refactoring then I guess 
I am a huge fan of refactoring :-)  However the other part of what I was 
saying is that I would like to see the documentation additions before we 
change method names.  I mean you can deprecate away so as not to break 
my existing code ... I still think you should release a version with a 
full javadoc "refactoring" before you release anything with refactored 
code.  What's the harm in focusing on the documentation first?

 I mean we're on version 1.2 now right?  What about a version 1.2.1 that 
included the documentation fixes, before moving on to a 1.3-beta that 
included the newly refactored code that you're so keen to start work on. 
 What would be the downside of proceeding in this fashion?

>Sixth, we have a very serious possibility of using AI in making tag
>recognitions. I have to find the time to write about the current recognition
>mechanism, and pass that on to Sam Joseph. Sam has considerable experience
>in artificial intelligence, and it will be good for the project to have his
>expertise in its core correction logic.
>
This sounds like an interesting possibility, but I still need to 
understand how the current parser handles all the existing messy tags.   
I mean when you started talking through all the messy html examples I 
though you were going to be showing me something that didn't work and 
that you wanted some machine learning/AI to fix, but you ended up saying 
that all the examples could be handled by the exisiting parser.  If that 
is so, why do you want to change it, and are there examples of messy 
code that can't be handled by the parser?  If there aren't and its all 
working fine, what advantage do we gain by replacing the existing core 
correction logic with some hypothetical AI alternative?

>Seventh, it might be good to have a Wiki for the parser - as so many open
>source projects do. That way, the entire burden of documentation is not on
>any one person. We should be looking at options of having our own Wiki so we
>can add content easily and collaboratively.
>
You are most welcome to use the wiki I have set up in the short term:

http://www.neurogrid.net/devwiki/wikipages/HtmlParser

It uses an open source java wiki (devwiki), which is not as fully 
functional as it might be, but I've got it set up so everything gets 
backed up to CVS and all changes get sent to the neurogrid-cvs mailing 
list.  If nothing else it might serve as an example wiki environment.

CHEERS> SAM

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-28 09:39:39

Hi Sam,

> If we want to call
> changing method names and adding documentation, refactoring then I guess
> I am a huge fan of refactoring :-)

Changing method names is definitely refactoring. The idea is to use really
good names that make the javadoc redundant. If you can explain what javadoc
would achieve for a method like getAttributes() - I'd be happy to add
javadoc for that.

> However the other part of what I was
> saying is that I would like to see the documentation additions before we
> change method names.  I mean you can deprecate away so as not to break
> my existing code ... I still think you should release a version with a
> full javadoc "refactoring" before you release anything with refactored
> code.  What's the harm in focusing on the documentation first?

Simply bcos once the name is changed to a better one, docs are no longer
necessary. But I understand your predicament. We could have a deprecated
method name for important methods.

>  I mean we're on version 1.2 now right?  What about a version 1.2.1 that
> included the documentation fixes, before moving on to a 1.3-beta that
> included the newly refactored code that you're so keen to start work on.
>  What would be the downside of proceeding in this fashion?

Duplicate work. Maintaining parallel branches of code is the last thing we
need to do, considering limited resources. I don't see a problem in going
straight with 1.3-beta as we've got all our tests in place, and we'll also
use deprecated methods. That should work for you, right ?

> >Sixth, we have a very serious possibility of using AI in making tag
> >recognitions. I have to find the time to write about the current
recognition
> >mechanism, and pass that on to Sam Joseph. Sam has considerable
experience
> >in artificial intelligence, and it will be good for the project to have
his
> >expertise in its core correction logic.
> >
> This sounds like an interesting possibility, but I still need to
> understand how the current parser handles all the existing messy tags.
> I mean when you started talking through all the messy html examples I
> though you were going to be showing me something that didn't work and
> that you wanted some machine learning/AI to fix, but you ended up saying
> that all the examples could be handled by the exisiting parser.  If that
> is so, why do you want to change it, and are there examples of messy
> code that can't be handled by the parser?  If there aren't and its all
> working fine, what advantage do we gain by replacing the existing core
> correction logic with some hypothetical AI alternative?

The main problem is that the current logic is complicated, and uses
hard-coded logic. Such a system is hard to maintain, change. I was thinking
on the lines of a rule-based system, where the rule-base could be dynamic.
But, I will write a seperate mail on this issue as soon as I get some time.

> >Seventh, it might be good to have a Wiki for the parser - as so many open
> >source projects do. That way, the entire burden of documentation is not
on
> >any one person. We should be looking at options of having our own Wiki so
we
> >can add content easily and collaboratively.
> >
> You are most welcome to use the wiki I have set up in the short term:
>
> http://www.neurogrid.net/devwiki/wikipages/HtmlParser
>
> It uses an open source java wiki (devwiki), which is not as fully
> functional as it might be, but I've got it set up so everything gets
> backed up to CVS and all changes get sent to the neurogrid-cvs mailing
> list.  If nothing else it might serve as an example wiki environment.

Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with
current website) as in logos, page color, etc..

Cheers,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-28 10:23:17

Hi Somik

Somik Raha wrote:

>>If we want to call
>>changing method names and adding documentation, refactoring then I guess
>>I am a huge fan of refactoring :-)
>>    
>>
>
>Changing method names is definitely refactoring. The idea is to use really
>good names that make the javadoc redundant. If you can explain what javadoc
>would achieve for a method like getAttributes() - I'd be happy to add
>javadoc for that.
>
I don't think good method names make javadoc redundant, and I don't 
think they can.  For example Hashtable getAttributes()  tells me very 
little. What is an attribute? What are the classes in the hashtable? 
Give me three examples of what an attribute is and then you might have a 
good javadoc, e.g.

[Gets a hashtable of attribute-value pairs from the html tag, e.g. 
type-->hidden, name-->address, value-->default would come from the tag 
<input type="hidden" name="address" value="default">.  Attributes and 
values are all String classes. ]

Now I think that the method getAttributes with the above javadoc would 
be much easier to use and understand for somebody trying to use 
htmlparser for the first or even the third time, no?  I think you need 
to be careful about what is obvious for you, who have designed the 
system, and what is obvious for a novice user.  

>>However the other part of what I was
>>saying is that I would like to see the documentation additions before we
>>change method names.  I mean you can deprecate away so as not to break
>>my existing code ... I still think you should release a version with a
>>full javadoc "refactoring" before you release anything with refactored
>>code.  What's the harm in focusing on the documentation first?
>>    
>>
>
>Simply bcos once the name is changed to a better one, docs are no longer
>necessary. 
>
See my point above.  I think you make more than a name change to make 
the api novice-user friendly.  And my comments about the need for more 
detailed javadocs go right through the whole project, not just the few 
examples I've given.  

>> I mean we're on version 1.2 now right?  What about a version 1.2.1 that
>>included the documentation fixes, before moving on to a 1.3-beta that
>>included the newly refactored code that you're so keen to start work on.
>> What would be the downside of proceeding in this fashion?
>>    
>>
>
>Duplicate work. Maintaining parallel branches of code is the last thing we
>need to do, considering limited resources. I don't see a problem in going
>straight with 1.3-beta as we've got all our tests in place, and we'll also
>use deprecated methods. That should work for you, right ?
>
Sure, having deprecated methods works for me, but it's not just about 
me.  I mean you've already released 1.2  People are already using it, 
building code around it.  You have to maintain it anyway, or rather 
users will continue asking you questions about it.  I'm suggesting that 
a 1.2.1 improved documentation release would mean less work in having to 
maintain things because the users could work from the javadocs.  I would 
also think that going through and writing thorough javadocs for your 
existing version would put you in a much better position to write a good 
1.3-beta.  Limited resources is an important concern, and I think you'll 
conserve resources by more thorough documentation first and re-writing 
your code second.  

As long as you have deprecated methods then my own code base will be 
fine, so it makes little difference to me in the short term which way 
you go.  However I think in the long term, a 1.2.1 documentation release 
would improve the quality of the project overall, and set a good 
precedent for high quality javadocs in subsequent releases.  I get the 
sense that you are really keen to plow into re-writing all the code, 
what's the hurry?

>>This sounds like an interesting possibility, but I still need to
>>understand how the current parser handles all the existing messy tags.
>>I mean when you started talking through all the messy html examples I
>>though you were going to be showing me something that didn't work and
>>that you wanted some machine learning/AI to fix, but you ended up saying
>>that all the examples could be handled by the exisiting parser.  If that
>>is so, why do you want to change it, and are there examples of messy
>>code that can't be handled by the parser?  If there aren't and its all
>>working fine, what advantage do we gain by replacing the existing core
>>correction logic with some hypothetical AI alternative?
>>    
>>
>
>The main problem is that the current logic is complicated, and uses
>hard-coded logic. Such a system is hard to maintain, change. I was thinking
>on the lines of a rule-based system, where the rule-base could be dynamic.
>But, I will write a seperate mail on this issue as soon as I get some time.
>
Ah, I think I understand.  What you want is some framework for dealing 
with new types of mess as they come up.  A system that can have more 
rules added to without too much work, or even a framework for trying to 
guess which rules to apply in which document ...

>>>Seventh, it might be good to have a Wiki for the parser - as so many open
>>>source projects do. That way, the entire burden of documentation is not
>>>      
>>>
>on
>  
>
>>>any one person. We should be looking at options of having our own Wiki so
>>>      
>>>
>we
>  
>
>>>can add content easily and collaboratively.
>>>
>>>      
>>>
>>You are most welcome to use the wiki I have set up in the short term:
>>
>>http://www.neurogrid.net/devwiki/wikipages/HtmlParser
>>
>>It uses an open source java wiki (devwiki), which is not as fully
>>functional as it might be, but I've got it set up so everything gets
>>backed up to CVS and all changes get sent to the neurogrid-cvs mailing
>>list.  If nothing else it might serve as an example wiki environment.
>>    
>>
>
>Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with
>current website) as in logos, page color, etc..
>
I guess so.  I think that could be arranged.  Send me a copy of the logo 
etc., and I'll see what I can do, although it might take a couple of 
weeks for me to find time to do it.

CHEERS> SAM

RE: [Htmlparser-developer] toPlainTextString() feedback requested

From: Claude D. <CD...@ar...> - 2002-12-28 23:47:45

SWYgSSBtYXkuLi4gV2F0Y2ggb3V0IGZvciBmZWF0dXJlIGNyZWVwISBBZGRpbmcgQUkgc291bmRz
IGxpa2UgYSBwYW5hY2VhIGJ1dCBpdCdzIG5vdC4gSSBoYXZlIGNvbnNpZGVyYWJsZSBleHBlcmll
bmNlIHdpdGggQUkgbXlzZWxmIGFuZCBpdCByYXJlbHkgbWVldHMgZXhwZWN0YXRpb25zIHVubGVz
cyB5b3UgY2FuIGNsZWFybHkgZGVmaW5lIHRoZSBvYmplY3RpdmVzIHVwIGZyb250LiBJZiB5b3Ug
cGxhbiB0byB0cmFpbiBjbGFzc2lmZXJzLCB5b3UnbGwgbmVlZCB0byBjb2xsZWN0IGEgZmFpcmx5
IGxhcmdlLCByZWxldmFudCB0cmFpbmluZyBzZXQuIEV2ZW4gdGhlbiwgdGhpcyBtYXkgbm90IHNv
bHZlIGFzIG1hbnkgcHJvYmxlbXMgYXMgaXQgbWF5IGNhdXNlIGFuZCB5b3UgbmVlZCB0byBtYWtl
IGEgZ29vZCBkZWNpc2lvbiBiYXNlZCBvbiBmYWN0LiANCiANClNvbWUgb2YgeW91ciBtZXNzYWdl
cyBzdWdnZXN0IHlvdSBhcmUgZWFzaWx5IHRha2VuIGJ5IHdpei1iYW5nIHRlY2hub2xvZ2llcyBh
bmQgY29vbCBuZXcgZmVhdHVyZXMgKHdoaWNoIEkgc3VmZmVyIGZyb20gbXlzZWxmIGFuZCBuZWVk
IHRvIGJhbGFuY2UpLiBZb3UgbmVlZCB0byBjb25zaWRlciB0aGVzZSBmZWF0dXJlcyBlc3BlY2lh
bGx5IGNhcmVmdWxseSBhbmQgcmVhbGl6ZSB0aGF0IHRoaXMgYmlhcyBjYW4gYmUgZGV0cmltZW50
YWwgdW5kZXIgc29tZSBjaXJjdW1zdGFuY2VzLiBDZXJ0YWlubHksIHRoZSBwcm9qZWN0IG5lZWRz
IHRvIGJlIGZ1biwgYnV0IGRvbid0IGdldCB0b28gY2F1Z2h0IHVwIGluIGZlYXR1cmUgZW52eS4g
TWFrZSBzdXJlIHlvdSBrZWVwIHRoZSBjb3VwbGluZyBsb29zZSB0aHJvdWdoIGludGVyZmFjZXMs
IGVzcGVjaWFsbHkgaWYgeW91IGFkZCBBSSBjb21wb25lbnRzLCBhbmQgcGxlYXNlIGFsbG93IHVz
ZXJzIHRvIGRlY2lkZSB3aGljaCBmZWF0dXJlcyB0aGV5IHdhbnQgYnkga2VlcGluZyB0aGUgYXJj
aGl0ZWN0dXJlIGFzIHBsdWdnYWJsZSBhcyBwb3NzaWJsZS4gVGhpcyB3aWxsIGJlbmVmaXQgYm90
aCB5b3VyIHVzZXIgYW5kIGRldmVsb3BtZW50IGNvbW11bml0aWVzLg0KIA0KLS0tLS1PcmlnaW5h
bCBNZXNzYWdlLS0tLS0gDQpGcm9tOiBTb21payBSYWhhIFttYWlsdG86c29taWtAeWFob28uY29t
XSANClNlbnQ6IFNhdCAxMi8yOC8yMDAyIDEyOjIwIEFNIA0KVG86IGh0bWxwYXJzZXItZGV2ZWxv
cGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiANClN1YmplY3Q6IFJlOiBbSHRtbHBhcnNl
ci1kZXZlbG9wZXJdIHRvUGxhaW5UZXh0U3RyaW5nKCkgZmVlZGJhY2sgcmVxdWVzdGVkDQoNCg0K
DQoJSGkgQ2xhdWRlLA0KCT5JIHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mg
c2l6ZSBpcyBoaXR0aW5nIGNyaXRpY2FsIG1hc3MgYW5kDQoJdGhhdCBtaW5pbWFsIGZvcm1hbGl0
aWVzIGFyZSBub3cgcmVxdWlyZWQuLi4NCgkNCglHb29kIHRvIGhlYXIgZnJvbSB5b3UuIEkgcXVp
dGUgYWdyZWUgdGhhdCB0aGlzIHByb2plY3Qgc2VlbXMgdG8gYmUgcmVhY2hpbmcNCgljcml0aWNh
bCBtYXNzLiBXZSBuZWVkIHRvIHNoYXJlIGEgcGxhbiBvZiBhY3Rpb24gdGhhdCB3ZSBjb3VsZCBh
bGwgd29yayBvbg0KCXBhcmFsbGVsbHkuDQoJDQoJRmlyc3QsIEkndmUgYmVlbiBub3RpY2luZyB0
aGF0IHdlJ3JlIG5vdCBjb25zaXN0ZW50IG9uIHRoZSBjb2RpbmcNCgljb252ZW50aW9uLiBJbiB0
aGUgcGFzdCwgaWYgSSBzZWUgYSBjbGFzcyB0aGF0IGRvZXMgbm90IGZvbGxvdyB0aGUNCgljb252
ZW50aW9uLCBJJ2QgZ28gYWhlYWQgYW5kIGZpeCBpdC4gSG93ZXZlciwgYXMgc28gbWFueSBwZW9w
bGUgYXJlIG5vdw0KCXdvcmtpbmcgcGFyYWxsZWxseSwgaXRzIGltcG9ydGFudCB0aGF0IHdlIGFs
bCBzaGFyZSB0aGUgc2FtZSBjb2RpbmcNCgljb252ZW50aW9uLiBUaGUgb25lIEkgZm9sbG93IGlz
IHRoZSBzdGFuZGFyZCBKYXZhIGNvZGluZyBjb252ZW50aW9uLiBJbg0KCWFkZGl0aW9uLCBJIHBh
cnRpY3VsYXJseSBkb24ndCBsaWtlIHVzaW5nIG5hbWVzIHdpdGggc2luZ2xlIGNoYXJhY3RlcnMg
aW4NCgl0aGVtIC0gaS5lLiBJJ2QgcHJlZmVyICJ0YWciIHRvICJwVGFnIi4gVGhhdCBpcyB0aGUg
Y29udmVudGlvbiBmb2xsb3dlZCBpbg0KCW1vc3Qgb2YgdGhlIHBhcnNlci4gVGhlIGNvZGluZyBj
b252ZW50aW9uIGlzIG9mIGNvdXJzZSBvcGVuIGZvciBkaXNjdXNzaW9uDQoJaWYgYW55b25lIGZl
ZWxzIHRoYXQgd2Ugc2hvdWxkIGFjY29tb2RhdGUgYW55IHBhcnRpY3VsYXIgcHJhY3RpY2UuDQoJ
DQoJU2Vjb25kLCBJIGFtIHBsYW5uaW5nIHRvIHNpbXBsaWZ5IHRoZSBkZXNpZ24gb2YgdGhlIHNj
YW5uZXJzLiBBcyBEaGF2YWwNCgltZW50aW9uZWQgZWFybGllciwgYW5kIGFzIFNhbSBtZW50aW9u
ZWQsIGl0IHdhc24ndCBlYXN5IHRvIHdyaXRlIGEgbmV3DQoJc2Nhbm5lci4gSSBhZ3JlZSB0aGF0
IHRoZSBqYXZhZG9jIG9mIGdldFBhcnNlZCgpIHNob3VsZCBiZSBiZXR0ZXIgdGhhbiB3aGF0DQoJ
aXQgaXMuIEhvd2V2ZXIsIEkgdGhpbmsgdGhlIHByb2JsZW0gbGllcyB3aXRoIHRoZSBuYW1lIGl0
c2VsZi4gSW4gdGhlIHNlbnNlLA0KCWlmIHdlIHJlbmFtZSAicGFyc2VkIiB0byAicGFyYW1ldGVy
VGFibGUiIG9yICJhdHRyaWJ1dGVUYWJsZSIgLSBhbmQNCglnZXRQYXJzZWQoKSBiZWNvbWVzLCBn
ZXRBdHRyaWJ1dGVUYWJsZSgpIC0gdGhlIGphdmFkb2Mgd291bGQgYmVjb21lDQoJcmVkdW5kYW50
LiBZb3UgY291bGQgc2F5IHRoYXQgdGhpcyBpcyBhICJkb2N1bWVudGF0aW9uIiBjaGFuZ2UsIG9y
IHlvdSBjb3VsZA0KCWNhbGwgdGhpcyBhICJyZWZhY3RvcmluZyIuIEVpdGhlciB3YXksIGl0IGlz
IGRldmVsb3BlciBhbmQgdXNlci1mcmllbmRseS4gU28NCglJIHRoaW5rIFNhbSdzIHN1Z2dlc3Rp
b25zIGFuZCBvdXIgY3VycmVudCBkaXJlY3Rpb24gYXJlIG9uZSBhbmQgdGhlIHNhbWUuDQoJDQoJ
VGhpcmQsIHRoZXJlJ3Mgd2F5IHRvbyBtdWNoIGR1cGxpY2F0ZSBjb2RlIGluIHRoZSB0YWdzLiBB
bG1vc3QgZXZlcnkgdGFnDQoJcnVucyB0aHJ1IGl0cyBhdHRyaWJ1dGVzIHRvIHJlbmRlciBpdCBh
cyBodG1sIC0gd2hlcmVhcyB0aGlzIGNhbiBiZSBkb25lDQoJZnJvbSB0aGUgc3VwZXJjbGFzcyAt
IEhUTUxUYWcuIFRoaXMgaXMganVzdCBvbmUgZXhhbXBsZS4gSSd2ZSBub3QgYmVlbg0KCWZvY3Vz
c2luZyBtdWNoIG9uIHJlZHVjaW5nIGR1cGxpY2F0aW9uIC0gYW5kIHdlIGNhbiBzZWUgc2VyaW91
cyBjb2RlIHNtZWxscy4NCglIb3dldmVyLCBJIHRoaW5rIGl0IGlzIGltcG9ydGFudCB0byBiZSBh
YmxlIHRvIHF1YW50aWZ5IHdoYXQgd2UgaG9wZSB0bw0KCWFjaGlldmUgYnkgdW5kZXJ0YWtpbmcg
c2V2ZXJhbCByb3VuZHMgb2YgcmVmYWN0b3JpbmcuIEFzIHdlIGtlZXANCglyZWZhY3RvcmluZywg
SSB3aWxsIGtlZXAgYSB0YWIgb24gdGhlIHNpemUgb2YgdGhlIHBhcnNlciAodXNpbmcgTGluZXMg
b2YNCglDb2RlKS4gSWYgdGhlcmUncyBhbnkgb3RoZXIgbWV0cmljIHRoYXQgcGVvcGxlIHRoaW5r
IGlzIGltcG9ydGFudCwgd2UgY291bGQNCglpbmNsdWRlIHRoYXQgYXMgd2VsbC4gSXQgd2lsbCBi
ZSBpbnRlcmVzdGluZyB0byBzZWUgaG93IG11Y2ggY29kZSB3ZSBjYW4NCgl0YWtlIG91dC4NCgkN
CglGb3VydGgsIHRoaXMgaXMgb25lIHByb2plY3Qgd2hlcmUgY2hhbmdlIGlzIHdlbGNvbWUgLSBh
bGwgdGhlIHRpbWUuIFNpbXBseQ0KCWJjb3MgaXQgaGFzIGEgbWFzc2l2ZSBzdWl0ZSBvZiB0ZXN0
cyB3aGljaCByZXByZXNlbnRzIG1vc3Qgb2YgdGhlIHVzZXINCglyZXF1aXJlbWVudHMgdGhhdCB3
ZSd2ZSByZWNlaXZlZCB0aWxsIGRhdGUuIFRoZSByZWFsLXZhbHVlIGluIHRoZSBwYXJzZXIgaXMN
Cglub3QgcmVhbGx5IGl0cyBwcm9kdWN0aW9uIGNvZGUgLSBidXQgaXRzIHNvbGlkIHRlc3Qgc3Vp
dGUuIElmIHRoZXJlIGFyZQ0KCWludGVyZXN0aW5nIGRlc2lnbiBjaGFuZ2VzIHRoYXQgd2lsbCBt
YWtlIHRoZSBjdXJyZW50IHN5c3RlbSBzaW1wbGVyLCB0aGV5DQoJc2hvdWxkIGJlIHRyaWVkIG91
dCB3aXRob3V0IGZlYXIgLSBmb3IgaWYgd2UgYnJlYWsgYW55dGhpbmcsIHdlIHdpbGwga25vdw0K
CWJjb3Mgb2YgdGhlIGF1dG9tYXRlZCB0ZXN0aW5nLiBJdCBpcyBpbXBvcnRhbnQgdGhhdCB0aGUg
dGVzdGluZyBzdWl0ZSBpcw0KCXRoZXJlZm9yZSByZWdhcmRlZCBqdXN0IGFzIGltcG9ydGFudCAo
aWYgbm90IG1vcmUpIGFzIHRoZSBwcm9kdWN0aW9uIGNvZGUsDQoJYW5kIHN0ZXBzIHRha2VuIHRv
IGltcHJvdmUgaXRzIGRlc2lnbiwgYW5kIG1ha2UgaXQgdmVyeSBlYXN5IHRvIGFkZCBuZXcNCgl0
ZXN0cyBhbGwgdGhlIHRpbWUuDQoJDQoJRmlmdGgsIGFzIHRoZSBwcm9qZWN0IGdyb3dzIC0gbmV3
IGZlYXR1cmVzIHNob3VsZCBrZWVwIGNvbWluZyBpbiBhbGwgdGhlDQoJdGltZS4gRGVycmljayBP
c3dhbGQgaGFzIGJlZW4gd29ya2luZyBhd2F5IGF0IHNvIG1hbnkgY29vbCBuZXcgZmVhdHVyZXMg
LQ0KCWFuZCBoZSdzIGJlZW4gd3JpdGluZyB0ZXN0cyBmb3IgdGhlbSBhbGwgLSB0aGlzIGlzIGEg
bXVjaC1hcHByZWNpYXRlZA0KCWFjdGl2aXR5IGFuZCBtdXN0IGNvbnRpbnVlLiBJJ2QgYmUgaGFw
cHkgdG8gbm90IHRoaW5rIGFib3V0IG5ldyBmZWF0dXJlcyBhbmQNCgljb21wbGV0ZWx5IGZvY3Vz
IG9uIHNvbGlkaWZ5aW5nIHRoZSBleGlzdGluZyBzeXN0ZW0sIGFuZCBsZXQgRGVycmljayBmb2N1
cw0KCW9uIGFkZGluZyBuZXcgc3R1ZmYgLSBmb3IgdjEuMy4NCgkNCglTaXh0aCwgd2UgaGF2ZSBh
IHZlcnkgc2VyaW91cyBwb3NzaWJpbGl0eSBvZiB1c2luZyBBSSBpbiBtYWtpbmcgdGFnDQoJcmVj
b2duaXRpb25zLiBJIGhhdmUgdG8gZmluZCB0aGUgdGltZSB0byB3cml0ZSBhYm91dCB0aGUgY3Vy
cmVudCByZWNvZ25pdGlvbg0KCW1lY2hhbmlzbSwgYW5kIHBhc3MgdGhhdCBvbiB0byBTYW0gSm9z
ZXBoLiBTYW0gaGFzIGNvbnNpZGVyYWJsZSBleHBlcmllbmNlDQoJaW4gYXJ0aWZpY2lhbCBpbnRl
bGxpZ2VuY2UsIGFuZCBpdCB3aWxsIGJlIGdvb2QgZm9yIHRoZSBwcm9qZWN0IHRvIGhhdmUgaGlz
DQoJZXhwZXJ0aXNlIGluIGl0cyBjb3JlIGNvcnJlY3Rpb24gbG9naWMuDQoJDQoJU2V2ZW50aCwg
aXQgbWlnaHQgYmUgZ29vZCB0byBoYXZlIGEgV2lraSBmb3IgdGhlIHBhcnNlciAtIGFzIHNvIG1h
bnkgb3Blbg0KCXNvdXJjZSBwcm9qZWN0cyBkby4gVGhhdCB3YXksIHRoZSBlbnRpcmUgYnVyZGVu
IG9mIGRvY3VtZW50YXRpb24gaXMgbm90IG9uDQoJYW55IG9uZSBwZXJzb24uIFdlIHNob3VsZCBi
ZSBsb29raW5nIGF0IG9wdGlvbnMgb2YgaGF2aW5nIG91ciBvd24gV2lraSBzbyB3ZQ0KCWNhbiBh
ZGQgY29udGVudCBlYXNpbHkgYW5kIGNvbGxhYm9yYXRpdmVseS4NCgkNCglQbHMgZmVlbCBmcmVl
IHRvIGFkZC9tb2RpZnkgdGhpcyBsaXN0IGlmIEkndmUgbWlzc2VkIG91dCBvbiBhbnl0aGluZy4N
CgkNCglSZWdhcmRzLA0KCVNvbWlrDQoJDQoJDQoJDQoJLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRoaXMgc2YubmV0IGVtYWlsIGlzIHNw
b25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdlZWsgaGVhdmVuLg0KCWh0dHA6Ly90
aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWlsaW5nIGxpc3QNCglIdG1scGFyc2Vy
LWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglodHRwczovL2xpc3RzLnNvdXJjZWZv
cmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRldmVsb3Blcg0KCQ0KDQo=

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-29 00:11:42

Hi Claude,

> If I may... Watch out for feature creep! Adding AI sounds like a panacea
but it's not. I have considerable experience with AI myself and it rarely
meets expectations unless you can clearly define the objectives up front. If
you plan to train classifers, you'll need to collect a fairly large,
relevant training set. Even then, this may not solve as many problems as it
may cause and you need to make a good decision based on fact.
>
> Some of your messages suggest you are easily taken by wiz-bang
technologies and cool new features (which I suffer from myself and need to
balance). You need to consider these features especially carefully and
realize that this bias can be detrimental under some circumstances.
Certainly, the project needs to be fun, but don't get too caught up in
feature envy. Make sure you keep the coupling loose through interfaces,
especially if you add AI components, and please allow users to decide which
features they want by keeping the architecture as pluggable as possible.
This will benefit both your user and development communities.

I agree with your opinion - I'm keeping the AI stuff a low priority for now,
as I mentioned - focussing only on refactoring for the moment. However, I am
hoping that we can do some evaluation, and decide if we should go for it in
v1.4. The real problem is, we keep finding real-life dirty html that just
does not follow any rules. Just sometime back, I found a bug in
HTMLStringNode..

If we have html like :

<script language="javascript">
var lower = '<%=lowerValue%>';
</script>

the StringNode does not realize that the jsp tag within quotes is not to be
handled as a tag but a string node. I found this problem while refactoring
today. Upon modifying the parsing automata and including a
PARSE_IGNORE_STATE for HTMLStringNode, I was able to handle this case. But
thanks to the tests, I found that we had another failing test, wherein we
had something like :

<A href="somelink.html">Kaarle's homepage</A>

As you can see, the single apostrophe caused the damage. Now, I had to code
in extra logic to also verify that the next character must be an opening tag
for the string node automaton to move into ignoring state.

Good news is that all tests are passing. But I am not really sure that this
philosophy of modifying the parsing automata logic is all that great. If
cases such as this can be coded outside the parser - perhaps with regular
expressions, we will no longer need to modify the code of the parser each
time. However, there'd probably be a performance hit as compared to the
system that we have now, and we've got to look at it really carefully before
we go for it.

We've been going vertical all this while, so I am just trying to go lateral
to see where that might take us - just to generate more possibilities. I'm
thinking that we could try a seperate implementation - writing the parsing
logic from scratch, and using the existing test mechanism to verify. This
could be a seperate module in the htmlparser project, for experimentation
only - to evaluate feasibility.

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 18:48:57

Hi Derrick,
    You've raised some interesting points..

    Looking at it from an end-user perspective, prior to using the visitor,
the code for collecting text from a page would be :

StringBuffer textCollectionBuffer = new StringBuffer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString());
}
String result = textCollectionBuffer.toString();

After putting in the visitor, the code looks like :

Renderer plainTextRenderer = new PlainTextRenderer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    e.nextHTMLNode().renderWith(plainTextRenderer);
}
String result = plainTextRenderer.getResult();

IMO, from a user perspective, there's almost no extra complexity. You'd
anyway collect results into a string buffer, while the renderer now does
that for you.

If it was just for the plain text renderer, we would not be justified in
this refactoring. However, the code for getting html out would look like :
Renderer htmlRenderer = new HTMLRenderer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    e.nextHTMLNode().renderWith(htmlRenderer);
}
String result = htmlRenderer.getResult();

And now, when ripping webpages, we would like to modify the links and images
to a base location in the local disk. We could have modifying html renderer,
like this:

Renderer  modifyingLinksAndImagesRenderer = new
ModifyingLinksAndImagesRenderer();
...

Any modification in the mechanism of rendering is now easily handled, and we
don't have to modify the parser per se, for every new user requirement.

> It breaks encapsulation.  It presumes that an exterior class (the
> visitor) knows how to do something to a class better than the class
> itself does.

In light of the above, this is not a bad thing. At the same time, putting in
the visitor has caused/will cause refactoring of the tag classes so that
most of the logic is still encapsulated in the tags, and the visitor only
makes calls to the tag's concrete methods.
e.g. The visitor still does NOT know how to render a tag's html. Such logic
continues to be encapsulated within the tag. The visitor only deals with a
high-level abstraction.

> It makes the system brittle.  The visitor class has a method defined for
> every subclass of visited object.  Adding, removing or refactoring any
> visited class breaks the visitors.

A valid point indeed. To counter, in our imlpementation, there is a lot of
duplication, and quite a few classes really don't need their own toHTML()
method. The parent class - HTMLTag, should take care of single tags. Those
tags with children always have duplicated code to recurse through their
children. We can refactor such functionality out, so that, tags with
children conform to a recursion interface. Using such an interface, you wont
have the recursion loops in each and every node class, but you could handle
it uniformly outside. So, the visitor does not need to take care of each and
every subclass. It will only take care of some special ones only, while
visiting the rest as a simple HTMLTag.

> It's obscure.  Looking at the visit loop tells you nothing of what is
> being perfomed.  The only indication of the task might be the name of
> the visitor class.

That is again not a bad thing - for well-intentioned naming reduces the need
for excessive documentation. You should be able to look at a class name and
have a fairly good idea of what it does. A look at the API of the class
would confirm your guess.

At the same time, looking at the visit loop above, you'd see that it looks
intuitive enough - the loops is not in the visitor, it is still in the
client code.

> It's redundant.  In every example I've seen (I've never seen it used in
> real applications), the result could have been coded in a simpler
> fashion with correct class design.

This is the point we're really thinking about. I've personally been thinking
on this for some months - for we did have realistic requirements of being
able to modify links and images in a ripper, in a uniform manner. In fact,
the current version of the parser already has a visitor. The Sample Programs
makes not mention of the word "visitor" to avoid controversy and confusion
:), but if you look at
http://htmlparser.sourceforge.net/samples/ripper.html - this is what's
happening. I must add that although this visitor does provide a solution, I
wasn't very happy with it. We've been able to refactor this out, and provide
a uniform mechanism now.

> I would only use the visitor pattern if either the set of public methods
> on the visited nodes completely specify their possible behavior and
> allowed the client to fully manipulate them at the highest level of
> abstraction (i.e. everything you can do to the visited classes is
> already coded and you don't want them to change)

I think we're talking the same language here - this is also our vision. The
code you see on the samples page is an example of that (although not a very
good one). We're working on those lines, and the first step was to introduce
such levels of abstraction.

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Kaarle K. <kaa...@ik...> - 2002-12-24 21:02:14

At 15:01 24.12.2002 -0500, you wrote:
>Somik Raha wrote:

You seem to be very active today!

So am I. Merry Christmas to you all!

Kaarle

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 21:49:42

Hi Kaarle,

> You seem to be very active today!
> 
> So am I. Merry Christmas to you all!
> 

Merry Christmas to everyone and to you! 

Cheers,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Derrick O. <Der...@ro...> - 2002-12-24 20:00:21

Somik Raha wrote:

>Hi Derrick,
>    You've raised some interesting points..
>
>    Looking at it from an end-user perspective, prior to using the visitor,
>the code for collecting text from a page would be :
>
>StringBuffer textCollectionBuffer = new StringBuffer();
>for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
>    textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString());
>}
>String result = textCollectionBuffer.toString();
>
>After putting in the visitor, the code looks like :
>
>Renderer plainTextRenderer = new PlainTextRenderer();
>for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
>    e.nextHTMLNode().renderWith(plainTextRenderer);
>}
>String result = plainTextRenderer.getResult();
>
>IMO, from a user perspective, there's almost no extra complexity. You'd
>anyway collect results into a string buffer, while the renderer now does
>that for you.
>
>If it was just for the plain text renderer, we would not be justified in
>this refactoring. However, the code for getting html out would look like :
>Renderer htmlRenderer = new HTMLRenderer();
>for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
>    e.nextHTMLNode().renderWith(htmlRenderer);
>}
>String result = htmlRenderer.getResult();
>
>And now, when ripping webpages, we would like to modify the links and images
>to a base location in the local disk. We could have modifying html renderer,
>like this:
>
>Renderer  modifyingLinksAndImagesRenderer = new
>ModifyingLinksAndImagesRenderer();
>...
>
>Any modification in the mechanism of rendering is now easily handled, and we
>don't have to modify the parser per se, for every new user requirement.
>
The first two examples gain you nothing from using a visitor pattern. 
 The plainTextRenderer and HTMLRenderer call the toPlainTextString() and 
toHTML() methods because only the tag knows how to render itself into 
text or HTML (see your comment below).

Your last example indicates the visitor class knows which parameters 
have links and which parameters have images for each tag and I would 
suggest this knowledge should be in the tag class itself.  If the base 
class (HTMLTag I presume) had an abstract method signature 'toHTML(URL 
baseURL)', then code within the class would adjust relative link and 
image references based on it's value, and the current behaviour of 
'toHRML()' would just call toHTML(null).  The code is not duplicated, 
only refactored to handle a base URL.

>
>  
>
>>It breaks encapsulation.  It presumes that an exterior class (the
>>visitor) knows how to do something to a class better than the class
>>itself does.
>>    
>>
>
>In light of the above, this is not a bad thing. At the same time, putting in
>the visitor has caused/will cause refactoring of the tag classes so that
>most of the logic is still encapsulated in the tags, and the visitor only
>makes calls to the tag's concrete methods.
>e.g. The visitor still does NOT know how to render a tag's html. Such logic
>continues to be encapsulated within the tag. The visitor only deals with a
>high-level abstraction.
>
So why bother with the extra overhead of the visitor class?  Logically, 
either the visitor has code that differs from the tag code or it 
doesn't.  If it doesn't, then cut the Gordian knot and use the simplest 
mechanism and forget the visitor.  If it does differ, the code within 
the visitor either uses the methods of the visitee only or it has some 
new methods with logic and data that are external to the visitee.  In 
the former case, that code that only uses methods on the visitee could 
be a method on the vistee's superclass.  If it has new methods, either 
the visitors extra logic and data are specific to each vistee or they 
are of a generic nature.  If they are specific to the visitee, then they 
should be within the visitee class and not in a catch-all visitor class, 
otherwise if they are of a general nature they might as well be moved to 
the superclass (or a utility class if they don't belong there).  There 
is really no need for a visitor pattern.

>
>  
>
>>It makes the system brittle.  The visitor class has a method defined for
>>every subclass of visited object.  Adding, removing or refactoring any
>>visited class breaks the visitors.
>>    
>>
>
>A valid point indeed. To counter, in our imlpementation, there is a lot of
>duplication, and quite a few classes really don't need their own toHTML()
>method. The parent class - HTMLTag, should take care of single tags. Those
>tags with children always have duplicated code to recurse through their
>children. We can refactor such functionality out, so that, tags with
>children conform to a recursion interface. Using such an interface, you wont
>have the recursion loops in each and every node class, but you could handle
>it uniformly outside. So, the visitor does not need to take care of each and
>every subclass. It will only take care of some special ones only, while
>visiting the rest as a simple HTMLTag.
>  
>
This The duplication should be refactored to superclasses.  You may need 
is subclass of HTMLTag that has a 'container' interface, but why not put 
that interface on HTMLTag itself and have a default implementation that 
is an empty container.  Then FORM and FRAMESET and other container tags 
derive from an abstract class (HTMLContainerTag ?) that derives from 
HTMLTag but overrides this behaviour and provides real recursion support.

>  
>
>>It's obscure.  Looking at the visit loop tells you nothing of what is
>>being perfomed.  The only indication of the task might be the name of
>>the visitor class.
>>    
>>
>
>That is again not a bad thing - for well-intentioned naming reduces the need
>for excessive documentation. You should be able to look at a class name and
>have a fairly good idea of what it does. A look at the API of the class
>would confirm your guess.
>
>At the same time, looking at the visit loop above, you'd see that it looks
>intuitive enough - the loops is not in the visitor, it is still in the
>client code.
>  
>
But the StringBuffer is now hidden and it's not obvious that a 
getResult() method call returns it's contents, so you have an 
indirection to look up the visitor class.

>  
>
>>It's redundant.  In every example I've seen (I've never seen it used in
>>real applications), the result could have been coded in a simpler
>>fashion with correct class design.
>>    
>>
>
>This is the point we're really thinking about. I've personally been thinking
>on this for some months - for we did have realistic requirements of being
>able to modify links and images in a ripper, in a uniform manner. In fact,
>the current version of the parser already has a visitor. The Sample Programs
>makes not mention of the word "visitor" to avoid controversy and confusion
>:), but if you look at
>http://htmlparser.sourceforge.net/samples/ripper.html - this is what's
>happening. I must add that although this visitor does provide a solution, I
>wasn't very happy with it. We've been able to refactor this out, and provide
>a uniform mechanism now.
>  
>
That's not really a visitor example, that's a callback, which is 
obviated if the toHTML(URL) signature is used.

>  
>
>>I would only use the visitor pattern if either the set of public methods
>>on the visited nodes completely specify their possible behavior and
>>allowed the client to fully manipulate them at the highest level of
>>abstraction (i.e. everything you can do to the visited classes is
>>already coded and you don't want them to change)
>>    
>>
>
>I think we're talking the same language here - this is also our vision. The
>code you see on the samples page is an example of that (although not a very
>good one). We're working on those lines, and the first step was to introduce
>such levels of abstraction.
>
>Regards,
>Somik
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>  
>

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Joshua K. <jo...@in...> - 2002-12-27 18:52:41

> IMHO the visitor pattern is not a good one. Here are the reasons.

The Visitor pattern is a fine pattern Derrick.  Ralph Johnson once said,
"Most of the time you don't need a Visitor, but when you need a Visitor, you
really need a Visitor."

> It breaks encapsulation.  It presumes that an exterior class (the
> visitor) knows how to do something to a class better than the class
> itself does.

A good many of the Design Patterns "break encapsulation."   I guess those
aren't good patterns, huh?   Is there an OO police agency to report
instances of breaking encapsulation?   ;-)

> It makes the system brittle.  The visitor class has a method defined for
> every subclass of visited object.  Adding, removing or refactoring any
> visited class breaks the visitors.

Hard-coded, duplicated logic in nodes makes class hierarchies brittle.

We've found that the html parser code doesn't need special visit methods for
each tag type, which means it's easy to implement Visitor and it won't be
hard to accomodate visiting new tags in the future.

BTW, do you envision needing to add many new tag types to the html parser?

> It's obscure.  Looking at the visit loop tells you nothing of what is
> being perfomed.  The only indication of the task might be the name of
> the visitor class.

If you call a method that yields the plain text string for some html, and
that code happens to use a Visitor, do you care?   That is, a client won't
even know they're using the pattern.

> It's redundant.  In every example I've seen (I've never seen it used in
> real applications), the result could have been coded in a simpler
> fashion with correct class design.

I agree that it's always best to seek out the simple design.  It takes a
committment to refactoring to find simplicity.  Presently, Somik and I are
removing duplication to get to a simpler design.  You have our promise that
no Visitor will be added if it complicates the design.

> I would only use the visitor pattern if either the set of public methods
> on the visited nodes completely specify their possible behavior and
> allowed the client to fully manipulate them at the highest level of
> abstraction (i.e. everything you can do to the visited classes is
> already coded and you don't want them to change), or the visitor pattern
> had already been coded into the visited nodes and for some reason they
> could not be modified further (i.e. the code was available in binary
> form only).  I don't think either of these applies here.

Visitor is a great pattern when you need various operations to be performed
over an object structure and you don't want to hard-code those operations
into the objects themselves.  So far, it seems to me you have just that need
in html parser.  I could be wrong.  We'll have to see how the code feels
with a Visitor implementation and make decisions based on real code.  CVS
allows us to go back to any previous implementation we like better.

best regards,
jk