Thread: Re: [Htmlparser-developer] toPlainTextString() feedback requested
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2002-12-24 17:47:30
|
IMHO the visitor pattern is not a good one. Here are the reasons. It breaks encapsulation. It presumes that an exterior class (the visitor) knows how to do something to a class better than the class itself does. It makes the system brittle. The visitor class has a method defined for every subclass of visited object. Adding, removing or refactoring any visited class breaks the visitors. It's obscure. Looking at the visit loop tells you nothing of what is being perfomed. The only indication of the task might be the name of the visitor class. It's redundant. In every example I've seen (I've never seen it used in real applications), the result could have been coded in a simpler fashion with correct class design. I would only use the visitor pattern if either the set of public methods on the visited nodes completely specify their possible behavior and allowed the client to fully manipulate them at the highest level of abstraction (i.e. everything you can do to the visited classes is already coded and you don't want them to change), or the visitor pattern had already been coded into the visited nodes and for some reason they could not be modified further (i.e. the code was available in binary form only). I don't think either of these applies here. my $0.02 Derrick |
From: <dha...@or...> - 2002-12-26 09:32:50
Attachments:
BDY.RTF
|
Hi, I agree with Sam when he says that "don't fix it when its not broken". But at the same time I see the need to make code better, more readable and simpler. However as suggested below it seems that the Visitor pattern is going to make things difficult to understand. And I too like Sam had a problem with HTMLParser initially. It may be with the documentation but this whole tag-scanner things was extremely confusing in the beginning(though subsequently I have loved them so much that I have even written a few myself without any problem). Hence I think that even if the visitor pattern is used, the user must have some simple easy-to-use methods to get his work done rather than try to understand the Visitor pattern. Just like Same, this was my two cents. Cheers, Dhaval -----Original Message----- From: gaijin [mailto:ga...@yh...] Sent: Tuesday, December 24, 2002 12:21 PM To: htmlparser-developer Cc: gaijin; htmlparser-user Subject: Re: [Htmlparser-developer] toPlainTextString() feedback requested Hi Somik and Joshua Joshua Kerievsky wrote: >>Could you explain why you want to refactor these methods? Remember the >>danger of premature refactoring ... you lose flexibility that then has >>to be re-added later on, making more work in the long run. >> >> >There's a good deal of duplicate code in way the two toHTML methods and the >toPlainTextString method do their work. The central theme is information >accumulation/alteration. That involves outputing tag and node results and >recusing through tags. The refactoring to Visitor allows us to > >* remove many lines of duplicate code, spread across many classes >* remove hard-coded accumulation/alteration logic, thereby making it easier >for clients to get the data they need > >Visitor takes some getting used to. I rarely use the pattern. In this case, >IMO, it was a good fit. > The Visitor pattern sounds interesting, and I look forward to hearing more about it. However, duplicated code itself is not IMO necessarily an evil. It all depends on whether one thinks that the duplicated components are going to diverge in functionality in the future. If you are sure they are not, then fine, refactor away. I guess my surprise at your (or perhaps Somik's) focus on refactoring comes from the fact that while the htmlparser is a great piece of software, the javadocs and other documentation could use some attention. For example, I can't find any explanation in the javadocs or otherwise of how the filters are supposed to work with the different scanners, or what values they are allowed to take. I generally work to "if it's not broken don't fix it", but I often add "before you start fixing it, make sure your documentation is up to date". Using the Visitor pattern may make it easier for clients to get the data they need, but given that the htmlparser is "working" (well it works for me), I would say that the more urgent issue here is making sure all the documentation is up to date. I have a lot of positive things to say about htmlparser, so don't take it the wrong way, when I say that the biggest problem I've had in using it in the last few weeks is inadequate javadocs. >>Is there some efficiency reason why you want to refactor these methods >>or is it just for neatness? >> >> > >Duplication removal is reason #1. > As I mention above. One should be careful of duplication removal for the sake of it. > Removal of hard-coded logic is reason #2. > This is a good reason. However I get the feeling that introduction of these Visitor classes will make the system conceptually more difficult to use rather than easier. I would feel better if the current set up was more fully documented before more complexity was added. And even if the Visitor pattern is used, I would recommend leaving methods like toPlainTextString() etc in place, but just making them short cut implementations to certain kinds of visitor-using methods. This will allow people who have yet to grasp the Visitor pattern something to work with. If you are keen to see lots of people using htmlparser, I think that you don't want people to have to come to terms with too many new concepts at once. You say yourself that the Visitor pattern takes some getting used to. I think the whole scanner concept takes some getting used to .... >Simplicity is reason #3: there is little reason to fatten the interfaces of >tag and node classes with various data accumulation/alteration methods when >one method and a variety of concrete Visitors can do the job with much less >code. > well I would agree if you could guarantee that there will be no divergence whatsoever in how the different methods will be used. If you can create a flexible enough implementation of the Visitor pattern then I guess that will support any possible divergence in the separate methods. However, I think there is a reason to have a fatter interface, in that convenience methods lower the barrier to entry for new users. Perhaps ideally one has a well implemented Visitor pattern that supports a raw method access, and a number of convenience methods? A well implemented Visitor pattern will, I assume, support all sorts of different operations, but I would feel much happier if the htmlparser had a complete javadoc and documentation review before any refactoring took place. People are trying to use the existing system and having trouble not because of the lack of refactoring, but a lack of well described methods. Well I say people, I mean me, I don't know if anyone else feels the same. Maybe it's just me :-) Just my two cents. CHEERS> SAM ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-12-27 07:53:19
|
Hi Dhaval, Sam, > I agree with Sam when he says that "don't fix it when its not broken". I quite disagree with both of u on this philosophy. Looking at the code with Joshua has made me realize its actually ghastly. A serious code cleanup is needed, there is just too much duplication! Most of the scanner code is unreadable, while the tag constructors are a nightmare. I dont think I will be at peace till a couple of rounds of refactoring has been completed. I do not think the panic on the visitor is warranted. Like I said before, the current access methods will continue to be present. However, having a visitor will make life simpler, as in - there's so much code now that uses the same loop over and over again. We can replace code like : Vector links = new Vector(); for (Enumeration e = parser.elements();e.hasMoreElements();) { HTMLNode node = (HTMLNode)e.nextElement(); if (node instanceof HTMLLinkTag) { links.add(node); } } with : HTMLLinkVisitor linkVisitor = new HTMLLinkVisitor(); collectNodesWith(linkVisitor); Vector links = linkVisitor.getResult(); This looks so much more readable and simple. Of course, you could still use the old way - its just that you would have the option of making life easier. In the latest round of refactorings, we've put in a HTMLCompositeTag, from which the link, form and select tags inherit. The toPlainTextString() and toHTML() methods are now in the parent class. All tests passing (except the charset tests). Regards, Somik |
From: Claude D. <CD...@ar...> - 2002-12-28 03:35:04
|
SSB0cmllZCB0byByZWZyYWluIGZyb20ganVtcGluZyBpbiwgYnV0IGhlcmUgYXJlIGEgY291cGxl IG9mIGNlbnRzIHdvcnRoIG9mIG9waW5pb24sIGp1c3QgZm9yIHRoZSBob2xsaWRheXMgOy0pLg0K IA0KSW4gbXkgdmlldywgZGlzdGlsbGluZyBzb2Z0d2FyZSBkb3duIHRvIGVzc2VudGlhbHMgaXMg cGFydCBvZiBpbXByb3ZpbmcgZGVzaWduLiBJdCBjdXRzIGRvd24gb24gY29kZSB0aGF0IHdvdWxk IGxhdGVyIG5lZWQgdG8gYmUgbWFpbnRhaW5lZC4gUmVkdW5kYW5jeSBpcyBleHBlbnNpdmUuIE1v cmUgY29kZSBuZWVkcyB0byBjaGFuZ2UgZXZlcnkgdGltZSBhIGJ1ZyBpcyBmaXhlZCwgcG90ZW50 aWFsbHkgbWlzc2luZyBvbmUgb2YgdGhlIHZhcmlhbnRzIGFuZCBpbnRyb2R1Y2luZyBkaWZmaWN1 bHQgdG8gZGlhZ25vc2Ugc2l0dWF0aW9ucy4gTGVzcyBjb2RlIGlzIGJldHRlci4gQXMgZGVzaWdu cyBpbXByb3ZlLCB0aGUgY29kZSBiYXNlIHdpbGwgdHlwaWNhbGx5IHNocmluay4gUmVmYWN0b3Jp bmcgYW5kIGFwcGx5aW5nIGVmZmVjdGl2ZSBwYXR0ZXJucyBpcyBrZXkgdG8gbWFraW5nIHRoaXMg cG9zc2libGUuDQogDQpUaGF0IGJlaW5nIHNhaWQsIEkgdGhpbmsgZ29vZCBzb2Z0d2FyZSBkZXNp Z24gaGluZ2VzIG9uIHdlbGwgZGVmaW5lZCB1c2UgY2FzZXMuIElmIGNoYW5nZXMgbmVlZCB0byBi ZSBtYWRlIHRvIG1lZXQgdGhlIHJlcXVpcmVtZW50cywgdGhhdCdzIHdoYXQgbmVlZHMgdG8gaGFw cGVuLiBSZWZhY3RvcmluZywgZGlzdGlsbGluZywgc2ltcGxpZnlpbmcsIHJlbW92aW5nIGRlYWQg Y29kZSBhbmQgZG9jdW1lbnRhdGlvbiBhcmUgYWxsIHBhcnQgb2YgZXZlcnkgcmVsZWFzZS4gUmVm YWN0b3JpbmcgZm9yIHRoZSBzYWtlIG9mIHJlZmFjdG9yaW5nIHNob3VsZCBiZSBhdm9pZGVkLCBh cyB3ZWxsIGFzIGZvcmdpbmcgYWhlYWQgd2l0aG91dCBhIHdlbGwtY29uc2lkZXJlZCBhbmQgcmV2 aWV3ZWQgcGxhbi4gSWYgdGhlIHBsYW4gc2hvd3MgdGhhdCByZWZhY3RvcmluZyB3aWxsIGltcHJv dmUgZGVzaWduLCBkZWxpdmVyIGJldHRlciBzb2Z0d2FyZSwgbWVldCByZXF1aXJlbWVudHMgYW5k IHNhdGlzZnkgdGFyZ2V0IHVzZSBjYXNlcywgdGhlbiBpdCBpcyBwcm9iYWJseSBhIGdvb2QgcGxh biBhbmQgc2hvdWxkIGJlIGltcGxlbWVudGVkLCBzbyBsb25nIGFzIHRoZXJlJ3MgZW5vdWdoIGFn cmVlbWVudCB0byB2YWxpZGF0ZSB0aGUgb2JqZWN0aXZlcyBhbmQgc3RyYXRlZ3kuDQogDQpJIHRo aW5rIG1vc3Qgb2YgdGhpcyBkaXNjdXNzaW9uIGNlbnRlcnMgb24gYSBsYWNrIG9mIGRvY3VtZW50 YXRpb24sIGJ1dCBub3QgdXNlciBkb2N1bWVudGF0aW9uLiBUaGUgZ3JvdXAgaXMgbGFja2luZyBp biBwcm9jZXNzLiBJZiBhIHBsYW4gaXMgd29ydGggcGVyc3VpbmcsIGl0J3Mgd29ydGggd3JpdGlu ZyB1cCBhIHN1bW1hcnksIHdpdGggc3VmZmljaWVudCBqdXN0aWZpY2F0aW9uIHRvIHNhdGlmeSB0 aGUgZ3JvdXAuIElmIHRoYXQgcGxhbiBpcyB0aGVuIHN1YmplY3RlZCB0byByZXZpZXcsIEkgZXhw ZWN0IGl0IGNhbiBiZSB2YWxpZGF0ZWQgcXVpY2tseSBhbmQgZXZlcnlvbmUgY2FuIGZlZWwgY29u ZmlkZW50IHRoYXQgcHJvcG9zZWQgY2hhbmdlcyB3aWxsIGltcHJvdmUgdGhlIGRlc2lnbiBhbmQg bm90IHNpbXBseSBpbnRyb2R1Y2UgbW9yZSB3b3JrIG9yIGNoYW5nZSBmb3IgdGhlIHNha2Ugb2Yg Y2hhbmdlLg0KIA0KSW4gcHJpbmNpcGxlLCBJIGhhdmVudCByZWFkIGFueXRoaW5nIG9uIHRoZSBs aXN0IHRoYXQgbWFkZSBtZSBuZXJ2b3VzLiBTaWxlbmNlIHNob3VsZCBiZSB0YWtlbiBhcyBpbXBs aWNpdCBzdXBwb3J0LiBJZiBhIGRvY3VtZW50IGlzIGNpcmN1bGF0ZWQgYW5kIG5vIG9iamVjdGlv bnMgYXJlIG5vdGVkLCB0aGVyZSBpcyBubyByZWFzb24gdG8gd2FpdCBpbmRlZmluaXRlbHksIHRo b3VnaCBhIHJlYXNvbmFibGUgcmV2aWV3IHByb2Nlc3Mgc2hvdWxkIHByb2JhYmx5IGJlIGFwcGxp ZWQuIENvbnRyaWJ1dG9ycyBhcmUgZG9pbmcgdGhlIHdvcmsgYW5kIGhhdmUgYSByaWdodCB0byBk aXJlY3QgdGhlaXIgb3duIGVmZm9ydHMuIEluIG15IGV4cGVyaWVuY2UsIGhvd2V2ZXIsIGV2ZW4g c2hvcnQgcHJvY2VzcyBkb2N1bWVudHMgY2FuIG1ha2UgYSB3b3JsZCBvZiBkaWZmZXJlbmNlLiBJ IHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mgc2l6ZSBpcyBoaXR0aW5nIGNy aXRpY2FsIG1hc3MgYW5kIHRoYXQgbWluaW1hbCBmb3JtYWxpdGllcyBhcmUgbm93IHJlcXVpcmVk Li4uDQogDQpIYXBweSBOZXcgWWVhciBFdmVyeW9uZSENCiANCi0tLS0tT3JpZ2luYWwgTWVzc2Fn ZS0tLS0tIA0KRnJvbTogSm9zaHVhIEtlcmlldnNreSBbbWFpbHRvOmpvc2h1YUBpbmR1c3RyaWFs bG9naWMuY29tXSANClNlbnQ6IEZyaSAxMi8yNy8yMDAyIDEwOjMzIEFNIA0KVG86IGh0bWxwYXJz ZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiBodG1scGFyc2VyLXVzZXJA bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KU3ViamVjdDogUmU6IFtIdG1scGFyc2VyLWRldmVsb3Bl cl0gdG9QbGFpblRleHRTdHJpbmcoKSBmZWVkYmFjayByZXF1ZXN0ZWQNCg0KDQoNCglTYW0gd3Jv dGU6DQoJPiBUaGUgVmlzaXRvciBwYXR0ZXJuIHNvdW5kcyBpbnRlcmVzdGluZywgYW5kIEkgbG9v ayBmb3J3YXJkIHRvIGhlYXJpbmcNCgk+IG1vcmUgYWJvdXQgaXQuICBIb3dldmVyLCBkdXBsaWNh dGVkIGNvZGUgaXRzZWxmIGlzIG5vdCBJTU8gbmVjZXNzYXJpbHkNCgk+IGFuIGV2aWwuIEl0IGFs bCBkZXBlbmRzIG9uIHdoZXRoZXIgb25lIHRoaW5rcyB0aGF0IHRoZSBkdXBsaWNhdGVkDQoJPiBj b21wb25lbnRzIGFyZSBnb2luZyB0byBkaXZlcmdlIGluIGZ1bmN0aW9uYWxpdHkgaW4gdGhlIGZ1 dHVyZS4gIElmIHlvdQ0KCT4gYXJlIHN1cmUgdGhleSBhcmUgbm90LCB0aGVuIGZpbmUsIHJlZmFj dG9yIGF3YXkuDQoJDQoJQXQgdGhlIG1vbWVudCwgU29taWsgYW5kIEkgc3BlY3VsYXRlIHRoYXQg NDAlIHRvIDUwJSBvZiB0aGUgY3VycmVudCBodG1sDQoJcGFyc2VyIGNvZGUgYmFzZSBpcyBwdXJl IGZhdCAtLSB1bm5lY2Vzc2FyeSBjb2RlIHRoYXQgb25seSBzZXJ2ZXMgdG8gYmxvYXQNCgl0aGUg Y29kZSBiYXNlLiAgRHVwbGljYXRlIGNvZGUsIGluIHN1YnRsZSBvciBub3Qtc28tc3VidGxlIHZl cnNpb25zLCBpcw0KCW1vc3RseSByZXNwb25zaWJsZSBmb3IgdGhlIGJsb2F0LiAgSW4gbXkgZXhw ZXJpZW5jZSBpbiBpbmR1c3RyeSwgSSdkIHNheQ0KCXRoYXQgOTUlIG9mIHRoZSB0aW1lLCBkdXBs aWNhdGUgY29kZSBpcyBiYWQuDQoJDQoJPiBJIGd1ZXNzIG15IHN1cnByaXNlIGF0IHlvdXIgKG9y IHBlcmhhcHMgU29taWsncykgZm9jdXMgb24gcmVmYWN0b3JpbmcNCgk+IGNvbWVzIGZyb20gdGhl IGZhY3QgdGhhdCB3aGlsZSB0aGUgaHRtbHBhcnNlciBpcyBhIGdyZWF0IHBpZWNlIG9mDQoJPiBz b2Z0d2FyZSwgdGhlIGphdmFkb2NzIGFuZCBvdGhlciBkb2N1bWVudGF0aW9uIGNvdWxkIHVzZSBz b21lIGF0dGVudGlvbi4NCgk+ICBGb3IgZXhhbXBsZSwgSSBjYW4ndCBmaW5kIGFueSBleHBsYW5h dGlvbiBpbiB0aGUgamF2YWRvY3Mgb3Igb3RoZXJ3aXNlDQoJPiBvZiBob3cgdGhlIGZpbHRlcnMg YXJlIHN1cHBvc2VkIHRvIHdvcmsgd2l0aCB0aGUgZGlmZmVyZW50IHNjYW5uZXJzLCBvcg0KCT4g d2hhdCB2YWx1ZXMgdGhleSBhcmUgYWxsb3dlZCB0byB0YWtlLg0KCQ0KCVRoZSBodG1sIHBhcnNl ciBpcyBpbiBzb3JlIG5lZWQgb2YgcmVmYWN0b3JpbmcgU2FtLiAgSW4gZmFjdCwgc29tZXRpbWVz IGNvZGUNCgliZWNvbWVzIHVubmVjZXNzYXJpbHkgY29tcGxleCwgaW4gd2hpY2ggY2FzZSBwZW9w bGUgKnJlYWxseSogbmVlZA0KCWRvY3VtZW50YXRpb24uICBJIGNvbmZyb250IHN1Y2ggYSBwcm9i bGVtIGJ5IGZpcnN0IGFza2luZyBpZiB0aGUgY29kZSBjb3VsZA0KCWJlIHNpbXBsaWZpZWQgc28g d2UgZGlkbid0IG5lZWQgc28gbXVjaCBkb2N1bWVudGF0aW9uLiAgIEluIHRoZSBldmVudCB0aGF0 DQoJd2UgZG8gbmVlZCBkb2NzLCBpdCBpcyBiZXN0IHRvIHdyaXRlIGV4ZWN1dGFibGUgZG9jdW1l bnRhdGlvbi4gIEFyZSB5b3UNCglmYW1pbGlhciB3aXRoIGV4ZWN1dGFibGUgZG9jdW1lbnRhdGlv bj8NCgkNCgk+IEkgZ2VuZXJhbGx5IHdvcmsgdG8gImlmIGl0J3Mgbm90IGJyb2tlbiBkb24ndCBm aXggaXQiLCBidXQgSSBvZnRlbiBhZGQNCgk+ICJiZWZvcmUgeW91IHN0YXJ0IGZpeGluZyBpdCwg bWFrZSBzdXJlIHlvdXIgZG9jdW1lbnRhdGlvbiBpcyB1cCB0byBkYXRlIi4NCgkNCglTb2Z0d2Fy ZSBiZWNvbWVzIGJyaXR0bGUgYW5kIGJsb2F0ZWQgdW5kZXIgdGhlIHBoaWxvc29waHkgb2YgImlm IGl0IGFpbid0DQoJYnJva2UsIGRvbid0IGZpeCBpdC4iICAgV2UgaGF2ZSBjbGllbnRzIHdpdGgg Y29kZSBiYXNlcyB0aGF0IGFyZSAyIG1pbGxpb25zDQoJbGluZXMgb2YgIndvcmtpbmciIHNwZWdo ZXR0aSBjb2RlLiAgVGhleSBuZWVkIGxvdHMgb2YgaGVscCB0byBsZWFybiB0byBkbw0KCWNvbnRp bnVvdXMgcmVmYWN0b3JpbmcuDQoJDQoJPiBVc2luZyB0aGUgVmlzaXRvciBwYXR0ZXJuIG1heSBt YWtlIGl0IGVhc2llciBmb3IgY2xpZW50cyB0byBnZXQgdGhlIGRhdGENCgk+IHRoZXkgbmVlZCwg YnV0IGdpdmVuIHRoYXQgdGhlIGh0bWxwYXJzZXIgaXMgIndvcmtpbmciICh3ZWxsIGl0IHdvcmtz IGZvcg0KCT4gbWUpLCBJIHdvdWxkIHNheSB0aGF0IHRoZSBtb3JlIHVyZ2VudCBpc3N1ZSBoZXJl IGlzIG1ha2luZyBzdXJlIGFsbCB0aGUNCgk+IGRvY3VtZW50YXRpb24gaXMgdXAgdG8gZGF0ZS4g ICBJIGhhdmUgYSBsb3Qgb2YgcG9zaXRpdmUgdGhpbmdzIHRvIHNheQ0KCT4gYWJvdXQgaHRtbHBh cnNlciwgc28gZG9uJ3QgdGFrZSBpdCB0aGUgd3Jvbmcgd2F5LCB3aGVuIEkgc2F5IHRoYXQgdGhl DQoJPiBiaWdnZXN0IHByb2JsZW0gSSd2ZSBoYWQgaW4gdXNpbmcgaXQgaW4gdGhlIGxhc3QgZmV3 IHdlZWtzIGlzIGluYWRlcXVhdGUNCgk+IGphdmFkb2NzLg0KCQ0KCUknbSBub3QgY29udGVudCB0 byBsaXZlIHdpdGggdGhlIHdvcmxkIGFzIGl0IGlzIFNhbS4gIElmIHNvbWV0aGluZyBpc24ndA0K CWVhc3ksIGl0J3Mgd3JvbmcuICBUaGUgYmVzdCBzb2Z0d2FyZSBpbiB0aGUgd29ybGQgaXMgZWFz eSB0byB1c2UsIGl0J3MNCglzZWxmLWV4cGxhbmF0b3J5LiAgV2Ugc2hvdWxkIGFsd2F5cyBzdHJp dmUgZm9yIHRoYXQuICBJbiB0aGUgbWVhbnRpbWUsIGlmDQoJeW91IGhhdmUgdGFza3MgdG8gY29t cGxldGUgYW5kIGRvbid0IGtub3cgaG93IHRvIGRvIHRoZW0gYmVjYXVzZSBvZiBsYWNrIG9mDQoJ ZG9jcywgSSdkIHN1Z2dlc3QgeW91IGFzayBxdWVzdGlvbnMgaGVyZS4gIFRoZSBiZXN0IHJlc3Vs dCB3aWxsIGJlDQoJZXhlY3V0YWJsZSBkb2N1bWVudGF0aW9uIGZvciB0aGUgaHRtbCBwYXJzZXIu DQoJDQoJPiA+PklzIHRoZXJlIHNvbWUgZWZmaWNpZW5jeSByZWFzb24gd2h5IHlvdSB3YW50IHRv IHJlZmFjdG9yIHRoZXNlIG1ldGhvZHMNCgk+ID4+b3IgaXMgaXQganVzdCBmb3IgbmVhdG5lc3M/ DQoJPiA+Pg0KCT4gPj4NCgk+ID4NCgk+ID5EdXBsaWNhdGlvbiByZW1vdmFsIGlzIHJlYXNvbiAj MS4NCgk+ID4NCgk+IEFzIEkgbWVudGlvbiBhYm92ZS4gIE9uZSBzaG91bGQgYmUgY2FyZWZ1bCBv ZiBkdXBsaWNhdGlvbiByZW1vdmFsIGZvcg0KCT4gdGhlIHNha2Ugb2YgaXQuDQoJDQoJQW5kIGFz IEkgbWVudGlvbmVkIGFib3ZlLCBJIGNvbXBsZXRlbHkgZGlzYWdyZWUgd2l0aCB5b3UuICBJIHdv bmRlciwgaG93DQoJbXVjaCByZWZhY3RvcmluZyBoYXZlIHlvdSBkb25lIGluIHlvdXIgY2FyZWVy IGdpdmVuIHlvdSBwaGlsb3NvcGh5IG9mICJpZiBpdA0KCWFpbid0IGJyb2tlLCBkb24ndCBmaXgg aXQ/IiAgIEhhdmUgeW91IHJlYWQgTWFydGluIEZvd2xlcidzIGxhbmRtYXJrIGJvb2ssDQoJUmVm YWN0b3Jpbmc/ICAgSWYgbm90LCBJJ2Qgc3VnZ2VzdCB5b3Ugc3R1ZHkgaXQgdGhvcm91Z2hseSAt IHlvdSdsbCBiZSBhDQoJYmV0dGVyIHByb2dyYW1tZXIgZm9yIGl0Lg0KCQ0KCT4gPiBSZW1vdmFs IG9mIGhhcmQtY29kZWQgbG9naWMgaXMgcmVhc29uICMyLg0KCT4gPg0KCT4gVGhpcyBpcyBhIGdv b2QgcmVhc29uLiAgSG93ZXZlciBJIGdldCB0aGUgZmVlbGluZyB0aGF0IGludHJvZHVjdGlvbiBv Zg0KCT4gdGhlc2UgVmlzaXRvciBjbGFzc2VzIHdpbGwgbWFrZSB0aGUgc3lzdGVtIGNvbmNlcHR1 YWxseSBtb3JlIGRpZmZpY3VsdA0KCT4gdG8gdXNlIHJhdGhlciB0aGFuIGVhc2llci4gIEkgd291 bGQgZmVlbCBiZXR0ZXIgaWYgdGhlIGN1cnJlbnQgc2V0IHVwDQoJPiB3YXMgbW9yZSBmdWxseSBk b2N1bWVudGVkIGJlZm9yZSBtb3JlIGNvbXBsZXhpdHkgd2FzIGFkZGVkLg0KCQ0KCUFzIEkgYWxz byBzYWlkIGFib3ZlLCBpZiBpdCBhaW4ndCBzaW1wbGUsIGl0J3Mgd3JvbmcuICBPdXIgY2hhbmdl cyB3aWxsIG5vdA0KCWFkZCBjb21wbGV4aXR5IC0gdGhhdCB3b3VsZCBiZSBmb29saXNoLg0KCQ0K CT4gQW5kIGV2ZW4gaWYgdGhlIFZpc2l0b3IgcGF0dGVybiBpcyB1c2VkLCBJIHdvdWxkIHJlY29t bWVuZCBsZWF2aW5nDQoJPiBtZXRob2RzIGxpa2UgdG9QbGFpblRleHRTdHJpbmcoKSBldGMgaW4g cGxhY2UsIGJ1dCBqdXN0IG1ha2luZyB0aGVtDQoJPiBzaG9ydCBjdXQgaW1wbGVtZW50YXRpb25z IHRvIGNlcnRhaW4ga2luZHMgb2YgdmlzaXRvci11c2luZyBtZXRob2RzLg0KCT4gIFRoaXMgd2ls bCBhbGxvdyBwZW9wbGUgd2hvIGhhdmUgeWV0IHRvIGdyYXNwIHRoZSBWaXNpdG9yICBwYXR0ZXJu DQoJPiBzb21ldGhpbmcgdG8gd29yayB3aXRoLg0KCQ0KCVBlb3BsZSBjYW4gYWx3YXlzIGNhbGwg ZGVwcmVjYXRlZCBtZXRob2RzLg0KCQ0KCT4gSWYgeW91IGFyZSBrZWVuIHRvIHNlZSBsb3RzIG9m IHBlb3BsZSB1c2luZyBodG1scGFyc2VyLCBJIHRoaW5rIHRoYXQgeW91DQoJPiBkb24ndCB3YW50 IHBlb3BsZSB0byBoYXZlIHRvIGNvbWUgdG8gdGVybXMgd2l0aCB0b28gbWFueSBuZXcgY29uY2Vw dHMgYXQNCgk+IG9uY2UuICBZb3Ugc2F5IHlvdXJzZWxmIHRoYXQgdGhlIFZpc2l0b3IgcGF0dGVy biB0YWtlcyBzb21lIGdldHRpbmcgdXNlZA0KCT4gdG8uICBJIHRoaW5rIHRoZSB3aG9sZSBzY2Fu bmVyIGNvbmNlcHQgdGFrZXMgc29tZSBnZXR0aW5nIHVzZWQgdG8gLi4uLg0KCQ0KCU91ciBWaXNp dG9yIGltcGxlbWVudGF0aW9uIGlzIHNvIHRyaXZpYWwgdGhhdCBmb2xrcyB3b24ndCBldmVuIGtu b3cgdGhleSdyZQ0KCXVzaW5nIHRoZSBwYXR0ZXJuLg0KCQ0KCT4gPlNpbXBsaWNpdHkgaXMgcmVh c29uICMzOiB0aGVyZSBpcyBsaXR0bGUgcmVhc29uIHRvIGZhdHRlbiB0aGUgaW50ZXJmYWNlcw0K CW9mDQoJPiA+dGFnIGFuZCBub2RlIGNsYXNzZXMgd2l0aCB2YXJpb3VzIGRhdGEgYWNjdW11bGF0 aW9uL2FsdGVyYXRpb24gbWV0aG9kcw0KCXdoZW4NCgk+ID5vbmUgbWV0aG9kIGFuZCBhIHZhcmll dHkgb2YgY29uY3JldGUgVmlzaXRvcnMgY2FuIGRvIHRoZSBqb2Igd2l0aCBtdWNoDQoJbGVzcw0K CT4gPmNvZGUuDQoJPiA+DQoJPiB3ZWxsIEkgd291bGQgYWdyZWUgaWYgeW91IGNvdWxkIGd1YXJh bnRlZSB0aGF0IHRoZXJlIHdpbGwgYmUgbm8NCgk+IGRpdmVyZ2VuY2Ugd2hhdHNvZXZlciBpbiBo b3cgdGhlIGRpZmZlcmVudCBtZXRob2RzIHdpbGwgYmUgdXNlZC4gIElmIHlvdQ0KCT4gY2FuIGNy ZWF0ZSBhIGZsZXhpYmxlIGVub3VnaCBpbXBsZW1lbnRhdGlvbiBvZiB0aGUgVmlzaXRvciBwYXR0 ZXJuIHRoZW4NCgk+IEkgZ3Vlc3MgdGhhdCB3aWxsIHN1cHBvcnQgYW55IHBvc3NpYmxlIGRpdmVy Z2VuY2UgaW4gdGhlIHNlcGFyYXRlDQoJPiBtZXRob2RzLiBIb3dldmVyLCBJIHRoaW5rIHRoZXJl IGlzIGEgcmVhc29uIHRvIGhhdmUgYSBmYXR0ZXIgaW50ZXJmYWNlLA0KCT4gaW4gdGhhdCBjb252 ZW5pZW5jZSBtZXRob2RzIGxvd2VyIHRoZSBiYXJyaWVyIHRvIGVudHJ5IGZvciBuZXcgdXNlcnMu DQoJDQoJV291bGRuJ3QgaXQgYmUgbWFydmVsb3VzIGlmIHRoZSBiYXJyaWVyIHRvIGVudHJ5IHdh cyBsb3cgYW5kIG91ciBjb2RlIHdhc24ndA0KCWJsb2F0d2FyZT8NCgkNCgk+IFBlcmhhcHMgaWRl YWxseSBvbmUgaGFzIGEgd2VsbCBpbXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gdGhhdCBzdXBw b3J0cw0KCT4gYSByYXcgbWV0aG9kIGFjY2VzcywgYW5kIGEgbnVtYmVyIG9mIGNvbnZlbmllbmNl IG1ldGhvZHM/DQoJDQoJVGhlIGNoYW5nZXMgd2lsbCBtYWtlIHRoZSBjb2RlIGVhc2llciB0byB1 c2UgLSBpZiB0aGV5IGRvbid0LCB0aGV5IHJlYWxseQ0KCWNvb2wgdGhpbmcgYWJvdXQgc29mdHdh cmUgaXMgdGhhdCBpdCBpcyBzb2Z0IC0gd2UgY2FuIGNoYW5nZSBpdC4NCgkNCgk+IEEgd2VsbCBp bXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gd2lsbCwgSSBhc3N1bWUsIHN1cHBvcnQgYWxsIHNv cnRzIG9mDQoJPiBkaWZmZXJlbnQgb3BlcmF0aW9ucywgYnV0IEkgd291bGQgZmVlbCBtdWNoIGhh cHBpZXIgaWYgdGhlIGh0bWxwYXJzZXINCgk+IGhhZCBhIGNvbXBsZXRlIGphdmFkb2MgYW5kIGRv Y3VtZW50YXRpb24gcmV2aWV3IGJlZm9yZSBhbnkgcmVmYWN0b3JpbmcNCgk+IHRvb2sgcGxhY2Uu ICBQZW9wbGUgYXJlIHRyeWluZyB0byB1c2UgdGhlIGV4aXN0aW5nIHN5c3RlbSBhbmQgaGF2aW5n DQoJPiB0cm91YmxlIG5vdCBiZWNhdXNlIG9mIHRoZSBsYWNrIG9mIHJlZmFjdG9yaW5nLCBidXQg YSBsYWNrIG9mIHdlbGwNCgk+IGRlc2NyaWJlZCBtZXRob2RzLiAgV2VsbCBJIHNheSBwZW9wbGUs IEkgbWVhbiBtZSwgSSBkb24ndCBrbm93IGlmIGFueW9uZQ0KCT4gZWxzZSBmZWVscyB0aGUgc2Ft ZS4gIE1heWJlIGl0J3MganVzdCBtZSA6LSkNCgkNCglFeGVjdXRhYmxlIGRvY3VtZW50YXRpb24g aXMgYSB0ZXJtIHdlIHVzZSBmb3IgQ3VzdG9tdGVyIFRlc3RzLiAgIEEgQ3VzdG9tZXINCglUZXN0 cyBzaG93cyBob3cgYSBib2R5IG9mIGNvZGUgZ2V0cyB1c2VkIHRvIHBlcmZvcm0gcmVhbCB0YXNr cy4gICBZb3UgaGF2ZQ0KCXJlYWwtd29ybGQgdGFza3MgdG8gcGVyZm9ybS4gIFdlIGNhbiBwZXJm b3JtIHRob3NlIHRhc2tzIHZpYSBhdXRvbWF0ZWQNCgl0ZXN0cy4gICBUaGVuLCB3aGVuIHdlIHJl ZmFjdG9yLCB3ZSBtdXN0IG1ha2Ugc3VyZSBvdXIgZXhlY3V0YWJsZQ0KCWRvY3VtZW50YXRpb24g aXMgdXAgdG8gZGF0ZSAoaS5lLiBpdCBwYXNzZXMgaXRzIHRlc3RzKS4gICBXZSBmaW5kIHRoaXMg aWRlYWwNCglmb3IgbWFraW5nIHN1cmUgdGhhdCBkb2N1bWVudGF0aW9uIHJlZmxlY3RzIHdoYXQg dGhlIGNvZGUgaXMgYWN0dWFsbHkgZG9pbmcuDQoJVGhpcyBhbHNvIGxlYXZlcyByb29tIGZvciBh IGRvY3VtZW50IHRoYXQgZ2l2ZXMgc29tZSBncmFwaGljYWwNCglyZXByZXNlbnRhdGlvbiBvZiBo b3cgdGhpbmdzIHdvcmssIHdoaWNoIG11c3QgYmUgbWFpbnRhaW5lZCBieSBzb21lb25lLCBsZXN0 DQoJaXQgZ2V0IG91dCBvZiBkYXRlLiAgV2UganVzdCBtaW5pbWl6ZSB0aGUgbmVlZCBmb3Igc3Vj aCBkb2N1bWVudHMgYnkga2VlcGluZw0KCW91ciBjb2RlIHNpbXBsZSBhbmQgc21hbGwgYW5kIHN1 cnJvdW5kaW5nIGl0IHdpdGggZWFzeSB0byB1bmRlcnN0YW5kDQoJZXhlY3V0YWJsZSBkb2N1bWVu dGF0aW9uLg0KCQ0KCWJlc3QgcmVnYXJkcw0KCWprDQoJDQoJDQoJDQoJDQoJDQoJDQoJDQoJLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRo aXMgc2YubmV0IGVtYWlsIGlzIHNwb25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdl ZWsgaGVhdmVuLg0KCWh0dHA6Ly90aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWls aW5nIGxpc3QNCglIdG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglo dHRwczovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRl dmVsb3Blcg0KCQ0KDQo= |
From: Somik R. <so...@ya...> - 2002-12-28 08:19:10
|
Hi Claude, >I think it may well be that the group's size is hitting critical mass and that minimal formalities are now required... Good to hear from you. I quite agree that this project seems to be reaching critical mass. We need to share a plan of action that we could all work on parallelly. First, I've been noticing that we're not consistent on the coding convention. In the past, if I see a class that does not follow the convention, I'd go ahead and fix it. However, as so many people are now working parallelly, its important that we all share the same coding convention. The one I follow is the standard Java coding convention. In addition, I particularly don't like using names with single characters in them - i.e. I'd prefer "tag" to "pTag". That is the convention followed in most of the parser. The coding convention is of course open for discussion if anyone feels that we should accomodate any particular practice. Second, I am planning to simplify the design of the scanners. As Dhaval mentioned earlier, and as Sam mentioned, it wasn't easy to write a new scanner. I agree that the javadoc of getParsed() should be better than what it is. However, I think the problem lies with the name itself. In the sense, if we rename "parsed" to "parameterTable" or "attributeTable" - and getParsed() becomes, getAttributeTable() - the javadoc would become redundant. You could say that this is a "documentation" change, or you could call this a "refactoring". Either way, it is developer and user-friendly. So I think Sam's suggestions and our current direction are one and the same. Third, there's way too much duplicate code in the tags. Almost every tag runs thru its attributes to render it as html - whereas this can be done from the superclass - HTMLTag. This is just one example. I've not been focussing much on reducing duplication - and we can see serious code smells. However, I think it is important to be able to quantify what we hope to achieve by undertaking several rounds of refactoring. As we keep refactoring, I will keep a tab on the size of the parser (using Lines of Code). If there's any other metric that people think is important, we could include that as well. It will be interesting to see how much code we can take out. Fourth, this is one project where change is welcome - all the time. Simply bcos it has a massive suite of tests which represents most of the user requirements that we've received till date. The real-value in the parser is not really its production code - but its solid test suite. If there are interesting design changes that will make the current system simpler, they should be tried out without fear - for if we break anything, we will know bcos of the automated testing. It is important that the testing suite is therefore regarded just as important (if not more) as the production code, and steps taken to improve its design, and make it very easy to add new tests all the time. Fifth, as the project grows - new features should keep coming in all the time. Derrick Oswald has been working away at so many cool new features - and he's been writing tests for them all - this is a much-appreciated activity and must continue. I'd be happy to not think about new features and completely focus on solidifying the existing system, and let Derrick focus on adding new stuff - for v1.3. Sixth, we have a very serious possibility of using AI in making tag recognitions. I have to find the time to write about the current recognition mechanism, and pass that on to Sam Joseph. Sam has considerable experience in artificial intelligence, and it will be good for the project to have his expertise in its core correction logic. Seventh, it might be good to have a Wiki for the parser - as so many open source projects do. That way, the entire burden of documentation is not on any one person. We should be looking at options of having our own Wiki so we can add content easily and collaboratively. Pls feel free to add/modify this list if I've missed out on anything. Regards, Somik |
From: Sam J. <ga...@yh...> - 2002-12-28 09:25:17
|
Hi Somik Somik Raha wrote: >Second, I am planning to simplify the design of the scanners. As Dhaval >mentioned earlier, and as Sam mentioned, it wasn't easy to write a new >scanner. I agree that the javadoc of getParsed() should be better than what >it is. However, I think the problem lies with the name itself. In the sense, >if we rename "parsed" to "parameterTable" or "attributeTable" - and >getParsed() becomes, getAttributeTable() - the javadoc would become >redundant. You could say that this is a "documentation" change, or you could >call this a "refactoring". Either way, it is developer and user-friendly. So >I think Sam's suggestions and our current direction are one and the same. > > I think that just changing the name of the method is not quite enough. Certainly changing the name of the method is a good step, but I think javadoc comments that actually show an example of what is being returned make a huge usability difference, particularly when the thing being returned is something as complex as a hashtable. If we want to call changing method names and adding documentation, refactoring then I guess I am a huge fan of refactoring :-) However the other part of what I was saying is that I would like to see the documentation additions before we change method names. I mean you can deprecate away so as not to break my existing code ... I still think you should release a version with a full javadoc "refactoring" before you release anything with refactored code. What's the harm in focusing on the documentation first? I mean we're on version 1.2 now right? What about a version 1.2.1 that included the documentation fixes, before moving on to a 1.3-beta that included the newly refactored code that you're so keen to start work on. What would be the downside of proceeding in this fashion? >Sixth, we have a very serious possibility of using AI in making tag >recognitions. I have to find the time to write about the current recognition >mechanism, and pass that on to Sam Joseph. Sam has considerable experience >in artificial intelligence, and it will be good for the project to have his >expertise in its core correction logic. > This sounds like an interesting possibility, but I still need to understand how the current parser handles all the existing messy tags. I mean when you started talking through all the messy html examples I though you were going to be showing me something that didn't work and that you wanted some machine learning/AI to fix, but you ended up saying that all the examples could be handled by the exisiting parser. If that is so, why do you want to change it, and are there examples of messy code that can't be handled by the parser? If there aren't and its all working fine, what advantage do we gain by replacing the existing core correction logic with some hypothetical AI alternative? >Seventh, it might be good to have a Wiki for the parser - as so many open >source projects do. That way, the entire burden of documentation is not on >any one person. We should be looking at options of having our own Wiki so we >can add content easily and collaboratively. > You are most welcome to use the wiki I have set up in the short term: http://www.neurogrid.net/devwiki/wikipages/HtmlParser It uses an open source java wiki (devwiki), which is not as fully functional as it might be, but I've got it set up so everything gets backed up to CVS and all changes get sent to the neurogrid-cvs mailing list. If nothing else it might serve as an example wiki environment. CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-28 09:39:39
|
Hi Sam, > If we want to call > changing method names and adding documentation, refactoring then I guess > I am a huge fan of refactoring :-) Changing method names is definitely refactoring. The idea is to use really good names that make the javadoc redundant. If you can explain what javadoc would achieve for a method like getAttributes() - I'd be happy to add javadoc for that. > However the other part of what I was > saying is that I would like to see the documentation additions before we > change method names. I mean you can deprecate away so as not to break > my existing code ... I still think you should release a version with a > full javadoc "refactoring" before you release anything with refactored > code. What's the harm in focusing on the documentation first? Simply bcos once the name is changed to a better one, docs are no longer necessary. But I understand your predicament. We could have a deprecated method name for important methods. > I mean we're on version 1.2 now right? What about a version 1.2.1 that > included the documentation fixes, before moving on to a 1.3-beta that > included the newly refactored code that you're so keen to start work on. > What would be the downside of proceeding in this fashion? Duplicate work. Maintaining parallel branches of code is the last thing we need to do, considering limited resources. I don't see a problem in going straight with 1.3-beta as we've got all our tests in place, and we'll also use deprecated methods. That should work for you, right ? > >Sixth, we have a very serious possibility of using AI in making tag > >recognitions. I have to find the time to write about the current recognition > >mechanism, and pass that on to Sam Joseph. Sam has considerable experience > >in artificial intelligence, and it will be good for the project to have his > >expertise in its core correction logic. > > > This sounds like an interesting possibility, but I still need to > understand how the current parser handles all the existing messy tags. > I mean when you started talking through all the messy html examples I > though you were going to be showing me something that didn't work and > that you wanted some machine learning/AI to fix, but you ended up saying > that all the examples could be handled by the exisiting parser. If that > is so, why do you want to change it, and are there examples of messy > code that can't be handled by the parser? If there aren't and its all > working fine, what advantage do we gain by replacing the existing core > correction logic with some hypothetical AI alternative? The main problem is that the current logic is complicated, and uses hard-coded logic. Such a system is hard to maintain, change. I was thinking on the lines of a rule-based system, where the rule-base could be dynamic. But, I will write a seperate mail on this issue as soon as I get some time. > >Seventh, it might be good to have a Wiki for the parser - as so many open > >source projects do. That way, the entire burden of documentation is not on > >any one person. We should be looking at options of having our own Wiki so we > >can add content easily and collaboratively. > > > You are most welcome to use the wiki I have set up in the short term: > > http://www.neurogrid.net/devwiki/wikipages/HtmlParser > > It uses an open source java wiki (devwiki), which is not as fully > functional as it might be, but I've got it set up so everything gets > backed up to CVS and all changes get sent to the neurogrid-cvs mailing > list. If nothing else it might serve as an example wiki environment. Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with current website) as in logos, page color, etc.. Cheers, Somik |
From: Sam J. <ga...@yh...> - 2002-12-28 10:23:17
|
Hi Somik Somik Raha wrote: >>If we want to call >>changing method names and adding documentation, refactoring then I guess >>I am a huge fan of refactoring :-) >> >> > >Changing method names is definitely refactoring. The idea is to use really >good names that make the javadoc redundant. If you can explain what javadoc >would achieve for a method like getAttributes() - I'd be happy to add >javadoc for that. > I don't think good method names make javadoc redundant, and I don't think they can. For example Hashtable getAttributes() tells me very little. What is an attribute? What are the classes in the hashtable? Give me three examples of what an attribute is and then you might have a good javadoc, e.g. [Gets a hashtable of attribute-value pairs from the html tag, e.g. type-->hidden, name-->address, value-->default would come from the tag <input type="hidden" name="address" value="default">. Attributes and values are all String classes. ] Now I think that the method getAttributes with the above javadoc would be much easier to use and understand for somebody trying to use htmlparser for the first or even the third time, no? I think you need to be careful about what is obvious for you, who have designed the system, and what is obvious for a novice user. >>However the other part of what I was >>saying is that I would like to see the documentation additions before we >>change method names. I mean you can deprecate away so as not to break >>my existing code ... I still think you should release a version with a >>full javadoc "refactoring" before you release anything with refactored >>code. What's the harm in focusing on the documentation first? >> >> > >Simply bcos once the name is changed to a better one, docs are no longer >necessary. > See my point above. I think you make more than a name change to make the api novice-user friendly. And my comments about the need for more detailed javadocs go right through the whole project, not just the few examples I've given. >> I mean we're on version 1.2 now right? What about a version 1.2.1 that >>included the documentation fixes, before moving on to a 1.3-beta that >>included the newly refactored code that you're so keen to start work on. >> What would be the downside of proceeding in this fashion? >> >> > >Duplicate work. Maintaining parallel branches of code is the last thing we >need to do, considering limited resources. I don't see a problem in going >straight with 1.3-beta as we've got all our tests in place, and we'll also >use deprecated methods. That should work for you, right ? > Sure, having deprecated methods works for me, but it's not just about me. I mean you've already released 1.2 People are already using it, building code around it. You have to maintain it anyway, or rather users will continue asking you questions about it. I'm suggesting that a 1.2.1 improved documentation release would mean less work in having to maintain things because the users could work from the javadocs. I would also think that going through and writing thorough javadocs for your existing version would put you in a much better position to write a good 1.3-beta. Limited resources is an important concern, and I think you'll conserve resources by more thorough documentation first and re-writing your code second. As long as you have deprecated methods then my own code base will be fine, so it makes little difference to me in the short term which way you go. However I think in the long term, a 1.2.1 documentation release would improve the quality of the project overall, and set a good precedent for high quality javadocs in subsequent releases. I get the sense that you are really keen to plow into re-writing all the code, what's the hurry? >>This sounds like an interesting possibility, but I still need to >>understand how the current parser handles all the existing messy tags. >>I mean when you started talking through all the messy html examples I >>though you were going to be showing me something that didn't work and >>that you wanted some machine learning/AI to fix, but you ended up saying >>that all the examples could be handled by the exisiting parser. If that >>is so, why do you want to change it, and are there examples of messy >>code that can't be handled by the parser? If there aren't and its all >>working fine, what advantage do we gain by replacing the existing core >>correction logic with some hypothetical AI alternative? >> >> > >The main problem is that the current logic is complicated, and uses >hard-coded logic. Such a system is hard to maintain, change. I was thinking >on the lines of a rule-based system, where the rule-base could be dynamic. >But, I will write a seperate mail on this issue as soon as I get some time. > Ah, I think I understand. What you want is some framework for dealing with new types of mess as they come up. A system that can have more rules added to without too much work, or even a framework for trying to guess which rules to apply in which document ... >>>Seventh, it might be good to have a Wiki for the parser - as so many open >>>source projects do. That way, the entire burden of documentation is not >>> >>> >on > > >>>any one person. We should be looking at options of having our own Wiki so >>> >>> >we > > >>>can add content easily and collaboratively. >>> >>> >>> >>You are most welcome to use the wiki I have set up in the short term: >> >>http://www.neurogrid.net/devwiki/wikipages/HtmlParser >> >>It uses an open source java wiki (devwiki), which is not as fully >>functional as it might be, but I've got it set up so everything gets >>backed up to CVS and all changes get sent to the neurogrid-cvs mailing >>list. If nothing else it might serve as an example wiki environment. >> >> > >Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with >current website) as in logos, page color, etc.. > I guess so. I think that could be arranged. Send me a copy of the logo etc., and I'll see what I can do, although it might take a couple of weeks for me to find time to do it. CHEERS> SAM |
From: Claude D. <CD...@ar...> - 2002-12-28 23:47:45
|
SWYgSSBtYXkuLi4gV2F0Y2ggb3V0IGZvciBmZWF0dXJlIGNyZWVwISBBZGRpbmcgQUkgc291bmRz IGxpa2UgYSBwYW5hY2VhIGJ1dCBpdCdzIG5vdC4gSSBoYXZlIGNvbnNpZGVyYWJsZSBleHBlcmll bmNlIHdpdGggQUkgbXlzZWxmIGFuZCBpdCByYXJlbHkgbWVldHMgZXhwZWN0YXRpb25zIHVubGVz cyB5b3UgY2FuIGNsZWFybHkgZGVmaW5lIHRoZSBvYmplY3RpdmVzIHVwIGZyb250LiBJZiB5b3Ug cGxhbiB0byB0cmFpbiBjbGFzc2lmZXJzLCB5b3UnbGwgbmVlZCB0byBjb2xsZWN0IGEgZmFpcmx5 IGxhcmdlLCByZWxldmFudCB0cmFpbmluZyBzZXQuIEV2ZW4gdGhlbiwgdGhpcyBtYXkgbm90IHNv bHZlIGFzIG1hbnkgcHJvYmxlbXMgYXMgaXQgbWF5IGNhdXNlIGFuZCB5b3UgbmVlZCB0byBtYWtl IGEgZ29vZCBkZWNpc2lvbiBiYXNlZCBvbiBmYWN0LiANCiANClNvbWUgb2YgeW91ciBtZXNzYWdl cyBzdWdnZXN0IHlvdSBhcmUgZWFzaWx5IHRha2VuIGJ5IHdpei1iYW5nIHRlY2hub2xvZ2llcyBh bmQgY29vbCBuZXcgZmVhdHVyZXMgKHdoaWNoIEkgc3VmZmVyIGZyb20gbXlzZWxmIGFuZCBuZWVk IHRvIGJhbGFuY2UpLiBZb3UgbmVlZCB0byBjb25zaWRlciB0aGVzZSBmZWF0dXJlcyBlc3BlY2lh bGx5IGNhcmVmdWxseSBhbmQgcmVhbGl6ZSB0aGF0IHRoaXMgYmlhcyBjYW4gYmUgZGV0cmltZW50 YWwgdW5kZXIgc29tZSBjaXJjdW1zdGFuY2VzLiBDZXJ0YWlubHksIHRoZSBwcm9qZWN0IG5lZWRz IHRvIGJlIGZ1biwgYnV0IGRvbid0IGdldCB0b28gY2F1Z2h0IHVwIGluIGZlYXR1cmUgZW52eS4g TWFrZSBzdXJlIHlvdSBrZWVwIHRoZSBjb3VwbGluZyBsb29zZSB0aHJvdWdoIGludGVyZmFjZXMs IGVzcGVjaWFsbHkgaWYgeW91IGFkZCBBSSBjb21wb25lbnRzLCBhbmQgcGxlYXNlIGFsbG93IHVz ZXJzIHRvIGRlY2lkZSB3aGljaCBmZWF0dXJlcyB0aGV5IHdhbnQgYnkga2VlcGluZyB0aGUgYXJj aGl0ZWN0dXJlIGFzIHBsdWdnYWJsZSBhcyBwb3NzaWJsZS4gVGhpcyB3aWxsIGJlbmVmaXQgYm90 aCB5b3VyIHVzZXIgYW5kIGRldmVsb3BtZW50IGNvbW11bml0aWVzLg0KIA0KLS0tLS1PcmlnaW5h bCBNZXNzYWdlLS0tLS0gDQpGcm9tOiBTb21payBSYWhhIFttYWlsdG86c29taWtAeWFob28uY29t XSANClNlbnQ6IFNhdCAxMi8yOC8yMDAyIDEyOjIwIEFNIA0KVG86IGh0bWxwYXJzZXItZGV2ZWxv cGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiANClN1YmplY3Q6IFJlOiBbSHRtbHBhcnNl ci1kZXZlbG9wZXJdIHRvUGxhaW5UZXh0U3RyaW5nKCkgZmVlZGJhY2sgcmVxdWVzdGVkDQoNCg0K DQoJSGkgQ2xhdWRlLA0KCT5JIHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mg c2l6ZSBpcyBoaXR0aW5nIGNyaXRpY2FsIG1hc3MgYW5kDQoJdGhhdCBtaW5pbWFsIGZvcm1hbGl0 aWVzIGFyZSBub3cgcmVxdWlyZWQuLi4NCgkNCglHb29kIHRvIGhlYXIgZnJvbSB5b3UuIEkgcXVp dGUgYWdyZWUgdGhhdCB0aGlzIHByb2plY3Qgc2VlbXMgdG8gYmUgcmVhY2hpbmcNCgljcml0aWNh bCBtYXNzLiBXZSBuZWVkIHRvIHNoYXJlIGEgcGxhbiBvZiBhY3Rpb24gdGhhdCB3ZSBjb3VsZCBh bGwgd29yayBvbg0KCXBhcmFsbGVsbHkuDQoJDQoJRmlyc3QsIEkndmUgYmVlbiBub3RpY2luZyB0 aGF0IHdlJ3JlIG5vdCBjb25zaXN0ZW50IG9uIHRoZSBjb2RpbmcNCgljb252ZW50aW9uLiBJbiB0 aGUgcGFzdCwgaWYgSSBzZWUgYSBjbGFzcyB0aGF0IGRvZXMgbm90IGZvbGxvdyB0aGUNCgljb252 ZW50aW9uLCBJJ2QgZ28gYWhlYWQgYW5kIGZpeCBpdC4gSG93ZXZlciwgYXMgc28gbWFueSBwZW9w bGUgYXJlIG5vdw0KCXdvcmtpbmcgcGFyYWxsZWxseSwgaXRzIGltcG9ydGFudCB0aGF0IHdlIGFs bCBzaGFyZSB0aGUgc2FtZSBjb2RpbmcNCgljb252ZW50aW9uLiBUaGUgb25lIEkgZm9sbG93IGlz IHRoZSBzdGFuZGFyZCBKYXZhIGNvZGluZyBjb252ZW50aW9uLiBJbg0KCWFkZGl0aW9uLCBJIHBh cnRpY3VsYXJseSBkb24ndCBsaWtlIHVzaW5nIG5hbWVzIHdpdGggc2luZ2xlIGNoYXJhY3RlcnMg aW4NCgl0aGVtIC0gaS5lLiBJJ2QgcHJlZmVyICJ0YWciIHRvICJwVGFnIi4gVGhhdCBpcyB0aGUg Y29udmVudGlvbiBmb2xsb3dlZCBpbg0KCW1vc3Qgb2YgdGhlIHBhcnNlci4gVGhlIGNvZGluZyBj b252ZW50aW9uIGlzIG9mIGNvdXJzZSBvcGVuIGZvciBkaXNjdXNzaW9uDQoJaWYgYW55b25lIGZl ZWxzIHRoYXQgd2Ugc2hvdWxkIGFjY29tb2RhdGUgYW55IHBhcnRpY3VsYXIgcHJhY3RpY2UuDQoJ DQoJU2Vjb25kLCBJIGFtIHBsYW5uaW5nIHRvIHNpbXBsaWZ5IHRoZSBkZXNpZ24gb2YgdGhlIHNj YW5uZXJzLiBBcyBEaGF2YWwNCgltZW50aW9uZWQgZWFybGllciwgYW5kIGFzIFNhbSBtZW50aW9u ZWQsIGl0IHdhc24ndCBlYXN5IHRvIHdyaXRlIGEgbmV3DQoJc2Nhbm5lci4gSSBhZ3JlZSB0aGF0 IHRoZSBqYXZhZG9jIG9mIGdldFBhcnNlZCgpIHNob3VsZCBiZSBiZXR0ZXIgdGhhbiB3aGF0DQoJ aXQgaXMuIEhvd2V2ZXIsIEkgdGhpbmsgdGhlIHByb2JsZW0gbGllcyB3aXRoIHRoZSBuYW1lIGl0 c2VsZi4gSW4gdGhlIHNlbnNlLA0KCWlmIHdlIHJlbmFtZSAicGFyc2VkIiB0byAicGFyYW1ldGVy VGFibGUiIG9yICJhdHRyaWJ1dGVUYWJsZSIgLSBhbmQNCglnZXRQYXJzZWQoKSBiZWNvbWVzLCBn ZXRBdHRyaWJ1dGVUYWJsZSgpIC0gdGhlIGphdmFkb2Mgd291bGQgYmVjb21lDQoJcmVkdW5kYW50 LiBZb3UgY291bGQgc2F5IHRoYXQgdGhpcyBpcyBhICJkb2N1bWVudGF0aW9uIiBjaGFuZ2UsIG9y IHlvdSBjb3VsZA0KCWNhbGwgdGhpcyBhICJyZWZhY3RvcmluZyIuIEVpdGhlciB3YXksIGl0IGlz IGRldmVsb3BlciBhbmQgdXNlci1mcmllbmRseS4gU28NCglJIHRoaW5rIFNhbSdzIHN1Z2dlc3Rp b25zIGFuZCBvdXIgY3VycmVudCBkaXJlY3Rpb24gYXJlIG9uZSBhbmQgdGhlIHNhbWUuDQoJDQoJ VGhpcmQsIHRoZXJlJ3Mgd2F5IHRvbyBtdWNoIGR1cGxpY2F0ZSBjb2RlIGluIHRoZSB0YWdzLiBB bG1vc3QgZXZlcnkgdGFnDQoJcnVucyB0aHJ1IGl0cyBhdHRyaWJ1dGVzIHRvIHJlbmRlciBpdCBh cyBodG1sIC0gd2hlcmVhcyB0aGlzIGNhbiBiZSBkb25lDQoJZnJvbSB0aGUgc3VwZXJjbGFzcyAt IEhUTUxUYWcuIFRoaXMgaXMganVzdCBvbmUgZXhhbXBsZS4gSSd2ZSBub3QgYmVlbg0KCWZvY3Vz c2luZyBtdWNoIG9uIHJlZHVjaW5nIGR1cGxpY2F0aW9uIC0gYW5kIHdlIGNhbiBzZWUgc2VyaW91 cyBjb2RlIHNtZWxscy4NCglIb3dldmVyLCBJIHRoaW5rIGl0IGlzIGltcG9ydGFudCB0byBiZSBh YmxlIHRvIHF1YW50aWZ5IHdoYXQgd2UgaG9wZSB0bw0KCWFjaGlldmUgYnkgdW5kZXJ0YWtpbmcg c2V2ZXJhbCByb3VuZHMgb2YgcmVmYWN0b3JpbmcuIEFzIHdlIGtlZXANCglyZWZhY3RvcmluZywg SSB3aWxsIGtlZXAgYSB0YWIgb24gdGhlIHNpemUgb2YgdGhlIHBhcnNlciAodXNpbmcgTGluZXMg b2YNCglDb2RlKS4gSWYgdGhlcmUncyBhbnkgb3RoZXIgbWV0cmljIHRoYXQgcGVvcGxlIHRoaW5r IGlzIGltcG9ydGFudCwgd2UgY291bGQNCglpbmNsdWRlIHRoYXQgYXMgd2VsbC4gSXQgd2lsbCBi ZSBpbnRlcmVzdGluZyB0byBzZWUgaG93IG11Y2ggY29kZSB3ZSBjYW4NCgl0YWtlIG91dC4NCgkN CglGb3VydGgsIHRoaXMgaXMgb25lIHByb2plY3Qgd2hlcmUgY2hhbmdlIGlzIHdlbGNvbWUgLSBh bGwgdGhlIHRpbWUuIFNpbXBseQ0KCWJjb3MgaXQgaGFzIGEgbWFzc2l2ZSBzdWl0ZSBvZiB0ZXN0 cyB3aGljaCByZXByZXNlbnRzIG1vc3Qgb2YgdGhlIHVzZXINCglyZXF1aXJlbWVudHMgdGhhdCB3 ZSd2ZSByZWNlaXZlZCB0aWxsIGRhdGUuIFRoZSByZWFsLXZhbHVlIGluIHRoZSBwYXJzZXIgaXMN Cglub3QgcmVhbGx5IGl0cyBwcm9kdWN0aW9uIGNvZGUgLSBidXQgaXRzIHNvbGlkIHRlc3Qgc3Vp dGUuIElmIHRoZXJlIGFyZQ0KCWludGVyZXN0aW5nIGRlc2lnbiBjaGFuZ2VzIHRoYXQgd2lsbCBt YWtlIHRoZSBjdXJyZW50IHN5c3RlbSBzaW1wbGVyLCB0aGV5DQoJc2hvdWxkIGJlIHRyaWVkIG91 dCB3aXRob3V0IGZlYXIgLSBmb3IgaWYgd2UgYnJlYWsgYW55dGhpbmcsIHdlIHdpbGwga25vdw0K CWJjb3Mgb2YgdGhlIGF1dG9tYXRlZCB0ZXN0aW5nLiBJdCBpcyBpbXBvcnRhbnQgdGhhdCB0aGUg dGVzdGluZyBzdWl0ZSBpcw0KCXRoZXJlZm9yZSByZWdhcmRlZCBqdXN0IGFzIGltcG9ydGFudCAo aWYgbm90IG1vcmUpIGFzIHRoZSBwcm9kdWN0aW9uIGNvZGUsDQoJYW5kIHN0ZXBzIHRha2VuIHRv IGltcHJvdmUgaXRzIGRlc2lnbiwgYW5kIG1ha2UgaXQgdmVyeSBlYXN5IHRvIGFkZCBuZXcNCgl0 ZXN0cyBhbGwgdGhlIHRpbWUuDQoJDQoJRmlmdGgsIGFzIHRoZSBwcm9qZWN0IGdyb3dzIC0gbmV3 IGZlYXR1cmVzIHNob3VsZCBrZWVwIGNvbWluZyBpbiBhbGwgdGhlDQoJdGltZS4gRGVycmljayBP c3dhbGQgaGFzIGJlZW4gd29ya2luZyBhd2F5IGF0IHNvIG1hbnkgY29vbCBuZXcgZmVhdHVyZXMg LQ0KCWFuZCBoZSdzIGJlZW4gd3JpdGluZyB0ZXN0cyBmb3IgdGhlbSBhbGwgLSB0aGlzIGlzIGEg bXVjaC1hcHByZWNpYXRlZA0KCWFjdGl2aXR5IGFuZCBtdXN0IGNvbnRpbnVlLiBJJ2QgYmUgaGFw cHkgdG8gbm90IHRoaW5rIGFib3V0IG5ldyBmZWF0dXJlcyBhbmQNCgljb21wbGV0ZWx5IGZvY3Vz IG9uIHNvbGlkaWZ5aW5nIHRoZSBleGlzdGluZyBzeXN0ZW0sIGFuZCBsZXQgRGVycmljayBmb2N1 cw0KCW9uIGFkZGluZyBuZXcgc3R1ZmYgLSBmb3IgdjEuMy4NCgkNCglTaXh0aCwgd2UgaGF2ZSBh IHZlcnkgc2VyaW91cyBwb3NzaWJpbGl0eSBvZiB1c2luZyBBSSBpbiBtYWtpbmcgdGFnDQoJcmVj b2duaXRpb25zLiBJIGhhdmUgdG8gZmluZCB0aGUgdGltZSB0byB3cml0ZSBhYm91dCB0aGUgY3Vy cmVudCByZWNvZ25pdGlvbg0KCW1lY2hhbmlzbSwgYW5kIHBhc3MgdGhhdCBvbiB0byBTYW0gSm9z ZXBoLiBTYW0gaGFzIGNvbnNpZGVyYWJsZSBleHBlcmllbmNlDQoJaW4gYXJ0aWZpY2lhbCBpbnRl bGxpZ2VuY2UsIGFuZCBpdCB3aWxsIGJlIGdvb2QgZm9yIHRoZSBwcm9qZWN0IHRvIGhhdmUgaGlz DQoJZXhwZXJ0aXNlIGluIGl0cyBjb3JlIGNvcnJlY3Rpb24gbG9naWMuDQoJDQoJU2V2ZW50aCwg aXQgbWlnaHQgYmUgZ29vZCB0byBoYXZlIGEgV2lraSBmb3IgdGhlIHBhcnNlciAtIGFzIHNvIG1h bnkgb3Blbg0KCXNvdXJjZSBwcm9qZWN0cyBkby4gVGhhdCB3YXksIHRoZSBlbnRpcmUgYnVyZGVu IG9mIGRvY3VtZW50YXRpb24gaXMgbm90IG9uDQoJYW55IG9uZSBwZXJzb24uIFdlIHNob3VsZCBi ZSBsb29raW5nIGF0IG9wdGlvbnMgb2YgaGF2aW5nIG91ciBvd24gV2lraSBzbyB3ZQ0KCWNhbiBh ZGQgY29udGVudCBlYXNpbHkgYW5kIGNvbGxhYm9yYXRpdmVseS4NCgkNCglQbHMgZmVlbCBmcmVl IHRvIGFkZC9tb2RpZnkgdGhpcyBsaXN0IGlmIEkndmUgbWlzc2VkIG91dCBvbiBhbnl0aGluZy4N CgkNCglSZWdhcmRzLA0KCVNvbWlrDQoJDQoJDQoJDQoJLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRoaXMgc2YubmV0IGVtYWlsIGlzIHNw b25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdlZWsgaGVhdmVuLg0KCWh0dHA6Ly90 aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWlsaW5nIGxpc3QNCglIdG1scGFyc2Vy LWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglodHRwczovL2xpc3RzLnNvdXJjZWZv cmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRldmVsb3Blcg0KCQ0KDQo= |
From: Somik R. <so...@ya...> - 2002-12-29 00:11:42
|
Hi Claude, > If I may... Watch out for feature creep! Adding AI sounds like a panacea but it's not. I have considerable experience with AI myself and it rarely meets expectations unless you can clearly define the objectives up front. If you plan to train classifers, you'll need to collect a fairly large, relevant training set. Even then, this may not solve as many problems as it may cause and you need to make a good decision based on fact. > > Some of your messages suggest you are easily taken by wiz-bang technologies and cool new features (which I suffer from myself and need to balance). You need to consider these features especially carefully and realize that this bias can be detrimental under some circumstances. Certainly, the project needs to be fun, but don't get too caught up in feature envy. Make sure you keep the coupling loose through interfaces, especially if you add AI components, and please allow users to decide which features they want by keeping the architecture as pluggable as possible. This will benefit both your user and development communities. I agree with your opinion - I'm keeping the AI stuff a low priority for now, as I mentioned - focussing only on refactoring for the moment. However, I am hoping that we can do some evaluation, and decide if we should go for it in v1.4. The real problem is, we keep finding real-life dirty html that just does not follow any rules. Just sometime back, I found a bug in HTMLStringNode.. If we have html like : <script language="javascript"> var lower = '<%=lowerValue%>'; </script> the StringNode does not realize that the jsp tag within quotes is not to be handled as a tag but a string node. I found this problem while refactoring today. Upon modifying the parsing automata and including a PARSE_IGNORE_STATE for HTMLStringNode, I was able to handle this case. But thanks to the tests, I found that we had another failing test, wherein we had something like : <A href="somelink.html">Kaarle's homepage</A> As you can see, the single apostrophe caused the damage. Now, I had to code in extra logic to also verify that the next character must be an opening tag for the string node automaton to move into ignoring state. Good news is that all tests are passing. But I am not really sure that this philosophy of modifying the parsing automata logic is all that great. If cases such as this can be coded outside the parser - perhaps with regular expressions, we will no longer need to modify the code of the parser each time. However, there'd probably be a performance hit as compared to the system that we have now, and we've got to look at it really carefully before we go for it. We've been going vertical all this while, so I am just trying to go lateral to see where that might take us - just to generate more possibilities. I'm thinking that we could try a seperate implementation - writing the parsing logic from scratch, and using the existing test mechanism to verify. This could be a seperate module in the htmlparser project, for experimentation only - to evaluate feasibility. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-24 18:48:57
|
Hi Derrick, You've raised some interesting points.. Looking at it from an end-user perspective, prior to using the visitor, the code for collecting text from a page would be : StringBuffer textCollectionBuffer = new StringBuffer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString()); } String result = textCollectionBuffer.toString(); After putting in the visitor, the code looks like : Renderer plainTextRenderer = new PlainTextRenderer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { e.nextHTMLNode().renderWith(plainTextRenderer); } String result = plainTextRenderer.getResult(); IMO, from a user perspective, there's almost no extra complexity. You'd anyway collect results into a string buffer, while the renderer now does that for you. If it was just for the plain text renderer, we would not be justified in this refactoring. However, the code for getting html out would look like : Renderer htmlRenderer = new HTMLRenderer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { e.nextHTMLNode().renderWith(htmlRenderer); } String result = htmlRenderer.getResult(); And now, when ripping webpages, we would like to modify the links and images to a base location in the local disk. We could have modifying html renderer, like this: Renderer modifyingLinksAndImagesRenderer = new ModifyingLinksAndImagesRenderer(); ... Any modification in the mechanism of rendering is now easily handled, and we don't have to modify the parser per se, for every new user requirement. > It breaks encapsulation. It presumes that an exterior class (the > visitor) knows how to do something to a class better than the class > itself does. In light of the above, this is not a bad thing. At the same time, putting in the visitor has caused/will cause refactoring of the tag classes so that most of the logic is still encapsulated in the tags, and the visitor only makes calls to the tag's concrete methods. e.g. The visitor still does NOT know how to render a tag's html. Such logic continues to be encapsulated within the tag. The visitor only deals with a high-level abstraction. > It makes the system brittle. The visitor class has a method defined for > every subclass of visited object. Adding, removing or refactoring any > visited class breaks the visitors. A valid point indeed. To counter, in our imlpementation, there is a lot of duplication, and quite a few classes really don't need their own toHTML() method. The parent class - HTMLTag, should take care of single tags. Those tags with children always have duplicated code to recurse through their children. We can refactor such functionality out, so that, tags with children conform to a recursion interface. Using such an interface, you wont have the recursion loops in each and every node class, but you could handle it uniformly outside. So, the visitor does not need to take care of each and every subclass. It will only take care of some special ones only, while visiting the rest as a simple HTMLTag. > It's obscure. Looking at the visit loop tells you nothing of what is > being perfomed. The only indication of the task might be the name of > the visitor class. That is again not a bad thing - for well-intentioned naming reduces the need for excessive documentation. You should be able to look at a class name and have a fairly good idea of what it does. A look at the API of the class would confirm your guess. At the same time, looking at the visit loop above, you'd see that it looks intuitive enough - the loops is not in the visitor, it is still in the client code. > It's redundant. In every example I've seen (I've never seen it used in > real applications), the result could have been coded in a simpler > fashion with correct class design. This is the point we're really thinking about. I've personally been thinking on this for some months - for we did have realistic requirements of being able to modify links and images in a ripper, in a uniform manner. In fact, the current version of the parser already has a visitor. The Sample Programs makes not mention of the word "visitor" to avoid controversy and confusion :), but if you look at http://htmlparser.sourceforge.net/samples/ripper.html - this is what's happening. I must add that although this visitor does provide a solution, I wasn't very happy with it. We've been able to refactor this out, and provide a uniform mechanism now. > I would only use the visitor pattern if either the set of public methods > on the visited nodes completely specify their possible behavior and > allowed the client to fully manipulate them at the highest level of > abstraction (i.e. everything you can do to the visited classes is > already coded and you don't want them to change) I think we're talking the same language here - this is also our vision. The code you see on the samples page is an example of that (although not a very good one). We're working on those lines, and the first step was to introduce such levels of abstraction. Regards, Somik |
From: Kaarle K. <kaa...@ik...> - 2002-12-24 21:02:14
|
At 15:01 24.12.2002 -0500, you wrote: >Somik Raha wrote: You seem to be very active today! So am I. Merry Christmas to you all! Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-12-24 21:49:42
|
Hi Kaarle, > You seem to be very active today! > > So am I. Merry Christmas to you all! > Merry Christmas to everyone and to you! Cheers, Somik |
From: Derrick O. <Der...@ro...> - 2002-12-24 20:00:21
|
Somik Raha wrote: >Hi Derrick, > You've raised some interesting points.. > > Looking at it from an end-user perspective, prior to using the visitor, >the code for collecting text from a page would be : > >StringBuffer textCollectionBuffer = new StringBuffer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString()); >} >String result = textCollectionBuffer.toString(); > >After putting in the visitor, the code looks like : > >Renderer plainTextRenderer = new PlainTextRenderer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > e.nextHTMLNode().renderWith(plainTextRenderer); >} >String result = plainTextRenderer.getResult(); > >IMO, from a user perspective, there's almost no extra complexity. You'd >anyway collect results into a string buffer, while the renderer now does >that for you. > >If it was just for the plain text renderer, we would not be justified in >this refactoring. However, the code for getting html out would look like : >Renderer htmlRenderer = new HTMLRenderer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > e.nextHTMLNode().renderWith(htmlRenderer); >} >String result = htmlRenderer.getResult(); > >And now, when ripping webpages, we would like to modify the links and images >to a base location in the local disk. We could have modifying html renderer, >like this: > >Renderer modifyingLinksAndImagesRenderer = new >ModifyingLinksAndImagesRenderer(); >... > >Any modification in the mechanism of rendering is now easily handled, and we >don't have to modify the parser per se, for every new user requirement. > The first two examples gain you nothing from using a visitor pattern. The plainTextRenderer and HTMLRenderer call the toPlainTextString() and toHTML() methods because only the tag knows how to render itself into text or HTML (see your comment below). Your last example indicates the visitor class knows which parameters have links and which parameters have images for each tag and I would suggest this knowledge should be in the tag class itself. If the base class (HTMLTag I presume) had an abstract method signature 'toHTML(URL baseURL)', then code within the class would adjust relative link and image references based on it's value, and the current behaviour of 'toHRML()' would just call toHTML(null). The code is not duplicated, only refactored to handle a base URL. > > > >>It breaks encapsulation. It presumes that an exterior class (the >>visitor) knows how to do something to a class better than the class >>itself does. >> >> > >In light of the above, this is not a bad thing. At the same time, putting in >the visitor has caused/will cause refactoring of the tag classes so that >most of the logic is still encapsulated in the tags, and the visitor only >makes calls to the tag's concrete methods. >e.g. The visitor still does NOT know how to render a tag's html. Such logic >continues to be encapsulated within the tag. The visitor only deals with a >high-level abstraction. > So why bother with the extra overhead of the visitor class? Logically, either the visitor has code that differs from the tag code or it doesn't. If it doesn't, then cut the Gordian knot and use the simplest mechanism and forget the visitor. If it does differ, the code within the visitor either uses the methods of the visitee only or it has some new methods with logic and data that are external to the visitee. In the former case, that code that only uses methods on the visitee could be a method on the vistee's superclass. If it has new methods, either the visitors extra logic and data are specific to each vistee or they are of a generic nature. If they are specific to the visitee, then they should be within the visitee class and not in a catch-all visitor class, otherwise if they are of a general nature they might as well be moved to the superclass (or a utility class if they don't belong there). There is really no need for a visitor pattern. > > > >>It makes the system brittle. The visitor class has a method defined for >>every subclass of visited object. Adding, removing or refactoring any >>visited class breaks the visitors. >> >> > >A valid point indeed. To counter, in our imlpementation, there is a lot of >duplication, and quite a few classes really don't need their own toHTML() >method. The parent class - HTMLTag, should take care of single tags. Those >tags with children always have duplicated code to recurse through their >children. We can refactor such functionality out, so that, tags with >children conform to a recursion interface. Using such an interface, you wont >have the recursion loops in each and every node class, but you could handle >it uniformly outside. So, the visitor does not need to take care of each and >every subclass. It will only take care of some special ones only, while >visiting the rest as a simple HTMLTag. > > This The duplication should be refactored to superclasses. You may need is subclass of HTMLTag that has a 'container' interface, but why not put that interface on HTMLTag itself and have a default implementation that is an empty container. Then FORM and FRAMESET and other container tags derive from an abstract class (HTMLContainerTag ?) that derives from HTMLTag but overrides this behaviour and provides real recursion support. > > >>It's obscure. Looking at the visit loop tells you nothing of what is >>being perfomed. The only indication of the task might be the name of >>the visitor class. >> >> > >That is again not a bad thing - for well-intentioned naming reduces the need >for excessive documentation. You should be able to look at a class name and >have a fairly good idea of what it does. A look at the API of the class >would confirm your guess. > >At the same time, looking at the visit loop above, you'd see that it looks >intuitive enough - the loops is not in the visitor, it is still in the >client code. > > But the StringBuffer is now hidden and it's not obvious that a getResult() method call returns it's contents, so you have an indirection to look up the visitor class. > > >>It's redundant. In every example I've seen (I've never seen it used in >>real applications), the result could have been coded in a simpler >>fashion with correct class design. >> >> > >This is the point we're really thinking about. I've personally been thinking >on this for some months - for we did have realistic requirements of being >able to modify links and images in a ripper, in a uniform manner. In fact, >the current version of the parser already has a visitor. The Sample Programs >makes not mention of the word "visitor" to avoid controversy and confusion >:), but if you look at >http://htmlparser.sourceforge.net/samples/ripper.html - this is what's >happening. I must add that although this visitor does provide a solution, I >wasn't very happy with it. We've been able to refactor this out, and provide >a uniform mechanism now. > > That's not really a visitor example, that's a callback, which is obviated if the toHTML(URL) signature is used. > > >>I would only use the visitor pattern if either the set of public methods >>on the visited nodes completely specify their possible behavior and >>allowed the client to fully manipulate them at the highest level of >>abstraction (i.e. everything you can do to the visited classes is >>already coded and you don't want them to change) >> >> > >I think we're talking the same language here - this is also our vision. The >code you see on the samples page is an example of that (although not a very >good one). We're working on those lines, and the first step was to introduce >such levels of abstraction. > >Regards, >Somik > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Joshua K. <jo...@in...> - 2002-12-27 18:52:41
|
> IMHO the visitor pattern is not a good one. Here are the reasons. The Visitor pattern is a fine pattern Derrick. Ralph Johnson once said, "Most of the time you don't need a Visitor, but when you need a Visitor, you really need a Visitor." > It breaks encapsulation. It presumes that an exterior class (the > visitor) knows how to do something to a class better than the class > itself does. A good many of the Design Patterns "break encapsulation." I guess those aren't good patterns, huh? Is there an OO police agency to report instances of breaking encapsulation? ;-) > It makes the system brittle. The visitor class has a method defined for > every subclass of visited object. Adding, removing or refactoring any > visited class breaks the visitors. Hard-coded, duplicated logic in nodes makes class hierarchies brittle. We've found that the html parser code doesn't need special visit methods for each tag type, which means it's easy to implement Visitor and it won't be hard to accomodate visiting new tags in the future. BTW, do you envision needing to add many new tag types to the html parser? > It's obscure. Looking at the visit loop tells you nothing of what is > being perfomed. The only indication of the task might be the name of > the visitor class. If you call a method that yields the plain text string for some html, and that code happens to use a Visitor, do you care? That is, a client won't even know they're using the pattern. > It's redundant. In every example I've seen (I've never seen it used in > real applications), the result could have been coded in a simpler > fashion with correct class design. I agree that it's always best to seek out the simple design. It takes a committment to refactoring to find simplicity. Presently, Somik and I are removing duplication to get to a simpler design. You have our promise that no Visitor will be added if it complicates the design. > I would only use the visitor pattern if either the set of public methods > on the visited nodes completely specify their possible behavior and > allowed the client to fully manipulate them at the highest level of > abstraction (i.e. everything you can do to the visited classes is > already coded and you don't want them to change), or the visitor pattern > had already been coded into the visited nodes and for some reason they > could not be modified further (i.e. the code was available in binary > form only). I don't think either of these applies here. Visitor is a great pattern when you need various operations to be performed over an object structure and you don't want to hard-code those operations into the objects themselves. So far, it seems to me you have just that need in html parser. I could be wrong. We'll have to see how the code feels with a Visitor implementation and make decisions based on real code. CVS allows us to go back to any previous implementation we like better. best regards, jk |