htmlparser-developer Mailing List for HTML Parser (Page 20)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2003-01-02 00:14:50
|
Hi Derrick Sorry for the delay in replying. I've opened up the htmlparser project - any developer can add stuff into the site using secure copy tools (scp/filezilla). Feel free to add documentation or anything else. Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: "Somik Raha" <so...@ya...> Sent: Monday, December 30, 2002 1:47 PM Subject: test directory > Somik, > > I was wondering if you could open up the directory permissions so others > could put files in /home/groups/h/ht/htmlparser/htdocs/test too. > Something like: > chmod g+w /home/groups/h/ht/htmlparser/htdocs/test > should allow people in the htmlparser group write access to it. > > Derrick |
From: Kaarle K. <kaa...@ik...> - 2002-12-31 22:01:56
|
At 10:34 31.12.2002 -0800, you wrote: >Great! That hasn't happened in quite some time. >Happy New Year to all! Happy new year to you! The new year 2003 started one minute ago here. For you Somik it would have begun quite some hours ago if you had stayed in Japan. regards Kaarle >Regards, >Somik >--- Derrick Oswald <Der...@ro...> wrote: > > HTML Parser made the top fifty most active projects > > last week at > > position 47 (99.526 percentile): > > > > https://sourceforge.net/top/mostactive.php?type=week > > > > > > > > >------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-12-31 18:34:17
|
Great! That hasn't happened in quite some time. Happy New Year to all! Regards, Somik --- Derrick Oswald <Der...@ro...> wrote: > HTML Parser made the top fifty most active projects > last week at > position 47 (99.526 percentile): > > https://sourceforge.net/top/mostactive.php?type=week > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2002-12-31 15:55:18
|
HTML Parser made the top fifty most active projects last week at position 47 (99.526 percentile): https://sourceforge.net/top/mostactive.php?type=week |
From: Derrick O. <Der...@ro...> - 2002-12-30 19:46:53
|
I've dropped changes to use the junit swing runner by default. I hope that's OK. If you want the old behaviour (awt runner) use: java org.htmlparser.tests.AllTests -awt |
From: Somik R. <so...@ya...> - 2002-12-30 06:14:51
|
Sam Joseph wrote: > I would have thought the solution to this would be to define under which > circumstances a pair of apostrophes will indicate text. In the first > two examples you are inside an ANCHOR tag, and in the third you are > inside a SCRIPT tag. It seems to me that you should only be using > apostrophe's to indicate text strings inside a SCRIPT tag, no? If > that's true then set your parsing behaviour differently depending on the > tag type. You are right. The script scanner could set some static variable in the HTMLStringNode - which tells it to move into the ignore states if it encounters an apostrophe, and flag it off when its done. I'll probably do that for now.. Regards, Somik ----- Original Message ----- From: "Sam Joseph" <ga...@yh...> To: <htm...@li...> Sent: Sunday, December 29, 2002 10:09 PM Subject: Re: [Htmlparser-developer] AI - to be or not to be > Hi Somik, > > I would have thought the solution to this would be to define under which > circumstances a pair of apostrophes will indicate text. In the first > two examples you are inside an ANCHOR tag, and in the third you are > inside a SCRIPT tag. It seems to me that you should only be using > apostrophe's to indicate text strings inside a SCRIPT tag, no? If > that's true then set your parsing behaviour differently depending on the > tag type. > > CHEERS> SAM > > Somik Raha wrote: > > >Hi Folks, > > Derrick Oswald just checked in a test case that fails.. > >Here's a link tag : > > > ><a href="http://cbc.ca/artsCanada/stories/greatnorth271202" > >class="lgblacku">Vancouver schools plan 'Great Northern Way'</a> > > > >We are in a quandary now. When we have cases like : > > > ><a href="something.html">Kaarle's Page</a> > > > >we should accept the apostrophe without doing anything special. > > > >When we get links like, > ><script> > > var code = '<sometag>'; > ></script> > > > >We should not take the tag symbols after code seriously, as they are part of > >the string. > > > >Handling the last two cases causes a conflict with the first case -bcos the > >last case is handled by checking if there's a < after ' - and this causes > >the first case to go into an ignoring mode. > > > >How do we handle this problem ? Do we write smart code to handle this > >particular situation ? From human experience, even if we've not encountered > >these cases, we know how to differentiate between a string node and a tag. > >Can AI help us here ? Also, pls feel free to suggest any straightforward > >solutions as well. > > > >Regards > >Somik > > > > > > > > > >------------------------------------------------------- > >This sf.net email is sponsored by:ThinkGeek > >Welcome to geek heaven. > >http://thinkgeek.com/sf > >_______________________________________________ > >Htmlparser-developer mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Sam J. <ga...@yh...> - 2002-12-30 05:54:14
|
Hi Somik, I would have thought the solution to this would be to define under which circumstances a pair of apostrophes will indicate text. In the first two examples you are inside an ANCHOR tag, and in the third you are inside a SCRIPT tag. It seems to me that you should only be using apostrophe's to indicate text strings inside a SCRIPT tag, no? If that's true then set your parsing behaviour differently depending on the tag type. CHEERS> SAM Somik Raha wrote: >Hi Folks, > Derrick Oswald just checked in a test case that fails.. >Here's a link tag : > ><a href="http://cbc.ca/artsCanada/stories/greatnorth271202" >class="lgblacku">Vancouver schools plan 'Great Northern Way'</a> > >We are in a quandary now. When we have cases like : > ><a href="something.html">Kaarle's Page</a> > >we should accept the apostrophe without doing anything special. > >When we get links like, ><script> > var code = '<sometag>'; ></script> > >We should not take the tag symbols after code seriously, as they are part of >the string. > >Handling the last two cases causes a conflict with the first case -bcos the >last case is handled by checking if there's a < after ' - and this causes >the first case to go into an ignoring mode. > >How do we handle this problem ? Do we write smart code to handle this >particular situation ? From human experience, even if we've not encountered >these cases, we know how to differentiate between a string node and a tag. >Can AI help us here ? Also, pls feel free to suggest any straightforward >solutions as well. > >Regards >Somik > > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > |
From: Somik R. <so...@ya...> - 2002-12-30 05:19:57
|
Hi Folks, Derrick Oswald just checked in a test case that fails.. Here's a link tag : <a href="http://cbc.ca/artsCanada/stories/greatnorth271202" class="lgblacku">Vancouver schools plan 'Great Northern Way'</a> We are in a quandary now. When we have cases like : <a href="something.html">Kaarle's Page</a> we should accept the apostrophe without doing anything special. When we get links like, <script> var code = '<sometag>'; </script> We should not take the tag symbols after code seriously, as they are part of the string. Handling the last two cases causes a conflict with the first case -bcos the last case is handled by checking if there's a < after ' - and this causes the first case to go into an ignoring mode. How do we handle this problem ? Do we write smart code to handle this particular situation ? From human experience, even if we've not encountered these cases, we know how to differentiate between a string node and a tag. Can AI help us here ? Also, pls feel free to suggest any straightforward solutions as well. Regards Somik |
From: Derrick O. <Der...@ro...> - 2002-12-29 15:50:41
|
Somik, Finally reproduced it by backing down my JVM to 1.2. Code fix and test modifications dropped. Sorry about that agent 99. Derrick Somik Raha wrote: > Hi Derrick, > Are you sure ? I have the latest version here too - but the two > tests are consistently failing. > Here's what I see : > Charset should be Shift_JIS but was ISO-8859-1 > > for both failures. > > Regards, > Somik |
From: Somik R. <so...@ya...> - 2002-12-29 08:09:57
|
Hi Folks, The integration release for this week is out. You can download it from http://htmlparser.sourceforge.net Integration build 1.3 - 20021228 -------------------------------- [1] Added URLConnection constructors to HTMLParser [2] Honour charset parameter on HTTP header and in HTML meta tag [3] Following tags now inherit from HTMLCompositeTag (i) HTMLFormTag (ii) HTMLLinkTag (iii) HTMLSelectTag (iv) HTMLFrameSetTag (v) HTMLTitleTag (vi) HTMLTextAreaTag (vii) HTMLStyleTag (viii) HTMLScriptTag (ix) HTMLAppletTag [4] Performed Refactoring "Introduce Parameter Object" on HTMLTag, HTMLCompositeTag, HTMLLinkTag, HTMLFormTag [5] Refactored HTMLFormTag, pulling up the search methods into HTMLCompositeTag [6] Added HTMLVector, which can return HTMLSimpleEnumeration - a no-exception flavor of HTMLEnumeration [7] Refactored HTMLEnumeration - created new interface - HTMLPeekingEnumeration Notes : HTMLVector is not yet integrated with the tags. That should happen in the next release. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-29 01:21:20
|
Hi Derrick, Are you sure ? I have the latest version here too - but the two = tests are consistently failing.=20 Here's what I see : Charset should be Shift_JIS but was ISO-8859-1 for both failures. =20 Regards, Somik ----- Original Message -----=20 From: Derrick Oswald=20 To: htm...@li...=20 Sent: Saturday, December 28, 2002 5:17 PM Subject: Re: [Htmlparser-developer] Charset tests failing Somik, Sorry, just got back from the pilgrimage to Bethlehem (Toronto). All 268 tests are running OK against the version I took just now. Perhaps www.ibm.co.jp or www.sony.co.jp weren't available for you when = you tried it. Try again, please. The tests can't be all in our repository, specifically the = testHTTPCharset relies on an HTTP header charset parameter being set (by = the HTTP server) to something other than ISO-8859-1 which the = sourceforge people are unlikely to set up for us. We could put the = testHTMLCharset under the sourceforge domain though. Will get to it as = time permits. Derrick Somik Raha wrote: Hi Derrick, I was working on the latest parser, and just found that the = charset tests are failing (I didnt modify anything in HTMLParser). Could you check what might have gone wrong ? Also, I was = wondering if it might not be better to have the charset pages on our = domain (http://htmlparser.sourceforge.net) - we could put up encoded = pages and they'd always be there. =20 Regards Somik =20 |
From: Derrick O. <Der...@ro...> - 2002-12-29 01:14:58
|
Somik, Sorry, just got back from the pilgrimage to Bethlehem (Toronto). All 268 tests are running OK against the version I took just now. Perhaps www.ibm.co.jp or www.sony.co.jp weren't available for you when you tried it. Try again, please. The tests can't be all in our repository, specifically the testHTTPCharset relies on an HTTP header charset parameter being set (by the HTTP server) to something other than ISO-8859-1 which the sourceforge people are unlikely to set up for us. We could put the testHTMLCharset under the sourceforge domain though. Will get to it as time permits. Derrick Somik Raha wrote: > Hi Derrick, > I was working on the latest parser, and just found that the > charset tests are failing (I didnt modify anything in HTMLParser). > Could you check what might have gone wrong ? Also, I was wondering > if it might not be better to have the charset pages on our domain > (http://htmlparser.sourceforge.net) - we could put up encoded pages > and they'd always be there. > > Regards > Somik > |
From: Somik R. <so...@ya...> - 2002-12-29 00:11:42
|
Hi Claude, > If I may... Watch out for feature creep! Adding AI sounds like a panacea but it's not. I have considerable experience with AI myself and it rarely meets expectations unless you can clearly define the objectives up front. If you plan to train classifers, you'll need to collect a fairly large, relevant training set. Even then, this may not solve as many problems as it may cause and you need to make a good decision based on fact. > > Some of your messages suggest you are easily taken by wiz-bang technologies and cool new features (which I suffer from myself and need to balance). You need to consider these features especially carefully and realize that this bias can be detrimental under some circumstances. Certainly, the project needs to be fun, but don't get too caught up in feature envy. Make sure you keep the coupling loose through interfaces, especially if you add AI components, and please allow users to decide which features they want by keeping the architecture as pluggable as possible. This will benefit both your user and development communities. I agree with your opinion - I'm keeping the AI stuff a low priority for now, as I mentioned - focussing only on refactoring for the moment. However, I am hoping that we can do some evaluation, and decide if we should go for it in v1.4. The real problem is, we keep finding real-life dirty html that just does not follow any rules. Just sometime back, I found a bug in HTMLStringNode.. If we have html like : <script language="javascript"> var lower = '<%=lowerValue%>'; </script> the StringNode does not realize that the jsp tag within quotes is not to be handled as a tag but a string node. I found this problem while refactoring today. Upon modifying the parsing automata and including a PARSE_IGNORE_STATE for HTMLStringNode, I was able to handle this case. But thanks to the tests, I found that we had another failing test, wherein we had something like : <A href="somelink.html">Kaarle's homepage</A> As you can see, the single apostrophe caused the damage. Now, I had to code in extra logic to also verify that the next character must be an opening tag for the string node automaton to move into ignoring state. Good news is that all tests are passing. But I am not really sure that this philosophy of modifying the parsing automata logic is all that great. If cases such as this can be coded outside the parser - perhaps with regular expressions, we will no longer need to modify the code of the parser each time. However, there'd probably be a performance hit as compared to the system that we have now, and we've got to look at it really carefully before we go for it. We've been going vertical all this while, so I am just trying to go lateral to see where that might take us - just to generate more possibilities. I'm thinking that we could try a seperate implementation - writing the parsing logic from scratch, and using the existing test mechanism to verify. This could be a seperate module in the htmlparser project, for experimentation only - to evaluate feasibility. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-28 23:55:11
|
Hi Folks, I'm done removing all duplication of toHTML(). Now, most tags either = inherit from HTMLCompositeTag or use the standard toHTML() in HTMLTag. = This was only the first step. But here are some interesting results : [1] Code size of the package org.htmlparser.tags reduced from 1218 to = 1082 lines (not counting tags and empty lines). A new class has been = added - org.htmlparser.tests.codeMetrics.LineCounter. It can recurse = through java files in a given directory and get the total count. [2] Due to the refactoring, I uncovered a bug in HTMLStringNode, in the = handling of single quotes. [3] There were a lot of complaints earlier (if I remember right, mostly = from Dhaval) about toHTML() putting in its own line breaks. That = shouldn't happen anymore. Any toHTML() bugs will be much easier to fix = as the code is in one place, and is uniform. [4] We have the added benefit of getting textarea contents in = toPlainTextString() automatically, as it now inherits from a composite = tag. We should be able to have a release tomorrow. I am currently waiting = for Derrick to fix the two tests that are failing. Derrick --> if you = dont find the time this week, I can comment out the two failing methods = in this week's version. Regards, Somik |
From: Claude D. <CD...@ar...> - 2002-12-28 23:47:45
|
SWYgSSBtYXkuLi4gV2F0Y2ggb3V0IGZvciBmZWF0dXJlIGNyZWVwISBBZGRpbmcgQUkgc291bmRz IGxpa2UgYSBwYW5hY2VhIGJ1dCBpdCdzIG5vdC4gSSBoYXZlIGNvbnNpZGVyYWJsZSBleHBlcmll bmNlIHdpdGggQUkgbXlzZWxmIGFuZCBpdCByYXJlbHkgbWVldHMgZXhwZWN0YXRpb25zIHVubGVz cyB5b3UgY2FuIGNsZWFybHkgZGVmaW5lIHRoZSBvYmplY3RpdmVzIHVwIGZyb250LiBJZiB5b3Ug cGxhbiB0byB0cmFpbiBjbGFzc2lmZXJzLCB5b3UnbGwgbmVlZCB0byBjb2xsZWN0IGEgZmFpcmx5 IGxhcmdlLCByZWxldmFudCB0cmFpbmluZyBzZXQuIEV2ZW4gdGhlbiwgdGhpcyBtYXkgbm90IHNv bHZlIGFzIG1hbnkgcHJvYmxlbXMgYXMgaXQgbWF5IGNhdXNlIGFuZCB5b3UgbmVlZCB0byBtYWtl IGEgZ29vZCBkZWNpc2lvbiBiYXNlZCBvbiBmYWN0LiANCiANClNvbWUgb2YgeW91ciBtZXNzYWdl cyBzdWdnZXN0IHlvdSBhcmUgZWFzaWx5IHRha2VuIGJ5IHdpei1iYW5nIHRlY2hub2xvZ2llcyBh bmQgY29vbCBuZXcgZmVhdHVyZXMgKHdoaWNoIEkgc3VmZmVyIGZyb20gbXlzZWxmIGFuZCBuZWVk IHRvIGJhbGFuY2UpLiBZb3UgbmVlZCB0byBjb25zaWRlciB0aGVzZSBmZWF0dXJlcyBlc3BlY2lh bGx5IGNhcmVmdWxseSBhbmQgcmVhbGl6ZSB0aGF0IHRoaXMgYmlhcyBjYW4gYmUgZGV0cmltZW50 YWwgdW5kZXIgc29tZSBjaXJjdW1zdGFuY2VzLiBDZXJ0YWlubHksIHRoZSBwcm9qZWN0IG5lZWRz IHRvIGJlIGZ1biwgYnV0IGRvbid0IGdldCB0b28gY2F1Z2h0IHVwIGluIGZlYXR1cmUgZW52eS4g TWFrZSBzdXJlIHlvdSBrZWVwIHRoZSBjb3VwbGluZyBsb29zZSB0aHJvdWdoIGludGVyZmFjZXMs IGVzcGVjaWFsbHkgaWYgeW91IGFkZCBBSSBjb21wb25lbnRzLCBhbmQgcGxlYXNlIGFsbG93IHVz ZXJzIHRvIGRlY2lkZSB3aGljaCBmZWF0dXJlcyB0aGV5IHdhbnQgYnkga2VlcGluZyB0aGUgYXJj aGl0ZWN0dXJlIGFzIHBsdWdnYWJsZSBhcyBwb3NzaWJsZS4gVGhpcyB3aWxsIGJlbmVmaXQgYm90 aCB5b3VyIHVzZXIgYW5kIGRldmVsb3BtZW50IGNvbW11bml0aWVzLg0KIA0KLS0tLS1PcmlnaW5h bCBNZXNzYWdlLS0tLS0gDQpGcm9tOiBTb21payBSYWhhIFttYWlsdG86c29taWtAeWFob28uY29t XSANClNlbnQ6IFNhdCAxMi8yOC8yMDAyIDEyOjIwIEFNIA0KVG86IGh0bWxwYXJzZXItZGV2ZWxv cGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiANClN1YmplY3Q6IFJlOiBbSHRtbHBhcnNl ci1kZXZlbG9wZXJdIHRvUGxhaW5UZXh0U3RyaW5nKCkgZmVlZGJhY2sgcmVxdWVzdGVkDQoNCg0K DQoJSGkgQ2xhdWRlLA0KCT5JIHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mg c2l6ZSBpcyBoaXR0aW5nIGNyaXRpY2FsIG1hc3MgYW5kDQoJdGhhdCBtaW5pbWFsIGZvcm1hbGl0 aWVzIGFyZSBub3cgcmVxdWlyZWQuLi4NCgkNCglHb29kIHRvIGhlYXIgZnJvbSB5b3UuIEkgcXVp dGUgYWdyZWUgdGhhdCB0aGlzIHByb2plY3Qgc2VlbXMgdG8gYmUgcmVhY2hpbmcNCgljcml0aWNh bCBtYXNzLiBXZSBuZWVkIHRvIHNoYXJlIGEgcGxhbiBvZiBhY3Rpb24gdGhhdCB3ZSBjb3VsZCBh bGwgd29yayBvbg0KCXBhcmFsbGVsbHkuDQoJDQoJRmlyc3QsIEkndmUgYmVlbiBub3RpY2luZyB0 aGF0IHdlJ3JlIG5vdCBjb25zaXN0ZW50IG9uIHRoZSBjb2RpbmcNCgljb252ZW50aW9uLiBJbiB0 aGUgcGFzdCwgaWYgSSBzZWUgYSBjbGFzcyB0aGF0IGRvZXMgbm90IGZvbGxvdyB0aGUNCgljb252 ZW50aW9uLCBJJ2QgZ28gYWhlYWQgYW5kIGZpeCBpdC4gSG93ZXZlciwgYXMgc28gbWFueSBwZW9w bGUgYXJlIG5vdw0KCXdvcmtpbmcgcGFyYWxsZWxseSwgaXRzIGltcG9ydGFudCB0aGF0IHdlIGFs bCBzaGFyZSB0aGUgc2FtZSBjb2RpbmcNCgljb252ZW50aW9uLiBUaGUgb25lIEkgZm9sbG93IGlz IHRoZSBzdGFuZGFyZCBKYXZhIGNvZGluZyBjb252ZW50aW9uLiBJbg0KCWFkZGl0aW9uLCBJIHBh cnRpY3VsYXJseSBkb24ndCBsaWtlIHVzaW5nIG5hbWVzIHdpdGggc2luZ2xlIGNoYXJhY3RlcnMg aW4NCgl0aGVtIC0gaS5lLiBJJ2QgcHJlZmVyICJ0YWciIHRvICJwVGFnIi4gVGhhdCBpcyB0aGUg Y29udmVudGlvbiBmb2xsb3dlZCBpbg0KCW1vc3Qgb2YgdGhlIHBhcnNlci4gVGhlIGNvZGluZyBj b252ZW50aW9uIGlzIG9mIGNvdXJzZSBvcGVuIGZvciBkaXNjdXNzaW9uDQoJaWYgYW55b25lIGZl ZWxzIHRoYXQgd2Ugc2hvdWxkIGFjY29tb2RhdGUgYW55IHBhcnRpY3VsYXIgcHJhY3RpY2UuDQoJ DQoJU2Vjb25kLCBJIGFtIHBsYW5uaW5nIHRvIHNpbXBsaWZ5IHRoZSBkZXNpZ24gb2YgdGhlIHNj YW5uZXJzLiBBcyBEaGF2YWwNCgltZW50aW9uZWQgZWFybGllciwgYW5kIGFzIFNhbSBtZW50aW9u ZWQsIGl0IHdhc24ndCBlYXN5IHRvIHdyaXRlIGEgbmV3DQoJc2Nhbm5lci4gSSBhZ3JlZSB0aGF0 IHRoZSBqYXZhZG9jIG9mIGdldFBhcnNlZCgpIHNob3VsZCBiZSBiZXR0ZXIgdGhhbiB3aGF0DQoJ aXQgaXMuIEhvd2V2ZXIsIEkgdGhpbmsgdGhlIHByb2JsZW0gbGllcyB3aXRoIHRoZSBuYW1lIGl0 c2VsZi4gSW4gdGhlIHNlbnNlLA0KCWlmIHdlIHJlbmFtZSAicGFyc2VkIiB0byAicGFyYW1ldGVy VGFibGUiIG9yICJhdHRyaWJ1dGVUYWJsZSIgLSBhbmQNCglnZXRQYXJzZWQoKSBiZWNvbWVzLCBn ZXRBdHRyaWJ1dGVUYWJsZSgpIC0gdGhlIGphdmFkb2Mgd291bGQgYmVjb21lDQoJcmVkdW5kYW50 LiBZb3UgY291bGQgc2F5IHRoYXQgdGhpcyBpcyBhICJkb2N1bWVudGF0aW9uIiBjaGFuZ2UsIG9y IHlvdSBjb3VsZA0KCWNhbGwgdGhpcyBhICJyZWZhY3RvcmluZyIuIEVpdGhlciB3YXksIGl0IGlz IGRldmVsb3BlciBhbmQgdXNlci1mcmllbmRseS4gU28NCglJIHRoaW5rIFNhbSdzIHN1Z2dlc3Rp b25zIGFuZCBvdXIgY3VycmVudCBkaXJlY3Rpb24gYXJlIG9uZSBhbmQgdGhlIHNhbWUuDQoJDQoJ VGhpcmQsIHRoZXJlJ3Mgd2F5IHRvbyBtdWNoIGR1cGxpY2F0ZSBjb2RlIGluIHRoZSB0YWdzLiBB bG1vc3QgZXZlcnkgdGFnDQoJcnVucyB0aHJ1IGl0cyBhdHRyaWJ1dGVzIHRvIHJlbmRlciBpdCBh cyBodG1sIC0gd2hlcmVhcyB0aGlzIGNhbiBiZSBkb25lDQoJZnJvbSB0aGUgc3VwZXJjbGFzcyAt IEhUTUxUYWcuIFRoaXMgaXMganVzdCBvbmUgZXhhbXBsZS4gSSd2ZSBub3QgYmVlbg0KCWZvY3Vz c2luZyBtdWNoIG9uIHJlZHVjaW5nIGR1cGxpY2F0aW9uIC0gYW5kIHdlIGNhbiBzZWUgc2VyaW91 cyBjb2RlIHNtZWxscy4NCglIb3dldmVyLCBJIHRoaW5rIGl0IGlzIGltcG9ydGFudCB0byBiZSBh YmxlIHRvIHF1YW50aWZ5IHdoYXQgd2UgaG9wZSB0bw0KCWFjaGlldmUgYnkgdW5kZXJ0YWtpbmcg c2V2ZXJhbCByb3VuZHMgb2YgcmVmYWN0b3JpbmcuIEFzIHdlIGtlZXANCglyZWZhY3RvcmluZywg SSB3aWxsIGtlZXAgYSB0YWIgb24gdGhlIHNpemUgb2YgdGhlIHBhcnNlciAodXNpbmcgTGluZXMg b2YNCglDb2RlKS4gSWYgdGhlcmUncyBhbnkgb3RoZXIgbWV0cmljIHRoYXQgcGVvcGxlIHRoaW5r IGlzIGltcG9ydGFudCwgd2UgY291bGQNCglpbmNsdWRlIHRoYXQgYXMgd2VsbC4gSXQgd2lsbCBi ZSBpbnRlcmVzdGluZyB0byBzZWUgaG93IG11Y2ggY29kZSB3ZSBjYW4NCgl0YWtlIG91dC4NCgkN CglGb3VydGgsIHRoaXMgaXMgb25lIHByb2plY3Qgd2hlcmUgY2hhbmdlIGlzIHdlbGNvbWUgLSBh bGwgdGhlIHRpbWUuIFNpbXBseQ0KCWJjb3MgaXQgaGFzIGEgbWFzc2l2ZSBzdWl0ZSBvZiB0ZXN0 cyB3aGljaCByZXByZXNlbnRzIG1vc3Qgb2YgdGhlIHVzZXINCglyZXF1aXJlbWVudHMgdGhhdCB3 ZSd2ZSByZWNlaXZlZCB0aWxsIGRhdGUuIFRoZSByZWFsLXZhbHVlIGluIHRoZSBwYXJzZXIgaXMN Cglub3QgcmVhbGx5IGl0cyBwcm9kdWN0aW9uIGNvZGUgLSBidXQgaXRzIHNvbGlkIHRlc3Qgc3Vp dGUuIElmIHRoZXJlIGFyZQ0KCWludGVyZXN0aW5nIGRlc2lnbiBjaGFuZ2VzIHRoYXQgd2lsbCBt YWtlIHRoZSBjdXJyZW50IHN5c3RlbSBzaW1wbGVyLCB0aGV5DQoJc2hvdWxkIGJlIHRyaWVkIG91 dCB3aXRob3V0IGZlYXIgLSBmb3IgaWYgd2UgYnJlYWsgYW55dGhpbmcsIHdlIHdpbGwga25vdw0K CWJjb3Mgb2YgdGhlIGF1dG9tYXRlZCB0ZXN0aW5nLiBJdCBpcyBpbXBvcnRhbnQgdGhhdCB0aGUg dGVzdGluZyBzdWl0ZSBpcw0KCXRoZXJlZm9yZSByZWdhcmRlZCBqdXN0IGFzIGltcG9ydGFudCAo aWYgbm90IG1vcmUpIGFzIHRoZSBwcm9kdWN0aW9uIGNvZGUsDQoJYW5kIHN0ZXBzIHRha2VuIHRv IGltcHJvdmUgaXRzIGRlc2lnbiwgYW5kIG1ha2UgaXQgdmVyeSBlYXN5IHRvIGFkZCBuZXcNCgl0 ZXN0cyBhbGwgdGhlIHRpbWUuDQoJDQoJRmlmdGgsIGFzIHRoZSBwcm9qZWN0IGdyb3dzIC0gbmV3 IGZlYXR1cmVzIHNob3VsZCBrZWVwIGNvbWluZyBpbiBhbGwgdGhlDQoJdGltZS4gRGVycmljayBP c3dhbGQgaGFzIGJlZW4gd29ya2luZyBhd2F5IGF0IHNvIG1hbnkgY29vbCBuZXcgZmVhdHVyZXMg LQ0KCWFuZCBoZSdzIGJlZW4gd3JpdGluZyB0ZXN0cyBmb3IgdGhlbSBhbGwgLSB0aGlzIGlzIGEg bXVjaC1hcHByZWNpYXRlZA0KCWFjdGl2aXR5IGFuZCBtdXN0IGNvbnRpbnVlLiBJJ2QgYmUgaGFw cHkgdG8gbm90IHRoaW5rIGFib3V0IG5ldyBmZWF0dXJlcyBhbmQNCgljb21wbGV0ZWx5IGZvY3Vz IG9uIHNvbGlkaWZ5aW5nIHRoZSBleGlzdGluZyBzeXN0ZW0sIGFuZCBsZXQgRGVycmljayBmb2N1 cw0KCW9uIGFkZGluZyBuZXcgc3R1ZmYgLSBmb3IgdjEuMy4NCgkNCglTaXh0aCwgd2UgaGF2ZSBh IHZlcnkgc2VyaW91cyBwb3NzaWJpbGl0eSBvZiB1c2luZyBBSSBpbiBtYWtpbmcgdGFnDQoJcmVj b2duaXRpb25zLiBJIGhhdmUgdG8gZmluZCB0aGUgdGltZSB0byB3cml0ZSBhYm91dCB0aGUgY3Vy cmVudCByZWNvZ25pdGlvbg0KCW1lY2hhbmlzbSwgYW5kIHBhc3MgdGhhdCBvbiB0byBTYW0gSm9z ZXBoLiBTYW0gaGFzIGNvbnNpZGVyYWJsZSBleHBlcmllbmNlDQoJaW4gYXJ0aWZpY2lhbCBpbnRl bGxpZ2VuY2UsIGFuZCBpdCB3aWxsIGJlIGdvb2QgZm9yIHRoZSBwcm9qZWN0IHRvIGhhdmUgaGlz DQoJZXhwZXJ0aXNlIGluIGl0cyBjb3JlIGNvcnJlY3Rpb24gbG9naWMuDQoJDQoJU2V2ZW50aCwg aXQgbWlnaHQgYmUgZ29vZCB0byBoYXZlIGEgV2lraSBmb3IgdGhlIHBhcnNlciAtIGFzIHNvIG1h bnkgb3Blbg0KCXNvdXJjZSBwcm9qZWN0cyBkby4gVGhhdCB3YXksIHRoZSBlbnRpcmUgYnVyZGVu IG9mIGRvY3VtZW50YXRpb24gaXMgbm90IG9uDQoJYW55IG9uZSBwZXJzb24uIFdlIHNob3VsZCBi ZSBsb29raW5nIGF0IG9wdGlvbnMgb2YgaGF2aW5nIG91ciBvd24gV2lraSBzbyB3ZQ0KCWNhbiBh ZGQgY29udGVudCBlYXNpbHkgYW5kIGNvbGxhYm9yYXRpdmVseS4NCgkNCglQbHMgZmVlbCBmcmVl IHRvIGFkZC9tb2RpZnkgdGhpcyBsaXN0IGlmIEkndmUgbWlzc2VkIG91dCBvbiBhbnl0aGluZy4N CgkNCglSZWdhcmRzLA0KCVNvbWlrDQoJDQoJDQoJDQoJLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRoaXMgc2YubmV0IGVtYWlsIGlzIHNw b25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdlZWsgaGVhdmVuLg0KCWh0dHA6Ly90 aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWlsaW5nIGxpc3QNCglIdG1scGFyc2Vy LWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglodHRwczovL2xpc3RzLnNvdXJjZWZv cmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRldmVsb3Blcg0KCQ0KDQo= |
From: Sam J. <ga...@yh...> - 2002-12-28 10:23:17
|
Hi Somik Somik Raha wrote: >>If we want to call >>changing method names and adding documentation, refactoring then I guess >>I am a huge fan of refactoring :-) >> >> > >Changing method names is definitely refactoring. The idea is to use really >good names that make the javadoc redundant. If you can explain what javadoc >would achieve for a method like getAttributes() - I'd be happy to add >javadoc for that. > I don't think good method names make javadoc redundant, and I don't think they can. For example Hashtable getAttributes() tells me very little. What is an attribute? What are the classes in the hashtable? Give me three examples of what an attribute is and then you might have a good javadoc, e.g. [Gets a hashtable of attribute-value pairs from the html tag, e.g. type-->hidden, name-->address, value-->default would come from the tag <input type="hidden" name="address" value="default">. Attributes and values are all String classes. ] Now I think that the method getAttributes with the above javadoc would be much easier to use and understand for somebody trying to use htmlparser for the first or even the third time, no? I think you need to be careful about what is obvious for you, who have designed the system, and what is obvious for a novice user. >>However the other part of what I was >>saying is that I would like to see the documentation additions before we >>change method names. I mean you can deprecate away so as not to break >>my existing code ... I still think you should release a version with a >>full javadoc "refactoring" before you release anything with refactored >>code. What's the harm in focusing on the documentation first? >> >> > >Simply bcos once the name is changed to a better one, docs are no longer >necessary. > See my point above. I think you make more than a name change to make the api novice-user friendly. And my comments about the need for more detailed javadocs go right through the whole project, not just the few examples I've given. >> I mean we're on version 1.2 now right? What about a version 1.2.1 that >>included the documentation fixes, before moving on to a 1.3-beta that >>included the newly refactored code that you're so keen to start work on. >> What would be the downside of proceeding in this fashion? >> >> > >Duplicate work. Maintaining parallel branches of code is the last thing we >need to do, considering limited resources. I don't see a problem in going >straight with 1.3-beta as we've got all our tests in place, and we'll also >use deprecated methods. That should work for you, right ? > Sure, having deprecated methods works for me, but it's not just about me. I mean you've already released 1.2 People are already using it, building code around it. You have to maintain it anyway, or rather users will continue asking you questions about it. I'm suggesting that a 1.2.1 improved documentation release would mean less work in having to maintain things because the users could work from the javadocs. I would also think that going through and writing thorough javadocs for your existing version would put you in a much better position to write a good 1.3-beta. Limited resources is an important concern, and I think you'll conserve resources by more thorough documentation first and re-writing your code second. As long as you have deprecated methods then my own code base will be fine, so it makes little difference to me in the short term which way you go. However I think in the long term, a 1.2.1 documentation release would improve the quality of the project overall, and set a good precedent for high quality javadocs in subsequent releases. I get the sense that you are really keen to plow into re-writing all the code, what's the hurry? >>This sounds like an interesting possibility, but I still need to >>understand how the current parser handles all the existing messy tags. >>I mean when you started talking through all the messy html examples I >>though you were going to be showing me something that didn't work and >>that you wanted some machine learning/AI to fix, but you ended up saying >>that all the examples could be handled by the exisiting parser. If that >>is so, why do you want to change it, and are there examples of messy >>code that can't be handled by the parser? If there aren't and its all >>working fine, what advantage do we gain by replacing the existing core >>correction logic with some hypothetical AI alternative? >> >> > >The main problem is that the current logic is complicated, and uses >hard-coded logic. Such a system is hard to maintain, change. I was thinking >on the lines of a rule-based system, where the rule-base could be dynamic. >But, I will write a seperate mail on this issue as soon as I get some time. > Ah, I think I understand. What you want is some framework for dealing with new types of mess as they come up. A system that can have more rules added to without too much work, or even a framework for trying to guess which rules to apply in which document ... >>>Seventh, it might be good to have a Wiki for the parser - as so many open >>>source projects do. That way, the entire burden of documentation is not >>> >>> >on > > >>>any one person. We should be looking at options of having our own Wiki so >>> >>> >we > > >>>can add content easily and collaboratively. >>> >>> >>> >>You are most welcome to use the wiki I have set up in the short term: >> >>http://www.neurogrid.net/devwiki/wikipages/HtmlParser >> >>It uses an open source java wiki (devwiki), which is not as fully >>functional as it might be, but I've got it set up so everything gets >>backed up to CVS and all changes get sent to the neurogrid-cvs mailing >>list. If nothing else it might serve as an example wiki environment. >> >> > >Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with >current website) as in logos, page color, etc.. > I guess so. I think that could be arranged. Send me a copy of the logo etc., and I'll see what I can do, although it might take a couple of weeks for me to find time to do it. CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-28 09:39:39
|
Hi Sam, > If we want to call > changing method names and adding documentation, refactoring then I guess > I am a huge fan of refactoring :-) Changing method names is definitely refactoring. The idea is to use really good names that make the javadoc redundant. If you can explain what javadoc would achieve for a method like getAttributes() - I'd be happy to add javadoc for that. > However the other part of what I was > saying is that I would like to see the documentation additions before we > change method names. I mean you can deprecate away so as not to break > my existing code ... I still think you should release a version with a > full javadoc "refactoring" before you release anything with refactored > code. What's the harm in focusing on the documentation first? Simply bcos once the name is changed to a better one, docs are no longer necessary. But I understand your predicament. We could have a deprecated method name for important methods. > I mean we're on version 1.2 now right? What about a version 1.2.1 that > included the documentation fixes, before moving on to a 1.3-beta that > included the newly refactored code that you're so keen to start work on. > What would be the downside of proceeding in this fashion? Duplicate work. Maintaining parallel branches of code is the last thing we need to do, considering limited resources. I don't see a problem in going straight with 1.3-beta as we've got all our tests in place, and we'll also use deprecated methods. That should work for you, right ? > >Sixth, we have a very serious possibility of using AI in making tag > >recognitions. I have to find the time to write about the current recognition > >mechanism, and pass that on to Sam Joseph. Sam has considerable experience > >in artificial intelligence, and it will be good for the project to have his > >expertise in its core correction logic. > > > This sounds like an interesting possibility, but I still need to > understand how the current parser handles all the existing messy tags. > I mean when you started talking through all the messy html examples I > though you were going to be showing me something that didn't work and > that you wanted some machine learning/AI to fix, but you ended up saying > that all the examples could be handled by the exisiting parser. If that > is so, why do you want to change it, and are there examples of messy > code that can't be handled by the parser? If there aren't and its all > working fine, what advantage do we gain by replacing the existing core > correction logic with some hypothetical AI alternative? The main problem is that the current logic is complicated, and uses hard-coded logic. Such a system is hard to maintain, change. I was thinking on the lines of a rule-based system, where the rule-base could be dynamic. But, I will write a seperate mail on this issue as soon as I get some time. > >Seventh, it might be good to have a Wiki for the parser - as so many open > >source projects do. That way, the entire burden of documentation is not on > >any one person. We should be looking at options of having our own Wiki so we > >can add content easily and collaboratively. > > > You are most welcome to use the wiki I have set up in the short term: > > http://www.neurogrid.net/devwiki/wikipages/HtmlParser > > It uses an open source java wiki (devwiki), which is not as fully > functional as it might be, but I've got it set up so everything gets > backed up to CVS and all changes get sent to the neurogrid-cvs mailing > list. If nothing else it might serve as an example wiki environment. Thanks, Sam! Is it possible to give it an HTMLParser look (consistent with current website) as in logos, page color, etc.. Cheers, Somik |
From: Sam J. <ga...@yh...> - 2002-12-28 09:25:17
|
Hi Somik Somik Raha wrote: >Second, I am planning to simplify the design of the scanners. As Dhaval >mentioned earlier, and as Sam mentioned, it wasn't easy to write a new >scanner. I agree that the javadoc of getParsed() should be better than what >it is. However, I think the problem lies with the name itself. In the sense, >if we rename "parsed" to "parameterTable" or "attributeTable" - and >getParsed() becomes, getAttributeTable() - the javadoc would become >redundant. You could say that this is a "documentation" change, or you could >call this a "refactoring". Either way, it is developer and user-friendly. So >I think Sam's suggestions and our current direction are one and the same. > > I think that just changing the name of the method is not quite enough. Certainly changing the name of the method is a good step, but I think javadoc comments that actually show an example of what is being returned make a huge usability difference, particularly when the thing being returned is something as complex as a hashtable. If we want to call changing method names and adding documentation, refactoring then I guess I am a huge fan of refactoring :-) However the other part of what I was saying is that I would like to see the documentation additions before we change method names. I mean you can deprecate away so as not to break my existing code ... I still think you should release a version with a full javadoc "refactoring" before you release anything with refactored code. What's the harm in focusing on the documentation first? I mean we're on version 1.2 now right? What about a version 1.2.1 that included the documentation fixes, before moving on to a 1.3-beta that included the newly refactored code that you're so keen to start work on. What would be the downside of proceeding in this fashion? >Sixth, we have a very serious possibility of using AI in making tag >recognitions. I have to find the time to write about the current recognition >mechanism, and pass that on to Sam Joseph. Sam has considerable experience >in artificial intelligence, and it will be good for the project to have his >expertise in its core correction logic. > This sounds like an interesting possibility, but I still need to understand how the current parser handles all the existing messy tags. I mean when you started talking through all the messy html examples I though you were going to be showing me something that didn't work and that you wanted some machine learning/AI to fix, but you ended up saying that all the examples could be handled by the exisiting parser. If that is so, why do you want to change it, and are there examples of messy code that can't be handled by the parser? If there aren't and its all working fine, what advantage do we gain by replacing the existing core correction logic with some hypothetical AI alternative? >Seventh, it might be good to have a Wiki for the parser - as so many open >source projects do. That way, the entire burden of documentation is not on >any one person. We should be looking at options of having our own Wiki so we >can add content easily and collaboratively. > You are most welcome to use the wiki I have set up in the short term: http://www.neurogrid.net/devwiki/wikipages/HtmlParser It uses an open source java wiki (devwiki), which is not as fully functional as it might be, but I've got it set up so everything gets backed up to CVS and all changes get sent to the neurogrid-cvs mailing list. If nothing else it might serve as an example wiki environment. CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-28 08:19:10
|
Hi Claude, >I think it may well be that the group's size is hitting critical mass and that minimal formalities are now required... Good to hear from you. I quite agree that this project seems to be reaching critical mass. We need to share a plan of action that we could all work on parallelly. First, I've been noticing that we're not consistent on the coding convention. In the past, if I see a class that does not follow the convention, I'd go ahead and fix it. However, as so many people are now working parallelly, its important that we all share the same coding convention. The one I follow is the standard Java coding convention. In addition, I particularly don't like using names with single characters in them - i.e. I'd prefer "tag" to "pTag". That is the convention followed in most of the parser. The coding convention is of course open for discussion if anyone feels that we should accomodate any particular practice. Second, I am planning to simplify the design of the scanners. As Dhaval mentioned earlier, and as Sam mentioned, it wasn't easy to write a new scanner. I agree that the javadoc of getParsed() should be better than what it is. However, I think the problem lies with the name itself. In the sense, if we rename "parsed" to "parameterTable" or "attributeTable" - and getParsed() becomes, getAttributeTable() - the javadoc would become redundant. You could say that this is a "documentation" change, or you could call this a "refactoring". Either way, it is developer and user-friendly. So I think Sam's suggestions and our current direction are one and the same. Third, there's way too much duplicate code in the tags. Almost every tag runs thru its attributes to render it as html - whereas this can be done from the superclass - HTMLTag. This is just one example. I've not been focussing much on reducing duplication - and we can see serious code smells. However, I think it is important to be able to quantify what we hope to achieve by undertaking several rounds of refactoring. As we keep refactoring, I will keep a tab on the size of the parser (using Lines of Code). If there's any other metric that people think is important, we could include that as well. It will be interesting to see how much code we can take out. Fourth, this is one project where change is welcome - all the time. Simply bcos it has a massive suite of tests which represents most of the user requirements that we've received till date. The real-value in the parser is not really its production code - but its solid test suite. If there are interesting design changes that will make the current system simpler, they should be tried out without fear - for if we break anything, we will know bcos of the automated testing. It is important that the testing suite is therefore regarded just as important (if not more) as the production code, and steps taken to improve its design, and make it very easy to add new tests all the time. Fifth, as the project grows - new features should keep coming in all the time. Derrick Oswald has been working away at so many cool new features - and he's been writing tests for them all - this is a much-appreciated activity and must continue. I'd be happy to not think about new features and completely focus on solidifying the existing system, and let Derrick focus on adding new stuff - for v1.3. Sixth, we have a very serious possibility of using AI in making tag recognitions. I have to find the time to write about the current recognition mechanism, and pass that on to Sam Joseph. Sam has considerable experience in artificial intelligence, and it will be good for the project to have his expertise in its core correction logic. Seventh, it might be good to have a Wiki for the parser - as so many open source projects do. That way, the entire burden of documentation is not on any one person. We should be looking at options of having our own Wiki so we can add content easily and collaboratively. Pls feel free to add/modify this list if I've missed out on anything. Regards, Somik |
From: Sam J. <ga...@yh...> - 2002-12-28 04:23:04
|
Hi, I realise that unless I get specific on the issue of documentation I am unlikely to be taken seriously http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/HTMLTag.html is from the java docs - here are some javadoc comments Hashtable getParsed() "Gets the parsed" static java.util.Vector getStrictTags() "Gets the strictTags" java.lang.String getTagLine() "Returns the tagLine" java.lang.String getTagName() "" Personally I would like to see a good couple of lines of text for each of these. Some are more important than others. I mean getTagName might seem self explanatory, but I would like to see more, e.g. Hashtable getParsed() "Gets a hashtable of attribute-value pairs from the html tag, e.g. type-->hidden, name-->address, value-->default" static java.util.Vector getStrictTags() "<not sure what strict tags are, would be nice if there was a description>" java.lang.String getTagLine() "Returns the line from the document that this tag was found on?" java.lang.String getTagName() "get the name of the tag itself, e.g. BODY, TR, TABLE etc." Personally I found the lack of javadocs on getParsed() particularly frustrating. I had to go into the code in some depth to find out what it was, and that it was the thing I needed. Naturally the beauty of open source is that I can actually do that. But I would have preferred to have discovered the information I needed from the javadocs. And I would prefer as a *user* that you would expend your efforts updating the documentation on the existing code base rather than refactoring. CHEERS> SAM Sam Joseph wrote: > Somik replied to my issues about documentation saying that he had been > working on it and asking me to point out other places where the > documentation could be improved. I don't have time right now to go > through and make those points, |
From: Sam J. <ga...@yh...> - 2002-12-28 04:02:55
|
Joshua Kerievsky wrote: >Sam wrote: > > >>As I mention above. One should be careful of duplication removal for >>the sake of it. >> >> > >And as I mentioned above, I completely disagree with you. I wonder, how >much refactoring have you done in your career given you philosophy of "if it >ain't broke, don't fix it?" Have you read Martin Fowler's landmark book, >Refactoring? If not, I'd suggest you study it thoroughly - you'll be a >better programmer for it. > I must say that I have yet to be convinced of the value of continous refactoring. I think you can have too much of a good thing. I refactor a lot in my code. I am constantly looking at how to design things more efficiently, how to implement a number of different pieces of code in a smaller, tidier way, and also I am well aware that too much refactoring can be dangerous. In your haste to refactor everything, consider that this is only the latest in a long line of computing fads, and none of them is a cure-all. Refactoring is just another technique, like OO, or unit testing. It will only help you if use it in moderation. Once you internalise this, you'll be a better programmer. Trust me. In my 15 years of coding experience I have seen projects collapse both from too much refactoring and too little. To believe that you will improve your project purely as a consequence of refactoring is just plain wrong in my experience. I am not saying that you shouldn't refactor in this case, I'm just saying that I think the effort spent on refactoring would be better spent on first improving the documentation. Somik replied to my issues about documentation saying that he had been working on it and asking me to point out other places where the documentation could be improved. I don't have time right now to go through and make those points, but as a *user* of htmlparser I would appreciate it if you would leave off any refactoring until you have gone through the entire code base and expanded every javadoc comment so that it really described what the method was doing in a manner understandable to someone unfamiliar with the code base. I think that if you direct your effort here first before refactoring you will have a greater increase in users in the long run. Don't get me wrong, I am not against refactoring, and I use the term "if it ain't broke don't fix it" partly in jest. My main point was about "documenting before fixing". I don't believe that any refactoring that you do will replace the need for good documentation, for the code base that *is already deployed and in use*. I am asking you as a *user* of htmlparser to focus on the documentation (and release a version including that latest documentation) before moving on with your refactoring. Remember it is dangerous for a software project to ignore the needs of its users. CHEERS> SAM |
From: Claude D. <CD...@ar...> - 2002-12-28 03:35:04
|
SSB0cmllZCB0byByZWZyYWluIGZyb20ganVtcGluZyBpbiwgYnV0IGhlcmUgYXJlIGEgY291cGxl IG9mIGNlbnRzIHdvcnRoIG9mIG9waW5pb24sIGp1c3QgZm9yIHRoZSBob2xsaWRheXMgOy0pLg0K IA0KSW4gbXkgdmlldywgZGlzdGlsbGluZyBzb2Z0d2FyZSBkb3duIHRvIGVzc2VudGlhbHMgaXMg cGFydCBvZiBpbXByb3ZpbmcgZGVzaWduLiBJdCBjdXRzIGRvd24gb24gY29kZSB0aGF0IHdvdWxk IGxhdGVyIG5lZWQgdG8gYmUgbWFpbnRhaW5lZC4gUmVkdW5kYW5jeSBpcyBleHBlbnNpdmUuIE1v cmUgY29kZSBuZWVkcyB0byBjaGFuZ2UgZXZlcnkgdGltZSBhIGJ1ZyBpcyBmaXhlZCwgcG90ZW50 aWFsbHkgbWlzc2luZyBvbmUgb2YgdGhlIHZhcmlhbnRzIGFuZCBpbnRyb2R1Y2luZyBkaWZmaWN1 bHQgdG8gZGlhZ25vc2Ugc2l0dWF0aW9ucy4gTGVzcyBjb2RlIGlzIGJldHRlci4gQXMgZGVzaWdu cyBpbXByb3ZlLCB0aGUgY29kZSBiYXNlIHdpbGwgdHlwaWNhbGx5IHNocmluay4gUmVmYWN0b3Jp bmcgYW5kIGFwcGx5aW5nIGVmZmVjdGl2ZSBwYXR0ZXJucyBpcyBrZXkgdG8gbWFraW5nIHRoaXMg cG9zc2libGUuDQogDQpUaGF0IGJlaW5nIHNhaWQsIEkgdGhpbmsgZ29vZCBzb2Z0d2FyZSBkZXNp Z24gaGluZ2VzIG9uIHdlbGwgZGVmaW5lZCB1c2UgY2FzZXMuIElmIGNoYW5nZXMgbmVlZCB0byBi ZSBtYWRlIHRvIG1lZXQgdGhlIHJlcXVpcmVtZW50cywgdGhhdCdzIHdoYXQgbmVlZHMgdG8gaGFw cGVuLiBSZWZhY3RvcmluZywgZGlzdGlsbGluZywgc2ltcGxpZnlpbmcsIHJlbW92aW5nIGRlYWQg Y29kZSBhbmQgZG9jdW1lbnRhdGlvbiBhcmUgYWxsIHBhcnQgb2YgZXZlcnkgcmVsZWFzZS4gUmVm YWN0b3JpbmcgZm9yIHRoZSBzYWtlIG9mIHJlZmFjdG9yaW5nIHNob3VsZCBiZSBhdm9pZGVkLCBh cyB3ZWxsIGFzIGZvcmdpbmcgYWhlYWQgd2l0aG91dCBhIHdlbGwtY29uc2lkZXJlZCBhbmQgcmV2 aWV3ZWQgcGxhbi4gSWYgdGhlIHBsYW4gc2hvd3MgdGhhdCByZWZhY3RvcmluZyB3aWxsIGltcHJv dmUgZGVzaWduLCBkZWxpdmVyIGJldHRlciBzb2Z0d2FyZSwgbWVldCByZXF1aXJlbWVudHMgYW5k IHNhdGlzZnkgdGFyZ2V0IHVzZSBjYXNlcywgdGhlbiBpdCBpcyBwcm9iYWJseSBhIGdvb2QgcGxh biBhbmQgc2hvdWxkIGJlIGltcGxlbWVudGVkLCBzbyBsb25nIGFzIHRoZXJlJ3MgZW5vdWdoIGFn cmVlbWVudCB0byB2YWxpZGF0ZSB0aGUgb2JqZWN0aXZlcyBhbmQgc3RyYXRlZ3kuDQogDQpJIHRo aW5rIG1vc3Qgb2YgdGhpcyBkaXNjdXNzaW9uIGNlbnRlcnMgb24gYSBsYWNrIG9mIGRvY3VtZW50 YXRpb24sIGJ1dCBub3QgdXNlciBkb2N1bWVudGF0aW9uLiBUaGUgZ3JvdXAgaXMgbGFja2luZyBp biBwcm9jZXNzLiBJZiBhIHBsYW4gaXMgd29ydGggcGVyc3VpbmcsIGl0J3Mgd29ydGggd3JpdGlu ZyB1cCBhIHN1bW1hcnksIHdpdGggc3VmZmljaWVudCBqdXN0aWZpY2F0aW9uIHRvIHNhdGlmeSB0 aGUgZ3JvdXAuIElmIHRoYXQgcGxhbiBpcyB0aGVuIHN1YmplY3RlZCB0byByZXZpZXcsIEkgZXhw ZWN0IGl0IGNhbiBiZSB2YWxpZGF0ZWQgcXVpY2tseSBhbmQgZXZlcnlvbmUgY2FuIGZlZWwgY29u ZmlkZW50IHRoYXQgcHJvcG9zZWQgY2hhbmdlcyB3aWxsIGltcHJvdmUgdGhlIGRlc2lnbiBhbmQg bm90IHNpbXBseSBpbnRyb2R1Y2UgbW9yZSB3b3JrIG9yIGNoYW5nZSBmb3IgdGhlIHNha2Ugb2Yg Y2hhbmdlLg0KIA0KSW4gcHJpbmNpcGxlLCBJIGhhdmVudCByZWFkIGFueXRoaW5nIG9uIHRoZSBs aXN0IHRoYXQgbWFkZSBtZSBuZXJ2b3VzLiBTaWxlbmNlIHNob3VsZCBiZSB0YWtlbiBhcyBpbXBs aWNpdCBzdXBwb3J0LiBJZiBhIGRvY3VtZW50IGlzIGNpcmN1bGF0ZWQgYW5kIG5vIG9iamVjdGlv bnMgYXJlIG5vdGVkLCB0aGVyZSBpcyBubyByZWFzb24gdG8gd2FpdCBpbmRlZmluaXRlbHksIHRo b3VnaCBhIHJlYXNvbmFibGUgcmV2aWV3IHByb2Nlc3Mgc2hvdWxkIHByb2JhYmx5IGJlIGFwcGxp ZWQuIENvbnRyaWJ1dG9ycyBhcmUgZG9pbmcgdGhlIHdvcmsgYW5kIGhhdmUgYSByaWdodCB0byBk aXJlY3QgdGhlaXIgb3duIGVmZm9ydHMuIEluIG15IGV4cGVyaWVuY2UsIGhvd2V2ZXIsIGV2ZW4g c2hvcnQgcHJvY2VzcyBkb2N1bWVudHMgY2FuIG1ha2UgYSB3b3JsZCBvZiBkaWZmZXJlbmNlLiBJ IHRoaW5rIGl0IG1heSB3ZWxsIGJlIHRoYXQgdGhlIGdyb3VwJ3Mgc2l6ZSBpcyBoaXR0aW5nIGNy aXRpY2FsIG1hc3MgYW5kIHRoYXQgbWluaW1hbCBmb3JtYWxpdGllcyBhcmUgbm93IHJlcXVpcmVk Li4uDQogDQpIYXBweSBOZXcgWWVhciBFdmVyeW9uZSENCiANCi0tLS0tT3JpZ2luYWwgTWVzc2Fn ZS0tLS0tIA0KRnJvbTogSm9zaHVhIEtlcmlldnNreSBbbWFpbHRvOmpvc2h1YUBpbmR1c3RyaWFs bG9naWMuY29tXSANClNlbnQ6IEZyaSAxMi8yNy8yMDAyIDEwOjMzIEFNIA0KVG86IGh0bWxwYXJz ZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiBodG1scGFyc2VyLXVzZXJA bGlzdHMuc291cmNlZm9yZ2UubmV0IA0KU3ViamVjdDogUmU6IFtIdG1scGFyc2VyLWRldmVsb3Bl cl0gdG9QbGFpblRleHRTdHJpbmcoKSBmZWVkYmFjayByZXF1ZXN0ZWQNCg0KDQoNCglTYW0gd3Jv dGU6DQoJPiBUaGUgVmlzaXRvciBwYXR0ZXJuIHNvdW5kcyBpbnRlcmVzdGluZywgYW5kIEkgbG9v ayBmb3J3YXJkIHRvIGhlYXJpbmcNCgk+IG1vcmUgYWJvdXQgaXQuICBIb3dldmVyLCBkdXBsaWNh dGVkIGNvZGUgaXRzZWxmIGlzIG5vdCBJTU8gbmVjZXNzYXJpbHkNCgk+IGFuIGV2aWwuIEl0IGFs bCBkZXBlbmRzIG9uIHdoZXRoZXIgb25lIHRoaW5rcyB0aGF0IHRoZSBkdXBsaWNhdGVkDQoJPiBj b21wb25lbnRzIGFyZSBnb2luZyB0byBkaXZlcmdlIGluIGZ1bmN0aW9uYWxpdHkgaW4gdGhlIGZ1 dHVyZS4gIElmIHlvdQ0KCT4gYXJlIHN1cmUgdGhleSBhcmUgbm90LCB0aGVuIGZpbmUsIHJlZmFj dG9yIGF3YXkuDQoJDQoJQXQgdGhlIG1vbWVudCwgU29taWsgYW5kIEkgc3BlY3VsYXRlIHRoYXQg NDAlIHRvIDUwJSBvZiB0aGUgY3VycmVudCBodG1sDQoJcGFyc2VyIGNvZGUgYmFzZSBpcyBwdXJl IGZhdCAtLSB1bm5lY2Vzc2FyeSBjb2RlIHRoYXQgb25seSBzZXJ2ZXMgdG8gYmxvYXQNCgl0aGUg Y29kZSBiYXNlLiAgRHVwbGljYXRlIGNvZGUsIGluIHN1YnRsZSBvciBub3Qtc28tc3VidGxlIHZl cnNpb25zLCBpcw0KCW1vc3RseSByZXNwb25zaWJsZSBmb3IgdGhlIGJsb2F0LiAgSW4gbXkgZXhw ZXJpZW5jZSBpbiBpbmR1c3RyeSwgSSdkIHNheQ0KCXRoYXQgOTUlIG9mIHRoZSB0aW1lLCBkdXBs aWNhdGUgY29kZSBpcyBiYWQuDQoJDQoJPiBJIGd1ZXNzIG15IHN1cnByaXNlIGF0IHlvdXIgKG9y IHBlcmhhcHMgU29taWsncykgZm9jdXMgb24gcmVmYWN0b3JpbmcNCgk+IGNvbWVzIGZyb20gdGhl IGZhY3QgdGhhdCB3aGlsZSB0aGUgaHRtbHBhcnNlciBpcyBhIGdyZWF0IHBpZWNlIG9mDQoJPiBz b2Z0d2FyZSwgdGhlIGphdmFkb2NzIGFuZCBvdGhlciBkb2N1bWVudGF0aW9uIGNvdWxkIHVzZSBz b21lIGF0dGVudGlvbi4NCgk+ICBGb3IgZXhhbXBsZSwgSSBjYW4ndCBmaW5kIGFueSBleHBsYW5h dGlvbiBpbiB0aGUgamF2YWRvY3Mgb3Igb3RoZXJ3aXNlDQoJPiBvZiBob3cgdGhlIGZpbHRlcnMg YXJlIHN1cHBvc2VkIHRvIHdvcmsgd2l0aCB0aGUgZGlmZmVyZW50IHNjYW5uZXJzLCBvcg0KCT4g d2hhdCB2YWx1ZXMgdGhleSBhcmUgYWxsb3dlZCB0byB0YWtlLg0KCQ0KCVRoZSBodG1sIHBhcnNl ciBpcyBpbiBzb3JlIG5lZWQgb2YgcmVmYWN0b3JpbmcgU2FtLiAgSW4gZmFjdCwgc29tZXRpbWVz IGNvZGUNCgliZWNvbWVzIHVubmVjZXNzYXJpbHkgY29tcGxleCwgaW4gd2hpY2ggY2FzZSBwZW9w bGUgKnJlYWxseSogbmVlZA0KCWRvY3VtZW50YXRpb24uICBJIGNvbmZyb250IHN1Y2ggYSBwcm9i bGVtIGJ5IGZpcnN0IGFza2luZyBpZiB0aGUgY29kZSBjb3VsZA0KCWJlIHNpbXBsaWZpZWQgc28g d2UgZGlkbid0IG5lZWQgc28gbXVjaCBkb2N1bWVudGF0aW9uLiAgIEluIHRoZSBldmVudCB0aGF0 DQoJd2UgZG8gbmVlZCBkb2NzLCBpdCBpcyBiZXN0IHRvIHdyaXRlIGV4ZWN1dGFibGUgZG9jdW1l bnRhdGlvbi4gIEFyZSB5b3UNCglmYW1pbGlhciB3aXRoIGV4ZWN1dGFibGUgZG9jdW1lbnRhdGlv bj8NCgkNCgk+IEkgZ2VuZXJhbGx5IHdvcmsgdG8gImlmIGl0J3Mgbm90IGJyb2tlbiBkb24ndCBm aXggaXQiLCBidXQgSSBvZnRlbiBhZGQNCgk+ICJiZWZvcmUgeW91IHN0YXJ0IGZpeGluZyBpdCwg bWFrZSBzdXJlIHlvdXIgZG9jdW1lbnRhdGlvbiBpcyB1cCB0byBkYXRlIi4NCgkNCglTb2Z0d2Fy ZSBiZWNvbWVzIGJyaXR0bGUgYW5kIGJsb2F0ZWQgdW5kZXIgdGhlIHBoaWxvc29waHkgb2YgImlm IGl0IGFpbid0DQoJYnJva2UsIGRvbid0IGZpeCBpdC4iICAgV2UgaGF2ZSBjbGllbnRzIHdpdGgg Y29kZSBiYXNlcyB0aGF0IGFyZSAyIG1pbGxpb25zDQoJbGluZXMgb2YgIndvcmtpbmciIHNwZWdo ZXR0aSBjb2RlLiAgVGhleSBuZWVkIGxvdHMgb2YgaGVscCB0byBsZWFybiB0byBkbw0KCWNvbnRp bnVvdXMgcmVmYWN0b3JpbmcuDQoJDQoJPiBVc2luZyB0aGUgVmlzaXRvciBwYXR0ZXJuIG1heSBt YWtlIGl0IGVhc2llciBmb3IgY2xpZW50cyB0byBnZXQgdGhlIGRhdGENCgk+IHRoZXkgbmVlZCwg YnV0IGdpdmVuIHRoYXQgdGhlIGh0bWxwYXJzZXIgaXMgIndvcmtpbmciICh3ZWxsIGl0IHdvcmtz IGZvcg0KCT4gbWUpLCBJIHdvdWxkIHNheSB0aGF0IHRoZSBtb3JlIHVyZ2VudCBpc3N1ZSBoZXJl IGlzIG1ha2luZyBzdXJlIGFsbCB0aGUNCgk+IGRvY3VtZW50YXRpb24gaXMgdXAgdG8gZGF0ZS4g ICBJIGhhdmUgYSBsb3Qgb2YgcG9zaXRpdmUgdGhpbmdzIHRvIHNheQ0KCT4gYWJvdXQgaHRtbHBh cnNlciwgc28gZG9uJ3QgdGFrZSBpdCB0aGUgd3Jvbmcgd2F5LCB3aGVuIEkgc2F5IHRoYXQgdGhl DQoJPiBiaWdnZXN0IHByb2JsZW0gSSd2ZSBoYWQgaW4gdXNpbmcgaXQgaW4gdGhlIGxhc3QgZmV3 IHdlZWtzIGlzIGluYWRlcXVhdGUNCgk+IGphdmFkb2NzLg0KCQ0KCUknbSBub3QgY29udGVudCB0 byBsaXZlIHdpdGggdGhlIHdvcmxkIGFzIGl0IGlzIFNhbS4gIElmIHNvbWV0aGluZyBpc24ndA0K CWVhc3ksIGl0J3Mgd3JvbmcuICBUaGUgYmVzdCBzb2Z0d2FyZSBpbiB0aGUgd29ybGQgaXMgZWFz eSB0byB1c2UsIGl0J3MNCglzZWxmLWV4cGxhbmF0b3J5LiAgV2Ugc2hvdWxkIGFsd2F5cyBzdHJp dmUgZm9yIHRoYXQuICBJbiB0aGUgbWVhbnRpbWUsIGlmDQoJeW91IGhhdmUgdGFza3MgdG8gY29t cGxldGUgYW5kIGRvbid0IGtub3cgaG93IHRvIGRvIHRoZW0gYmVjYXVzZSBvZiBsYWNrIG9mDQoJ ZG9jcywgSSdkIHN1Z2dlc3QgeW91IGFzayBxdWVzdGlvbnMgaGVyZS4gIFRoZSBiZXN0IHJlc3Vs dCB3aWxsIGJlDQoJZXhlY3V0YWJsZSBkb2N1bWVudGF0aW9uIGZvciB0aGUgaHRtbCBwYXJzZXIu DQoJDQoJPiA+PklzIHRoZXJlIHNvbWUgZWZmaWNpZW5jeSByZWFzb24gd2h5IHlvdSB3YW50IHRv IHJlZmFjdG9yIHRoZXNlIG1ldGhvZHMNCgk+ID4+b3IgaXMgaXQganVzdCBmb3IgbmVhdG5lc3M/ DQoJPiA+Pg0KCT4gPj4NCgk+ID4NCgk+ID5EdXBsaWNhdGlvbiByZW1vdmFsIGlzIHJlYXNvbiAj MS4NCgk+ID4NCgk+IEFzIEkgbWVudGlvbiBhYm92ZS4gIE9uZSBzaG91bGQgYmUgY2FyZWZ1bCBv ZiBkdXBsaWNhdGlvbiByZW1vdmFsIGZvcg0KCT4gdGhlIHNha2Ugb2YgaXQuDQoJDQoJQW5kIGFz IEkgbWVudGlvbmVkIGFib3ZlLCBJIGNvbXBsZXRlbHkgZGlzYWdyZWUgd2l0aCB5b3UuICBJIHdv bmRlciwgaG93DQoJbXVjaCByZWZhY3RvcmluZyBoYXZlIHlvdSBkb25lIGluIHlvdXIgY2FyZWVy IGdpdmVuIHlvdSBwaGlsb3NvcGh5IG9mICJpZiBpdA0KCWFpbid0IGJyb2tlLCBkb24ndCBmaXgg aXQ/IiAgIEhhdmUgeW91IHJlYWQgTWFydGluIEZvd2xlcidzIGxhbmRtYXJrIGJvb2ssDQoJUmVm YWN0b3Jpbmc/ICAgSWYgbm90LCBJJ2Qgc3VnZ2VzdCB5b3Ugc3R1ZHkgaXQgdGhvcm91Z2hseSAt IHlvdSdsbCBiZSBhDQoJYmV0dGVyIHByb2dyYW1tZXIgZm9yIGl0Lg0KCQ0KCT4gPiBSZW1vdmFs IG9mIGhhcmQtY29kZWQgbG9naWMgaXMgcmVhc29uICMyLg0KCT4gPg0KCT4gVGhpcyBpcyBhIGdv b2QgcmVhc29uLiAgSG93ZXZlciBJIGdldCB0aGUgZmVlbGluZyB0aGF0IGludHJvZHVjdGlvbiBv Zg0KCT4gdGhlc2UgVmlzaXRvciBjbGFzc2VzIHdpbGwgbWFrZSB0aGUgc3lzdGVtIGNvbmNlcHR1 YWxseSBtb3JlIGRpZmZpY3VsdA0KCT4gdG8gdXNlIHJhdGhlciB0aGFuIGVhc2llci4gIEkgd291 bGQgZmVlbCBiZXR0ZXIgaWYgdGhlIGN1cnJlbnQgc2V0IHVwDQoJPiB3YXMgbW9yZSBmdWxseSBk b2N1bWVudGVkIGJlZm9yZSBtb3JlIGNvbXBsZXhpdHkgd2FzIGFkZGVkLg0KCQ0KCUFzIEkgYWxz byBzYWlkIGFib3ZlLCBpZiBpdCBhaW4ndCBzaW1wbGUsIGl0J3Mgd3JvbmcuICBPdXIgY2hhbmdl cyB3aWxsIG5vdA0KCWFkZCBjb21wbGV4aXR5IC0gdGhhdCB3b3VsZCBiZSBmb29saXNoLg0KCQ0K CT4gQW5kIGV2ZW4gaWYgdGhlIFZpc2l0b3IgcGF0dGVybiBpcyB1c2VkLCBJIHdvdWxkIHJlY29t bWVuZCBsZWF2aW5nDQoJPiBtZXRob2RzIGxpa2UgdG9QbGFpblRleHRTdHJpbmcoKSBldGMgaW4g cGxhY2UsIGJ1dCBqdXN0IG1ha2luZyB0aGVtDQoJPiBzaG9ydCBjdXQgaW1wbGVtZW50YXRpb25z IHRvIGNlcnRhaW4ga2luZHMgb2YgdmlzaXRvci11c2luZyBtZXRob2RzLg0KCT4gIFRoaXMgd2ls bCBhbGxvdyBwZW9wbGUgd2hvIGhhdmUgeWV0IHRvIGdyYXNwIHRoZSBWaXNpdG9yICBwYXR0ZXJu DQoJPiBzb21ldGhpbmcgdG8gd29yayB3aXRoLg0KCQ0KCVBlb3BsZSBjYW4gYWx3YXlzIGNhbGwg ZGVwcmVjYXRlZCBtZXRob2RzLg0KCQ0KCT4gSWYgeW91IGFyZSBrZWVuIHRvIHNlZSBsb3RzIG9m IHBlb3BsZSB1c2luZyBodG1scGFyc2VyLCBJIHRoaW5rIHRoYXQgeW91DQoJPiBkb24ndCB3YW50 IHBlb3BsZSB0byBoYXZlIHRvIGNvbWUgdG8gdGVybXMgd2l0aCB0b28gbWFueSBuZXcgY29uY2Vw dHMgYXQNCgk+IG9uY2UuICBZb3Ugc2F5IHlvdXJzZWxmIHRoYXQgdGhlIFZpc2l0b3IgcGF0dGVy biB0YWtlcyBzb21lIGdldHRpbmcgdXNlZA0KCT4gdG8uICBJIHRoaW5rIHRoZSB3aG9sZSBzY2Fu bmVyIGNvbmNlcHQgdGFrZXMgc29tZSBnZXR0aW5nIHVzZWQgdG8gLi4uLg0KCQ0KCU91ciBWaXNp dG9yIGltcGxlbWVudGF0aW9uIGlzIHNvIHRyaXZpYWwgdGhhdCBmb2xrcyB3b24ndCBldmVuIGtu b3cgdGhleSdyZQ0KCXVzaW5nIHRoZSBwYXR0ZXJuLg0KCQ0KCT4gPlNpbXBsaWNpdHkgaXMgcmVh c29uICMzOiB0aGVyZSBpcyBsaXR0bGUgcmVhc29uIHRvIGZhdHRlbiB0aGUgaW50ZXJmYWNlcw0K CW9mDQoJPiA+dGFnIGFuZCBub2RlIGNsYXNzZXMgd2l0aCB2YXJpb3VzIGRhdGEgYWNjdW11bGF0 aW9uL2FsdGVyYXRpb24gbWV0aG9kcw0KCXdoZW4NCgk+ID5vbmUgbWV0aG9kIGFuZCBhIHZhcmll dHkgb2YgY29uY3JldGUgVmlzaXRvcnMgY2FuIGRvIHRoZSBqb2Igd2l0aCBtdWNoDQoJbGVzcw0K CT4gPmNvZGUuDQoJPiA+DQoJPiB3ZWxsIEkgd291bGQgYWdyZWUgaWYgeW91IGNvdWxkIGd1YXJh bnRlZSB0aGF0IHRoZXJlIHdpbGwgYmUgbm8NCgk+IGRpdmVyZ2VuY2Ugd2hhdHNvZXZlciBpbiBo b3cgdGhlIGRpZmZlcmVudCBtZXRob2RzIHdpbGwgYmUgdXNlZC4gIElmIHlvdQ0KCT4gY2FuIGNy ZWF0ZSBhIGZsZXhpYmxlIGVub3VnaCBpbXBsZW1lbnRhdGlvbiBvZiB0aGUgVmlzaXRvciBwYXR0 ZXJuIHRoZW4NCgk+IEkgZ3Vlc3MgdGhhdCB3aWxsIHN1cHBvcnQgYW55IHBvc3NpYmxlIGRpdmVy Z2VuY2UgaW4gdGhlIHNlcGFyYXRlDQoJPiBtZXRob2RzLiBIb3dldmVyLCBJIHRoaW5rIHRoZXJl IGlzIGEgcmVhc29uIHRvIGhhdmUgYSBmYXR0ZXIgaW50ZXJmYWNlLA0KCT4gaW4gdGhhdCBjb252 ZW5pZW5jZSBtZXRob2RzIGxvd2VyIHRoZSBiYXJyaWVyIHRvIGVudHJ5IGZvciBuZXcgdXNlcnMu DQoJDQoJV291bGRuJ3QgaXQgYmUgbWFydmVsb3VzIGlmIHRoZSBiYXJyaWVyIHRvIGVudHJ5IHdh cyBsb3cgYW5kIG91ciBjb2RlIHdhc24ndA0KCWJsb2F0d2FyZT8NCgkNCgk+IFBlcmhhcHMgaWRl YWxseSBvbmUgaGFzIGEgd2VsbCBpbXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gdGhhdCBzdXBw b3J0cw0KCT4gYSByYXcgbWV0aG9kIGFjY2VzcywgYW5kIGEgbnVtYmVyIG9mIGNvbnZlbmllbmNl IG1ldGhvZHM/DQoJDQoJVGhlIGNoYW5nZXMgd2lsbCBtYWtlIHRoZSBjb2RlIGVhc2llciB0byB1 c2UgLSBpZiB0aGV5IGRvbid0LCB0aGV5IHJlYWxseQ0KCWNvb2wgdGhpbmcgYWJvdXQgc29mdHdh cmUgaXMgdGhhdCBpdCBpcyBzb2Z0IC0gd2UgY2FuIGNoYW5nZSBpdC4NCgkNCgk+IEEgd2VsbCBp bXBsZW1lbnRlZCBWaXNpdG9yIHBhdHRlcm4gd2lsbCwgSSBhc3N1bWUsIHN1cHBvcnQgYWxsIHNv cnRzIG9mDQoJPiBkaWZmZXJlbnQgb3BlcmF0aW9ucywgYnV0IEkgd291bGQgZmVlbCBtdWNoIGhh cHBpZXIgaWYgdGhlIGh0bWxwYXJzZXINCgk+IGhhZCBhIGNvbXBsZXRlIGphdmFkb2MgYW5kIGRv Y3VtZW50YXRpb24gcmV2aWV3IGJlZm9yZSBhbnkgcmVmYWN0b3JpbmcNCgk+IHRvb2sgcGxhY2Uu ICBQZW9wbGUgYXJlIHRyeWluZyB0byB1c2UgdGhlIGV4aXN0aW5nIHN5c3RlbSBhbmQgaGF2aW5n DQoJPiB0cm91YmxlIG5vdCBiZWNhdXNlIG9mIHRoZSBsYWNrIG9mIHJlZmFjdG9yaW5nLCBidXQg YSBsYWNrIG9mIHdlbGwNCgk+IGRlc2NyaWJlZCBtZXRob2RzLiAgV2VsbCBJIHNheSBwZW9wbGUs IEkgbWVhbiBtZSwgSSBkb24ndCBrbm93IGlmIGFueW9uZQ0KCT4gZWxzZSBmZWVscyB0aGUgc2Ft ZS4gIE1heWJlIGl0J3MganVzdCBtZSA6LSkNCgkNCglFeGVjdXRhYmxlIGRvY3VtZW50YXRpb24g aXMgYSB0ZXJtIHdlIHVzZSBmb3IgQ3VzdG9tdGVyIFRlc3RzLiAgIEEgQ3VzdG9tZXINCglUZXN0 cyBzaG93cyBob3cgYSBib2R5IG9mIGNvZGUgZ2V0cyB1c2VkIHRvIHBlcmZvcm0gcmVhbCB0YXNr cy4gICBZb3UgaGF2ZQ0KCXJlYWwtd29ybGQgdGFza3MgdG8gcGVyZm9ybS4gIFdlIGNhbiBwZXJm b3JtIHRob3NlIHRhc2tzIHZpYSBhdXRvbWF0ZWQNCgl0ZXN0cy4gICBUaGVuLCB3aGVuIHdlIHJl ZmFjdG9yLCB3ZSBtdXN0IG1ha2Ugc3VyZSBvdXIgZXhlY3V0YWJsZQ0KCWRvY3VtZW50YXRpb24g aXMgdXAgdG8gZGF0ZSAoaS5lLiBpdCBwYXNzZXMgaXRzIHRlc3RzKS4gICBXZSBmaW5kIHRoaXMg aWRlYWwNCglmb3IgbWFraW5nIHN1cmUgdGhhdCBkb2N1bWVudGF0aW9uIHJlZmxlY3RzIHdoYXQg dGhlIGNvZGUgaXMgYWN0dWFsbHkgZG9pbmcuDQoJVGhpcyBhbHNvIGxlYXZlcyByb29tIGZvciBh IGRvY3VtZW50IHRoYXQgZ2l2ZXMgc29tZSBncmFwaGljYWwNCglyZXByZXNlbnRhdGlvbiBvZiBo b3cgdGhpbmdzIHdvcmssIHdoaWNoIG11c3QgYmUgbWFpbnRhaW5lZCBieSBzb21lb25lLCBsZXN0 DQoJaXQgZ2V0IG91dCBvZiBkYXRlLiAgV2UganVzdCBtaW5pbWl6ZSB0aGUgbmVlZCBmb3Igc3Vj aCBkb2N1bWVudHMgYnkga2VlcGluZw0KCW91ciBjb2RlIHNpbXBsZSBhbmQgc21hbGwgYW5kIHN1 cnJvdW5kaW5nIGl0IHdpdGggZWFzeSB0byB1bmRlcnN0YW5kDQoJZXhlY3V0YWJsZSBkb2N1bWVu dGF0aW9uLg0KCQ0KCWJlc3QgcmVnYXJkcw0KCWprDQoJDQoJDQoJDQoJDQoJDQoJDQoJDQoJLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KCVRo aXMgc2YubmV0IGVtYWlsIGlzIHNwb25zb3JlZCBieTpUaGlua0dlZWsNCglXZWxjb21lIHRvIGdl ZWsgaGVhdmVuLg0KCWh0dHA6Ly90aGlua2dlZWsuY29tL3NmDQoJX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCglIdG1scGFyc2VyLWRldmVsb3BlciBtYWls aW5nIGxpc3QNCglIdG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCglo dHRwczovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRl dmVsb3Blcg0KCQ0KDQo= |
From: Joshua K. <jo...@in...> - 2002-12-27 18:52:41
|
> IMHO the visitor pattern is not a good one. Here are the reasons. The Visitor pattern is a fine pattern Derrick. Ralph Johnson once said, "Most of the time you don't need a Visitor, but when you need a Visitor, you really need a Visitor." > It breaks encapsulation. It presumes that an exterior class (the > visitor) knows how to do something to a class better than the class > itself does. A good many of the Design Patterns "break encapsulation." I guess those aren't good patterns, huh? Is there an OO police agency to report instances of breaking encapsulation? ;-) > It makes the system brittle. The visitor class has a method defined for > every subclass of visited object. Adding, removing or refactoring any > visited class breaks the visitors. Hard-coded, duplicated logic in nodes makes class hierarchies brittle. We've found that the html parser code doesn't need special visit methods for each tag type, which means it's easy to implement Visitor and it won't be hard to accomodate visiting new tags in the future. BTW, do you envision needing to add many new tag types to the html parser? > It's obscure. Looking at the visit loop tells you nothing of what is > being perfomed. The only indication of the task might be the name of > the visitor class. If you call a method that yields the plain text string for some html, and that code happens to use a Visitor, do you care? That is, a client won't even know they're using the pattern. > It's redundant. In every example I've seen (I've never seen it used in > real applications), the result could have been coded in a simpler > fashion with correct class design. I agree that it's always best to seek out the simple design. It takes a committment to refactoring to find simplicity. Presently, Somik and I are removing duplication to get to a simpler design. You have our promise that no Visitor will be added if it complicates the design. > I would only use the visitor pattern if either the set of public methods > on the visited nodes completely specify their possible behavior and > allowed the client to fully manipulate them at the highest level of > abstraction (i.e. everything you can do to the visited classes is > already coded and you don't want them to change), or the visitor pattern > had already been coded into the visited nodes and for some reason they > could not be modified further (i.e. the code was available in binary > form only). I don't think either of these applies here. Visitor is a great pattern when you need various operations to be performed over an object structure and you don't want to hard-code those operations into the objects themselves. So far, it seems to me you have just that need in html parser. I could be wrong. We'll have to see how the code feels with a Visitor implementation and make decisions based on real code. CVS allows us to go back to any previous implementation we like better. best regards, jk |
From: Joshua K. <jo...@in...> - 2002-12-27 18:33:44
|
Sam wrote: > The Visitor pattern sounds interesting, and I look forward to hearing > more about it. However, duplicated code itself is not IMO necessarily > an evil. It all depends on whether one thinks that the duplicated > components are going to diverge in functionality in the future. If you > are sure they are not, then fine, refactor away. At the moment, Somik and I speculate that 40% to 50% of the current html parser code base is pure fat -- unnecessary code that only serves to bloat the code base. Duplicate code, in subtle or not-so-subtle versions, is mostly responsible for the bloat. In my experience in industry, I'd say that 95% of the time, duplicate code is bad. > I guess my surprise at your (or perhaps Somik's) focus on refactoring > comes from the fact that while the htmlparser is a great piece of > software, the javadocs and other documentation could use some attention. > For example, I can't find any explanation in the javadocs or otherwise > of how the filters are supposed to work with the different scanners, or > what values they are allowed to take. The html parser is in sore need of refactoring Sam. In fact, sometimes code becomes unnecessarily complex, in which case people *really* need documentation. I confront such a problem by first asking if the code could be simplified so we didn't need so much documentation. In the event that we do need docs, it is best to write executable documentation. Are you familiar with executable documentation? > I generally work to "if it's not broken don't fix it", but I often add > "before you start fixing it, make sure your documentation is up to date". Software becomes brittle and bloated under the philosophy of "if it ain't broke, don't fix it." We have clients with code bases that are 2 millions lines of "working" speghetti code. They need lots of help to learn to do continuous refactoring. > Using the Visitor pattern may make it easier for clients to get the data > they need, but given that the htmlparser is "working" (well it works for > me), I would say that the more urgent issue here is making sure all the > documentation is up to date. I have a lot of positive things to say > about htmlparser, so don't take it the wrong way, when I say that the > biggest problem I've had in using it in the last few weeks is inadequate > javadocs. I'm not content to live with the world as it is Sam. If something isn't easy, it's wrong. The best software in the world is easy to use, it's self-explanatory. We should always strive for that. In the meantime, if you have tasks to complete and don't know how to do them because of lack of docs, I'd suggest you ask questions here. The best result will be executable documentation for the html parser. > >>Is there some efficiency reason why you want to refactor these methods > >>or is it just for neatness? > >> > >> > > > >Duplication removal is reason #1. > > > As I mention above. One should be careful of duplication removal for > the sake of it. And as I mentioned above, I completely disagree with you. I wonder, how much refactoring have you done in your career given you philosophy of "if it ain't broke, don't fix it?" Have you read Martin Fowler's landmark book, Refactoring? If not, I'd suggest you study it thoroughly - you'll be a better programmer for it. > > Removal of hard-coded logic is reason #2. > > > This is a good reason. However I get the feeling that introduction of > these Visitor classes will make the system conceptually more difficult > to use rather than easier. I would feel better if the current set up > was more fully documented before more complexity was added. As I also said above, if it ain't simple, it's wrong. Our changes will not add complexity - that would be foolish. > And even if the Visitor pattern is used, I would recommend leaving > methods like toPlainTextString() etc in place, but just making them > short cut implementations to certain kinds of visitor-using methods. > This will allow people who have yet to grasp the Visitor pattern > something to work with. People can always call deprecated methods. > If you are keen to see lots of people using htmlparser, I think that you > don't want people to have to come to terms with too many new concepts at > once. You say yourself that the Visitor pattern takes some getting used > to. I think the whole scanner concept takes some getting used to .... Our Visitor implementation is so trivial that folks won't even know they're using the pattern. > >Simplicity is reason #3: there is little reason to fatten the interfaces of > >tag and node classes with various data accumulation/alteration methods when > >one method and a variety of concrete Visitors can do the job with much less > >code. > > > well I would agree if you could guarantee that there will be no > divergence whatsoever in how the different methods will be used. If you > can create a flexible enough implementation of the Visitor pattern then > I guess that will support any possible divergence in the separate > methods. However, I think there is a reason to have a fatter interface, > in that convenience methods lower the barrier to entry for new users. Wouldn't it be marvelous if the barrier to entry was low and our code wasn't bloatware? > Perhaps ideally one has a well implemented Visitor pattern that supports > a raw method access, and a number of convenience methods? The changes will make the code easier to use - if they don't, they really cool thing about software is that it is soft - we can change it. > A well implemented Visitor pattern will, I assume, support all sorts of > different operations, but I would feel much happier if the htmlparser > had a complete javadoc and documentation review before any refactoring > took place. People are trying to use the existing system and having > trouble not because of the lack of refactoring, but a lack of well > described methods. Well I say people, I mean me, I don't know if anyone > else feels the same. Maybe it's just me :-) Executable documentation is a term we use for Customter Tests. A Customer Tests shows how a body of code gets used to perform real tasks. You have real-world tasks to perform. We can perform those tasks via automated tests. Then, when we refactor, we must make sure our executable documentation is up to date (i.e. it passes its tests). We find this ideal for making sure that documentation reflects what the code is actually doing. This also leaves room for a document that gives some graphical representation of how things work, which must be maintained by someone, lest it get out of date. We just minimize the need for such documents by keeping our code simple and small and surrounding it with easy to understand executable documentation. best regards jk |
From: Somik R. <so...@ya...> - 2002-12-27 07:53:19
|
Hi Dhaval, Sam, > I agree with Sam when he says that "don't fix it when its not broken". I quite disagree with both of u on this philosophy. Looking at the code with Joshua has made me realize its actually ghastly. A serious code cleanup is needed, there is just too much duplication! Most of the scanner code is unreadable, while the tag constructors are a nightmare. I dont think I will be at peace till a couple of rounds of refactoring has been completed. I do not think the panic on the visitor is warranted. Like I said before, the current access methods will continue to be present. However, having a visitor will make life simpler, as in - there's so much code now that uses the same loop over and over again. We can replace code like : Vector links = new Vector(); for (Enumeration e = parser.elements();e.hasMoreElements();) { HTMLNode node = (HTMLNode)e.nextElement(); if (node instanceof HTMLLinkTag) { links.add(node); } } with : HTMLLinkVisitor linkVisitor = new HTMLLinkVisitor(); collectNodesWith(linkVisitor); Vector links = linkVisitor.getResult(); This looks so much more readable and simple. Of course, you could still use the old way - its just that you would have the option of making life easier. In the latest round of refactorings, we've put in a HTMLCompositeTag, from which the link, form and select tags inherit. The toPlainTextString() and toHTML() methods are now in the parent class. All tests passing (except the charset tests). Regards, Somik |