htmlparser-developer Mailing List for HTML Parser (Page 21)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2002-12-27 07:17:56
|
Hi Derrick, I was working on the latest parser, and just found that the charset = tests are failing (I didnt modify anything in HTMLParser). Could you check what might have gone wrong ? Also, I was wondering = if it might not be better to have the charset pages on our domain = (http://htmlparser.sourceforge.net) - we could put up encoded pages and = they'd always be there. =20 Regards Somik |
From: <dha...@or...> - 2002-12-26 09:32:50
|
Hi, I agree with Sam when he says that "don't fix it when its not broken". But at the same time I see the need to make code better, more readable and simpler. However as suggested below it seems that the Visitor pattern is going to make things difficult to understand. And I too like Sam had a problem with HTMLParser initially. It may be with the documentation but this whole tag-scanner things was extremely confusing in the beginning(though subsequently I have loved them so much that I have even written a few myself without any problem). Hence I think that even if the visitor pattern is used, the user must have some simple easy-to-use methods to get his work done rather than try to understand the Visitor pattern. Just like Same, this was my two cents. Cheers, Dhaval -----Original Message----- From: gaijin [mailto:ga...@yh...] Sent: Tuesday, December 24, 2002 12:21 PM To: htmlparser-developer Cc: gaijin; htmlparser-user Subject: Re: [Htmlparser-developer] toPlainTextString() feedback requested Hi Somik and Joshua Joshua Kerievsky wrote: >>Could you explain why you want to refactor these methods? Remember the >>danger of premature refactoring ... you lose flexibility that then has >>to be re-added later on, making more work in the long run. >> >> >There's a good deal of duplicate code in way the two toHTML methods and the >toPlainTextString method do their work. The central theme is information >accumulation/alteration. That involves outputing tag and node results and >recusing through tags. The refactoring to Visitor allows us to > >* remove many lines of duplicate code, spread across many classes >* remove hard-coded accumulation/alteration logic, thereby making it easier >for clients to get the data they need > >Visitor takes some getting used to. I rarely use the pattern. In this case, >IMO, it was a good fit. > The Visitor pattern sounds interesting, and I look forward to hearing more about it. However, duplicated code itself is not IMO necessarily an evil. It all depends on whether one thinks that the duplicated components are going to diverge in functionality in the future. If you are sure they are not, then fine, refactor away. I guess my surprise at your (or perhaps Somik's) focus on refactoring comes from the fact that while the htmlparser is a great piece of software, the javadocs and other documentation could use some attention. For example, I can't find any explanation in the javadocs or otherwise of how the filters are supposed to work with the different scanners, or what values they are allowed to take. I generally work to "if it's not broken don't fix it", but I often add "before you start fixing it, make sure your documentation is up to date". Using the Visitor pattern may make it easier for clients to get the data they need, but given that the htmlparser is "working" (well it works for me), I would say that the more urgent issue here is making sure all the documentation is up to date. I have a lot of positive things to say about htmlparser, so don't take it the wrong way, when I say that the biggest problem I've had in using it in the last few weeks is inadequate javadocs. >>Is there some efficiency reason why you want to refactor these methods >>or is it just for neatness? >> >> > >Duplication removal is reason #1. > As I mention above. One should be careful of duplication removal for the sake of it. > Removal of hard-coded logic is reason #2. > This is a good reason. However I get the feeling that introduction of these Visitor classes will make the system conceptually more difficult to use rather than easier. I would feel better if the current set up was more fully documented before more complexity was added. And even if the Visitor pattern is used, I would recommend leaving methods like toPlainTextString() etc in place, but just making them short cut implementations to certain kinds of visitor-using methods. This will allow people who have yet to grasp the Visitor pattern something to work with. If you are keen to see lots of people using htmlparser, I think that you don't want people to have to come to terms with too many new concepts at once. You say yourself that the Visitor pattern takes some getting used to. I think the whole scanner concept takes some getting used to .... >Simplicity is reason #3: there is little reason to fatten the interfaces of >tag and node classes with various data accumulation/alteration methods when >one method and a variety of concrete Visitors can do the job with much less >code. > well I would agree if you could guarantee that there will be no divergence whatsoever in how the different methods will be used. If you can create a flexible enough implementation of the Visitor pattern then I guess that will support any possible divergence in the separate methods. However, I think there is a reason to have a fatter interface, in that convenience methods lower the barrier to entry for new users. Perhaps ideally one has a well implemented Visitor pattern that supports a raw method access, and a number of convenience methods? A well implemented Visitor pattern will, I assume, support all sorts of different operations, but I would feel much happier if the htmlparser had a complete javadoc and documentation review before any refactoring took place. People are trying to use the existing system and having trouble not because of the lack of refactoring, but a lack of well described methods. Well I say people, I mean me, I don't know if anyone else feels the same. Maybe it's just me :-) Just my two cents. CHEERS> SAM ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2002-12-24 21:49:42
|
Hi Kaarle, > You seem to be very active today! > > So am I. Merry Christmas to you all! > Merry Christmas to everyone and to you! Cheers, Somik |
From: Kaarle K. <kaa...@ik...> - 2002-12-24 21:02:14
|
At 15:01 24.12.2002 -0500, you wrote: >Somik Raha wrote: You seem to be very active today! So am I. Merry Christmas to you all! Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Derrick O. <Der...@ro...> - 2002-12-24 20:00:21
|
Somik Raha wrote: >Hi Derrick, > You've raised some interesting points.. > > Looking at it from an end-user perspective, prior to using the visitor, >the code for collecting text from a page would be : > >StringBuffer textCollectionBuffer = new StringBuffer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString()); >} >String result = textCollectionBuffer.toString(); > >After putting in the visitor, the code looks like : > >Renderer plainTextRenderer = new PlainTextRenderer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > e.nextHTMLNode().renderWith(plainTextRenderer); >} >String result = plainTextRenderer.getResult(); > >IMO, from a user perspective, there's almost no extra complexity. You'd >anyway collect results into a string buffer, while the renderer now does >that for you. > >If it was just for the plain text renderer, we would not be justified in >this refactoring. However, the code for getting html out would look like : >Renderer htmlRenderer = new HTMLRenderer(); >for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { > e.nextHTMLNode().renderWith(htmlRenderer); >} >String result = htmlRenderer.getResult(); > >And now, when ripping webpages, we would like to modify the links and images >to a base location in the local disk. We could have modifying html renderer, >like this: > >Renderer modifyingLinksAndImagesRenderer = new >ModifyingLinksAndImagesRenderer(); >... > >Any modification in the mechanism of rendering is now easily handled, and we >don't have to modify the parser per se, for every new user requirement. > The first two examples gain you nothing from using a visitor pattern. The plainTextRenderer and HTMLRenderer call the toPlainTextString() and toHTML() methods because only the tag knows how to render itself into text or HTML (see your comment below). Your last example indicates the visitor class knows which parameters have links and which parameters have images for each tag and I would suggest this knowledge should be in the tag class itself. If the base class (HTMLTag I presume) had an abstract method signature 'toHTML(URL baseURL)', then code within the class would adjust relative link and image references based on it's value, and the current behaviour of 'toHRML()' would just call toHTML(null). The code is not duplicated, only refactored to handle a base URL. > > > >>It breaks encapsulation. It presumes that an exterior class (the >>visitor) knows how to do something to a class better than the class >>itself does. >> >> > >In light of the above, this is not a bad thing. At the same time, putting in >the visitor has caused/will cause refactoring of the tag classes so that >most of the logic is still encapsulated in the tags, and the visitor only >makes calls to the tag's concrete methods. >e.g. The visitor still does NOT know how to render a tag's html. Such logic >continues to be encapsulated within the tag. The visitor only deals with a >high-level abstraction. > So why bother with the extra overhead of the visitor class? Logically, either the visitor has code that differs from the tag code or it doesn't. If it doesn't, then cut the Gordian knot and use the simplest mechanism and forget the visitor. If it does differ, the code within the visitor either uses the methods of the visitee only or it has some new methods with logic and data that are external to the visitee. In the former case, that code that only uses methods on the visitee could be a method on the vistee's superclass. If it has new methods, either the visitors extra logic and data are specific to each vistee or they are of a generic nature. If they are specific to the visitee, then they should be within the visitee class and not in a catch-all visitor class, otherwise if they are of a general nature they might as well be moved to the superclass (or a utility class if they don't belong there). There is really no need for a visitor pattern. > > > >>It makes the system brittle. The visitor class has a method defined for >>every subclass of visited object. Adding, removing or refactoring any >>visited class breaks the visitors. >> >> > >A valid point indeed. To counter, in our imlpementation, there is a lot of >duplication, and quite a few classes really don't need their own toHTML() >method. The parent class - HTMLTag, should take care of single tags. Those >tags with children always have duplicated code to recurse through their >children. We can refactor such functionality out, so that, tags with >children conform to a recursion interface. Using such an interface, you wont >have the recursion loops in each and every node class, but you could handle >it uniformly outside. So, the visitor does not need to take care of each and >every subclass. It will only take care of some special ones only, while >visiting the rest as a simple HTMLTag. > > This The duplication should be refactored to superclasses. You may need is subclass of HTMLTag that has a 'container' interface, but why not put that interface on HTMLTag itself and have a default implementation that is an empty container. Then FORM and FRAMESET and other container tags derive from an abstract class (HTMLContainerTag ?) that derives from HTMLTag but overrides this behaviour and provides real recursion support. > > >>It's obscure. Looking at the visit loop tells you nothing of what is >>being perfomed. The only indication of the task might be the name of >>the visitor class. >> >> > >That is again not a bad thing - for well-intentioned naming reduces the need >for excessive documentation. You should be able to look at a class name and >have a fairly good idea of what it does. A look at the API of the class >would confirm your guess. > >At the same time, looking at the visit loop above, you'd see that it looks >intuitive enough - the loops is not in the visitor, it is still in the >client code. > > But the StringBuffer is now hidden and it's not obvious that a getResult() method call returns it's contents, so you have an indirection to look up the visitor class. > > >>It's redundant. In every example I've seen (I've never seen it used in >>real applications), the result could have been coded in a simpler >>fashion with correct class design. >> >> > >This is the point we're really thinking about. I've personally been thinking >on this for some months - for we did have realistic requirements of being >able to modify links and images in a ripper, in a uniform manner. In fact, >the current version of the parser already has a visitor. The Sample Programs >makes not mention of the word "visitor" to avoid controversy and confusion >:), but if you look at >http://htmlparser.sourceforge.net/samples/ripper.html - this is what's >happening. I must add that although this visitor does provide a solution, I >wasn't very happy with it. We've been able to refactor this out, and provide >a uniform mechanism now. > > That's not really a visitor example, that's a callback, which is obviated if the toHTML(URL) signature is used. > > >>I would only use the visitor pattern if either the set of public methods >>on the visited nodes completely specify their possible behavior and >>allowed the client to fully manipulate them at the highest level of >>abstraction (i.e. everything you can do to the visited classes is >>already coded and you don't want them to change) >> >> > >I think we're talking the same language here - this is also our vision. The >code you see on the samples page is an example of that (although not a very >good one). We're working on those lines, and the first step was to introduce >such levels of abstraction. > >Regards, >Somik > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Somik R. <so...@ya...> - 2002-12-24 18:48:57
|
Hi Derrick, You've raised some interesting points.. Looking at it from an end-user perspective, prior to using the visitor, the code for collecting text from a page would be : StringBuffer textCollectionBuffer = new StringBuffer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString()); } String result = textCollectionBuffer.toString(); After putting in the visitor, the code looks like : Renderer plainTextRenderer = new PlainTextRenderer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { e.nextHTMLNode().renderWith(plainTextRenderer); } String result = plainTextRenderer.getResult(); IMO, from a user perspective, there's almost no extra complexity. You'd anyway collect results into a string buffer, while the renderer now does that for you. If it was just for the plain text renderer, we would not be justified in this refactoring. However, the code for getting html out would look like : Renderer htmlRenderer = new HTMLRenderer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { e.nextHTMLNode().renderWith(htmlRenderer); } String result = htmlRenderer.getResult(); And now, when ripping webpages, we would like to modify the links and images to a base location in the local disk. We could have modifying html renderer, like this: Renderer modifyingLinksAndImagesRenderer = new ModifyingLinksAndImagesRenderer(); ... Any modification in the mechanism of rendering is now easily handled, and we don't have to modify the parser per se, for every new user requirement. > It breaks encapsulation. It presumes that an exterior class (the > visitor) knows how to do something to a class better than the class > itself does. In light of the above, this is not a bad thing. At the same time, putting in the visitor has caused/will cause refactoring of the tag classes so that most of the logic is still encapsulated in the tags, and the visitor only makes calls to the tag's concrete methods. e.g. The visitor still does NOT know how to render a tag's html. Such logic continues to be encapsulated within the tag. The visitor only deals with a high-level abstraction. > It makes the system brittle. The visitor class has a method defined for > every subclass of visited object. Adding, removing or refactoring any > visited class breaks the visitors. A valid point indeed. To counter, in our imlpementation, there is a lot of duplication, and quite a few classes really don't need their own toHTML() method. The parent class - HTMLTag, should take care of single tags. Those tags with children always have duplicated code to recurse through their children. We can refactor such functionality out, so that, tags with children conform to a recursion interface. Using such an interface, you wont have the recursion loops in each and every node class, but you could handle it uniformly outside. So, the visitor does not need to take care of each and every subclass. It will only take care of some special ones only, while visiting the rest as a simple HTMLTag. > It's obscure. Looking at the visit loop tells you nothing of what is > being perfomed. The only indication of the task might be the name of > the visitor class. That is again not a bad thing - for well-intentioned naming reduces the need for excessive documentation. You should be able to look at a class name and have a fairly good idea of what it does. A look at the API of the class would confirm your guess. At the same time, looking at the visit loop above, you'd see that it looks intuitive enough - the loops is not in the visitor, it is still in the client code. > It's redundant. In every example I've seen (I've never seen it used in > real applications), the result could have been coded in a simpler > fashion with correct class design. This is the point we're really thinking about. I've personally been thinking on this for some months - for we did have realistic requirements of being able to modify links and images in a ripper, in a uniform manner. In fact, the current version of the parser already has a visitor. The Sample Programs makes not mention of the word "visitor" to avoid controversy and confusion :), but if you look at http://htmlparser.sourceforge.net/samples/ripper.html - this is what's happening. I must add that although this visitor does provide a solution, I wasn't very happy with it. We've been able to refactor this out, and provide a uniform mechanism now. > I would only use the visitor pattern if either the set of public methods > on the visited nodes completely specify their possible behavior and > allowed the client to fully manipulate them at the highest level of > abstraction (i.e. everything you can do to the visited classes is > already coded and you don't want them to change) I think we're talking the same language here - this is also our vision. The code you see on the samples page is an example of that (although not a very good one). We're working on those lines, and the first step was to introduce such levels of abstraction. Regards, Somik |
From: Derrick O. <Der...@ro...> - 2002-12-24 17:47:30
|
IMHO the visitor pattern is not a good one. Here are the reasons. It breaks encapsulation. It presumes that an exterior class (the visitor) knows how to do something to a class better than the class itself does. It makes the system brittle. The visitor class has a method defined for every subclass of visited object. Adding, removing or refactoring any visited class breaks the visitors. It's obscure. Looking at the visit loop tells you nothing of what is being perfomed. The only indication of the task might be the name of the visitor class. It's redundant. In every example I've seen (I've never seen it used in real applications), the result could have been coded in a simpler fashion with correct class design. I would only use the visitor pattern if either the set of public methods on the visited nodes completely specify their possible behavior and allowed the client to fully manipulate them at the highest level of abstraction (i.e. everything you can do to the visited classes is already coded and you don't want them to change), or the visitor pattern had already been coded into the visited nodes and for some reason they could not be modified further (i.e. the code was available in binary form only). I don't think either of these applies here. my $0.02 Derrick |
From: Sam J. <ga...@yh...> - 2002-12-24 17:05:00
|
Hi Somik Somik Raha wrote: >Hi Sam, > > >>Can you give me an example of the hard coded rules you are using now, >>and a couple of examples of dirty html pages that cause them to be >>sub-optimal. >> >> > >Here are some tags : >[1] From neurogrid.com (debugging last year :) ><a href="mailto:sa...@ne...?subject=Site Comments">Mail Us<a> > >[2] From freshmeat.net ><a>revision</a> > >[3] From fedpage.com ><a href="registration.asp?EventID=1272"><img border="0" >src="\images\register.gif"</a> > >[4] From yahoo.com ><a href=s/8741><img >src="http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=16 >width=16 border=0></img></td><td nowrap> ><a href=s/7509><b>Yahoo! Movies</b></a> > >As you can see, dirty html hardly looks predictable. Especially when links >are not closed correctly, the scanner has to guess when it should close the >tag. > >And this is only for the link tag. For normal tags, >[1] <sometag key1=value key2="value2 key3 = value3> >[2] <sometag key1="<sometag>" key2="<!-- skdlskld -->"> > >The above two tags demonstrate a classic dilemma. If we ignore inverted >commas, we cannot handle case 2, where the contents within inverted commas >is valid text and not tags. All these examples are accepted by IE. > >All of these problems are currently handled by the parser, and I was looking >to simplify the brain of the parser. > Could you explain how all of these examples are handled by the current parser. Are you using some kind of specific rule to handle each case? Perhaps you can cut and paste a bit of code to the list to illustrate. The more precisely you can describe the operation of the existing parser when handling these kinds of cases, the more likely I can come up with a learner that will meet your needs. CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-24 07:28:40
|
Hi Sam, > Can you give me an example of the hard coded rules you are using now, > and a couple of examples of dirty html pages that cause them to be > sub-optimal. Here are some tags : [1] From neurogrid.com (debugging last year :) <a href="mailto:sa...@ne...?subject=Site Comments">Mail Us<a> [2] From freshmeat.net <a>revision</a> [3] From fedpage.com <a href="registration.asp?EventID=1272"><img border="0" src="\images\register.gif"</a> [4] From yahoo.com <a href=s/8741><img src="http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=16 width=16 border=0></img></td><td nowrap> <a href=s/7509><b>Yahoo! Movies</b></a> As you can see, dirty html hardly looks predictable. Especially when links are not closed correctly, the scanner has to guess when it should close the tag. And this is only for the link tag. For normal tags, [1] <sometag key1=value key2="value2 key3 = value3> [2] <sometag key1="<sometag>" key2="<!-- skdlskld -->"> The above two tags demonstrate a classic dilemma. If we ignore inverted commas, we cannot handle case 2, where the contents within inverted commas is valid text and not tags. All these examples are accepted by IE. All of these problems are currently handled by the parser, and I was looking to simplify the brain of the parser. > Using learning in a system to increase efficiency is usually very > difficult to do well. Learning systems basically have more flexibility > than other systems, but as a consequence you have moer free parameters. > It is easy to add a learning framework but then spend all your time > just trying to adjust the system parameters, and then to discover that > exploring the space of possible parameters for your learner is just too > expensive. > > Nontheless I am always fascinated by the problem of adding learning to a > system, precisely because it is so difficult to do well. If you can > give me some concrete examples, I will do my best to help you select an > appropriate learning mechanism. Thanks! > Interesting. The ant scripts for neurogrid were originally made by Rick > Knowles, and I'm still only just getting a really good feel for ant. I > remember trying to set up something in my ng scripts that would add the > date to the jar file name, like you sometimes do, and failing. My own > fault really; I rarely read the manual and always try to learn by > modifying the operation of an existing system (kind of an evolutionary > approach ....) Thats what I did too - but I started with the examples in the ant website (I think they believe in the approach too) Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-12-24 07:01:48
|
Hi Sam, > I guess my surprise at your (or perhaps Somik's) focus on refactoring > comes from the fact that while the htmlparser is a great piece of > software, the javadocs and other documentation could use some attention. > For example, I can't find any explanation in the javadocs or otherwise > of how the filters are supposed to work with the different scanners, or > what values they are allowed to take. The last few weeks, I've been only adding docs... The Sample Programs are new - you will find an example of using filters here : http://htmlparser.sourceforge.net/samples/linksEmbedded.html From the javadoc : http://htmlparser.sourceforge.net/javadoc/org/htmlparser/HTMLNode.html (check collectInto) I guess more javadoc should be written about this, but I didn't bother to write it bcos the filters were mostly used for demonstration from the command line. When you type java -jar htmlparser.jar, you would see a help menu of each filter. Its only with the introduction of Collection Parameter that we've been using filter strings for actual collection of data, and that has been documented. If its not important, its not documented :). > I generally work to "if it's not broken don't fix it", but I often add > "before you start fixing it, make sure your documentation is up to date". > Using the Visitor pattern may make it easier for clients to get the data > they need, but given that the htmlparser is "working" (well it works for > me), I would say that the more urgent issue here is making sure all the > documentation is up to date. I have a lot of positive things to say > about htmlparser, so don't take it the wrong way, when I say that the > biggest problem I've had in using it in the last few weeks is inadequate > javadocs. Sam - feel free to post as often as you like on the list. I'd be glad to help you out. The lack of adequate documentation is of course my responsibility and in light of my explanation above, if you can suggest other areas of inadequate docs, I'll look into it. > >Duplication removal is reason #1. > > > As I mention above. One should be careful of duplication removal for > the sake of it. > > > Removal of hard-coded logic is reason #2. > > > This is a good reason. However I get the feeling that introduction of > these Visitor classes will make the system conceptually more difficult > to use rather than easier. I would feel better if the current set up > was more fully documented before more complexity was added. > > And even if the Visitor pattern is used, I would recommend leaving > methods like toPlainTextString() etc in place, but just making them > short cut implementations to certain kinds of visitor-using methods. > This will allow people who have yet to grasp the Visitor pattern > something to work with. Don't worry, we'll leave toPlainTextString() alone - in fact, we've already begun doing the short cut implementations. :) The purpose of my initial mail was not to alarm you, but to know more about realistic "customer" stories. > If you are keen to see lots of people using htmlparser, I think that you > don't want people to have to come to terms with too many new concepts at > once. You say yourself that the Visitor pattern takes some getting used > to. I think the whole scanner concept takes some getting used to .... > Bytway, have you gone thru the Sample Programs - I should've thought that this new addition will make life very simple. If not, we'd probably need more docs.. Regards, Somik |
From: Sam J. <ga...@yh...> - 2002-12-24 06:35:40
|
Hi Somik and Joshua Joshua Kerievsky wrote: >>Could you explain why you want to refactor these methods? Remember the >>danger of premature refactoring ... you lose flexibility that then has >>to be re-added later on, making more work in the long run. >> >> >There's a good deal of duplicate code in way the two toHTML methods and the >toPlainTextString method do their work. The central theme is information >accumulation/alteration. That involves outputing tag and node results and >recusing through tags. The refactoring to Visitor allows us to > >* remove many lines of duplicate code, spread across many classes >* remove hard-coded accumulation/alteration logic, thereby making it easier >for clients to get the data they need > >Visitor takes some getting used to. I rarely use the pattern. In this case, >IMO, it was a good fit. > The Visitor pattern sounds interesting, and I look forward to hearing more about it. However, duplicated code itself is not IMO necessarily an evil. It all depends on whether one thinks that the duplicated components are going to diverge in functionality in the future. If you are sure they are not, then fine, refactor away. I guess my surprise at your (or perhaps Somik's) focus on refactoring comes from the fact that while the htmlparser is a great piece of software, the javadocs and other documentation could use some attention. For example, I can't find any explanation in the javadocs or otherwise of how the filters are supposed to work with the different scanners, or what values they are allowed to take. I generally work to "if it's not broken don't fix it", but I often add "before you start fixing it, make sure your documentation is up to date". Using the Visitor pattern may make it easier for clients to get the data they need, but given that the htmlparser is "working" (well it works for me), I would say that the more urgent issue here is making sure all the documentation is up to date. I have a lot of positive things to say about htmlparser, so don't take it the wrong way, when I say that the biggest problem I've had in using it in the last few weeks is inadequate javadocs. >>Is there some efficiency reason why you want to refactor these methods >>or is it just for neatness? >> >> > >Duplication removal is reason #1. > As I mention above. One should be careful of duplication removal for the sake of it. > Removal of hard-coded logic is reason #2. > This is a good reason. However I get the feeling that introduction of these Visitor classes will make the system conceptually more difficult to use rather than easier. I would feel better if the current set up was more fully documented before more complexity was added. And even if the Visitor pattern is used, I would recommend leaving methods like toPlainTextString() etc in place, but just making them short cut implementations to certain kinds of visitor-using methods. This will allow people who have yet to grasp the Visitor pattern something to work with. If you are keen to see lots of people using htmlparser, I think that you don't want people to have to come to terms with too many new concepts at once. You say yourself that the Visitor pattern takes some getting used to. I think the whole scanner concept takes some getting used to .... >Simplicity is reason #3: there is little reason to fatten the interfaces of >tag and node classes with various data accumulation/alteration methods when >one method and a variety of concrete Visitors can do the job with much less >code. > well I would agree if you could guarantee that there will be no divergence whatsoever in how the different methods will be used. If you can create a flexible enough implementation of the Visitor pattern then I guess that will support any possible divergence in the separate methods. However, I think there is a reason to have a fatter interface, in that convenience methods lower the barrier to entry for new users. Perhaps ideally one has a well implemented Visitor pattern that supports a raw method access, and a number of convenience methods? A well implemented Visitor pattern will, I assume, support all sorts of different operations, but I would feel much happier if the htmlparser had a complete javadoc and documentation review before any refactoring took place. People are trying to use the existing system and having trouble not because of the lack of refactoring, but a lack of well described methods. Well I say people, I mean me, I don't know if anyone else feels the same. Maybe it's just me :-) Just my two cents. CHEERS> SAM |
From: Joshua K. <jo...@in...> - 2002-12-24 05:40:22
|
> Could you explain why you want to refactor these methods? Remember the > danger of premature refactoring ... you lose flexibility that then has > to be re-added later on, making more work in the long run. Hi Sam, There's a good deal of duplicate code in way the two toHTML methods and the toPlainTextString method do their work. The central theme is information accumulation/alteration. That involves outputing tag and node results and recusing through tags. The refactoring to Visitor allows us to * remove many lines of duplicate code, spread across many classes * remove hard-coded accumulation/alteration logic, thereby making it easier for clients to get the data they need Visitor takes some getting used to. I rarely use the pattern. In this case, IMO, it was a good fit. > Is there some efficiency reason why you want to refactor these methods > or is it just for neatness? Duplication removal is reason #1. Removal of hard-coded logic is reason #2. Simplicity is reason #3: there is little reason to fatten the interfaces of tag and node classes with various data accumulation/alteration methods when one method and a variety of concrete Visitors can do the job with much less code. best regards jk |
From: Joshua K. <jo...@in...> - 2002-12-24 04:29:33
|
Thanks for the intro Somik. I'd hardly consider myself a legend of OO - merely a practitioner. I'm excited to help out here. Thanks for having e. --jk |
From: Somik R. <so...@ya...> - 2002-12-24 03:37:22
|
Hi Sam, > Could you explain why you want to refactor these > methods? Remember the > danger of premature refactoring ... you lose > flexibility that then has > to be re-added later on, making more work in the > long run. There seems to be so much duplication in the parser, that it seems like Merciless_Refactoring can alone help clean the code. > Is there some efficiency reason why you want to > refactor these methods > or is it just for neatness? Readability and efficiency are important factors - far more important is to have a simple way of getting data from the parser. The idea is to pass in your own visitors into the parser, which could either extract text, or html. If you need to modify them, then you can modify these visitors to customize behaviour. The weight of the parser would reduce, code duplication would not be there, and it would be easier to customize. > Either way, I would certainly recommend a > deprecation step. Will keep that in mind. Cheers, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Sam J. <ga...@yh...> - 2002-12-24 02:39:57
|
Hi Somik, I tend to use toPlainTextString() in all three ways described below. I use the first to generate a plain test overview, the second for debugging statements, and the final one for when I want to add my own additional formatting to the final text. Could you explain why you want to refactor these methods? Remember the danger of premature refactoring ... you lose flexibility that then has to be re-added later on, making more work in the long run. Is there some efficiency reason why you want to refactor these methods or is it just for neatness? Either way, I would certainly recommend a deprecation step. CHEERS> SAM Somik Raha wrote: >Hi Folks, > We are performing a series of refactorings, which >will involve merging of the functionality of >toPlainTextString() and toHTML(HTMLRenderer), and >perhaps, toHTML() itself.. > > We'd like to know user stories of how you've been >using toPlainTextString(). > >Do you enumerate through the nodes and : >[1] collect all the text data using >toPlainTextString() into a buffer? > >[2] make calls to toPlainTextString() but don't >collect the data ? > >[3] collect all text data using toPlainTextString() >into a buffer, but only after having processed the >string in some way ? > >Another question - if we remove toPlainTextString() >[and provide you an alternative], will it hurt, or >would you prefer a deprecation step (in 1.3) before >removal (in 1.4) ? > >It would be good for us to know more details about >your applications, so that we can perform our design >modifications correctly. Feel free to tell us your >stories in detail. (Remember to click Reply All so the >replies go to both lists). > >Regards, >Somik > >__________________________________________________ >Do you Yahoo!? >Yahoo! Mail Plus - Powerful. Affordable. Sign up now. >http://mailplus.yahoo.com > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > |
From: Somik R. <so...@ya...> - 2002-12-24 02:26:39
|
Hi Folks, It gives me great pleasure to introduce Joshua Kerievsky, founder of Industrial Logic, Inc. Josh is a legendary OO programmer, and a great XP coach. Josh is actually writing a book on refactoring to patterns (http://www.industriallogic.com/xp/refactoring/), and he is considering using the parser as an example of a framework where refactoring to a visitor could be applied. We in fact spent much of today improving the design and merging the functionality of toPlainTextString() and toHTML(HTMLRenderer) into a single acceptVisitor(HTMLVisitor) method. We are also looking at merging functionality of toHTML() itself.. Pairing with Josh has given me insight into so many possible refactorings on the current codebase. Josh -> Welcome to the developer team of HTMLParser. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-12-24 02:20:37
|
Hi Folks, We are performing a series of refactorings, which will involve merging of the functionality of toPlainTextString() and toHTML(HTMLRenderer), and perhaps, toHTML() itself.. We'd like to know user stories of how you've been using toPlainTextString(). Do you enumerate through the nodes and : [1] collect all the text data using toPlainTextString() into a buffer? [2] make calls to toPlainTextString() but don't collect the data ? [3] collect all text data using toPlainTextString() into a buffer, but only after having processed the string in some way ? Another question - if we remove toPlainTextString() [and provide you an alternative], will it hurt, or would you prefer a deprecation step (in 1.3) before removal (in 1.4) ? It would be good for us to know more details about your applications, so that we can perform our design modifications correctly. Feel free to tell us your stories in detail. (Remember to click Reply All so the replies go to both lists). Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-12-24 02:05:00
|
Hi Derrick, Great going! > 1) Why aren't the standard scanners registered by > the HTMLParser > constructor? It would seem that proper parsing of > HTML would require > these always be registered. Unless there's a set for > HTML 3.2 and one > for HTML 4.1 or something. Some of the tests rely on > this 'empty scanner > list' behaviour to return the expected number of > nodes. That is bcos we allow folks to custom build their own parser. e.g. if you are only interested in text content, you wouldnt wish to register any of the scanners at all. If you are interested only in links, you would register only the link scanner. Control that passes to a scanner - implies extra processing time (sometimes twice that of the no-scanner scenario). > 2) Mark and reset are still supported somewhat. If > you mark before > calling elements() and reset() after exhausting the > input stream, all is > well. But mark() in the midst of parsing is useless > and dangerous > because of the BufferedReader. Perhaps the > documentation should reflect > this, or a way to do it properly worked out. > Hmm.. I thought I'd added this doc in. If its not enough, feel free to modify. > Aside notes: > > Nobody's checking that the contents of the document > are "text/html", > perhaps this check should be added. > > These changes still leave the 'Reader' constructors > as a separate group > which is used by a majority of the test cases. The > 'new testing > framework' should try to use the same code path as > most of the > real-world use cases. Perhaps, test cases could be > loaded from > htmlparser.sourceforge.net like > HTMLParserTest.testURLWithSpaces() does. > That would cause a serious performance hit while running 284 tests - I'd simply stop testing if it took too long to run the tests.. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-12-24 02:00:26
|
Hi Derrick, --- Derrick Oswald <Der...@ro...> wrote: > g'day, > > I've dropped code to handle charset parameters, but > this raised a couple > of issues. First an explanation of the changes. > > HTMLParser constructors were added that take a > URLConnection, so that > the input stream could be reacquired as needed. > For consistency the 'from file' handling is > identical to a URL now. > > The HTTP header is now examined for the charset > parameter and the input > stream is converted to a reader with that encoding. > However, a lot of sites lie. They don't specify the > charset in the HTTP > header, even though the HTML is encoded with a > non-standard encoding. > So, we have to pre-read the header portion (meta > tags) and restart if > the charset in the meta tags is different than in > the HTTP header. > > Ya, ya, pre-reading is fraught with peril, and > that's what I'm on about > here... > > 1) Why aren't the standard scanners registered by > the HTMLParser > constructor? It would seem that proper parsing of > HTML would require > these always be registered. Unless there's a set for > HTML 3.2 and one > for HTML 4.1 or something. Some of the tests rely on > this 'empty scanner > list' behaviour to return the expected number of > nodes. > > 2) Mark and reset are still supported somewhat. If > you mark before > calling elements() and reset() after exhausting the > input stream, all is > well. But mark() in the midst of parsing is useless > and dangerous > because of the BufferedReader. Perhaps the > documentation should reflect > this, or a way to do it properly worked out. > > Aside notes: > > Nobody's checking that the contents of the document > are "text/html", > perhaps this check should be added. > > These changes still leave the 'Reader' constructors > as a separate group > which is used by a majority of the test cases. The > 'new testing > framework' should try to use the same code path as > most of the > real-world use cases. Perhaps, test cases could be > loaded from > htmlparser.sourceforge.net like > HTMLParserTest.testURLWithSpaces() does. > > Derrick > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2002-12-23 20:47:40
|
g'day, I've dropped code to handle charset parameters, but this raised a couple of issues. First an explanation of the changes. HTMLParser constructors were added that take a URLConnection, so that the input stream could be reacquired as needed. For consistency the 'from file' handling is identical to a URL now. The HTTP header is now examined for the charset parameter and the input stream is converted to a reader with that encoding. However, a lot of sites lie. They don't specify the charset in the HTTP header, even though the HTML is encoded with a non-standard encoding. So, we have to pre-read the header portion (meta tags) and restart if the charset in the meta tags is different than in the HTTP header. Ya, ya, pre-reading is fraught with peril, and that's what I'm on about here... 1) Why aren't the standard scanners registered by the HTMLParser constructor? It would seem that proper parsing of HTML would require these always be registered. Unless there's a set for HTML 3.2 and one for HTML 4.1 or something. Some of the tests rely on this 'empty scanner list' behaviour to return the expected number of nodes. 2) Mark and reset are still supported somewhat. If you mark before calling elements() and reset() after exhausting the input stream, all is well. But mark() in the midst of parsing is useless and dangerous because of the BufferedReader. Perhaps the documentation should reflect this, or a way to do it properly worked out. Aside notes: Nobody's checking that the contents of the document are "text/html", perhaps this check should be added. These changes still leave the 'Reader' constructors as a separate group which is used by a majority of the test cases. The 'new testing framework' should try to use the same code path as most of the real-world use cases. Perhaps, test cases could be loaded from htmlparser.sourceforge.net like HTMLParserTest.testURLWithSpaces() does. Derrick |
From: Sam J. <ga...@yh...> - 2002-12-23 06:07:54
|
Hi Somik, Somik Raha wrote: >Bytway... I'd written earlier about this - what is your opinion on using >Bayesian networks to have a rule-based learning system, that gets better >over time ? i.e. right now the tag identification mechanism is linear- >there is only so far that can go. But with the sort of dirty html we get, >the system has to be self-learning. I am thinking of an approach where we'd >try to eliminate a lot of the hard-coded rules, with a learning network. Of >course, we'd have our tests to verify that we haven't broken anything, and >from there, it should only get better. It would be great to have your >insight on this. > I think I don't quite understand enough abuot htmlparser to see how Bayesian networks would be applicable. I have only recently worked out how your scanners work, or rather, that you have scanners for different types of tags and can then avoid processing those tags that you are not interested in. You say above that your tag identification mechanism is linear, but linear with respect to what? Can you give me an example of the hard coded rules you are using now, and a couple of examples of dirty html pages that cause them to be sub-optimal. Using learning in a system to increase efficiency is usually very difficult to do well. Learning systems basically have more flexibility than other systems, but as a consequence you have moer free parameters. It is easy to add a learning framework but then spend all your time just trying to adjust the system parameters, and then to discover that exploring the space of possible parameters for your learner is just too expensive. Nontheless I am always fascinated by the problem of adding learning to a system, precisely because it is so difficult to do well. If you can give me some concrete examples, I will do my best to help you select an appropriate learning mechanism. >>p.s. I'm impressed by the frequency with which you are releasing >>htmlParser, and your process of having multiple candidates etc. I >>struggle to release often as the release process itself still seems a >>little cumbersome (sourceforge has got better) .... have you any tips >>for streamlining it ....? I guess what I really need is an ant methods >> >> >like > > >>ant release-bug-fix version >>ant create-new-version-release >>ant create-new-candiate-release >> >>which handle all the necessary communication with sourceforge, >>uploading, packaging and handling of release numbers .... >> >> > >Ha ha! I am not sure if you'll believe this, but I was inspired to structure >the htmlparser project based on the neurogrid project- you had ant scripts >long before we did. > Interesting. The ant scripts for neurogrid were originally made by Rick Knowles, and I'm still only just getting a really good feel for ant. I remember trying to set up something in my ng scripts that would add the date to the jar file name, like you sometimes do, and failing. My own fault really; I rarely read the manual and always try to learn by modifying the operation of an existing system (kind of an evolutionary approach ....) >Of course, ant scripts are so important to do the job >automatically - but I like keeping things simple -in the sense, there is no >seperate bug-fix version, but the next integration release (Candidate). > >I am not yet a fan of branches - they're ok if they dont live more than two >weeks (I've been thinking real hard about it for a while). Im planning to >get the production release out this week - so we can all move on to 1.3 >(instead of having two versions - we'll live with 1.3 integration releases). >I'd hate to make the same bug fixes twice. > ok, but how much do you use the ant tasks to collect together all the thigns required for a release, or do you access souceforge yourself each time? Given that the code is ready to go, how long does it take you to do a release? 5 minutes, 30? I'm imagining an ant task that would require one command and then you'd leave it to run .... Really I will have to sit down and spend some time looking at your ant scripts again to work that out, but I thought it might be interesting to hear from you about whether the release process feels cumbersome or not... CHEERS> SAM |
From: Somik R. <so...@ya...> - 2002-12-22 03:24:26
|
Hi Folks, Finally, after 8 months of hard work, we have the next production release of the parser. 1.2 has tons of bug fixes and features. The change log difference b/w 1.2 and 1.1 is too big to be listed in this mail - check the change log when you are downloading (its also in the download package). Documentation has been considerably improved (the Sample programs would be the place to start). There's a section on the patterns in action as well. You can modify the rendering process for links and images, as well as provide collecting parameters to pick up nodes that you wish (currently images and links supported). Below is the change log (as compared to last week's integration release) : Production Release 1.2 ---------------------- [1] Rewrote HTMLLinkProcessor.extract() so URL class does all the heavy lifting [2] Partially fixed bug 654746 - HTMLLinkScanner error, code review needed [3] Rendering bug fixed - allowing uniform rendering for links and images [4] Fixed bug 655917, made HTMLParameterParser.parseParameters() thread-safe [5] Refactored HTMLFormTag (introduced POST and GET static members) [6] Bug fixed in HTMLFormTag.getInputTag() (NullPointerException when input tag has no name) [7] Added ability to get textarea tag from HTMLFormTag. [8] Added search capability in HTMLFormTag [9] Fixed bug 655627 - JSP tags with < sign (for loops) were not being parsed correctly [10] Fixed bug 655603 - JSP tags within src of script not recognized correctly when using single apostrophes [11] Fixed bug 655580 - JSP tags within title tags not recognized correctly [12] Fixed bug 655599 - Erroneous end-of-line characters were being added in string nodes [13] Fixed bug 656870 - HTMLFormScanner goes into infinite loop if a previous link has not been closed Thanks to Derrick Oswald and Dhaval Udani for their work on the last few releases. Thanks to Joe Robins for pointing out an important bug in HTMLFormScanner. A special mention for Dhaval - all his bug reports come with testcases making it really easy for us to reproduce the bug and fix them. Regards, Somik |
From: Derrick O. <Der...@ro...> - 2002-12-22 03:10:18
|
Wow, Somik's been busy. He wiped the slate of bugs clean and has officially released version 1.2: https://sourceforge.net/project/showfiles.php?group_id=24399&release_id=129477 Let's give him a round of applause. Now roll up your sleeves and make a glorious 1.3! |
From: Somik R. <so...@ya...> - 2002-12-21 08:47:08
|
Hi Sam > It seems that really we want something like: > > 1. Specify use case, and data-in/data-out > 2. Automatically generate test code and shell implementation code > 3. Run test code to fail > 4. fill in implementaton details ...etc > I like your idea. You can check this week's release - you'll find searching support for form tags which allow you to pick up input tags, textarea tags, select tags, ... Bytway... I'd written earlier about this - what is your opinion on using Bayesian networks to have a rule-based learning system, that gets better over time ? i.e. right now the tag identification mechanism is linear- there is only so far that can go. But with the sort of dirty html we get, the system has to be self-learning. I am thinking of an approach where we'd try to eliminate a lot of the hard-coded rules, with a learning network. Of course, we'd have our tests to verify that we haven't broken anything, and from there, it should only get better. It would be great to have your insight on this. > p.s. I'm impressed by the frequency with which you are releasing > htmlParser, and your process of having multiple candidates etc. I > struggle to release often as the release process itself still seems a > little cumbersome (sourceforge has got better) .... have you any tips > for streamlining it ....? I guess what I really need is an ant methods like > > ant release-bug-fix version > ant create-new-version-release > ant create-new-candiate-release > > which handle all the necessary communication with sourceforge, > uploading, packaging and handling of release numbers .... Ha ha! I am not sure if you'll believe this, but I was inspired to structure the htmlparser project based on the neurogrid project- you had ant scripts long before we did. Of course, ant scripts are so important to do the job automatically - but I like keeping things simple -in the sense, there is no seperate bug-fix version, but the next integration release (Candidate). I am not yet a fan of branches - they're ok if they dont live more than two weeks (I've been thinking real hard about it for a while). Im planning to get the production release out this week - so we can all move on to 1.3 (instead of having two versions - we'll live with 1.3 integration releases). I'd hate to make the same bug fixes twice. Regards, Somik |
From: Sam J. <ga...@yh...> - 2002-12-21 06:39:59
|
Hi Somik Somik Raha wrote: >Hi Sam, > This is interesting - bcos I was doing much the same >thing last evening. I've added a lot of searching >methods into HTMLFormTag - as I was using this to do >test-first development of an XSLT stylesheet. > I was happy with the results - I was actually able >to develop the stylesheet test-first. > Sounds cool. >>I have started creating a TestCodeGenerator using >>HtmlParser. I am >>using an HTMLFormScanner to strip out all the form >>details, and from >>this I hope to generate testing and handling java >>classes .... >> >> > >I will probably add some more utility methods into >HTMLParserTestCase that will make life easier - but >even in its current form, you might find it useful. > I think it will be. At first I was thinking I wanted lots of additional methods, but actually I can predicate behaviour based on the TYPE of the input tags. I'm getting output like this: [java] DEBUG [main] (TestCodeGenerator.java:296) - pTagName: FORM [java] DEBUG [main] (TestCodeGenerator.java:302) - attr: ACTION, value: $FORM_ACTION [java] DEBUG [main] (TestCodeGenerator.java:302) - attr: NAME, value: $EDIT_FORM [java] DEBUG [main] (TestCodeGenerator.java:302) - attr: METHOD, value: POST [java] DEBUG [main] (TestCodeGenerator.java:306) - x_inputs.size(): 7 [java] DEBUG [main] (TestCodeGenerator.java:321) - pTagName: INPUT [java] DEBUG [main] (TestCodeGenerator.java:327) - attr: VALUE, value: add_bookmark [java] DEBUG [main] (TestCodeGenerator.java:327) - attr: NAME, value: $ACTION [java] DEBUG [main] (TestCodeGenerator.java:327) - attr: TYPE, value: hidden and I think I can generate the necessary test and code shells from this. >I've documented it here : >http://htmlparser.sourceforge.net/design/tests.html > Interesting. I'm becoming more and more commited to a test-first philosophy myself, i.e. 1. write test 2. run it and watch it fail to confirm test will fail when implementation is not present 3. write the implementation 4. try and get implementation to pass test 5. repeat as appropriate Although actually my post was more about automatic test creation rather than testing itself. I recently found this: http://www.junitdoclet.org/ which automatically creates test shells out of code. And I found myself wanting the equivalent thing for httpUnit. Either way round I think if people are going to do testing they need support to make it easier to create the tests. And if they are working on the tests first they need support creating the implementation. It seems that really we want something like: 1. Specify use case, and data-in/data-out 2. Automatically generate test code and shell implementation code 3. Run test code to fail 4. fill in implementaton details ...etc Anyways..... CHEERS> SAM p.s. I'm impressed by the frequency with which you are releasing htmlParser, and your process of having multiple candidates etc. I struggle to release often as the release process itself still seems a little cumbersome (sourceforge has got better) .... have you any tips for streamlining it ....? I guess what I really need is an ant methods like ant release-bug-fix version ant create-new-version-release ant create-new-candiate-release which handle all the necessary communication with sourceforge, uploading, packaging and handling of release numbers .... |