Thread: [Htmlparser-developer] RE: question about using HTMLParser in Apache JMeter
Brought to you by:
derrickoswald
From: Derrick O. <der...@au...> - 2003-09-29 20:00:13
|
Peter, =20 Yes, you have permission. In fact we would be honoured and endeavor to assist you in any way necessary. =20 It's funny you should mention images and DOM. The latest versions of htmlparser includes an example application that does a very similar task; getting the images behind thumbnails (see lib/thumbelina.jar or package org.htmlparser.lexerapplications.thumbelina). It uses the low level Lexer package to avoid having to form the entire document model. I would check to see if something like this meets your needs. =20 If you need more than that (i.e. table parsing, balancing end tags, etc.) you'll have to go with the full parser. Unfortunately, the Lexer hasn't been completely integrated into the parser yet and the current CVS snapshot is a bit of a mess. With a bit of patience, this too will come to pass. =20 As far as performance comparisons go, I've only heard anecdotal evidence that htmlparser is faster. I suppose this could be an area of investigation. =20 Derrick -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: September 29, 2003 8:53 AM To: Derrick Oswald Subject: question about using HTMLParser in Apache JMeter =20 Hi derrick, =20 =20 I am a commiter on Apache's Jakarta JMeter project. I was wondering if we can get permission to use it. Since Apache foundation can't use LGPL code without permission, I'm hoping you're open to the idea. =20 here is a quick description of how I want to use it. JMeter currently is a load testing tool for HTTP, FTP, JDBC and Java. The HTTP plugin uses JTidy to parse the HTML and extract the images for download. =20 test plans with more than 20 clients performs poorly because of the high cost of DOM. JTidy generates DOM documents. One trick is to turn off download images in JMeter, but that doesn't solve the real problem. I want to replace JTidy with HTMLParser. I haven't done any performance comparison yet, but I'm guessing it should use less memory. =20 has anyone done a performance comparison between JTidy and HTMLParser? =20 peter lin =20 =20 =20 =20 _____ =20 Do you Yahoo!? The <http://shopping.yahoo.com/?__yltc=3Ds%3A150000443%2Cd%3A22708228%2Cslk%3= A text%2Csec%3Amail> New Yahoo! Shopping - with improved product search |
From: Derrick O. <der...@au...> - 2003-10-01 14:06:07
|
Are there any opinions regarding Peter Lin's proposal to make htmlparser an official Jakarta project? -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: September 30, 2003 11:39 PM To: Derrick Oswald Subject: RE: question about using HTMLParser in Apache JMeter =20 I haven't found out the exact policy. Assuming the policy as I described is the official policy, is that OK with the developers of HTMLParser? I would like to help make HTMLParser an official Jakarta project. Does that sound appealing to you? I don't know the process for making it an official jakarta project, but I can look into it and get the details to you. =20 thanks again for your kindness and assistance. I know you and the other developers have put alot of blood and sweat into the code. Plus having it as an official jakarta project would give it a ton of exposure, since jakarta now accounts for a huge percentage of Apache's traffic. Also, I believe the ScrapeTags in taglib project could benefit from HTMLParser. If I remember correctly, it uses Tidy also and suffers from the same performance limitations. =20 peter lin Derrick Oswald <der...@au...> wrote: =20 So, you're taking a snapshot? I would have thought you would just include the jar file, and build it into the JMeter project, i.e. use Ant's zipfileset. If not, what's the procedure for updates? -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: September 30, 2003 9:48 AM To: Derrick Oswald Subject: RE: question about using HTMLParser in Apache JMeter =20 Hi derrick, =20 I talked to the maintainer of JMeter and got the information on the process. From my understanding of Apache guidelines as explained by Mike stover, it goes something like this. =20 1. Add Apache license to the source files 2. make sure all license and copyright information required by HTMLParser developers are present 3. big huge thanks to HTMLParser developers posted on JMeter 4. I will do code clean up so it conforms to JMeter code guidelines 5. check in code to JMeter cvs 6. change relevant code in JMeter to use HTMLParser =20 =20 Basically, Apache requires that donations give the foundation a non-exclusive license to the software. If that is ok with all the developers, I will continue with process. I've started running some benchmarks. When I am done I will send you the full results with source, so you can post it on HTMLParser site. Thanks again for your generousity. =20 =20 peter lin |
From: Joshua K. <jo...@in...> - 2003-10-01 19:32:10
|
Derrick Oswald wrote: > Are there any opinions regarding Peter Lin's proposal to make htmlparser > an official Jakarta project? Sounds like a great idea. BTW, I had integrated the NodeFactory into the code a while back. It allows one to add decorators to things like StringNodes. I haven't had time to look at the latest code -- does it still retain that feature with the introduction of the lexar? thanks jk -- I n d u s t r i a l L o g i c , I n c . Joshua Kerievsky Founder, Extreme Programmer & Coach http://industriallogic.com http://industrialxp.org 866-540-8336 (toll free) 510-540-8336 (phone) Berkeley, California |
From: Derrick O. <Der...@Ro...> - 2003-10-02 02:13:51
|
The StringNodeFactory you added is currently sidelined by the more generic NodeFactory. It would be easy to add it back in. Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> Are there any opinions regarding Peter Lin's proposal to make >> htmlparser an official Jakarta project? > > > Sounds like a great idea. > > BTW, I had integrated the NodeFactory into the code a while back. It > allows one to add decorators to things like StringNodes. I haven't > had time to look at the latest code -- does it still retain that > feature with the introduction of the lexar? > > thanks > jk > |
From: Joshua K. <jo...@in...> - 2003-10-02 04:27:06
|
Derrick Oswald wrote: > The StringNodeFactory you added is currently sidelined by the more > generic NodeFactory. It would be easy to add it back in. > Derrick I deliberately added StringNodeFactory to the parser, not a generic NodeFactory, because I had no need for a generic NodeFactory. Have you found a need for a generic NodeFactory? --jk |
From: Derrick O. <Der...@Ro...> - 2003-10-02 11:44:50
|
Joshua, Yes. In the transition from using a straight Lexer to get basic nodes (lexer.nodes package), to using the Parser to get nodes that can be visited (htmlparser package), the Lexer needs to generate nodes it was not compiled with. Hence the Parser replaces the Lexer as the NodeFactory that the Lexer calls when it needs to create a Node. I'm thinking this concept should be augmented in the Parser's createTagNode to look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> The StringNodeFactory you added is currently sidelined by the more >> generic NodeFactory. It would be easy to add it back in. >> Derrick > > > I deliberately added StringNodeFactory to the parser, not a generic > NodeFactory, because I had no need for a generic NodeFactory. Have > you found a need for a generic NodeFactory? --jk > |
From: Joshua K. <jo...@in...> - 2003-10-06 04:39:42
|
Derrick Oswald wrote: > Yes. In the transition from using a straight Lexer to get basic nodes > (lexer.nodes package), to using the Parser to get nodes that can be > visited (htmlparser package), the Lexer needs to generate nodes it was > not compiled with. Hence the Parser replaces the Lexer as the > NodeFactory that the Lexer calls when it needs to create a Node. IMO, the NodeFactory is better off as its own object. The Parser can use a default instance of it. Clients can configure the Parser to use a specific NodeFactory. This is important for decorating nodes and tags. In addition, we don't want to give the Parser too many responsibilities, as it complicates its design. At present, we've made some choices about which tags are visitable - i.e. visitable nodes and tags are hard-coded into our NodeVisitor class. I'm not sure what you mean above when you write "using the Parser to get nodes that can be visited"? > I'm thinking this concept should be augmented in the Parser's > createTagNode to look up the name of the node (from the attribute list > provided), and create specific types of tags (FormTag, TableTag etc.) > by cloning empty tags from a Hashtable of possible tag types (possibly > called mBlastocyst in reference to undifferentiated stem cells). Sounds like the Prototype pattern. The trouble with this approach is getting the right data into the node/tag. You can clone a tag that has no data, then you got to get the right data into the tag. Since different tags have different data needs, it gets complicated. Have you considered these issues? > This would provide a concrete implementation of createTag in > CompositeTagScanner, removing a lot of near duplicate code from the > scanners, and allow end users to plug in their own tags via a call like > setTagFor ("BODY", new myBodyTag()) > on the Parser. Details on interaction with the scanners have to be > worked out, but it seems the end user wouldn't have to replace the > scanner to get their own tags out. When you say "this would provide a concrete ...." I don't follow. Why is a Prototype-based createTagNode method a prerequisite for removing near duplicate code in the scanners? i.e. couldn't that be done regardless of whether a Prototype solution is used? What am I missing? best regards jk |
From: Derrick O. <Der...@Ro...> - 2003-10-06 10:24:25
|
was subject: Re: [Htmlparser-developer] RE: question about using HTMLParser in Apache JMeter Joshua, The parser can be a NodeFactory with just three additional methods. It's still replaceable because the factory is set on the Lexer, i.e. clients can still create and set their own NodeFactory, even using the parser as a delegate for methods they don't want to handle. A major benefit of interface design is to avoid spurious trivial classes. A node that's visitable has a signature: void accept (NodeVisitor visitor) By incorporating that signature, because the NodeVisitor class knows about specific high level composite node types (why only Image, Link and Title?), the low level Lexer jar file would have to drag in a whole lot of other stuff. So currently the low level tags only implement (vacuously): void accept (Object visitor) and then the high level Tag class thunks up to the more specific signature with an up-cast. If NodeVisitor were to only handle base types (String, Remark and Tag) this could be avoided. The fact that the NodeVisitor class knows about ImageTag, LinkTag and TitleTag makes it less useful in the presence of user supplied node types; but that's it's inherent flaw. Getting data into user supplied nodes is easy: each tag is presented with the attributes and children found by the scanner, what else is there? The current implementation does it the other way, each scanner is the one that figures out the special data and then creates a new specialized tag by some byzantine constructor taking arguments that only it can understand. The tag is reduced to regurgitating the simple strings it was given. Typical example; FrameScanner has extractFrameLocn() and extractFrameName() which it passes into the FrameTag constructor. Why not have FrameTag figure this stuff out? The TagScanner class is abstract, partly because of the signature: protected abstract Tag createTag(TagData tagData, Tag tag, String url) throws ParserException; Each scanner has code like: public Tag createTag(TagData tagData, CompositeTagData compositeTagData) throws ParserException { return new BulletList(tagData,compositeTagData); } With a 'Prototype' solution, the TagScanner class could implement: public Tag createTag(TagData tagData, CompositeTagData compositeTagData) throws ParserException { Tag tag = mBlastocyst.get (tagData.getTagName ()); if (null == tag) tag = new Tag (tagData, compositeTagData); // should use the NodeFactory else { tag = (Tag)tag.clone (); tag.setData (tagData, compositeTagData); } return (tag); } which would remove the need for each class to implement it. How would you remove the createTag() code from all the scanners without prototypes? The above is couched in current TagData format, but in reality it would be more like: tag = (Tag)tag.clone (); tag.setAttributes (attributes); tag.setChildren (children); Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> Yes. In the transition from using a straight Lexer to get basic >> nodes (lexer.nodes package), to using the Parser to get nodes that >> can be visited (htmlparser package), the Lexer needs to generate >> nodes it was not compiled with. Hence the Parser replaces the Lexer >> as the NodeFactory that the Lexer calls when it needs to create a Node. > > > IMO, the NodeFactory is better off as its own object. The Parser can > use a default instance of it. Clients can configure the Parser to use > a specific NodeFactory. This is important for decorating nodes and > tags. In addition, we don't want to give the Parser too many > responsibilities, as it complicates its design. > > At present, we've made some choices about which tags are visitable - > i.e. visitable nodes and tags are hard-coded into our NodeVisitor > class. I'm not sure what you mean above when you write "using the > Parser to get nodes that can be visited"? > >> I'm thinking this concept should be augmented in the Parser's >> createTagNode to look up the name of the node (from the attribute >> list provided), and create specific types of tags (FormTag, TableTag >> etc.) by cloning empty tags from a Hashtable of possible tag types >> (possibly called mBlastocyst in reference to undifferentiated stem >> cells). > > > Sounds like the Prototype pattern. The trouble with this approach is > getting the right data into the node/tag. You can clone a tag that > has no data, then you got to get the right data into the tag. Since > different tags have different data needs, it gets complicated. Have > you considered these issues? > >> This would provide a concrete implementation of createTag in >> CompositeTagScanner, removing a lot of near duplicate code from the >> scanners, and allow end users to plug in their own tags via a call like >> setTagFor ("BODY", new myBodyTag()) >> on the Parser. Details on interaction with the scanners have to be >> worked out, but it seems the end user wouldn't have to replace the >> scanner to get their own tags out. > > > When you say "this would provide a concrete ...." I don't follow. Why > is a Prototype-based createTagNode method a prerequisite for removing > near duplicate code in the scanners? i.e. couldn't that be done > regardless of whether a Prototype solution is used? What am I missing? > > best regards > jk > |
From: Joshua K. <jo...@in...> - 2003-10-06 18:56:49
|
Derrick Oswald wrote: > The parser can be a NodeFactory with just three additional methods. > It's still replaceable because the factory is set on the Lexer, i.e. > clients can still create and set their own NodeFactory, even using the > parser as a delegate for methods they don't want to handle. A major > benefit of interface design is to avoid spurious trivial classes. Let me see if I can understand your design. You want a user of the parser to first get access to the Lexar to then set which NodeFactory to use? I must be misunderstanding something. Most users of the parser shouldn't even know the Lexar exists, right? It's a low-level detail to average parser users. A NodeFactory encapsulates data and methods used in node/tag creation - nothing spurious or trivial about it. In fact, small classes (such as NodeFactory) which have one responsibility are easier to understand, extend and maintain. Furthermore, one method on the parser is all it takes to let parser users set a NodeFactory instance. On the other hand, the current implementation has three separate methods to handle node/tag creation. I dislike that design because: * it bloats the Parser interface, which is already heavily bloated with too many methods * it gives the Parser a new responsibility which it has no business having: node/tag creation * it adds code to an already fat Parser class that's overburdened with responsibilities. I'm sensing that you prefer to build and work with Large Classes. Is that correct? If so, are you aware that Large Class is a smell? See Refactoring: Improving the Design of Existing Code, by Martin Fowler. The chapter on smells was co-written by Kent Beck and Martin Fowler. > A node that's visitable has a signature: > void accept (NodeVisitor visitor) Yeah, I'm the guy who popularized the use of Visitors in the parser - remember? You were against their usage. Have you come around to the dark side? > By incorporating that signature, because the NodeVisitor class knows > about specific high level composite node types (why only Image, Link > and Title?), the low level Lexer jar file would have to drag in a > whole lot of other stuff. So currently the low level tags only > implement (vacuously): > void accept (Object visitor) > and then the high level Tag class thunks up to the more specific > signature with an up-cast. If NodeVisitor were to only handle base > types (String, Remark and Tag) this could be avoided. The fact that > the NodeVisitor class knows about ImageTag, LinkTag and TitleTag makes > it less useful in the presence of user supplied node types; but that's > it's inherent flaw. When I wrote NodeVisitor, I deliberately avoided making it aware of nodes or tags beyond StringNode, Tag and EndTag. The reason? To be able to visit other node/tag types, scanners must be registered and a Visitor, being separate from the whole scanner mechanism, cannot guarantee that a given scanner is registered. Over time, people started adding visitXYZ methods to the NodeVisitor interface, such as visitLink, etc. Was that necessary? I don't think so. If one needs information about Links, Images, etc., one doesn't need to use a Visitor. If we use reflection, we can likely make a NodeVisitor that could visit any node/tag type. That would perhaps be slow, since reflection is slow, but it would be a useful experiment. In addition, those who use a reflection-based Visitor may not care about speed. > Getting data into user supplied nodes is easy: each tag is presented > with the attributes and children found by the scanner, what else is > there? The current implementation does it the other way, each scanner > is the one that figures out the special data and then creates a new > specialized tag by some byzantine constructor taking arguments that > only it can understand. The tag is reduced to regurgitating the simple > strings it was given. Typical example; FrameScanner has > extractFrameLocn() and extractFrameName() which it passes into the > FrameTag constructor. Why not have FrameTag figure this stuff out? > > The TagScanner class is abstract, partly because of the signature: > protected abstract Tag createTag(TagData tagData, Tag tag, String > url) throws ParserException; > Each scanner has code like: > public Tag createTag(TagData tagData, CompositeTagData > compositeTagData) throws ParserException > { > return new BulletList(tagData,compositeTagData); > } > With a 'Prototype' solution, the TagScanner class could implement: > public Tag createTag(TagData tagData, CompositeTagData > compositeTagData) throws ParserException > { > Tag tag = mBlastocyst.get (tagData.getTagName ()); > if (null == tag) > tag = new Tag (tagData, compositeTagData); // should use > the NodeFactory > else > { > tag = (Tag)tag.clone (); > tag.setData (tagData, compositeTagData); > } > return (tag); > } > which would remove the need for each class to implement it. How would > you remove the createTag() code from all the scanners without prototypes? How would a prototype approach account for the stack in the following code, from the OptionTagScanner: public Tag createTag( TagData tagData, CompositeTagData compositeTagData) { if (!stack.empty () && (this == stack.peek ())) stack.pop (); return new OptionTag(tagData,compositeTagData); } BTW, FormTagScanner has a similar stack. Believe it or not, Derrick, I like the Prototype pattern and have even considered using it within StringNodeFactory - I didn't proceed because I didn't find a genuine need. Now you've uncovered a possible real need for Prototype in the parser -- I'm all for exploring it. I just want to be clear about what we're doing. You say we could remove a lot of duplicated code in the scanners - I can see lots of code that creates specific tag instances and yes, Prototype can help make that code go away. However the scanners also appear to do useful work (like usage of a stack or implementing the evaluate method) and I'm not seeing how that would easily transfer to the node/tag classes without making those classes overly complex. best regards, jk |
From: Derrick O. <Der...@Ro...> - 2003-10-07 01:07:36
|
Joshua, As it stands the NodeFactory is set automatically by the Lexer or Parser to itself. It's only if someone wants to *change* the node classes being returned that they would need to access the Lexer and set the NodeFactory property: parser.getLexer ().setNodeFactory (myfactory); Not something for the casual user. True, most users won't see the Lexer. But if their needs are fast linear lightweight access, they would use just the Lexer, and ignore the parser (see the Thumbelina lexer application for example). Then the node factory accessor is on the primary object. It's only when you add another level to the parsing that the accessor becomes indirect. The parser class is large. But, of the forty or so methods it has, a quarter of them are dealing with the scanner list and a quarter of them are convenience pass through methods to the lexer. Don't think large or small, think useful. Let me understand your position; the parser shouldn't be doing node/tag creation? What is a parser then? A shell for a gaggle of deus ex machina pulling the levers behind the curtain? No, it's not overburdened, it's just doing it's job. All the rest of the classes are spurious artefacts. To end users, the smell is in memory inefficient, slow programs; they really couldn't care less what's under the hood. Every object created has an overhead in time and memory, so the fewer the better. To programmers though, to which we cater, the design has to be clean and simple. Two classes, where one would do, is not necessarily clean or simple. In fact I'm thinking that each tag should be it's own scanner, folding the whole scanner tree into the tag tree. What better object to understand how to parse it than the tag itself. This would mean the prototype list *is* the scanner list, and the parser doesn't have to get larger. Currently, changing the code in two places means extra effort. Not keeping the two in sync can lead to bugs. Currently, most of the scanners are just baskets to hold the MATCH_NAME, ENDERS and END_TAG_ENDERS lists. Shouldn't the tag be the one responsible for knowing it's own name, terminators and place in the dtd? Larger classes can mean easier maintenance, if what they are replacing is a plethora of trivial interrelated classes. And don't get me going about the 'refactoring' that spawned the 'helpers'. All the overhead of creating a new CompositeTagScannerHelper for each node scanned is horrendous -- all to avoid a 'large' CompositeTagScanner class. Sorry, I heartily disagree with making classes smaller in an attempt to avoid a smell, when the smell only shows up when it's 'refactored'. The stack example in the option class should be handled by the ENDERS list. Running into a new <OPTION> while parsing the previous one, should close the original and open another. In this case, using a stack to track the recursion is overkill, I think, and could be handled in a more straight forward manner. // Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> The parser can be a NodeFactory with just three additional methods. >> It's still replaceable because the factory is set on the Lexer, i.e. >> clients can still create and set their own NodeFactory, even using >> the parser as a delegate for methods they don't want to handle. A >> major benefit of interface design is to avoid spurious trivial classes. > > > Let me see if I can understand your design. You want a user of the > parser to first get access to the Lexar to then set which NodeFactory > to use? I must be misunderstanding something. Most users of the > parser shouldn't even know the Lexar exists, right? It's a low-level > detail to average parser users. > A NodeFactory encapsulates data and methods used in node/tag creation > - nothing spurious or trivial about it. In fact, small classes (such > as NodeFactory) which have one responsibility are easier to > understand, extend and maintain. Furthermore, one method on the > parser is all it takes to let parser users set a NodeFactory > instance. On the other hand, the current implementation has three > separate methods to handle node/tag creation. I dislike that design > because: > > * it bloats the Parser interface, which is already heavily bloated > with too many methods > * it gives the Parser a new responsibility which it has no business > having: node/tag creation > * it adds code to an already fat Parser class that's overburdened with > responsibilities. > > I'm sensing that you prefer to build and work with Large Classes. Is > that correct? If so, are you aware that Large Class is a smell? > See Refactoring: Improving the Design of Existing Code, by Martin > Fowler. The chapter on smells was co-written by Kent Beck and Martin > Fowler. > >> A node that's visitable has a signature: >> void accept (NodeVisitor visitor) > > > Yeah, I'm the guy who popularized the use of Visitors in the parser - > remember? You were against their usage. Have you come around to the > dark side? > >> By incorporating that signature, because the NodeVisitor class knows >> about specific high level composite node types (why only Image, Link >> and Title?), the low level Lexer jar file would have to drag in a >> whole lot of other stuff. So currently the low level tags only >> implement (vacuously): >> void accept (Object visitor) >> and then the high level Tag class thunks up to the more specific >> signature with an up-cast. If NodeVisitor were to only handle base >> types (String, Remark and Tag) this could be avoided. The fact that >> the NodeVisitor class knows about ImageTag, LinkTag and TitleTag >> makes it less useful in the presence of user supplied node types; but >> that's it's inherent flaw. > > > When I wrote NodeVisitor, I deliberately avoided making it aware of > nodes or tags beyond StringNode, Tag and EndTag. The reason? To be > able to visit other node/tag types, scanners must be registered and a > Visitor, being separate from the whole scanner mechanism, cannot > guarantee that a given scanner is registered. > Over time, people started adding visitXYZ methods to the NodeVisitor > interface, such as visitLink, etc. Was that necessary? I don't think > so. If one needs information about Links, Images, etc., one doesn't > need to use a Visitor. > > If we use reflection, we can likely make a NodeVisitor that could > visit any node/tag type. That would perhaps be slow, since reflection > is slow, but it would be a useful experiment. In addition, those who > use a reflection-based Visitor may not care about speed. > >> Getting data into user supplied nodes is easy: each tag is presented >> with the attributes and children found by the scanner, what else is >> there? The current implementation does it the other way, each scanner >> is the one that figures out the special data and then creates a new >> specialized tag by some byzantine constructor taking arguments that >> only it can understand. The tag is reduced to regurgitating the >> simple strings it was given. Typical example; FrameScanner has >> extractFrameLocn() and extractFrameName() which it passes into the >> FrameTag constructor. Why not have FrameTag figure this stuff out? >> >> The TagScanner class is abstract, partly because of the signature: >> protected abstract Tag createTag(TagData tagData, Tag tag, String >> url) throws ParserException; >> Each scanner has code like: >> public Tag createTag(TagData tagData, CompositeTagData >> compositeTagData) throws ParserException >> { >> return new BulletList(tagData,compositeTagData); >> } >> With a 'Prototype' solution, the TagScanner class could implement: >> public Tag createTag(TagData tagData, CompositeTagData >> compositeTagData) throws ParserException >> { >> Tag tag = mBlastocyst.get (tagData.getTagName ()); >> if (null == tag) >> tag = new Tag (tagData, compositeTagData); // should use >> the NodeFactory >> else >> { >> tag = (Tag)tag.clone (); >> tag.setData (tagData, compositeTagData); >> } >> return (tag); >> } >> which would remove the need for each class to implement it. How would >> you remove the createTag() code from all the scanners without >> prototypes? > > > How would a prototype approach account for the stack in the following > code, from the OptionTagScanner: > > public Tag createTag( > TagData tagData, > CompositeTagData compositeTagData) { > if (!stack.empty () && (this == stack.peek ())) > stack.pop (); > return new OptionTag(tagData,compositeTagData); > } > > BTW, FormTagScanner has a similar stack. > > Believe it or not, Derrick, I like the Prototype pattern and have even > considered using it within StringNodeFactory - I didn't proceed > because I didn't find a genuine need. Now you've uncovered a possible > real need for Prototype in the parser -- I'm all for exploring it. I > just want to be clear about what we're doing. You say we could remove > a lot of duplicated code in the scanners - I can see lots of code that > creates specific tag instances and yes, Prototype can help make that > code go away. However the scanners also appear to do useful work > (like usage of a stack or implementing the evaluate method) and I'm > not seeing how that would easily transfer to the node/tag classes > without making those classes overly complex. > best regards, > jk > |
From: Joshua K. <jo...@in...> - 2003-10-07 16:55:06
|
Derrick Oswald wrote: > As it stands the NodeFactory is set automatically by the Lexer or Parser > to itself. It's only if someone wants to *change* the node classes being > returned that they would need to access the Lexer and set the > NodeFactory property: > parser.getLexer ().setNodeFactory (myfactory); > Not something for the casual user. *Changing* the node classes being returned is PRECISELY why the NodeFactory must (and will) be accessible. Today, users of the parser must write their own code to remove white spaces, remove escape characters, decode strings. That's no longer necessary. You can now configure the parser to remove white spaces, escape chars, etc., by means of node decorators. Those decorators are currently configured within StringNodeFactory, which is going to become NodeFactory. Decorating non-StringNodes, such as RemarkNodes, is already possible. The easiest way to popularize this use of decorators in the parser is to give parser users access to the NodeFactory and make it easy for them to configure it, or subclass it, as they like. > The parser class is large. But, of the forty or so methods it has, a > quarter of them are dealing with the scanner list and a quarter of them > are convenience pass through methods to the lexer. Don't think large or > small, think useful. Let me understand your position; the parser > shouldn't be doing node/tag creation? What is a parser then? A shell for > a gaggle of deus ex machina pulling the levers behind the curtain? No, > it's not overburdened, it's just doing it's job. All the rest of the > classes are spurious artefacts. See the above for why the parser should delegate node creation. > To end users, the smell is in memory inefficient, slow programs; they > really couldn't care less what's under the hood. Every object created > has an overhead in time and memory, so the fewer the better. Do you use a profiler Derrick? Because the above utterance is something I'd expect to hear from an inexperienced programmer. Or maybe an old C programmer. Hey, I programmed in C once. It's been a long time since I programmed in C. But not long enough. > To > programmers though, to which we cater, the design has to be clean and > simple. Two classes, where one would do, is not necessarily clean or > simple. In fact I'm thinking that each tag should be it's own scanner, > folding the whole scanner tree into the tag tree. What better object to > understand how to parse it than the tag itself. This would mean the > prototype list *is* the scanner list, and the parser doesn't have to get > larger. Currently, changing the code in two places means extra effort. > Not keeping the two in sync can lead to bugs. Currently, most of the > scanners are just baskets to hold the MATCH_NAME, ENDERS and > END_TAG_ENDERS lists. Shouldn't the tag be the one responsible for > knowing it's own name, terminators and place in the dtd? Larger classes > can mean easier maintenance, if what they are replacing is a plethora of > trivial interrelated classes. Lazy Class is another smell in Refactoring -- it refers to a class that isn't pulling its own weight. We "inline" classes when they don't pull their own weight. When I joined this project -- less than a year ago -- the scanners were a mess. Tons of duplicate code. Inlining these scanners into the tags at that point would've been foolish, as it would've bloated the tags. So Somik and I began removing duplication from the scanners. Sometimes when you remove duplication, you get to the point where you see that the classes are no longer necessary -- they aren't doing enough to justify their existence. I believe we've now reached that point with the scanners, so I support the inlining of them into the tags. > And don't get me going about the 'refactoring' that spawned the > 'helpers'. All the overhead of creating a new CompositeTagScannerHelper > for each node scanned is horrendous -- all to avoid a 'large' > CompositeTagScanner class. Sorry, I heartily disagree with making > classes smaller in an attempt to avoid a smell, when the smell only > shows up when it's 'refactored'. Most of the code in the parser pre-dates my appearance on this project, so I know not how/why the helpers got added. What attracked me to this project was the messy, unrefactored code, which happened to have pretty good test coverage. This project is ripe for all sorts of refactorings, which the tests make a whole lot easier to do. Without tests, it's hard to refactor. I remember looking at your StringBean class before it was made into a Visitor. I had just added Visitor to the parser and was using it for useful things. I looked at StringBean and say "uhhhhhhh, now that's crying out to be a Visitor." Yet I didn't make it a Visitor. Why? Because it had no tests. Without tests, refactoring is hard. Creating tests for code is a no-brainer if you practice Test-Driven Development (TDD). Derrick, do you practice TDD? > The stack example in the option class should be handled by the ENDERS > list. Running into a new <OPTION> while parsing the previous one, should > close the original and open another. In this case, using a stack to > track the recursion is overkill, I think, and could be handled in a more > straight forward manner. I believe there are good tests for that recursion, so it shouldn't be hard to come up with other ways to do it. BTW, when will the tests be green again? I can't do much when they are running red. --jk |
From: Somik R. <so...@ya...> - 2003-10-08 02:10:41
|
Derrick Oswald wrote: > > And don't get me going about the 'refactoring' > that spawned the > > 'helpers'. All the overhead of creating a new > CompositeTagScannerHelper > > for each node scanned is horrendous -- all to > avoid a 'large' > > CompositeTagScanner class. Sorry, I heartily > disagree with making > > classes smaller in an attempt to avoid a smell, > when the smell only > > shows up when it's 'refactored'. Just to set the record straight, some of the scanners were an utter mess. Primarily because I wanted them stateless- so that one scanner object could be used throughout the life of a parser. That made refactoring hard. But moving out functionality into a helper class allowed the creation of state within the helper on every parse - that allowed refactoring of large and obscure methods. At certain points, I even threw out the refactored code and wrote it from scratch. I am not sure what you mean by your last statement - it looks heck of a lot better than when the scanners had all the code. Sure, there is some penalty for creating objects every time.. but that happens only when the scan is essential (triggered off on identification of a tag). I am not in favor of getting rid of the scanner hierarchy - clients can rig up a parser of their choice by including a scanner of their choice. Data files describing scanners could be used to remove some of the scanner classes.. But it needs to be explored. I could be wrong - maybe I haven't understood fully what removal of the scanner hierarchy will buy us.. This is a good debate to have. Regards, Somik __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com |
From: Derrick O. <Der...@Ro...> - 2003-10-08 02:38:13
|
Joshua, The NodeFactory is an interface, you don't subclass an interface, you implement it. People can write their own three method class to do that, or like I said, delegate to the parser for things they don't want to handle. But delegation comes at the cost of time and memory. The current string decorator architecture, for example, delegates through three wrapping objects (with associated memory overhead) to do the job the StringNode should do in the first place. If one knows a priori that whitespace removal, reference translation and so on are useful functions, build them into the StringNode. Provide the factory mechanism only for things you can't anticipate in advance. Leaving the decorator code as a caboose grafted on to the base implementation is only acceptable if it's a transient fix. I use profilers all the time. And memory analysis. And javap to examine the byte code. That's why the lexer code runs 60% faster than the old NodeReader code and uses a quarter of the memory. Are you saying the creation of objects doesn't take time and memory? Maybe you should run a profiler comparing the 'decorative wrapping' approach with a 'built in functionality' approach. Or better yet, just step through it (and I mean step *in* to every method) in a debugger and see where all the time is spent allocating and wandering between objects without getting any of the work done, and repeatedly copying out strings. It's a real eye-opener for someone who only works at the 30,000 foot level. Yes, I'm a big advocate of test driven development. The bean classes were dropped with their tests - BeanTest, so I don't see why you couldn't refactor it. The lexer package has it's tests. There are currently 35 outstanding failures out of 448 and I'm currently trying to re-integrate the test cases that got shunted off to the temporaryFailures package -- because someone wanted a green bar. You can still do test driven development without a green bar by monitoring the number of failures... ...like a doctor, you just ensure you 'do no harm'. Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> As it stands the NodeFactory is set automatically by the Lexer or >> Parser to itself. It's only if someone wants to *change* the node >> classes being returned that they would need to access the Lexer and >> set the NodeFactory property: >> parser.getLexer ().setNodeFactory (myfactory); >> Not something for the casual user. > > > *Changing* the node classes being returned is PRECISELY why the > NodeFactory must (and will) be accessible. > > Today, users of the parser must write their own code to remove white > spaces, remove escape characters, decode strings. That's no longer > necessary. You can now configure the parser to remove white spaces, > escape chars, etc., by means of node decorators. Those decorators are > currently configured within StringNodeFactory, which is going to > become NodeFactory. Decorating non-StringNodes, such as RemarkNodes, > is already possible. > > The easiest way to popularize this use of decorators in the parser is > to give parser users access to the NodeFactory and make it easy for > them to configure it, or subclass it, as they like. > >> The parser class is large. But, of the forty or so methods it has, a >> quarter of them are dealing with the scanner list and a quarter of >> them are convenience pass through methods to the lexer. Don't think >> large or small, think useful. Let me understand your position; the >> parser shouldn't be doing node/tag creation? What is a parser then? A >> shell for a gaggle of deus ex machina pulling the levers behind the >> curtain? No, it's not overburdened, it's just doing it's job. All the >> rest of the classes are spurious artefacts. > > > See the above for why the parser should delegate node creation. > >> To end users, the smell is in memory inefficient, slow programs; they >> really couldn't care less what's under the hood. Every object created >> has an overhead in time and memory, so the fewer the better. > > > Do you use a profiler Derrick? Because the above utterance is > something I'd expect to hear from an inexperienced programmer. Or > maybe an old C programmer. Hey, I programmed in C once. It's been a > long time since I programmed in C. But not long enough. > >> To programmers though, to which we cater, the design has to be clean >> and simple. Two classes, where one would do, is not necessarily clean >> or simple. In fact I'm thinking that each tag should be it's own >> scanner, folding the whole scanner tree into the tag tree. What >> better object to understand how to parse it than the tag itself. This >> would mean the prototype list *is* the scanner list, and the parser >> doesn't have to get larger. Currently, changing the code in two >> places means extra effort. Not keeping the two in sync can lead to >> bugs. Currently, most of the scanners are just baskets to hold the >> MATCH_NAME, ENDERS and END_TAG_ENDERS lists. Shouldn't the tag be the >> one responsible for knowing it's own name, terminators and place in >> the dtd? Larger classes can mean easier maintenance, if what they are >> replacing is a plethora of trivial interrelated classes. > > > Lazy Class is another smell in Refactoring -- it refers to a class > that isn't pulling its own weight. We "inline" classes when they > don't pull their own weight. > > When I joined this project -- less than a year ago -- the scanners > were a mess. Tons of duplicate code. Inlining these scanners into > the tags at that point would've been foolish, as it would've bloated > the tags. > So Somik and I began removing duplication from the scanners. > Sometimes when you remove duplication, you get to the point where you > see that the classes are no longer necessary -- they aren't doing > enough to justify their existence. I believe we've now reached that > point with the scanners, so I support the inlining of them into the tags. > >> And don't get me going about the 'refactoring' that spawned the >> 'helpers'. All the overhead of creating a new >> CompositeTagScannerHelper for each node scanned is horrendous -- all >> to avoid a 'large' CompositeTagScanner class. Sorry, I heartily >> disagree with making classes smaller in an attempt to avoid a smell, >> when the smell only shows up when it's 'refactored'. > > > Most of the code in the parser pre-dates my appearance on this > project, so I know not how/why the helpers got added. > > What attracked me to this project was the messy, unrefactored code, > which happened to have pretty good test coverage. This project is > ripe for all sorts of refactorings, which the tests make a whole lot > easier to do. > > Without tests, it's hard to refactor. I remember looking at your > StringBean class before it was made into a Visitor. I had just added > Visitor to the parser and was using it for useful things. I looked at > StringBean and say "uhhhhhhh, now that's crying out to be a Visitor." > Yet I didn't make it a Visitor. Why? Because it had no tests. > Without tests, refactoring is hard. > > Creating tests for code is a no-brainer if you practice Test-Driven > Development (TDD). Derrick, do you practice TDD? > >> The stack example in the option class should be handled by the ENDERS >> list. Running into a new <OPTION> while parsing the previous one, >> should close the original and open another. In this case, using a >> stack to track the recursion is overkill, I think, and could be >> handled in a more straight forward manner. > > > I believe there are good tests for that recursion, so it shouldn't be > hard to come up with other ways to do it. BTW, when will the tests be > green again? I can't do much when they are running red. > > --jk > |
From: Joshua K. <jo...@in...> - 2003-10-09 07:58:09
|
Derrick Oswald wrote: > The NodeFactory is an interface, Your NodeFactory is an interface. I'm not interested in your NodeFactory, as I don't like its implementation. When I refer to NodeFactory, it's my StringNodeFactory with the name, NodeFactory. > you don't subclass an interface, you implement it. Have you considered a career in teaching the Java language? > People can write their own three method class to do that, or like I > said, delegate to the parser for things they don't want to handle. Nah. People will clearly see a *class* called NodeFactory, which will have numerous methods on it for configuring the primitive types, like StringNode, Tag, etc. I know you love your new lexar, but we aren't gonna make folks get access to the lexar to set a NodeFactory -- that's just plain awkward, unintuitive and downright strange! > But delegation comes at the cost of time and memory. Spoken like a premature optimizer! Hey, what else does the Parser delegate to? Maybe we could fold all sorts of classes into the Parser to create one big monster class that would be so utterly efficient. Wait, do you think we should re-write this thing in C? Show me how having a client call the parser's NodeFactory object is so much slower and more memory intensive than having the client call the parser's create methods directly. If you cannot show a huge difference in time and memory, I won't prematurely optimize my code. > The current string decorator architecture, for example, delegates > through three wrapping objects (with associated memory overhead) to do > the job the StringNode should do in the first place. Easily optimized without dumping the code into StringNode -- see what I say about Flyweights, below. > If one knows a priori that whitespace removal, reference translation > and so on are useful functions, build them into the StringNode. > Provide the factory mechanism only for things you can't anticipate in > advance. Leaving the decorator code as a caboose grafted on to the > base implementation is only acceptable if it's a transient fix. I know 5 useful functions for StringNodes. It would be poor design to shove them into the StringNode, which is better off being primitive. The useful functions I know about are options for a StringNode. Most folks don't use them. Those that do, can decide to turn them on if they need them. They can do that easily through a NodeFactory, since that is the perfect place to say "here's the kind of StringNodes I want you to create during the parse." > I use profilers all the time. And memory analysis. And javap to > examine the byte code. That's why the lexer code runs 60% faster than > the old NodeReader code and uses a quarter of the memory. Are users complaining about the speed and memory usage of the parser? > Are you saying the creation of objects doesn't take time and memory? > Maybe you should run a profiler comparing the 'decorative wrapping' > approach with a 'built in functionality' approach. Or better yet, just > step through it (and I mean step *in* to every method) in a debugger > and see where all the time is spent allocating and wandering between > objects without getting any of the work done, and repeatedly copying > out strings. It's a real eye-opener for someone who only works at the > 30,000 foot level. The decorators on this project can easily be made into Flyweights, which means 3 objects can do *all* of the decoration for all of the StringNode objects. That's an easy change to make and doesn't involve *shoving* behavior into StringNode and bloating that class with code it doesn't need. It also leaves room for the Decorators to decorate other node types, like RemarkNode. Or would you rather shove embellishments into that class as well? > Yes, I'm a big advocate of test driven development. The bean classes > were dropped with their tests - BeanTest, so I don't see why you > couldn't refactor it. BeanTest! I looked at that one. It barely tested a fraction of StringNode's behavior. It tested if you could successfully serialize a StringNode. Hee haaa! If you used TDD to program StringBean, you're doing something very wrong. > The lexer package has it's tests. There are currently 35 outstanding > failures out of 448 and I'm currently trying to re-integrate the test > cases that got shunted off to the temporaryFailures package -- because > someone wanted a green bar. You can still do test driven development > without a green bar by monitoring the number of failures... > ...like a doctor, you just ensure you 'do no harm'. If you had evolved the lexar into the parser, as I suggested to you in a private email, the bar would be green today, it would've been green yesterday and the day before that. But evolutionary design isn't something you get, or, more probably, isn't something you even want to learn. A green bar is necessary for refactoring. Monitoring failures is not a habit I plan to adopt, as I find it annoying. You don't mind making us all live with red tests for days on end. So please delete the temporaryFailures package -- it's useless given your style of programming. --jk |
From: Joshua K. <jo...@in...> - 2003-10-01 19:45:21
|
Derrick Oswald wrote: > Are there any opinions regarding Peter Lin's proposal to make htmlparser > an official Jakarta project? It occured to me to mention that Somik is on vacation in India -- he'll be back mid-October. He may check email so I'll mention this to him. --jk -- I n d u s t r i a l L o g i c , I n c . Joshua Kerievsky Founder, Extreme Programmer & Coach http://industriallogic.com http://industrialxp.org 866-540-8336 (toll free) 510-540-8336 (phone) Berkeley, California |
From: Derrick O. <der...@au...> - 2003-10-09 15:01:57
|
Peter, I was just thinking about you this morning. Nobody objected. Nobody answered. I guess it's that way with open source. I think you can go ahead. Derrick -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: October 9, 2003 10:43 AM To: Derrick Oswald Subject: RE: question about using HTMLParser in Apache JMeter Hi derrick, Were there any objects from the developers? I did a test implementation and tested it against the current release using Tidy. You might be interested in the results. http://tao.altern8.net:8080/comparison_summary.pdf peter lin |
From: Derrick O. <der...@au...> - 2003-10-09 15:15:29
|
Peter, I can't think of anything special that needs to go in the license, other than a reference to the htmlparser project on sourceforge. Since this is only a snapshot, I don't think anybody needs commit privileges for JMeter. Any updates will go into the htmlparser project and when another major revision is released you can get everything at once and reintegrate it at your leisure. Derrick -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: October 9, 2003 11:08 AM To: Derrick Oswald Subject: RE: question about using HTMLParser in Apache JMeter Hi derrick, thanks for the assitance. I really do appreciate it.=20 feel free to publish the early benchmark results with the test implementation for JMeter. in case you don't have time to read it. using HmtlParser increases the throughput of JMeter by 2-3x. the memory and cpu usage are consistently less, but only by 1-5% if incremental GC isn't used. if incremental GC is used, the memory usage is half that of using Tidy. I would like to include a huge thanks and acknowledgement in the license for all the source files. Is there anything in particular you would like included in the license. I haven't had time to get in touch with the person responsible for creating new projects, but we will make HtmlParser a sub-project of JMeter. Do you or any of the other developers want commit priviledges for JMeter in the mean time? If you're too busy, I am happy to take responsibility of merging updates and patches. peter lin |