htmlparser-developer Mailing List for HTML Parser (Page 7)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Joshua K. <jo...@in...> - 2003-10-22 01:26:55
|
Derrick, It is me or are there duplicates of the StringNode, RemarkNode, etc between the org.htmlparser package and the org.htmlparser.lexer.nodes package? I also noticed that the NodeFactory's creation methods take the lexer as an argument, yet *all* of those methods and the methods they call rely on lexer.getPage(). Have you considered simply passing in a page instance rather than a lexer instance? That will work well for some further refactoring I have in mind. --jk |
From: Derrick O. <Der...@Ro...> - 2003-10-20 01:56:28
|
Removed the data package from the parser level tags. Out went TagData, CompositeTagData, LinkData and FormData. This means the createTag call is now bloated with arguments, but this too shall pass. Moved a lot of the functionality from the scanners to the tags. Whereas before, the scanner would extract all sorts of stuff and pass it to special tag constructors and the tag would just hold it, the tag now performs these tasks when asked. I also removed a lot of member variables so the tags get and set attribute values directly, which means it comes out in the toHtml() call without any special work. Removed lexer level AbstractNode, so there is a Page property on the org.htmlparser.AbstractNode now. Separated tag creation from recursion in NodeFactory interface, so people who want to create their own tags won't need to worry about the scanning recursion. It passes 508 of 522 unit tests. TODO ===== Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. The end user wouldn't have to create or replace a scanner to get their own tags out. Getting rid of the data package cleared up a lot of questions regarding the interaction scanners have with tags. In general, the scanner now creates the tag in a very straight forward bean-like manner: ret = new Div (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); ret.setStartTag (startTag); ret.setEndTag (endTag); ret.setChildren (children); This is nearly always the same in every scanner, only the tag name is different. The oddball cases have been highlighted with a // special step here... comment in the code. These special steps mostly revolve around meta-information available in scanners only (i.e. base href), or handling of nesting with a stack construct. It shouldn't be too much trouble to make these all go away. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Examples: testIncompleteTitle - <title>blah</title </head> used to be 2 nodes testEmptyComment - <!--> was considered a valid remark node Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-13 22:09:27
|
It now passes 499 tests out of 521. The remaining 22 failures indicate changed functionality that needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. I've eliminated the ParserHelper static class, only one 'helper' left to go. I reinstated the tests in the test.temporaryFailures package, they're really lexer package tests, so I put them in there. The good news is, they all pass now. In other words, the reason for them being relegated to the temporaryFailures package no longer exists. The bad news is, I've given up on the JSP test cases for the nonce. These tests pointed out that the old attribute parser was handling some pretty awful attributes, so I added a fixAttributes() on Lexer to handle these bad tags. Cheifly the changes were to provide for whitespace either side of an equals sign (between the attribute name and the value) in Attribute and then recognize where it was needed in parseTag by calling fixAttributes. I also fell back to providing unquoted values in the special getAttributes() hashtable. This is to provide backwards compatibility. Reverting it messed up a lot of tests that I had 'fixed' already. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if Tags were more bean-like and had zero args constructors with appropriate accessors. Helpers ------- I desparately want to get rid of the last remaining 'helper' class, the CompositeTagScannerHelper. It's close, it just needs some more untangling. AbstractNode ------------ Drop org.htmlparser.lexer.nodes.AbstractNode, fold functionality into org.htmlparser.AbstractNode. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Examples: testIncompleteTitle - <title>blah</title </head> used to be 2 nodes testEmptyComment - <!--> was considered a valid remark node Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <der...@au...> - 2003-10-10 13:44:59
|
Looks fine. -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: October 10, 2003 9:42 AM To: Derrick Oswald Subject: apache license =20 hi derrick, =20 here is the license text. =20 peter ------------------------------------------------- =20 /* * = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * The Apache Software License, Version 1.1 * * Copyright (c) 2001-2003 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and * "Apache JMeter" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact ap...@ap.... * * 5. Products derived from this software may not be called "Apache", * "Apache JMeter", nor may "Apache" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. *=20 */ // The developers of JMeter and Apache are greatful to the developers // of HTMLParser for giving Apache Software Foundation a non-exclusive // license. The performance benefits of HTMLParser are clear and the // users of JMeter will benefit from the hard work the HTMLParser // team. For detailed information about HTMLParser, the project is // hosted on sourceforge at http://htmlparser.sourceforge.net/. // // HTMLParser was originally created by Somik Raha in 2000. Since then // a healthy community of users has formed and helped refine the // design so that it is able to tackle the difficult task of parsing // dirty HTML. Derrick Oswald is the current lead developer and was kind // enough to assist JMeter. _____ =20 Do you Yahoo!? The <http://shopping.yahoo.com/?__yltc=3Ds%3A150000443%2Cd%3A22708228%2Cslk%3= A text%2Csec%3Amail> New Yahoo! Shopping - with improved product search |
From: Derrick O. <der...@au...> - 2003-10-09 15:15:29
|
Peter, I can't think of anything special that needs to go in the license, other than a reference to the htmlparser project on sourceforge. Since this is only a snapshot, I don't think anybody needs commit privileges for JMeter. Any updates will go into the htmlparser project and when another major revision is released you can get everything at once and reintegrate it at your leisure. Derrick -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: October 9, 2003 11:08 AM To: Derrick Oswald Subject: RE: question about using HTMLParser in Apache JMeter Hi derrick, thanks for the assitance. I really do appreciate it.=20 feel free to publish the early benchmark results with the test implementation for JMeter. in case you don't have time to read it. using HmtlParser increases the throughput of JMeter by 2-3x. the memory and cpu usage are consistently less, but only by 1-5% if incremental GC isn't used. if incremental GC is used, the memory usage is half that of using Tidy. I would like to include a huge thanks and acknowledgement in the license for all the source files. Is there anything in particular you would like included in the license. I haven't had time to get in touch with the person responsible for creating new projects, but we will make HtmlParser a sub-project of JMeter. Do you or any of the other developers want commit priviledges for JMeter in the mean time? If you're too busy, I am happy to take responsibility of merging updates and patches. peter lin |
From: Derrick O. <der...@au...> - 2003-10-09 15:01:57
|
Peter, I was just thinking about you this morning. Nobody objected. Nobody answered. I guess it's that way with open source. I think you can go ahead. Derrick -----Original Message----- From: peter lin [mailto:jmw...@ya...]=20 Sent: October 9, 2003 10:43 AM To: Derrick Oswald Subject: RE: question about using HTMLParser in Apache JMeter Hi derrick, Were there any objects from the developers? I did a test implementation and tested it against the current release using Tidy. You might be interested in the results. http://tao.altern8.net:8080/comparison_summary.pdf peter lin |
From: Joshua K. <jo...@in...> - 2003-10-09 07:58:09
|
Derrick Oswald wrote: > The NodeFactory is an interface, Your NodeFactory is an interface. I'm not interested in your NodeFactory, as I don't like its implementation. When I refer to NodeFactory, it's my StringNodeFactory with the name, NodeFactory. > you don't subclass an interface, you implement it. Have you considered a career in teaching the Java language? > People can write their own three method class to do that, or like I > said, delegate to the parser for things they don't want to handle. Nah. People will clearly see a *class* called NodeFactory, which will have numerous methods on it for configuring the primitive types, like StringNode, Tag, etc. I know you love your new lexar, but we aren't gonna make folks get access to the lexar to set a NodeFactory -- that's just plain awkward, unintuitive and downright strange! > But delegation comes at the cost of time and memory. Spoken like a premature optimizer! Hey, what else does the Parser delegate to? Maybe we could fold all sorts of classes into the Parser to create one big monster class that would be so utterly efficient. Wait, do you think we should re-write this thing in C? Show me how having a client call the parser's NodeFactory object is so much slower and more memory intensive than having the client call the parser's create methods directly. If you cannot show a huge difference in time and memory, I won't prematurely optimize my code. > The current string decorator architecture, for example, delegates > through three wrapping objects (with associated memory overhead) to do > the job the StringNode should do in the first place. Easily optimized without dumping the code into StringNode -- see what I say about Flyweights, below. > If one knows a priori that whitespace removal, reference translation > and so on are useful functions, build them into the StringNode. > Provide the factory mechanism only for things you can't anticipate in > advance. Leaving the decorator code as a caboose grafted on to the > base implementation is only acceptable if it's a transient fix. I know 5 useful functions for StringNodes. It would be poor design to shove them into the StringNode, which is better off being primitive. The useful functions I know about are options for a StringNode. Most folks don't use them. Those that do, can decide to turn them on if they need them. They can do that easily through a NodeFactory, since that is the perfect place to say "here's the kind of StringNodes I want you to create during the parse." > I use profilers all the time. And memory analysis. And javap to > examine the byte code. That's why the lexer code runs 60% faster than > the old NodeReader code and uses a quarter of the memory. Are users complaining about the speed and memory usage of the parser? > Are you saying the creation of objects doesn't take time and memory? > Maybe you should run a profiler comparing the 'decorative wrapping' > approach with a 'built in functionality' approach. Or better yet, just > step through it (and I mean step *in* to every method) in a debugger > and see where all the time is spent allocating and wandering between > objects without getting any of the work done, and repeatedly copying > out strings. It's a real eye-opener for someone who only works at the > 30,000 foot level. The decorators on this project can easily be made into Flyweights, which means 3 objects can do *all* of the decoration for all of the StringNode objects. That's an easy change to make and doesn't involve *shoving* behavior into StringNode and bloating that class with code it doesn't need. It also leaves room for the Decorators to decorate other node types, like RemarkNode. Or would you rather shove embellishments into that class as well? > Yes, I'm a big advocate of test driven development. The bean classes > were dropped with their tests - BeanTest, so I don't see why you > couldn't refactor it. BeanTest! I looked at that one. It barely tested a fraction of StringNode's behavior. It tested if you could successfully serialize a StringNode. Hee haaa! If you used TDD to program StringBean, you're doing something very wrong. > The lexer package has it's tests. There are currently 35 outstanding > failures out of 448 and I'm currently trying to re-integrate the test > cases that got shunted off to the temporaryFailures package -- because > someone wanted a green bar. You can still do test driven development > without a green bar by monitoring the number of failures... > ...like a doctor, you just ensure you 'do no harm'. If you had evolved the lexar into the parser, as I suggested to you in a private email, the bar would be green today, it would've been green yesterday and the day before that. But evolutionary design isn't something you get, or, more probably, isn't something you even want to learn. A green bar is necessary for refactoring. Monitoring failures is not a habit I plan to adopt, as I find it annoying. You don't mind making us all live with red tests for days on end. So please delete the temporaryFailures package -- it's useless given your style of programming. --jk |
From: Derrick O. <Der...@Ro...> - 2003-10-08 02:38:13
|
Joshua, The NodeFactory is an interface, you don't subclass an interface, you implement it. People can write their own three method class to do that, or like I said, delegate to the parser for things they don't want to handle. But delegation comes at the cost of time and memory. The current string decorator architecture, for example, delegates through three wrapping objects (with associated memory overhead) to do the job the StringNode should do in the first place. If one knows a priori that whitespace removal, reference translation and so on are useful functions, build them into the StringNode. Provide the factory mechanism only for things you can't anticipate in advance. Leaving the decorator code as a caboose grafted on to the base implementation is only acceptable if it's a transient fix. I use profilers all the time. And memory analysis. And javap to examine the byte code. That's why the lexer code runs 60% faster than the old NodeReader code and uses a quarter of the memory. Are you saying the creation of objects doesn't take time and memory? Maybe you should run a profiler comparing the 'decorative wrapping' approach with a 'built in functionality' approach. Or better yet, just step through it (and I mean step *in* to every method) in a debugger and see where all the time is spent allocating and wandering between objects without getting any of the work done, and repeatedly copying out strings. It's a real eye-opener for someone who only works at the 30,000 foot level. Yes, I'm a big advocate of test driven development. The bean classes were dropped with their tests - BeanTest, so I don't see why you couldn't refactor it. The lexer package has it's tests. There are currently 35 outstanding failures out of 448 and I'm currently trying to re-integrate the test cases that got shunted off to the temporaryFailures package -- because someone wanted a green bar. You can still do test driven development without a green bar by monitoring the number of failures... ...like a doctor, you just ensure you 'do no harm'. Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> As it stands the NodeFactory is set automatically by the Lexer or >> Parser to itself. It's only if someone wants to *change* the node >> classes being returned that they would need to access the Lexer and >> set the NodeFactory property: >> parser.getLexer ().setNodeFactory (myfactory); >> Not something for the casual user. > > > *Changing* the node classes being returned is PRECISELY why the > NodeFactory must (and will) be accessible. > > Today, users of the parser must write their own code to remove white > spaces, remove escape characters, decode strings. That's no longer > necessary. You can now configure the parser to remove white spaces, > escape chars, etc., by means of node decorators. Those decorators are > currently configured within StringNodeFactory, which is going to > become NodeFactory. Decorating non-StringNodes, such as RemarkNodes, > is already possible. > > The easiest way to popularize this use of decorators in the parser is > to give parser users access to the NodeFactory and make it easy for > them to configure it, or subclass it, as they like. > >> The parser class is large. But, of the forty or so methods it has, a >> quarter of them are dealing with the scanner list and a quarter of >> them are convenience pass through methods to the lexer. Don't think >> large or small, think useful. Let me understand your position; the >> parser shouldn't be doing node/tag creation? What is a parser then? A >> shell for a gaggle of deus ex machina pulling the levers behind the >> curtain? No, it's not overburdened, it's just doing it's job. All the >> rest of the classes are spurious artefacts. > > > See the above for why the parser should delegate node creation. > >> To end users, the smell is in memory inefficient, slow programs; they >> really couldn't care less what's under the hood. Every object created >> has an overhead in time and memory, so the fewer the better. > > > Do you use a profiler Derrick? Because the above utterance is > something I'd expect to hear from an inexperienced programmer. Or > maybe an old C programmer. Hey, I programmed in C once. It's been a > long time since I programmed in C. But not long enough. > >> To programmers though, to which we cater, the design has to be clean >> and simple. Two classes, where one would do, is not necessarily clean >> or simple. In fact I'm thinking that each tag should be it's own >> scanner, folding the whole scanner tree into the tag tree. What >> better object to understand how to parse it than the tag itself. This >> would mean the prototype list *is* the scanner list, and the parser >> doesn't have to get larger. Currently, changing the code in two >> places means extra effort. Not keeping the two in sync can lead to >> bugs. Currently, most of the scanners are just baskets to hold the >> MATCH_NAME, ENDERS and END_TAG_ENDERS lists. Shouldn't the tag be the >> one responsible for knowing it's own name, terminators and place in >> the dtd? Larger classes can mean easier maintenance, if what they are >> replacing is a plethora of trivial interrelated classes. > > > Lazy Class is another smell in Refactoring -- it refers to a class > that isn't pulling its own weight. We "inline" classes when they > don't pull their own weight. > > When I joined this project -- less than a year ago -- the scanners > were a mess. Tons of duplicate code. Inlining these scanners into > the tags at that point would've been foolish, as it would've bloated > the tags. > So Somik and I began removing duplication from the scanners. > Sometimes when you remove duplication, you get to the point where you > see that the classes are no longer necessary -- they aren't doing > enough to justify their existence. I believe we've now reached that > point with the scanners, so I support the inlining of them into the tags. > >> And don't get me going about the 'refactoring' that spawned the >> 'helpers'. All the overhead of creating a new >> CompositeTagScannerHelper for each node scanned is horrendous -- all >> to avoid a 'large' CompositeTagScanner class. Sorry, I heartily >> disagree with making classes smaller in an attempt to avoid a smell, >> when the smell only shows up when it's 'refactored'. > > > Most of the code in the parser pre-dates my appearance on this > project, so I know not how/why the helpers got added. > > What attracked me to this project was the messy, unrefactored code, > which happened to have pretty good test coverage. This project is > ripe for all sorts of refactorings, which the tests make a whole lot > easier to do. > > Without tests, it's hard to refactor. I remember looking at your > StringBean class before it was made into a Visitor. I had just added > Visitor to the parser and was using it for useful things. I looked at > StringBean and say "uhhhhhhh, now that's crying out to be a Visitor." > Yet I didn't make it a Visitor. Why? Because it had no tests. > Without tests, refactoring is hard. > > Creating tests for code is a no-brainer if you practice Test-Driven > Development (TDD). Derrick, do you practice TDD? > >> The stack example in the option class should be handled by the ENDERS >> list. Running into a new <OPTION> while parsing the previous one, >> should close the original and open another. In this case, using a >> stack to track the recursion is overkill, I think, and could be >> handled in a more straight forward manner. > > > I believe there are good tests for that recursion, so it shouldn't be > hard to come up with other ways to do it. BTW, when will the tests be > green again? I can't do much when they are running red. > > --jk > |
From: Somik R. <so...@ya...> - 2003-10-08 02:10:41
|
Derrick Oswald wrote: > > And don't get me going about the 'refactoring' > that spawned the > > 'helpers'. All the overhead of creating a new > CompositeTagScannerHelper > > for each node scanned is horrendous -- all to > avoid a 'large' > > CompositeTagScanner class. Sorry, I heartily > disagree with making > > classes smaller in an attempt to avoid a smell, > when the smell only > > shows up when it's 'refactored'. Just to set the record straight, some of the scanners were an utter mess. Primarily because I wanted them stateless- so that one scanner object could be used throughout the life of a parser. That made refactoring hard. But moving out functionality into a helper class allowed the creation of state within the helper on every parse - that allowed refactoring of large and obscure methods. At certain points, I even threw out the refactored code and wrote it from scratch. I am not sure what you mean by your last statement - it looks heck of a lot better than when the scanners had all the code. Sure, there is some penalty for creating objects every time.. but that happens only when the scan is essential (triggered off on identification of a tag). I am not in favor of getting rid of the scanner hierarchy - clients can rig up a parser of their choice by including a scanner of their choice. Data files describing scanners could be used to remove some of the scanner classes.. But it needs to be explored. I could be wrong - maybe I haven't understood fully what removal of the scanner hierarchy will buy us.. This is a good debate to have. Regards, Somik __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com |
From: Derrick O. <Der...@Ro...> - 2003-10-08 01:25:01
|
When you unpacked the zip file the program is ready to run, but there's no command/script file to run it. If htmlparser.jar is in the current directory, use something like: java -classpath htmlparser.jar org.htmlparser.parserapplications.StringExtractor http://whatever Derrick lakshmi narasimhan wrote: >Hi everybody!! >Iam new to the parsing community and have downloaded the HTML Parser.Unfortunately iam unable to run >the programs.Basically i want to extract the text from a given URL for which i came across a program called >StringExtractor.java.But unfortunately i dunno the exact directory the file should be placed. >Can you help me in this regard >Thanks a lot in advance >Regards >Chari > >SIZE does matter - The UK's biggest *Free* Web based mail - 10 MB Free >mail.lycos.co.uk > > > |
From: Derrick O. <Der...@Ro...> - 2003-10-08 01:13:05
|
If you've been following the developer threads, Joshua and I are still thrashing out the details on how that would work ;-) It will be extendable. Couball, James wrote: >Regarding your note about having TagFactory have signatures for all >possible tags... how will TagFactory be extended to account for new, >user defined tags? Is it intended to be user extendable? > >Thanks for the great work! > >Sincerely, >James. > >-----Original Message----- >From: Derrick Oswald [mailto:Der...@Ro...] >Sent: Sunday, September 28, 2003 12:33 PM >To: htm...@li... >Subject: [Htmlparser-developer] lexer integration - added back >visitEndTag > >Fixed up the broken visitor logic. >Added some docos on NodeVisitor. > >TODO >===== > >Serializable >-------------- >The Parser needs to be made serializable again. This involves a >transient field down on the Source, I think, rather than having the >whole Lexer transient in the Parser. > >TagData >------- >This has been reworked to allow it to limp along under the new system, >but it should really be removed. I think the reason for it (reduce the >number of arguments to tag constructors) no longer applies, and a lot of > >the code could be easier to read if the Tag was more bean-like and had a > >zero args constructor with appropriate accessors. > >Helpers >------- >I desparately want to get rid of these 'helper' classes. They are just >obfuscating the code. > >Node Factory >------------ >The factory concept needs to be extended with a TagFactory (extending >NodeFactory) that has the signatures for creating all the possible types > >of tags there are, and then this needs to be used by all the scanners to > >create their specific tags. > >Scanners >-------- >The scanners may not be working, hard to tell without the unit tests >running. I'm not sure that CompositeTagScanner is completely all right >yet, It probably needs to be reworked based on the lexer. > >Unit Tests >---------- >As mentioned, many of the unit tests expect toHtml() to produce >capitalized and rearranged output. And parseAndAssertNodeCount() is >expected not to include so many whitespace nodes. These need to be >addressed. > >Documentation >------------- >As of now, it's more likely that the javadocs are lying to you than >providing any helpful advice. This needs to be reworked completely. > > > > >As you can see there's lots of work to do, so anyone with a death wish >can jump in. I'll be working my way from top to bottom of the TODO list > >and commiting and notifying the developer list after each of them. So >go ahead and do a take from CVS and jump in the middle with anything >that appeals. Keep the list posted and update your CVS tree often (or >subscribe to the htmlparsre-cvs mailing list for interrupt driven >notification rather than polled notification). > > > > |
From: Joshua K. <jo...@in...> - 2003-10-07 16:55:06
|
Derrick Oswald wrote: > As it stands the NodeFactory is set automatically by the Lexer or Parser > to itself. It's only if someone wants to *change* the node classes being > returned that they would need to access the Lexer and set the > NodeFactory property: > parser.getLexer ().setNodeFactory (myfactory); > Not something for the casual user. *Changing* the node classes being returned is PRECISELY why the NodeFactory must (and will) be accessible. Today, users of the parser must write their own code to remove white spaces, remove escape characters, decode strings. That's no longer necessary. You can now configure the parser to remove white spaces, escape chars, etc., by means of node decorators. Those decorators are currently configured within StringNodeFactory, which is going to become NodeFactory. Decorating non-StringNodes, such as RemarkNodes, is already possible. The easiest way to popularize this use of decorators in the parser is to give parser users access to the NodeFactory and make it easy for them to configure it, or subclass it, as they like. > The parser class is large. But, of the forty or so methods it has, a > quarter of them are dealing with the scanner list and a quarter of them > are convenience pass through methods to the lexer. Don't think large or > small, think useful. Let me understand your position; the parser > shouldn't be doing node/tag creation? What is a parser then? A shell for > a gaggle of deus ex machina pulling the levers behind the curtain? No, > it's not overburdened, it's just doing it's job. All the rest of the > classes are spurious artefacts. See the above for why the parser should delegate node creation. > To end users, the smell is in memory inefficient, slow programs; they > really couldn't care less what's under the hood. Every object created > has an overhead in time and memory, so the fewer the better. Do you use a profiler Derrick? Because the above utterance is something I'd expect to hear from an inexperienced programmer. Or maybe an old C programmer. Hey, I programmed in C once. It's been a long time since I programmed in C. But not long enough. > To > programmers though, to which we cater, the design has to be clean and > simple. Two classes, where one would do, is not necessarily clean or > simple. In fact I'm thinking that each tag should be it's own scanner, > folding the whole scanner tree into the tag tree. What better object to > understand how to parse it than the tag itself. This would mean the > prototype list *is* the scanner list, and the parser doesn't have to get > larger. Currently, changing the code in two places means extra effort. > Not keeping the two in sync can lead to bugs. Currently, most of the > scanners are just baskets to hold the MATCH_NAME, ENDERS and > END_TAG_ENDERS lists. Shouldn't the tag be the one responsible for > knowing it's own name, terminators and place in the dtd? Larger classes > can mean easier maintenance, if what they are replacing is a plethora of > trivial interrelated classes. Lazy Class is another smell in Refactoring -- it refers to a class that isn't pulling its own weight. We "inline" classes when they don't pull their own weight. When I joined this project -- less than a year ago -- the scanners were a mess. Tons of duplicate code. Inlining these scanners into the tags at that point would've been foolish, as it would've bloated the tags. So Somik and I began removing duplication from the scanners. Sometimes when you remove duplication, you get to the point where you see that the classes are no longer necessary -- they aren't doing enough to justify their existence. I believe we've now reached that point with the scanners, so I support the inlining of them into the tags. > And don't get me going about the 'refactoring' that spawned the > 'helpers'. All the overhead of creating a new CompositeTagScannerHelper > for each node scanned is horrendous -- all to avoid a 'large' > CompositeTagScanner class. Sorry, I heartily disagree with making > classes smaller in an attempt to avoid a smell, when the smell only > shows up when it's 'refactored'. Most of the code in the parser pre-dates my appearance on this project, so I know not how/why the helpers got added. What attracked me to this project was the messy, unrefactored code, which happened to have pretty good test coverage. This project is ripe for all sorts of refactorings, which the tests make a whole lot easier to do. Without tests, it's hard to refactor. I remember looking at your StringBean class before it was made into a Visitor. I had just added Visitor to the parser and was using it for useful things. I looked at StringBean and say "uhhhhhhh, now that's crying out to be a Visitor." Yet I didn't make it a Visitor. Why? Because it had no tests. Without tests, refactoring is hard. Creating tests for code is a no-brainer if you practice Test-Driven Development (TDD). Derrick, do you practice TDD? > The stack example in the option class should be handled by the ENDERS > list. Running into a new <OPTION> while parsing the previous one, should > close the original and open another. In this case, using a stack to > track the recursion is overkill, I think, and could be handled in a more > straight forward manner. I believe there are good tests for that recursion, so it shouldn't be hard to come up with other ways to do it. BTW, when will the tests be green again? I can't do much when they are running red. --jk |
From: Couball, J. <jam...@co...> - 2003-10-07 15:49:50
|
Regarding your note about having TagFactory have signatures for all possible tags... how will TagFactory be extended to account for new, user defined tags? Is it intended to be user extendable? Thanks for the great work! Sincerely, James. -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...]=20 Sent: Sunday, September 28, 2003 12:33 PM To: htm...@li... Subject: [Htmlparser-developer] lexer integration - added back visitEndTag Fixed up the broken visitor logic. Added some docos on NodeVisitor. TODO =3D=3D=3D=3D=3D Serializable -------------- The Parser needs to be made serializable again. This involves a=20 transient field down on the Source, I think, rather than having the=20 whole Lexer transient in the Parser. TagData ------- This has been reworked to allow it to limp along under the new system,=20 but it should really be removed. I think the reason for it (reduce the=20 number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just=20 obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending=20 NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests=20 running. I'm not sure that CompositeTagScanner is completely all right=20 yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce=20 capitalized and rearranged output. And parseAndAssertNodeCount() is=20 expected not to include so many whitespace nodes. These need to be=20 addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than=20 providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish=20 can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So=20 go ahead and do a take from CVS and jump in the middle with anything=20 that appeals. Keep the list posted and update your CVS tree often (or=20 subscribe to the htmlparsre-cvs mailing list for interrupt driven=20 notification rather than polled notification). ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: lakshmi n. <sim...@ly...> - 2003-10-07 15:38:39
|
Hi everybody!! Iam new to the parsing community and have downloaded the HTML Parser.Unfortunately iam unable to run the programs.Basically i want to extract the text from a given URL for which i came across a program called StringExtractor.java.But unfortunately i dunno the exact directory the file should be placed. Can you help me in this regard Thanks a lot in advance Regards Chari SIZE does matter - The UK's biggest *Free* Web based mail - 10 MB Free mail.lycos.co.uk |
From: Derrick O. <Der...@Ro...> - 2003-10-07 01:07:36
|
Joshua, As it stands the NodeFactory is set automatically by the Lexer or Parser to itself. It's only if someone wants to *change* the node classes being returned that they would need to access the Lexer and set the NodeFactory property: parser.getLexer ().setNodeFactory (myfactory); Not something for the casual user. True, most users won't see the Lexer. But if their needs are fast linear lightweight access, they would use just the Lexer, and ignore the parser (see the Thumbelina lexer application for example). Then the node factory accessor is on the primary object. It's only when you add another level to the parsing that the accessor becomes indirect. The parser class is large. But, of the forty or so methods it has, a quarter of them are dealing with the scanner list and a quarter of them are convenience pass through methods to the lexer. Don't think large or small, think useful. Let me understand your position; the parser shouldn't be doing node/tag creation? What is a parser then? A shell for a gaggle of deus ex machina pulling the levers behind the curtain? No, it's not overburdened, it's just doing it's job. All the rest of the classes are spurious artefacts. To end users, the smell is in memory inefficient, slow programs; they really couldn't care less what's under the hood. Every object created has an overhead in time and memory, so the fewer the better. To programmers though, to which we cater, the design has to be clean and simple. Two classes, where one would do, is not necessarily clean or simple. In fact I'm thinking that each tag should be it's own scanner, folding the whole scanner tree into the tag tree. What better object to understand how to parse it than the tag itself. This would mean the prototype list *is* the scanner list, and the parser doesn't have to get larger. Currently, changing the code in two places means extra effort. Not keeping the two in sync can lead to bugs. Currently, most of the scanners are just baskets to hold the MATCH_NAME, ENDERS and END_TAG_ENDERS lists. Shouldn't the tag be the one responsible for knowing it's own name, terminators and place in the dtd? Larger classes can mean easier maintenance, if what they are replacing is a plethora of trivial interrelated classes. And don't get me going about the 'refactoring' that spawned the 'helpers'. All the overhead of creating a new CompositeTagScannerHelper for each node scanned is horrendous -- all to avoid a 'large' CompositeTagScanner class. Sorry, I heartily disagree with making classes smaller in an attempt to avoid a smell, when the smell only shows up when it's 'refactored'. The stack example in the option class should be handled by the ENDERS list. Running into a new <OPTION> while parsing the previous one, should close the original and open another. In this case, using a stack to track the recursion is overkill, I think, and could be handled in a more straight forward manner. // Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> The parser can be a NodeFactory with just three additional methods. >> It's still replaceable because the factory is set on the Lexer, i.e. >> clients can still create and set their own NodeFactory, even using >> the parser as a delegate for methods they don't want to handle. A >> major benefit of interface design is to avoid spurious trivial classes. > > > Let me see if I can understand your design. You want a user of the > parser to first get access to the Lexar to then set which NodeFactory > to use? I must be misunderstanding something. Most users of the > parser shouldn't even know the Lexar exists, right? It's a low-level > detail to average parser users. > A NodeFactory encapsulates data and methods used in node/tag creation > - nothing spurious or trivial about it. In fact, small classes (such > as NodeFactory) which have one responsibility are easier to > understand, extend and maintain. Furthermore, one method on the > parser is all it takes to let parser users set a NodeFactory > instance. On the other hand, the current implementation has three > separate methods to handle node/tag creation. I dislike that design > because: > > * it bloats the Parser interface, which is already heavily bloated > with too many methods > * it gives the Parser a new responsibility which it has no business > having: node/tag creation > * it adds code to an already fat Parser class that's overburdened with > responsibilities. > > I'm sensing that you prefer to build and work with Large Classes. Is > that correct? If so, are you aware that Large Class is a smell? > See Refactoring: Improving the Design of Existing Code, by Martin > Fowler. The chapter on smells was co-written by Kent Beck and Martin > Fowler. > >> A node that's visitable has a signature: >> void accept (NodeVisitor visitor) > > > Yeah, I'm the guy who popularized the use of Visitors in the parser - > remember? You were against their usage. Have you come around to the > dark side? > >> By incorporating that signature, because the NodeVisitor class knows >> about specific high level composite node types (why only Image, Link >> and Title?), the low level Lexer jar file would have to drag in a >> whole lot of other stuff. So currently the low level tags only >> implement (vacuously): >> void accept (Object visitor) >> and then the high level Tag class thunks up to the more specific >> signature with an up-cast. If NodeVisitor were to only handle base >> types (String, Remark and Tag) this could be avoided. The fact that >> the NodeVisitor class knows about ImageTag, LinkTag and TitleTag >> makes it less useful in the presence of user supplied node types; but >> that's it's inherent flaw. > > > When I wrote NodeVisitor, I deliberately avoided making it aware of > nodes or tags beyond StringNode, Tag and EndTag. The reason? To be > able to visit other node/tag types, scanners must be registered and a > Visitor, being separate from the whole scanner mechanism, cannot > guarantee that a given scanner is registered. > Over time, people started adding visitXYZ methods to the NodeVisitor > interface, such as visitLink, etc. Was that necessary? I don't think > so. If one needs information about Links, Images, etc., one doesn't > need to use a Visitor. > > If we use reflection, we can likely make a NodeVisitor that could > visit any node/tag type. That would perhaps be slow, since reflection > is slow, but it would be a useful experiment. In addition, those who > use a reflection-based Visitor may not care about speed. > >> Getting data into user supplied nodes is easy: each tag is presented >> with the attributes and children found by the scanner, what else is >> there? The current implementation does it the other way, each scanner >> is the one that figures out the special data and then creates a new >> specialized tag by some byzantine constructor taking arguments that >> only it can understand. The tag is reduced to regurgitating the >> simple strings it was given. Typical example; FrameScanner has >> extractFrameLocn() and extractFrameName() which it passes into the >> FrameTag constructor. Why not have FrameTag figure this stuff out? >> >> The TagScanner class is abstract, partly because of the signature: >> protected abstract Tag createTag(TagData tagData, Tag tag, String >> url) throws ParserException; >> Each scanner has code like: >> public Tag createTag(TagData tagData, CompositeTagData >> compositeTagData) throws ParserException >> { >> return new BulletList(tagData,compositeTagData); >> } >> With a 'Prototype' solution, the TagScanner class could implement: >> public Tag createTag(TagData tagData, CompositeTagData >> compositeTagData) throws ParserException >> { >> Tag tag = mBlastocyst.get (tagData.getTagName ()); >> if (null == tag) >> tag = new Tag (tagData, compositeTagData); // should use >> the NodeFactory >> else >> { >> tag = (Tag)tag.clone (); >> tag.setData (tagData, compositeTagData); >> } >> return (tag); >> } >> which would remove the need for each class to implement it. How would >> you remove the createTag() code from all the scanners without >> prototypes? > > > How would a prototype approach account for the stack in the following > code, from the OptionTagScanner: > > public Tag createTag( > TagData tagData, > CompositeTagData compositeTagData) { > if (!stack.empty () && (this == stack.peek ())) > stack.pop (); > return new OptionTag(tagData,compositeTagData); > } > > BTW, FormTagScanner has a similar stack. > > Believe it or not, Derrick, I like the Prototype pattern and have even > considered using it within StringNodeFactory - I didn't proceed > because I didn't find a genuine need. Now you've uncovered a possible > real need for Prototype in the parser -- I'm all for exploring it. I > just want to be clear about what we're doing. You say we could remove > a lot of duplicated code in the scanners - I can see lots of code that > creates specific tag instances and yes, Prototype can help make that > code go away. However the scanners also appear to do useful work > (like usage of a stack or implementing the evaluate method) and I'm > not seeing how that would easily transfer to the node/tag classes > without making those classes overly complex. > best regards, > jk > |
From: Joshua K. <jo...@in...> - 2003-10-06 18:56:49
|
Derrick Oswald wrote: > The parser can be a NodeFactory with just three additional methods. > It's still replaceable because the factory is set on the Lexer, i.e. > clients can still create and set their own NodeFactory, even using the > parser as a delegate for methods they don't want to handle. A major > benefit of interface design is to avoid spurious trivial classes. Let me see if I can understand your design. You want a user of the parser to first get access to the Lexar to then set which NodeFactory to use? I must be misunderstanding something. Most users of the parser shouldn't even know the Lexar exists, right? It's a low-level detail to average parser users. A NodeFactory encapsulates data and methods used in node/tag creation - nothing spurious or trivial about it. In fact, small classes (such as NodeFactory) which have one responsibility are easier to understand, extend and maintain. Furthermore, one method on the parser is all it takes to let parser users set a NodeFactory instance. On the other hand, the current implementation has three separate methods to handle node/tag creation. I dislike that design because: * it bloats the Parser interface, which is already heavily bloated with too many methods * it gives the Parser a new responsibility which it has no business having: node/tag creation * it adds code to an already fat Parser class that's overburdened with responsibilities. I'm sensing that you prefer to build and work with Large Classes. Is that correct? If so, are you aware that Large Class is a smell? See Refactoring: Improving the Design of Existing Code, by Martin Fowler. The chapter on smells was co-written by Kent Beck and Martin Fowler. > A node that's visitable has a signature: > void accept (NodeVisitor visitor) Yeah, I'm the guy who popularized the use of Visitors in the parser - remember? You were against their usage. Have you come around to the dark side? > By incorporating that signature, because the NodeVisitor class knows > about specific high level composite node types (why only Image, Link > and Title?), the low level Lexer jar file would have to drag in a > whole lot of other stuff. So currently the low level tags only > implement (vacuously): > void accept (Object visitor) > and then the high level Tag class thunks up to the more specific > signature with an up-cast. If NodeVisitor were to only handle base > types (String, Remark and Tag) this could be avoided. The fact that > the NodeVisitor class knows about ImageTag, LinkTag and TitleTag makes > it less useful in the presence of user supplied node types; but that's > it's inherent flaw. When I wrote NodeVisitor, I deliberately avoided making it aware of nodes or tags beyond StringNode, Tag and EndTag. The reason? To be able to visit other node/tag types, scanners must be registered and a Visitor, being separate from the whole scanner mechanism, cannot guarantee that a given scanner is registered. Over time, people started adding visitXYZ methods to the NodeVisitor interface, such as visitLink, etc. Was that necessary? I don't think so. If one needs information about Links, Images, etc., one doesn't need to use a Visitor. If we use reflection, we can likely make a NodeVisitor that could visit any node/tag type. That would perhaps be slow, since reflection is slow, but it would be a useful experiment. In addition, those who use a reflection-based Visitor may not care about speed. > Getting data into user supplied nodes is easy: each tag is presented > with the attributes and children found by the scanner, what else is > there? The current implementation does it the other way, each scanner > is the one that figures out the special data and then creates a new > specialized tag by some byzantine constructor taking arguments that > only it can understand. The tag is reduced to regurgitating the simple > strings it was given. Typical example; FrameScanner has > extractFrameLocn() and extractFrameName() which it passes into the > FrameTag constructor. Why not have FrameTag figure this stuff out? > > The TagScanner class is abstract, partly because of the signature: > protected abstract Tag createTag(TagData tagData, Tag tag, String > url) throws ParserException; > Each scanner has code like: > public Tag createTag(TagData tagData, CompositeTagData > compositeTagData) throws ParserException > { > return new BulletList(tagData,compositeTagData); > } > With a 'Prototype' solution, the TagScanner class could implement: > public Tag createTag(TagData tagData, CompositeTagData > compositeTagData) throws ParserException > { > Tag tag = mBlastocyst.get (tagData.getTagName ()); > if (null == tag) > tag = new Tag (tagData, compositeTagData); // should use > the NodeFactory > else > { > tag = (Tag)tag.clone (); > tag.setData (tagData, compositeTagData); > } > return (tag); > } > which would remove the need for each class to implement it. How would > you remove the createTag() code from all the scanners without prototypes? How would a prototype approach account for the stack in the following code, from the OptionTagScanner: public Tag createTag( TagData tagData, CompositeTagData compositeTagData) { if (!stack.empty () && (this == stack.peek ())) stack.pop (); return new OptionTag(tagData,compositeTagData); } BTW, FormTagScanner has a similar stack. Believe it or not, Derrick, I like the Prototype pattern and have even considered using it within StringNodeFactory - I didn't proceed because I didn't find a genuine need. Now you've uncovered a possible real need for Prototype in the parser -- I'm all for exploring it. I just want to be clear about what we're doing. You say we could remove a lot of duplicated code in the scanners - I can see lots of code that creates specific tag instances and yes, Prototype can help make that code go away. However the scanners also appear to do useful work (like usage of a stack or implementing the evaluate method) and I'm not seeing how that would easily transfer to the node/tag classes without making those classes overly complex. best regards, jk |
From: Derrick O. <Der...@Ro...> - 2003-10-06 10:24:25
|
was subject: Re: [Htmlparser-developer] RE: question about using HTMLParser in Apache JMeter Joshua, The parser can be a NodeFactory with just three additional methods. It's still replaceable because the factory is set on the Lexer, i.e. clients can still create and set their own NodeFactory, even using the parser as a delegate for methods they don't want to handle. A major benefit of interface design is to avoid spurious trivial classes. A node that's visitable has a signature: void accept (NodeVisitor visitor) By incorporating that signature, because the NodeVisitor class knows about specific high level composite node types (why only Image, Link and Title?), the low level Lexer jar file would have to drag in a whole lot of other stuff. So currently the low level tags only implement (vacuously): void accept (Object visitor) and then the high level Tag class thunks up to the more specific signature with an up-cast. If NodeVisitor were to only handle base types (String, Remark and Tag) this could be avoided. The fact that the NodeVisitor class knows about ImageTag, LinkTag and TitleTag makes it less useful in the presence of user supplied node types; but that's it's inherent flaw. Getting data into user supplied nodes is easy: each tag is presented with the attributes and children found by the scanner, what else is there? The current implementation does it the other way, each scanner is the one that figures out the special data and then creates a new specialized tag by some byzantine constructor taking arguments that only it can understand. The tag is reduced to regurgitating the simple strings it was given. Typical example; FrameScanner has extractFrameLocn() and extractFrameName() which it passes into the FrameTag constructor. Why not have FrameTag figure this stuff out? The TagScanner class is abstract, partly because of the signature: protected abstract Tag createTag(TagData tagData, Tag tag, String url) throws ParserException; Each scanner has code like: public Tag createTag(TagData tagData, CompositeTagData compositeTagData) throws ParserException { return new BulletList(tagData,compositeTagData); } With a 'Prototype' solution, the TagScanner class could implement: public Tag createTag(TagData tagData, CompositeTagData compositeTagData) throws ParserException { Tag tag = mBlastocyst.get (tagData.getTagName ()); if (null == tag) tag = new Tag (tagData, compositeTagData); // should use the NodeFactory else { tag = (Tag)tag.clone (); tag.setData (tagData, compositeTagData); } return (tag); } which would remove the need for each class to implement it. How would you remove the createTag() code from all the scanners without prototypes? The above is couched in current TagData format, but in reality it would be more like: tag = (Tag)tag.clone (); tag.setAttributes (attributes); tag.setChildren (children); Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> Yes. In the transition from using a straight Lexer to get basic >> nodes (lexer.nodes package), to using the Parser to get nodes that >> can be visited (htmlparser package), the Lexer needs to generate >> nodes it was not compiled with. Hence the Parser replaces the Lexer >> as the NodeFactory that the Lexer calls when it needs to create a Node. > > > IMO, the NodeFactory is better off as its own object. The Parser can > use a default instance of it. Clients can configure the Parser to use > a specific NodeFactory. This is important for decorating nodes and > tags. In addition, we don't want to give the Parser too many > responsibilities, as it complicates its design. > > At present, we've made some choices about which tags are visitable - > i.e. visitable nodes and tags are hard-coded into our NodeVisitor > class. I'm not sure what you mean above when you write "using the > Parser to get nodes that can be visited"? > >> I'm thinking this concept should be augmented in the Parser's >> createTagNode to look up the name of the node (from the attribute >> list provided), and create specific types of tags (FormTag, TableTag >> etc.) by cloning empty tags from a Hashtable of possible tag types >> (possibly called mBlastocyst in reference to undifferentiated stem >> cells). > > > Sounds like the Prototype pattern. The trouble with this approach is > getting the right data into the node/tag. You can clone a tag that > has no data, then you got to get the right data into the tag. Since > different tags have different data needs, it gets complicated. Have > you considered these issues? > >> This would provide a concrete implementation of createTag in >> CompositeTagScanner, removing a lot of near duplicate code from the >> scanners, and allow end users to plug in their own tags via a call like >> setTagFor ("BODY", new myBodyTag()) >> on the Parser. Details on interaction with the scanners have to be >> worked out, but it seems the end user wouldn't have to replace the >> scanner to get their own tags out. > > > When you say "this would provide a concrete ...." I don't follow. Why > is a Prototype-based createTagNode method a prerequisite for removing > near duplicate code in the scanners? i.e. couldn't that be done > regardless of whether a Prototype solution is used? What am I missing? > > best regards > jk > |
From: Joshua K. <jo...@in...> - 2003-10-06 04:39:42
|
Derrick Oswald wrote: > Yes. In the transition from using a straight Lexer to get basic nodes > (lexer.nodes package), to using the Parser to get nodes that can be > visited (htmlparser package), the Lexer needs to generate nodes it was > not compiled with. Hence the Parser replaces the Lexer as the > NodeFactory that the Lexer calls when it needs to create a Node. IMO, the NodeFactory is better off as its own object. The Parser can use a default instance of it. Clients can configure the Parser to use a specific NodeFactory. This is important for decorating nodes and tags. In addition, we don't want to give the Parser too many responsibilities, as it complicates its design. At present, we've made some choices about which tags are visitable - i.e. visitable nodes and tags are hard-coded into our NodeVisitor class. I'm not sure what you mean above when you write "using the Parser to get nodes that can be visited"? > I'm thinking this concept should be augmented in the Parser's > createTagNode to look up the name of the node (from the attribute list > provided), and create specific types of tags (FormTag, TableTag etc.) > by cloning empty tags from a Hashtable of possible tag types (possibly > called mBlastocyst in reference to undifferentiated stem cells). Sounds like the Prototype pattern. The trouble with this approach is getting the right data into the node/tag. You can clone a tag that has no data, then you got to get the right data into the tag. Since different tags have different data needs, it gets complicated. Have you considered these issues? > This would provide a concrete implementation of createTag in > CompositeTagScanner, removing a lot of near duplicate code from the > scanners, and allow end users to plug in their own tags via a call like > setTagFor ("BODY", new myBodyTag()) > on the Parser. Details on interaction with the scanners have to be > worked out, but it seems the end user wouldn't have to replace the > scanner to get their own tags out. When you say "this would provide a concrete ...." I don't follow. Why is a Prototype-based createTagNode method a prerequisite for removing near duplicate code in the scanners? i.e. couldn't that be done regardless of whether a Prototype solution is used? What am I missing? best regards jk |
From: Derrick O. <Der...@Ro...> - 2003-10-06 02:11:44
|
I've fixed the easily fixed tests now, the remaining 40 or so indicate changed functionality that needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if Tags were more bean-like and had zero args constructors with appropriate accessors. Helpers ------- I desparately want to get rid of the two remaining 'helper' classes. They are just obfuscating the code. The CompositeTagScannerHelper is close to being folded back into the CompositeTagScanner. It just needs some more untangling. AbstractNode ------------ Drop org.htmlparser.lexer.nodes.AbstractNode, fold functionality into org.htmlparser.AbstractNode. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- The remaining failing unit tests show up the changed functionality. Examples: testIncompleteTitle - <title>blah</title </head> used to be 2 nodes testInvertedCommas - <tag attribute = whatever> used to be acceptable testEmptyComment - <!--> was considered a valid remark node Each needs to be examined, a decision on the 'correct' behaviour made, and the code or test altered accordingly. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-05 14:00:45
|
Made progress on nearly all the TODO items. The tasks aren't as separable as I thought. There are still 133 failing tests. I'll make a stab at the easy ones next. TODO ===== TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if Tags were more bean-like and had zero args constructors with appropriate accessors. Helpers ------- I desparately want to get rid of the two remaining 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended. The Parser's createTagNode should look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Scanners -------- The script scanner has been replaced. It can be considered as a first pass at what needs to be done to replace the generic CompositeTagScanner. The use of the underlying lexer makes these specialty scanners much easier. Unit Tests ---------- Many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the JUnit errors list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). |
From: Derrick O. <Der...@Ro...> - 2003-10-02 11:44:50
|
Joshua, Yes. In the transition from using a straight Lexer to get basic nodes (lexer.nodes package), to using the Parser to get nodes that can be visited (htmlparser package), the Lexer needs to generate nodes it was not compiled with. Hence the Parser replaces the Lexer as the NodeFactory that the Lexer calls when it needs to create a Node. I'm thinking this concept should be augmented in the Parser's createTagNode to look up the name of the node (from the attribute list provided), and create specific types of tags (FormTag, TableTag etc.) by cloning empty tags from a Hashtable of possible tag types (possibly called mBlastocyst in reference to undifferentiated stem cells). This would provide a concrete implementation of createTag in CompositeTagScanner, removing a lot of near duplicate code from the scanners, and allow end users to plug in their own tags via a call like setTagFor ("BODY", new myBodyTag()) on the Parser. Details on interaction with the scanners have to be worked out, but it seems the end user wouldn't have to replace the scanner to get their own tags out. Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> The StringNodeFactory you added is currently sidelined by the more >> generic NodeFactory. It would be easy to add it back in. >> Derrick > > > I deliberately added StringNodeFactory to the parser, not a generic > NodeFactory, because I had no need for a generic NodeFactory. Have > you found a need for a generic NodeFactory? --jk > |
From: Joshua K. <jo...@in...> - 2003-10-02 04:27:06
|
Derrick Oswald wrote: > The StringNodeFactory you added is currently sidelined by the more > generic NodeFactory. It would be easy to add it back in. > Derrick I deliberately added StringNodeFactory to the parser, not a generic NodeFactory, because I had no need for a generic NodeFactory. Have you found a need for a generic NodeFactory? --jk |
From: Derrick O. <Der...@Ro...> - 2003-10-02 02:13:51
|
The StringNodeFactory you added is currently sidelined by the more generic NodeFactory. It would be easy to add it back in. Derrick Joshua Kerievsky wrote: > Derrick Oswald wrote: > >> Are there any opinions regarding Peter Lin's proposal to make >> htmlparser an official Jakarta project? > > > Sounds like a great idea. > > BTW, I had integrated the NodeFactory into the code a while back. It > allows one to add decorators to things like StringNodes. I haven't > had time to look at the latest code -- does it still retain that > feature with the introduction of the lexar? > > thanks > jk > |
From: Joshua K. <jo...@in...> - 2003-10-01 19:45:21
|
Derrick Oswald wrote: > Are there any opinions regarding Peter Lin's proposal to make htmlparser > an official Jakarta project? It occured to me to mention that Somik is on vacation in India -- he'll be back mid-October. He may check email so I'll mention this to him. --jk -- I n d u s t r i a l L o g i c , I n c . Joshua Kerievsky Founder, Extreme Programmer & Coach http://industriallogic.com http://industrialxp.org 866-540-8336 (toll free) 510-540-8336 (phone) Berkeley, California |
From: Joshua K. <jo...@in...> - 2003-10-01 19:32:10
|
Derrick Oswald wrote: > Are there any opinions regarding Peter Lin's proposal to make htmlparser > an official Jakarta project? Sounds like a great idea. BTW, I had integrated the NodeFactory into the code a while back. It allows one to add decorators to things like StringNodes. I haven't had time to look at the latest code -- does it still retain that feature with the introduction of the lexar? thanks jk -- I n d u s t r i a l L o g i c , I n c . Joshua Kerievsky Founder, Extreme Programmer & Coach http://industriallogic.com http://industrialxp.org 866-540-8336 (toll free) 510-540-8336 (phone) Berkeley, California |