Re: [Htmlparser-developer] factories and prototypes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Joshua,

The NodeFactory is an interface, you don't subclass an interface, you 
implement it. People can write their own three method class to do that, 
or like I said, delegate to the parser for things they don't want to 
handle. But delegation comes at the cost of time and memory. The current 
string decorator architecture, for example, delegates through three 
wrapping objects (with associated memory overhead) to do the job the 
StringNode should do in the first place. If one knows a priori that 
whitespace removal, reference translation and so on are useful 
functions, build them into the StringNode. Provide the factory mechanism 
only for things you can't anticipate in advance. Leaving the decorator 
code as a caboose grafted on to the base implementation is only 
acceptable if it's a transient fix.

I use profilers all the time. And memory analysis. And javap to examine 
the byte code. That's why the lexer code runs 60% faster than the old 
NodeReader code and uses a quarter of the memory. Are you saying the 
creation of objects doesn't take time and memory? Maybe you should run a 
profiler comparing the 'decorative wrapping' approach with a 'built in 
functionality' approach. Or better yet, just step through it (and I mean 
step *in* to every method) in a debugger and see where all the time is 
spent allocating and wandering between objects without getting any of 
the work done, and repeatedly copying out strings. It's a real 
eye-opener for someone who only works at the 30,000 foot level.

Yes, I'm a big advocate of test driven development. The bean classes 
were dropped with their tests - BeanTest, so I don't see why you 
couldn't refactor it. The lexer package has it's tests. There are 
currently 35 outstanding failures out of 448 and I'm currently trying to 
re-integrate the test cases that got shunted off to the 
temporaryFailures package -- because someone wanted a green bar. You can 
still do test driven development without a green bar by monitoring the 
number of failures...
...like a doctor, you just ensure you 'do no harm'.

Derrick

Joshua Kerievsky wrote:

> Derrick Oswald wrote:
>
>> As it stands the NodeFactory is set automatically by the Lexer or 
>> Parser to itself. It's only if someone wants to *change* the node 
>> classes being returned that they would need to access the Lexer and 
>> set the NodeFactory property:
>>    parser.getLexer ().setNodeFactory (myfactory);
>> Not something for the casual user. 
>
>
> *Changing* the node classes being returned is PRECISELY why the 
> NodeFactory must (and will) be accessible.
>
> Today, users of the parser must write their own code to remove white 
> spaces, remove escape characters, decode strings.  That's no longer 
> necessary.  You can now configure the parser to remove white spaces, 
> escape chars, etc., by means of node decorators.  Those decorators are 
> currently configured within StringNodeFactory, which is going to 
> become NodeFactory.   Decorating non-StringNodes, such as RemarkNodes, 
> is already possible.
>
> The easiest way to popularize this use of decorators in the parser is 
> to give parser users access to the NodeFactory and make it easy for 
> them to configure it, or subclass it, as they like.
>
>> The parser class is large. But, of the forty or so methods it has, a 
>> quarter of them are dealing with the scanner list and a quarter of 
>> them are convenience pass through methods to the lexer. Don't think 
>> large or small, think useful. Let me understand your position; the 
>> parser shouldn't be doing node/tag creation? What is a parser then? A 
>> shell for a gaggle of deus ex machina pulling the levers behind the 
>> curtain? No, it's not overburdened, it's just doing it's job. All the 
>> rest of the classes are spurious artefacts.
>
>
> See the above for why the parser should delegate node creation.
>
>> To end users, the smell is in memory inefficient, slow programs; they 
>> really couldn't care less what's under the hood. Every object created 
>> has an overhead in time and memory, so the fewer the better.
>
>
> Do you use a profiler Derrick?  Because the above utterance is 
> something I'd expect to hear from an inexperienced programmer.  Or 
> maybe an old C programmer.  Hey, I programmed in C once.  It's been a 
> long time since I programmed in C.  But not long enough.
>
>> To programmers though, to which we cater, the design has to be clean 
>> and simple. Two classes, where one would do, is not necessarily clean 
>> or simple. In fact I'm thinking that each tag should be it's own 
>> scanner, folding the whole scanner tree into the tag tree. What 
>> better object to understand how to parse it than the tag itself. This 
>> would mean the prototype list *is* the scanner list, and the parser 
>> doesn't have to get larger. Currently, changing the code in two 
>> places means extra effort. Not keeping the two in sync can lead to 
>> bugs. Currently, most of the scanners are just baskets to hold the 
>> MATCH_NAME, ENDERS and END_TAG_ENDERS lists. Shouldn't the tag be the 
>> one responsible for knowing it's own name, terminators and place in 
>> the dtd? Larger classes can mean easier maintenance, if what they are 
>> replacing is a plethora of trivial interrelated classes.
>
>
> Lazy Class is another smell in Refactoring -- it refers to a class 
> that isn't pulling its own weight.  We "inline" classes when they 
> don't pull their own weight.
>
> When I joined this project -- less than a year ago -- the scanners 
> were a mess.  Tons of duplicate code.  Inlining these scanners into 
> the tags at that point would've been foolish, as it would've bloated 
> the tags.
> So Somik and I began removing duplication from the scanners.  
> Sometimes when you remove duplication, you get to the point where you 
> see that the classes are no longer necessary -- they aren't doing 
> enough to justify their existence.  I believe we've now reached that 
> point with the scanners, so I support the inlining of them into the tags.
>
>> And don't get me going about the 'refactoring' that spawned the 
>> 'helpers'. All the overhead of creating a new 
>> CompositeTagScannerHelper for each node scanned is horrendous -- all 
>> to avoid a 'large' CompositeTagScanner class. Sorry, I heartily 
>> disagree with making classes smaller in an attempt to avoid a smell, 
>> when the smell only shows up when it's 'refactored'.
>
>
> Most of the code in the parser pre-dates my appearance on this 
> project, so I know not how/why the helpers got added.
>
> What attracked me to this project was the messy, unrefactored code, 
> which happened to have pretty good test coverage.   This project is 
> ripe for all sorts of refactorings, which the tests make a whole lot 
> easier to do.
>
> Without tests, it's hard to refactor.  I remember looking at your 
> StringBean class before it was made into a Visitor.  I had just added 
> Visitor to the parser and was using it for useful things.  I looked at 
> StringBean and say "uhhhhhhh, now that's crying out to be a Visitor." 
> Yet I didn't make it a Visitor.  Why?  Because it had no tests. 
> Without tests, refactoring is hard.
>
> Creating tests for code is a no-brainer if you practice Test-Driven 
> Development (TDD).  Derrick, do you practice TDD?
>
>> The stack example in the option class should be handled by the ENDERS 
>> list. Running into a new <OPTION> while parsing the previous one, 
>> should close the original and open another. In this case, using a 
>> stack to track the recursion is overkill, I think, and could be 
>> handled in a more straight forward manner.
>
>
> I believe there are good tests for that recursion, so it shouldn't be 
> hard to come up with other ways to do it.  BTW, when will the tests be 
> green again?  I can't do much when they are running red.
>
> --jk
>