Re: [Htmlparser-developer] factories and prototypes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Derrick Oswald wrote:
> As it stands the NodeFactory is set automatically by the Lexer or Parser 
> to itself. It's only if someone wants to *change* the node classes being 
> returned that they would need to access the Lexer and set the 
> NodeFactory property:
>    parser.getLexer ().setNodeFactory (myfactory);
> Not something for the casual user. 

*Changing* the node classes being returned is PRECISELY why the 
NodeFactory must (and will) be accessible.

Today, users of the parser must write their own code to remove white 
spaces, remove escape characters, decode strings.  That's no longer 
necessary.  You can now configure the parser to remove white spaces, 
escape chars, etc., by means of node decorators.  Those decorators are 
currently configured within StringNodeFactory, which is going to become 
NodeFactory.   Decorating non-StringNodes, such as RemarkNodes, is 
already possible.

The easiest way to popularize this use of decorators in the parser is to 
give parser users access to the NodeFactory and make it easy for them to 
configure it, or subclass it, as they like.

> The parser class is large. But, of the forty or so methods it has, a 
> quarter of them are dealing with the scanner list and a quarter of them 
> are convenience pass through methods to the lexer. Don't think large or 
> small, think useful. Let me understand your position; the parser 
> shouldn't be doing node/tag creation? What is a parser then? A shell for 
> a gaggle of deus ex machina pulling the levers behind the curtain? No, 
> it's not overburdened, it's just doing it's job. All the rest of the 
> classes are spurious artefacts.

See the above for why the parser should delegate node creation.

> To end users, the smell is in memory inefficient, slow programs; they 
> really couldn't care less what's under the hood. Every object created 
> has an overhead in time and memory, so the fewer the better.

Do you use a profiler Derrick?  Because the above utterance is something 
I'd expect to hear from an inexperienced programmer.  Or maybe an old C 
programmer.  Hey, I programmed in C once.  It's been a long time since I 
programmed in C.  But not long enough.

> To 
> programmers though, to which we cater, the design has to be clean and 
> simple. Two classes, where one would do, is not necessarily clean or 
> simple. In fact I'm thinking that each tag should be it's own scanner, 
> folding the whole scanner tree into the tag tree. What better object to 
> understand how to parse it than the tag itself. This would mean the 
> prototype list *is* the scanner list, and the parser doesn't have to get 
> larger. Currently, changing the code in two places means extra effort. 
> Not keeping the two in sync can lead to bugs. Currently, most of the 
> scanners are just baskets to hold the MATCH_NAME, ENDERS and 
> END_TAG_ENDERS lists. Shouldn't the tag be the one responsible for 
> knowing it's own name, terminators and place in the dtd? Larger classes 
> can mean easier maintenance, if what they are replacing is a plethora of 
> trivial interrelated classes.

Lazy Class is another smell in Refactoring -- it refers to a class that 
isn't pulling its own weight.  We "inline" classes when they don't pull 
their own weight.

When I joined this project -- less than a year ago -- the scanners were 
a mess.  Tons of duplicate code.  Inlining these scanners into the tags 
at that point would've been foolish, as it would've bloated the tags.
So Somik and I began removing duplication from the scanners.  Sometimes 
when you remove duplication, you get to the point where you see that the 
classes are no longer necessary -- they aren't doing enough to justify 
their existence.  I believe we've now reached that point with the 
scanners, so I support the inlining of them into the tags.

> And don't get me going about the 'refactoring' that spawned the 
> 'helpers'. All the overhead of creating a new CompositeTagScannerHelper 
> for each node scanned is horrendous -- all to avoid a 'large' 
> CompositeTagScanner class. Sorry, I heartily disagree with making 
> classes smaller in an attempt to avoid a smell, when the smell only 
> shows up when it's 'refactored'.

Most of the code in the parser pre-dates my appearance on this project, 
so I know not how/why the helpers got added.

What attracked me to this project was the messy, unrefactored code, 
which happened to have pretty good test coverage.   This project is ripe 
for all sorts of refactorings, which the tests make a whole lot easier 
to do.

Without tests, it's hard to refactor.  I remember looking at your 
StringBean class before it was made into a Visitor.  I had just added 
Visitor to the parser and was using it for useful things.  I looked at 
StringBean and say "uhhhhhhh, now that's crying out to be a Visitor." 
Yet I didn't make it a Visitor.  Why?  Because it had no tests. 
Without tests, refactoring is hard.

Creating tests for code is a no-brainer if you practice Test-Driven 
Development (TDD).  Derrick, do you practice TDD?

> The stack example in the option class should be handled by the ENDERS 
> list. Running into a new <OPTION> while parsing the previous one, should 
> close the original and open another. In this case, using a stack to 
> track the recursion is overkill, I think, and could be handled in a more 
> straight forward manner.

I believe there are good tests for that recursion, so it shouldn't be 
hard to come up with other ways to do it.  BTW, when will the tests be 
green again?  I can't do much when they are running red.

--jk