Re: [Htmlparser-developer] factories and prototypes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Joshua,

As it stands the NodeFactory is set automatically by the Lexer or Parser 
to itself. It's only if someone wants to *change* the node classes being 
returned that they would need to access the Lexer and set the 
NodeFactory property:
    parser.getLexer ().setNodeFactory (myfactory);
Not something for the casual user. True, most users won't see the Lexer. 
But if their needs are fast linear lightweight access, they would use 
just the Lexer, and ignore the parser (see the Thumbelina lexer 
application for example). Then the node factory accessor is on the 
primary object. It's only when you add another level to the parsing that 
the accessor becomes indirect.

The parser class is large. But, of the forty or so methods it has, a 
quarter of them are dealing with the scanner list and a quarter of them 
are convenience pass through methods to the lexer. Don't think large or 
small, think useful. Let me understand your position; the parser 
shouldn't be doing node/tag creation? What is a parser then? A shell for 
a gaggle of deus ex machina pulling the levers behind the curtain? No, 
it's not overburdened, it's just doing it's job. All the rest of the 
classes are spurious artefacts.

To end users, the smell is in memory inefficient, slow programs; they 
really couldn't care less what's under the hood. Every object created 
has an overhead in time and memory, so the fewer the better. To 
programmers though, to which we cater, the design has to be clean and 
simple. Two classes, where one would do, is not necessarily clean or 
simple. In fact I'm thinking that each tag should be it's own scanner, 
folding the whole scanner tree into the tag tree. What better object to 
understand how to parse it than the tag itself. This would mean the 
prototype list *is* the scanner list, and the parser doesn't have to get 
larger. Currently, changing the code in two places means extra effort. 
Not keeping the two in sync can lead to bugs. Currently, most of the 
scanners are just baskets to hold the MATCH_NAME, ENDERS and 
END_TAG_ENDERS lists. Shouldn't the tag be the one responsible for 
knowing it's own name, terminators and place in the dtd? Larger classes 
can mean easier maintenance, if what they are replacing is a plethora of 
trivial interrelated classes.

And don't get me going about the 'refactoring' that spawned the 
'helpers'. All the overhead of creating a new CompositeTagScannerHelper 
for each node scanned is horrendous -- all to avoid a 'large' 
CompositeTagScanner class. Sorry, I heartily disagree with making 
classes smaller in an attempt to avoid a smell, when the smell only 
shows up when it's 'refactored'.

The stack example in the option class should be handled by the ENDERS 
list. Running into a new <OPTION> while parsing the previous one, should 
close the original and open another. In this case, using a stack to 
track the recursion is overkill, I think, and could be handled in a more 
straight forward manner.
//
Derrick

Joshua Kerievsky wrote:

> Derrick Oswald wrote:
>
>> The parser can be a NodeFactory with just three additional methods. 
>> It's still replaceable because the factory is set on the Lexer, i.e. 
>> clients can still create and set their own NodeFactory, even using 
>> the parser as a delegate for methods they don't want to handle. A 
>> major benefit of interface design is to avoid spurious trivial classes.
>
>
> Let me see if I can understand your design.  You want a user of the 
> parser to first get access to the Lexar to then set which NodeFactory 
> to use?   I must be misunderstanding something.  Most users of the 
> parser shouldn't even know the Lexar exists, right?  It's a low-level 
> detail to average parser users.
> A NodeFactory encapsulates data and methods used in node/tag creation 
> - nothing spurious or trivial about it.  In fact, small classes (such 
> as NodeFactory) which have one responsibility are easier to 
> understand, extend and maintain.  Furthermore, one method on the 
> parser is all it takes to let parser users set a NodeFactory 
> instance.  On the other hand, the current implementation has three 
> separate methods to handle node/tag creation.  I dislike that design 
> because:
>
> * it bloats the Parser interface, which is already heavily bloated 
> with too many methods
> * it gives the Parser a new responsibility which it has no business 
> having: node/tag creation
> * it adds code to an already fat Parser class that's overburdened with 
> responsibilities.
>
> I'm sensing that you prefer to build and work with Large Classes.  Is 
> that correct?   If so, are you aware that Large Class is a smell?   
> See Refactoring: Improving the Design of Existing Code, by Martin 
> Fowler. The chapter on smells was co-written by Kent Beck and Martin 
> Fowler.
>
>> A node that's visitable has a signature:
>>    void accept (NodeVisitor visitor) 
>
>
> Yeah, I'm the guy who popularized the use of Visitors in the parser - 
> remember?   You were against their usage.  Have you come around to the 
> dark side?
>
>> By incorporating that signature, because the NodeVisitor class knows 
>> about specific high level composite node types (why only Image, Link 
>> and Title?), the low level Lexer jar file would have to drag in a 
>> whole lot of other stuff. So currently the low level tags only 
>> implement (vacuously):
>>    void accept (Object visitor)
>> and then the high level Tag class thunks up to the more specific 
>> signature with an up-cast. If NodeVisitor were to only handle base 
>> types (String, Remark and Tag) this could be avoided. The fact that 
>> the NodeVisitor class knows about ImageTag, LinkTag and TitleTag 
>> makes it less useful in the presence of user supplied node types; but 
>> that's it's inherent flaw. 
>
>
> When I wrote NodeVisitor, I deliberately avoided making it aware of 
> nodes or tags beyond StringNode, Tag and EndTag.  The reason?  To be 
> able to visit other node/tag types, scanners must be registered and a 
> Visitor, being separate from the whole scanner mechanism, cannot 
> guarantee that  a given scanner is registered.
> Over time, people started adding visitXYZ methods to the NodeVisitor 
> interface, such as visitLink, etc.  Was that necessary?  I don't think 
> so.  If one needs information about Links, Images, etc., one doesn't 
> need to use a Visitor.
>
> If we use reflection, we can likely make a NodeVisitor that could 
> visit any node/tag type.  That would perhaps be slow, since reflection 
> is slow, but it would be a useful experiment.  In addition, those who 
> use a reflection-based Visitor may not care about speed.
>
>> Getting data into user supplied nodes is easy: each tag is presented 
>> with the attributes and children found by the scanner, what else is 
>> there? The current implementation does it the other way, each scanner 
>> is the one that figures out the special data and then creates a new 
>> specialized tag by some byzantine constructor taking arguments that 
>> only it can understand. The tag is reduced to regurgitating the 
>> simple strings it was given. Typical example; FrameScanner has 
>> extractFrameLocn() and extractFrameName() which it passes into the 
>> FrameTag constructor. Why not have FrameTag figure this stuff out?
>>
>> The TagScanner class is abstract, partly because of the signature:
>>   protected abstract Tag createTag(TagData tagData, Tag tag, String 
>> url) throws ParserException;
>> Each scanner has code like:
>>    public Tag createTag(TagData tagData, CompositeTagData 
>> compositeTagData) throws ParserException
>>    {
>>        return new BulletList(tagData,compositeTagData);
>>    }
>> With a 'Prototype' solution, the TagScanner class could implement:
>>    public Tag createTag(TagData tagData, CompositeTagData 
>> compositeTagData) throws ParserException
>>    {
>>        Tag tag = mBlastocyst.get (tagData.getTagName ());
>>        if (null == tag)
>>            tag = new Tag (tagData, compositeTagData); // should use 
>> the NodeFactory
>>        else
>>        {
>>            tag = (Tag)tag.clone ();
>>            tag.setData (tagData, compositeTagData);
>>        }
>>        return (tag);
>>    }
>> which would remove the need for each class to implement it. How would 
>> you remove the createTag() code from all the scanners without 
>> prototypes? 
>
>
> How would a prototype approach account for the stack in the following 
> code, from the OptionTagScanner:
>
>    public Tag createTag(
>        TagData tagData,
>        CompositeTagData compositeTagData) {
>        if (!stack.empty () && (this == stack.peek ()))
>            stack.pop ();
>        return new OptionTag(tagData,compositeTagData);
>    }
>
> BTW, FormTagScanner has a similar stack.
>
> Believe it or not, Derrick, I like the Prototype pattern and have even 
> considered using it within StringNodeFactory - I didn't proceed 
> because I didn't find a genuine need.  Now you've uncovered a possible 
> real need for Prototype in the parser -- I'm all for exploring it.  I 
> just want to be clear about what we're doing.  You say we could remove 
> a lot of duplicated code in the scanners - I can see lots of code that 
> creates specific tag instances and yes, Prototype can help make that 
> code go away.  However the scanners also appear to do useful work 
> (like usage of a stack or implementing the evaluate method) and I'm 
> not seeing how that would easily transfer to the node/tag classes 
> without making those classes overly complex. 
> best regards,
> jk
>