[Htmlparser-developer] factories and prototypes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

was subject: Re: [Htmlparser-developer] RE: question about using 
HTMLParser in Apache JMeter
Joshua,

The parser can be a NodeFactory with just three additional methods. It's 
still replaceable because the factory is set on the Lexer, i.e. clients 
can still create and set their own NodeFactory, even using the parser as 
a delegate for methods they don't want to handle. A major benefit of 
interface design is to avoid spurious trivial classes.

A node that's visitable has a signature:
    void accept (NodeVisitor visitor)
By incorporating that signature, because the NodeVisitor class knows 
about specific high level composite node types (why only Image, Link and 
Title?), the low level Lexer jar file would have to drag in a whole lot 
of other stuff. So currently the low level tags only implement (vacuously):
    void accept (Object visitor)
and then the high level Tag class thunks up to the more specific 
signature with an up-cast. If NodeVisitor were to only handle base types 
(String, Remark and Tag) this could be avoided. The fact that the 
NodeVisitor class knows about ImageTag, LinkTag and TitleTag makes it 
less useful in the presence of user supplied node types; but that's it's 
inherent flaw.

Getting data into user supplied nodes is easy: each tag is presented 
with the attributes and children found by the scanner, what else is 
there? The current implementation does it the other way, each scanner is 
the one that figures out the special data and then creates a new 
specialized tag by some byzantine constructor taking arguments that only 
it can understand. The tag is reduced to regurgitating the simple 
strings it was given. Typical example; FrameScanner has 
extractFrameLocn() and extractFrameName() which it passes into the 
FrameTag constructor. Why not have FrameTag figure this stuff out?

The TagScanner class is abstract, partly because of the signature:
   protected abstract Tag createTag(TagData tagData, Tag tag, String 
url) throws ParserException;
Each scanner has code like:
    public Tag createTag(TagData tagData, CompositeTagData 
compositeTagData) throws ParserException
    {
        return new BulletList(tagData,compositeTagData);
    }
With a 'Prototype' solution, the TagScanner class could implement:
    public Tag createTag(TagData tagData, CompositeTagData 
compositeTagData) throws ParserException
    {
        Tag tag = mBlastocyst.get (tagData.getTagName ());
        if (null == tag)
            tag = new Tag (tagData, compositeTagData); // should use the 
NodeFactory
        else
        {
            tag = (Tag)tag.clone ();
            tag.setData (tagData, compositeTagData);
        }
        return (tag);
    }
which would remove the need for each class to implement it. How would 
you remove the createTag() code from all the scanners without prototypes?
The above is couched in current TagData format, but in reality it would 
be more like:
    tag = (Tag)tag.clone ();
    tag.setAttributes (attributes);
    tag.setChildren (children);

Derrick

Joshua Kerievsky wrote:

> Derrick Oswald wrote:
>
>> Yes.  In the transition from using a straight Lexer to get basic 
>> nodes (lexer.nodes package), to using the Parser to get nodes that 
>> can be visited (htmlparser package), the Lexer needs to generate 
>> nodes it was not compiled with.  Hence the Parser replaces the Lexer 
>> as the NodeFactory that the Lexer calls when it needs to create a Node.
>
>
> IMO, the NodeFactory is better off as its own object.  The Parser can 
> use a default instance of it.  Clients can configure the Parser to use 
> a specific NodeFactory.  This is important for decorating nodes and 
> tags.  In addition, we don't want to give the Parser too many 
> responsibilities, as it complicates its design.
>
> At present, we've made some choices about which tags are visitable - 
> i.e. visitable nodes and tags are hard-coded into our NodeVisitor 
> class.  I'm not sure what you mean above when you write "using the 
> Parser to get nodes that can be visited"?
>
>> I'm thinking this concept should be augmented in the Parser's 
>> createTagNode to look up the name of the node (from the attribute 
>> list provided), and create specific types of tags (FormTag, TableTag 
>> etc.) by cloning empty tags from a Hashtable of possible tag types 
>> (possibly called mBlastocyst in reference to undifferentiated stem 
>> cells).
>
>
> Sounds like the Prototype pattern.   The trouble with this approach is 
> getting the right data into the node/tag.  You can clone a tag that 
> has no data, then you got to get the right data into the tag.  Since 
> different tags have different data needs, it gets complicated.  Have 
> you considered these issues?
>
>> This would provide a concrete implementation of createTag in 
>> CompositeTagScanner, removing a lot of near duplicate code from the 
>> scanners, and allow end users to plug in their own tags via a call like
>>   setTagFor ("BODY", new myBodyTag())
>> on the Parser. Details on interaction with the scanners have to be 
>> worked out, but it seems the end user wouldn't have to replace the 
>> scanner to get their own tags out.
>
>
> When you say "this would provide a concrete ...." I don't follow.  Why 
> is a Prototype-based createTagNode method a prerequisite for removing 
> near duplicate code in the scanners?   i.e. couldn't that be done 
> regardless of whether a Prototype solution is used?  What am I missing?
>
> best regards
> jk
>