[Htmlparser-developer] lexer integration

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Removed the data package from the parser level tags. Out went TagData, 
CompositeTagData, LinkData and FormData. This means the createTag call 
is now bloated with arguments, but this too shall pass.
Moved a lot of the functionality from the scanners to the tags. Whereas 
before, the scanner would extract all sorts of stuff and pass it to 
special tag constructors and the tag would just hold it, the tag now 
performs these tasks when asked. I also removed a lot of member 
variables so the tags get and set attribute values directly, which means 
it comes out in the toHtml() call without any special work.
Removed lexer level AbstractNode, so there is a Page property on the 
org.htmlparser.AbstractNode now.
Separated tag creation from recursion in NodeFactory interface, so 
people who want to create their own tags won't need to worry about the 
scanning recursion.
It passes 508 of 522 unit tests.

TODO
=====

Helpers
-------
I desparately want to get rid of the last remaining 'helper' class, the 
CompositeTagScannerHelper. It's close, it just needs some more untangling.

Node Factory
------------
The factory concept needs to be extended. The Parser's createTagNode 
should look up the name of the node (from the attribute list provided), 
and create specific types of tags (FormTag, TableTag etc.) by cloning 
empty tags from a Hashtable of possible tag types (possibly called 
mBlastocyst in reference to undifferentiated stem cells).
This would provide a concrete implementation of createTag in 
CompositeTagScanner, removing a lot of near duplicate code from the 
scanners, and allow end users to plug in their own tags via a call like
    setTagFor ("BODY", new myBodyTag())
on the Parser. The end user wouldn't have to create or replace a scanner 
to get their own tags out.  Getting rid of the data package cleared up a 
lot of questions regarding the interaction scanners have with tags. In 
general, the scanner now creates the tag in a very straight forward 
bean-like manner:
        ret = new Div ();
        ret.setPage (page);
        ret.setStartPosition (start);
        ret.setEndPosition (end);
        ret.setAttributesEx (attributes);
        ret.setStartTag (startTag);
        ret.setEndTag (endTag);
        ret.setChildren (children);
This is nearly always the same in every scanner, only the tag name is 
different. The oddball cases have been highlighted with a
    // special step here...
comment in the code.  These special steps mostly revolve around 
meta-information available in scanners only (i.e. base href), or 
handling of nesting with a stack construct. It shouldn't be too much 
trouble to make these all go away.

Scanners
--------
The script scanner has been replaced. It can be considered as a first 
pass at what needs to be done to replace the generic 
CompositeTagScanner. The use of the underlying lexer makes these 
specialty scanners much easier.

Unit Tests
----------
The remaining failing unit tests show up the changed functionality.
Examples:
testIncompleteTitle - <title>blah</title </head> used to be 2 nodes
testEmptyComment - <!--> was considered a valid remark node

Each needs to be examined, a decision on the 'correct' behaviour made, 
and the code or test altered accordingly.

Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than 
providing any helpful advice. This needs to be reworked completely.

As you can see there's lots of work to do, so anyone with a death wish 
can jump in.  I'll be working my way from top to bottom of the JUnit 
errors list and commiting and notifying the developer list after each of 
them.  So go ahead and do a take from CVS and jump in the middle with 
anything that appeals. Keep the list posted and update your CVS tree 
often (or subscribe to the htmlparsre-cvs mailing list for interrupt 
driven notification rather than polled notification).