[Htmlparser-developer] lexer integration

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I've fixed the easily fixed tests now, the remaining 40 or so indicate 
changed functionality that needs to be examined, a decision on the 
'correct' behaviour made, and the code or test altered accordingly.

TODO
=====

TagData
-------
This has been reworked to allow it to limp along under the new system, 
but it should really be removed. I think the reason for it (reduce the 
number of arguments to tag constructors) no longer applies, and a lot of 
the code could be easier to read if Tags were more bean-like and had 
zero args constructors with appropriate accessors.

Helpers
-------
I desparately want to get rid of the two remaining 'helper' classes. 
They are just obfuscating the code.
The CompositeTagScannerHelper is close to being folded back into the 
CompositeTagScanner. It just needs some more untangling.

AbstractNode
------------
Drop org.htmlparser.lexer.nodes.AbstractNode, fold functionality into 
org.htmlparser.AbstractNode.

Node Factory
------------
The factory concept needs to be extended. The Parser's createTagNode 
should look up the name of the node (from the attribute list provided), 
and create specific types of tags (FormTag, TableTag etc.) by cloning 
empty tags from a Hashtable of possible tag types (possibly called 
mBlastocyst in reference to undifferentiated stem cells).
This would provide a concrete implementation of createTag in 
CompositeTagScanner, removing a lot of near duplicate code from the 
scanners, and allow end users to plug in their own tags via a call like
   setTagFor ("BODY", new myBodyTag())
on the Parser. Details on interaction with the scanners have to be 
worked out, but it seems the end user wouldn't have to replace the 
scanner to get their own tags out.

Scanners
--------
The script scanner has been replaced. It can be considered as a first 
pass at what needs to be done to replace the generic 
CompositeTagScanner. The use of the underlying lexer makes these 
specialty scanners much easier.

Unit Tests
----------
The remaining failing unit tests show up the changed functionality.
Examples:
testIncompleteTitle - <title>blah</title </head> used to be 2 nodes
testInvertedCommas - <tag attribute = whatever> used to be acceptable
testEmptyComment - <!--> was considered a valid remark node

Each needs to be examined, a decision on the 'correct' behaviour made, 
and the code or test altered accordingly.

Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than 
providing any helpful advice. This needs to be reworked completely.

As you can see there's lots of work to do, so anyone with a death wish 
can jump in.  I'll be working my way from top to bottom of the JUnit 
errors list and commiting and notifying the developer list after each of 
them.  So go ahead and do a take from CVS and jump in the middle with 
anything that appeals. Keep the list posted and update your CVS tree 
often (or subscribe to the htmlparsre-cvs mailing list for interrupt 
driven notification rather than polled notification).