[Htmlparser-developer] lexer integration

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

OK, it's started...

I've integrated the low level lexer code into the main parser code. Many 
things aren't working anymore
Of the 448 unit tests 213 of them fail and 14 show exception faults. But 
the upside is 211 of the tests pass.
So I'm dropping my current snapshot, opening it up to those who may wish 
to assist. See the TODO section.

Big changes
===========

A lot of files have been removed
--------------------------------
htmlparser/NodeReader.java
   this is the primary class that's being replaced by Lexer, the method 
nextNode() replaces readElement()
htmlparser/RemarkNodeParser.java
   remark nodes are now parsed in the Lexer main loop
htmlparser/parserHelper/AttributeParser.java
   attributes are now parsed by the lexer before the tag is created, 
manipulated as a Vector of Attribute objects
htmlparser/parserHelper/StringParser.java
    string nodes are now parsed by the lexer
htmlparser/parserHelper/TagParser.java
    tags are now parsed by the lexer
htmlparser/tags/EndTag.java
    this class was replaced by a call to the new isEndTag() method on 
the Tag class

I labeled the repository with tag "PriorToLexerIntegration" just in case 
you want to retreive a file that's no longer there.

Class Derivations
-----------------
The StringNode, RemarkNode and tags.Tag class now derive from their 
lexeme counterparts in lexer.nodes instead of the other way around.

NodeFactory
-----------
The beginnings of a node factory interface are included. This was added 
so the lexer could return 'visitable' nodes to the parser. The parser 
acts as it's own node factory, as does the Lexer.

NodeCount
---------
The node count for parsing goes up in most cases because every 
whitespace (i.e. newline) now counts as a StringNode. This has whacked 
out a lot of the tests that were expecting fewer nodes or a certain type 
of node at a particular index.

Attributes
----------
Attributes now maintain their order and case. The count of attributes 
also went up because whitespace is maintained within tags too. The 
storage in a Vector means the element 0 Attribute is actually the name 
of the tag, rather than having the $TAGNAME entry in a HashTable.

TODO
=====
visitEndTag()
-----------------
The visitEndNode() method on the visitor interface should be put back. I 
shouldn't have removed it when EndTag was removed. Instead the accept() 
in Tag should dispatch to visitTag() or visitEndTag() based on isEndTag().

Serializable
--------------
The Parser needs to be made serializable again. This involves a 
transient field down on the Source, I think, rather than having the 
whole Lexer transient in the Parser.

TagData
-------
This has been reworked to allow it to limp along under the new system, 
but it should really be removed. I think the reason for it (reduce the 
number of arguments to tag constructors) no longer applies, and a lot of 
the code could be easier to read if the Tag was more bean-like and had a 
zero args constructor with appropriate accessors.

Helpers
-------
I desparately want to get rid of these 'helper' classes. They are just 
obfuscating the code.

Node Factory
------------
The factory concept needs to be extended with a TagFactory (extending 
NodeFactory) that has the signatures for creating all the possible types 
of tags there are, and then this needs to be used by all the scanners to 
create their specific tags.

Scanners
--------
The scanners may not be working, hard to tell without the unit tests 
running. I'm not sure that CompositeTagScanner is completely all right 
yet, It probably needs to be reworked based on the lexer.

Unit Tests
----------
As mentioned, many of the unit tests expect toHtml() to produce 
capitalized and rearranged output. And parseAndAssertNodeCount() is 
expected not to include so many whitespace nodes. These need to be 
addressed.

Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than 
providing any helpful advice. This needs to be reworked completely.

As you can see there's lots of work to do, so anyone with a death wish 
can jump in.  I'll be working my way from top to bottom of the TODO list 
and commiting and notifying the developer list after each of them.  So 
go ahead and do a take from CVS and jump in the middle with anything 
that appeals. Keep the list posted and update your CVS tree often (or 
subscribe to the htmlparsre-cvs mailing list for interrupt driven 
notification rather than polled notification).

Derrick