[Htmlparser-developer] lexer integration
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2003-09-29 19:55:06
|
OK, it's started...
I've integrated the low level lexer code into the main parser code. Many
things aren't working anymore
Of the 448 unit tests 213 of them fail and 14 show exception faults. But
the upside is 211 of the tests pass.
So I'm dropping my current snapshot, opening it up to those who may wish
to assist. See the TODO section.
Big changes
===========
A lot of files have been removed
--------------------------------
htmlparser/NodeReader.java
this is the primary class that's being replaced by Lexer, the method
nextNode() replaces readElement()
htmlparser/RemarkNodeParser.java
remark nodes are now parsed in the Lexer main loop
htmlparser/parserHelper/AttributeParser.java
attributes are now parsed by the lexer before the tag is created,
manipulated as a Vector of Attribute objects
htmlparser/parserHelper/StringParser.java
string nodes are now parsed by the lexer
htmlparser/parserHelper/TagParser.java
tags are now parsed by the lexer
htmlparser/tags/EndTag.java
this class was replaced by a call to the new isEndTag() method on
the Tag class
I labeled the repository with tag "PriorToLexerIntegration" just in case
you want to retreive a file that's no longer there.
Class Derivations
-----------------
The StringNode, RemarkNode and tags.Tag class now derive from their
lexeme counterparts in lexer.nodes instead of the other way around.
NodeFactory
-----------
The beginnings of a node factory interface are included. This was added
so the lexer could return 'visitable' nodes to the parser. The parser
acts as it's own node factory, as does the Lexer.
NodeCount
---------
The node count for parsing goes up in most cases because every
whitespace (i.e. newline) now counts as a StringNode. This has whacked
out a lot of the tests that were expecting fewer nodes or a certain type
of node at a particular index.
Attributes
----------
Attributes now maintain their order and case. The count of attributes
also went up because whitespace is maintained within tags too. The
storage in a Vector means the element 0 Attribute is actually the name
of the tag, rather than having the $TAGNAME entry in a HashTable.
TODO
=====
visitEndTag()
-----------------
The visitEndNode() method on the visitor interface should be put back. I
shouldn't have removed it when EndTag was removed. Instead the accept()
in Tag should dispatch to visitTag() or visitEndTag() based on isEndTag().
Serializable
--------------
The Parser needs to be made serializable again. This involves a
transient field down on the Source, I think, rather than having the
whole Lexer transient in the Parser.
TagData
-------
This has been reworked to allow it to limp along under the new system,
but it should really be removed. I think the reason for it (reduce the
number of arguments to tag constructors) no longer applies, and a lot of
the code could be easier to read if the Tag was more bean-like and had a
zero args constructor with appropriate accessors.
Helpers
-------
I desparately want to get rid of these 'helper' classes. They are just
obfuscating the code.
Node Factory
------------
The factory concept needs to be extended with a TagFactory (extending
NodeFactory) that has the signatures for creating all the possible types
of tags there are, and then this needs to be used by all the scanners to
create their specific tags.
Scanners
--------
The scanners may not be working, hard to tell without the unit tests
running. I'm not sure that CompositeTagScanner is completely all right
yet, It probably needs to be reworked based on the lexer.
Unit Tests
----------
As mentioned, many of the unit tests expect toHtml() to produce
capitalized and rearranged output. And parseAndAssertNodeCount() is
expected not to include so many whitespace nodes. These need to be
addressed.
Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than
providing any helpful advice. This needs to be reworked completely.
As you can see there's lots of work to do, so anyone with a death wish
can jump in. I'll be working my way from top to bottom of the TODO list
and commiting and notifying the developer list after each of them. So
go ahead and do a take from CVS and jump in the middle with anything
that appeals. Keep the list posted and update your CVS tree often (or
subscribe to the htmlparsre-cvs mailing list for interrupt driven
notification rather than polled notification).
Derrick
|