[Htmlparser-developer] lexer integration
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2003-09-29 19:55:06
|
OK, it's started... I've integrated the low level lexer code into the main parser code. Many things aren't working anymore Of the 448 unit tests 213 of them fail and 14 show exception faults. But the upside is 211 of the tests pass. So I'm dropping my current snapshot, opening it up to those who may wish to assist. See the TODO section. Big changes =========== A lot of files have been removed -------------------------------- htmlparser/NodeReader.java this is the primary class that's being replaced by Lexer, the method nextNode() replaces readElement() htmlparser/RemarkNodeParser.java remark nodes are now parsed in the Lexer main loop htmlparser/parserHelper/AttributeParser.java attributes are now parsed by the lexer before the tag is created, manipulated as a Vector of Attribute objects htmlparser/parserHelper/StringParser.java string nodes are now parsed by the lexer htmlparser/parserHelper/TagParser.java tags are now parsed by the lexer htmlparser/tags/EndTag.java this class was replaced by a call to the new isEndTag() method on the Tag class I labeled the repository with tag "PriorToLexerIntegration" just in case you want to retreive a file that's no longer there. Class Derivations ----------------- The StringNode, RemarkNode and tags.Tag class now derive from their lexeme counterparts in lexer.nodes instead of the other way around. NodeFactory ----------- The beginnings of a node factory interface are included. This was added so the lexer could return 'visitable' nodes to the parser. The parser acts as it's own node factory, as does the Lexer. NodeCount --------- The node count for parsing goes up in most cases because every whitespace (i.e. newline) now counts as a StringNode. This has whacked out a lot of the tests that were expecting fewer nodes or a certain type of node at a particular index. Attributes ---------- Attributes now maintain their order and case. The count of attributes also went up because whitespace is maintained within tags too. The storage in a Vector means the element 0 Attribute is actually the name of the tag, rather than having the $TAGNAME entry in a HashTable. TODO ===== visitEndTag() ----------------- The visitEndNode() method on the visitor interface should be put back. I shouldn't have removed it when EndTag was removed. Instead the accept() in Tag should dispatch to visitTag() or visitEndTag() based on isEndTag(). Serializable -------------- The Parser needs to be made serializable again. This involves a transient field down on the Source, I think, rather than having the whole Lexer transient in the Parser. TagData ------- This has been reworked to allow it to limp along under the new system, but it should really be removed. I think the reason for it (reduce the number of arguments to tag constructors) no longer applies, and a lot of the code could be easier to read if the Tag was more bean-like and had a zero args constructor with appropriate accessors. Helpers ------- I desparately want to get rid of these 'helper' classes. They are just obfuscating the code. Node Factory ------------ The factory concept needs to be extended with a TagFactory (extending NodeFactory) that has the signatures for creating all the possible types of tags there are, and then this needs to be used by all the scanners to create their specific tags. Scanners -------- The scanners may not be working, hard to tell without the unit tests running. I'm not sure that CompositeTagScanner is completely all right yet, It probably needs to be reworked based on the lexer. Unit Tests ---------- As mentioned, many of the unit tests expect toHtml() to produce capitalized and rearranged output. And parseAndAssertNodeCount() is expected not to include so many whitespace nodes. These need to be addressed. Documentation ------------- As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. As you can see there's lots of work to do, so anyone with a death wish can jump in. I'll be working my way from top to bottom of the TODO list and commiting and notifying the developer list after each of them. So go ahead and do a take from CVS and jump in the middle with anything that appeals. Keep the list posted and update your CVS tree often (or subscribe to the htmlparsre-cvs mailing list for interrupt driven notification rather than polled notification). Derrick |