[Htmlparser-developer] lexer integration

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Fixed or avoided the remaining failing unit tests.
It's a green bar now, 522 of 522 passing.
I shut up all the excess verbiage from the tests, so they're silent too.

TODO
=====

Helpers
-------
I desparately want to get rid of the last remaining 'helper' class, the 
CompositeTagScannerHelper. It's close, it just needs some more untangling.

Node Factory
------------
The factory concept needs to be extended. The Parser's createTagNode 
should look up the name of the node (from the attribute list provided), 
and create specific types of tags (FormTag, TableTag etc.) by cloning 
empty tags from a Hashtable of possible tag types (possibly called 
mBlastocyst in reference to undifferentiated stem cells).
This would provide a concrete implementation of createTag in 
CompositeTagScanner, removing a lot of near duplicate code from the 
scanners, and allow end users to plug in their own tags via a call like
  setTagFor ("BODY", new myBodyTag())
on the Parser. The end user wouldn't have to create or replace a scanner 
to get their own tags out.  Getting rid of the data package cleared up a 
lot of questions regarding the interaction scanners have with tags. In 
general, the scanner now creates the tag in a very straight forward 
bean-like manner:
      ret = new Div ();
      ret.setPage (page);
      ret.setStartPosition (start);
      ret.setEndPosition (end);
      ret.setAttributesEx (attributes);
      ret.setStartTag (startTag);
      ret.setEndTag (endTag);
      ret.setChildren (children);
This is nearly always the same in every scanner, only the tag name is 
different. The oddball cases have been highlighted with a
  // special step here...
comment in the code.  These special steps mostly revolve around 
meta-information available in scanners only (i.e. base href), or 
handling of nesting with a stack construct. It shouldn't be too much 
trouble to make these all go away.

Scanners
--------
The script scanner has been replaced. It can be considered as a first 
pass at what needs to be done to replace the generic 
CompositeTagScanner. The use of the underlying lexer makes these 
specialty scanners much easier.

Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than 
providing any helpful advice. This needs to be reworked completely.

Augment Lexer State Machines
----------------------------------------
There are some changes needed in the lexer state machines to handle JSP 
constructs and also whitespace either side of attribute equals signs. 
Currently the latter is handled by a kludgy fixAttributes() method 
applied after a tag is parsed, but it would be better handled in the 
state machine initially. The former isn't handled at all, and would 
involve all nodes possibly having children (a remark or string node can 
have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, 
needs to be handled -->. So some design work needs to be done to analyze 
the state transitions and gating characters.

Case Sensitive TestCase
-------------------------------
Currently all string comparisons via the 
ParserTestCase.assertStringsEqual() are case insensitive. This should be 
turned off by setting ParserTestCase.mCaseInsensitiveComparisons to 
false, and the tests fixed to accommodate.

toHtml(verbatim/fixed)
-----------------------------
One of the design goals for the new Lexer subsystem was to be able to 
regurgitate the original HTML via the toHtml() method, so the original 
page is unmodified except for any explicit user edits, i.e. link URL 
edits. But the parser fixes broken HTML without asking, so you can't get 
back an unadulterated page from toHtml(). A lot of test cases assume 
fixed HTML. Either a parameter on toHtml() or another method would be 
needed to provide the choice of the original HTML or the fixed HTML. 
There's some initial work on eliminating the added virtual end tags 
commented out in TagNode, but it will also require a way to remember 
broken tags, like ...<title>The Title</title</head><body>...

GUI Parser Tool
---------------------
Some GUI based parser application showing the HTML parse tree in one 
panel and the HTML text in another, with the tree node selected being 
highlighted in the text, or the text cursor setting the tree node 
selected, would be really good.

As you can see there's lots of work to do, so anyone with a death wish 
can jump in.  So go ahead and do a take from CVS and jump in the middle 
with anything that appeals. Keep the list posted and update your CVS 
tree often (or subscribe to the htmlparsre-cvs mailing list for 
interrupt driven notification rather than polled notification).