The 'lexer integration' subject line is wearing a little thin, since
it's been a while since lexer integration issues were complete,
so from now on I'll try to label it appropriately.
I've removed the scanners that didn't do anything anymore, leaving
script and jsp scanners.
Instead of registering a scanner to enable returning a specific tag you
now add a tag to a new class called PrototypicalNodeFactory.
These 'prototype' tags are cloned as needed to be returned from the parser.
All known tags are 'registered' by default in a new Parser which is
similar to having called the old 'registerDOMScanners()', so tags are
fully nested.
This is different behaviour so you will need to recurse into returned
nodes to get at what you want, or if you want to return only some of the
derived tags while keeping most as generic tags and a flatter structure,
there are various constructors and manipulators on the factory. See the
javadocs and examples in the tests package.
Nearly all the old scanner tests are folded into the tag tests.
I've changed the operation of toString() for CompositeTags. It now
returns an indented listing of children so the mainline from the Parser
looks better.
TODO
=====
1.3.1
------
It looks like there are enough bugs and requests to warrant another 1.3
point release with some patched files.
I hate to work on a branch, but it may be the only way to get everyone
off my back.
Filters
-------
Implement the new filtering mechanism for NodeList.searchFor ().
Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than
providing any helpful advice. This needs to be reworked completely.
Augment Lexer State Machines
----------------------------------------
There are some changes needed in the lexer state machines to handle JSP
constructs and also whitespace either side of attribute equals signs.
Currently the latter is handled by a kludgy fixAttributes() method
applied after a tag is parsed, but it would be better handled in the
state machine initially. The former isn't handled at all, and would
involve all nodes possibly having children (a remark or string node can
have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>,
needs to be handled -->. So some design work needs to be done to analyze
the state transitions and gating characters.
toHtml(verbatim/fixed)
-----------------------------
One of the design goals for the new Lexer subsystem was to be able to
regurgitate the original HTML via the toHtml() method, so the original
page is unmodified except for any explicit user edits, i.e. link URL
edits. But the parser fixes broken HTML without asking, so you can't get
back an unadulterated page from toHtml(). A lot of test cases assume
fixed HTML. Either a parameter on toHtml() or another method would be
needed to provide the choice of the original HTML or the fixed HTML.
There's some initial work on eliminating the added virtual end tags
commented out in TagNode, but it will also require a way to remember
broken tags, like ...<title>The Title</title</head><body>...
GUI Parser Tool
---------------------
Some GUI based parser application showing the HTML parse tree in one
panel and the HTML text in another, with the tree node selected being
highlighted in the text, or the text cursor setting the tree node
selected, would be really good. A filter builder tool to graphically
construct a program to extract a snippet from an HTML page would blow
people away.
Applications
-----------
Rework all the applications for a better 'out of the box' experience for
new and novice users. Fix all the scripts in /bin (for unix and windows)
and add any others that don't exist already.
Clean Up
------------
The integration process needs to be revamped to take use the $Name: CVS
substitution, so a checkin isn't required every integration.
Block/Inline
----------------
The tag-enders and end-tag-enders lists are only a partial solution to
the HTML specification for block and inline tags. By ensuring block tags
don't overlap, a better parsing job could be done, i.e.
<FORM> .... <TABLE> ... </FORM></TABLE>
would be rearranged as
<FORM> .... <TABLE> ... </TABLE></FORM>
This needs some design work.
|