[Htmlparser-developer] lexer integration
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2003-11-06 04:06:41
|
OK, almost ready to get rid of most of the scanner package that shadows
the tag package.
There remains the 'filter' concept to handle, and then all but
TagScanner, CompositeTagScanner and ScriptScanner are obsolete.
The tags now own their 'ids', 'enders' and 'end tag enders' lists, and
the isTagToBeEndedFor() logic now uses information from the tags, not
the scanners.
Nodes are created by cloning from a list of prototypes in the Parser
(NodeFactory), so the scanners no longer create the tags (but they still
create the prototypical ones).
Now, the startTag() *is* the CompositeTag, and the CompositeTagScanner
just adds children to an already differentiated tag.
The scanners have no special actions on behalf of tags anymore. Things
like the LinkProcessor and form ACTION determination have been moved out
of the scanners and into either the Page object or the appropriate tags.
Other changes:
Made visitor 'node visiting order' the same order as on the page.
Fixed StringBean, which was still looking for end tags with names
starting with a slash, i.e. "/SCRIPT".
Added some debugging support to the lexer, so you can easily base a
breakpoint on a line number in a HTML page.
Fixed all the tests failing if case sensitivity was turned on. Now
ParserTestCase does case sensitive comparisons.
Convert native characters in tests to unicode. Mostly this was the
division sign (\u00f7) used in tests of character entity reference
translation.
Remove deprecated method calls: elementBegin() is now getStartPosition()
and elementEnd() is now getEndPosition()
Also fixed the NodeFactory signatures to have a Page rather than a Lexer.
TODO
=====
Filters
-------
Replace the String to String comparison of the 'filter' concept with a
TagFilter interface:
boolean accept (Tag tag);
and allow users to perform something like:
NodeList list = parser.extractAllNodesThatAre (
new NodeFilter () { public boolean accept (Tag tag) { return
(tag.getClass() == LinkTag.class); } };
And similarly for:
tag.collectInto (NodeList collectionList, NodeFilter filter);
nodelist.searchFor (NodeFilter filter);
parser.parse (NodeFilter filter)
etc.
Remove Scanners
---------------
Finish off obviating the scanners. Think of a good way to group tags so
adding one tag to the list of tags to be returned by the parser would
add it's buddies, i.e. the Form scanner now adds Input, TextArea,
Selection and Option scanners behind the scenes for you. Then replace
the add, remove, get, etc. scanner methods on the parser with the
comparable tag based ones. Alter all the test cases to use the new
methods, and move all the unique scanner test cases into tag test cases
then delete most of the scannersTests package.
Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than
providing any helpful advice. This needs to be reworked completely.
Augment Lexer State Machines
----------------------------------------
There are some changes needed in the lexer state machines to handle JSP
constructs and also whitespace either side of attribute equals signs.
Currently the latter is handled by a kludgy fixAttributes() method
applied after a tag is parsed, but it would be better handled in the
state machine initially. The former isn't handled at all, and would
involve all nodes possibly having children (a remark or string node can
have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>,
needs to be handled -->. So some design work needs to be done to analyze
the state transitions and gating characters.
toHtml(verbatim/fixed)
-----------------------------
One of the design goals for the new Lexer subsystem was to be able to
regurgitate the original HTML via the toHtml() method, so the original
page is unmodified except for any explicit user edits, i.e. link URL
edits. But the parser fixes broken HTML without asking, so you can't get
back an unadulterated page from toHtml(). A lot of test cases assume
fixed HTML. Either a parameter on toHtml() or another method would be
needed to provide the choice of the original HTML or the fixed HTML.
There's some initial work on eliminating the added virtual end tags
commented out in TagNode, but it will also require a way to remember
broken tags, like ...<title>The Title</title</head><body>...
GUI Parser Tool
---------------------
Some GUI based parser application showing the HTML parse tree in one
panel and the HTML text in another, with the tree node selected being
highlighted in the text, or the text cursor setting the tree node
selected, would be really good.
Applications
-----------
Rework all the applications for a better 'out of the box' experience for
new and novice users. Fix all the scripts in /bin (for unix and windows)
and add any others that don't exist already.
As you can see there's lots of work to do, so anyone with a death wish
can jump in. So go ahead and do a take from CVS and jump in the middle
with anything that appeals. Keep the list posted and update your CVS
tree often (or subscribe to the htmlparsre-cvs mailing list for
interrupt driven notification rather than polled notification).
|