Thread: [Htmlparser-developer] scanners removed, was: lexer integration

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The 'lexer integration' subject line is wearing a little thin, since 
it's been a while since lexer integration issues were complete,
so from now on I'll try to label it appropriately.

I've removed the scanners that didn't do anything anymore, leaving 
script and jsp scanners.
Instead of registering a scanner to enable returning a specific tag you 
now add a tag to a new class called PrototypicalNodeFactory.
These 'prototype' tags are cloned as needed to be returned from the parser.

All known tags are 'registered' by default in a new Parser which is 
similar to having called the old 'registerDOMScanners()', so tags are 
fully nested.
This is different behaviour so you will need to recurse into returned 
nodes to get at what you want, or if you want to return only some of the 
derived tags while keeping most as generic tags and a flatter structure, 
there are various constructors and manipulators on the factory. See the 
javadocs and examples in the tests package.

Nearly all the old scanner tests are folded into the tag tests.

I've changed the operation of toString() for CompositeTags. It now 
returns an indented listing of children so the mainline from the Parser 
looks better.

TODO
=====

1.3.1
------
It looks like there are enough bugs and requests to warrant another 1.3 
point release with some patched files.
I hate to work on a branch, but it may be the only way to get everyone 
off my back.

Filters
-------
Implement the new filtering mechanism for NodeList.searchFor ().

Documentation
-------------
As of now, it's more likely that the javadocs are lying to you than 
providing any helpful advice. This needs to be reworked completely.

Augment Lexer State Machines
----------------------------------------
There are some changes needed in the lexer state machines to handle JSP 
constructs and also whitespace either side of attribute equals signs. 
Currently the latter is handled by a kludgy fixAttributes() method 
applied after a tag is parsed, but it would be better handled in the 
state machine initially. The former isn't handled at all, and would 
involve all nodes possibly having children (a remark or string node can 
have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, 
needs to be handled -->. So some design work needs to be done to analyze 
the state transitions and gating characters.

toHtml(verbatim/fixed)
-----------------------------
One of the design goals for the new Lexer subsystem was to be able to 
regurgitate the original HTML via the toHtml() method, so the original 
page is unmodified except for any explicit user edits, i.e. link URL 
edits. But the parser fixes broken HTML without asking, so you can't get 
back an unadulterated page from toHtml(). A lot of test cases assume 
fixed HTML. Either a parameter on toHtml() or another method would be 
needed to provide the choice of the original HTML or the fixed HTML. 
There's some initial work on eliminating the added virtual end tags 
commented out in TagNode, but it will also require a way to remember 
broken tags, like ...<title>The Title</title</head><body>...

GUI Parser Tool
---------------------
Some GUI based parser application showing the HTML parse tree in one 
panel and the HTML text in another, with the tree node selected being 
highlighted in the text, or the text cursor setting the tree node 
selected, would be really good. A filter builder tool to graphically 
construct a program to extract a snippet from an HTML page would blow 
people away.

Applications
-----------
Rework all the applications for a better 'out of the box' experience for 
new and novice users. Fix all the scripts in /bin (for unix and windows) 
and add any others that don't exist already.

Clean Up
------------
The integration process needs to be revamped to take use the $Name: CVS 
substitution, so a checkin isn't required every integration.

Block/Inline
----------------
The tag-enders and end-tag-enders lists are only a partial solution to 
the HTML specification for block and inline tags. By ensuring block tags 
don't overlap, a better parsing job could be done, i.e.
    <FORM>    ....   <TABLE>   ...   </FORM></TABLE>
would be rearranged as
    <FORM>    ....   <TABLE>   ...   </TABLE></FORM>
This needs some design work.

Thread: [Htmlparser-developer] scanners removed, was: lexer integration

htmlparser-developer